ESP Technical Roadmap


This blog provides an overview of ESP's technical roadmap - offering a path forward for our proposed and ongoing scientific work, which is focused on using AI to decode on non-human animal communication.  

Machine learning (ML) has proven to be a powerful tool for learning latent representations of data (LeCun et al., 2015). For example, learned representations of human languages have enabled translation between different languages without the use of dictionaries (Artetxe et al., 2017; Lample et al., 2017). More recently, ML has been able to generate realistic images based on text descriptions (DALL-E 2), as well as predict and imitate the words we speak (AudioLM).

We hypothesize that ML models can provide new insights into non-human communication, and believe that discoveries fueled by these new techniques have the potential to transform our relationship with the rest of nature. We also expect that many of the techniques we develop will be useful tools for processing data in applied conservation settings (Tuia et al., 2022).

The Challenge

Unraveling meaning in non-human animal (henceforth, animal) communication is a hard problem which biologists have studied for many decades (Bradbury and Vehrencamp, 2011; Kershenbaum et al., 2014). Even for human languages, capturing their full diversity is still an open problem for ML (Joshi et al., 2020). Animals stretch our intuitions about communication even further. For example:

Using animal communication data in ML also presents a challenge. Much of ML uses data that humans generate to share on the web, so the data is easy to collect and intuitively formatted for us (e.g., text captions on images). Data from other animals are difficult to record, and in order to annotate and structure these data, we rely on the small number of scientists who have expertise with a given species. Moreover, we can only evaluate model predictions based on what is already understood about that animal’s communication and behavior. Finally, our research, from data collection to model evaluation, must be subject to thorough ethical constraints to prevent harm to the animals involved.

Our Approach 

Given these challenges, we focus on areas where scientists can use ML-based methods to:

  1. reveal and explore patterns in data, and generate data-driven hypotheses about animal communication behavior; and
  2. perform experiments with a level of nuance that is inaccessible using traditional methods.

In particular, our research focuses on ML techniques which help address the following questions, each a central aspect of what “decoding” means to us:

  • Under what conditions does an animal produce a signal?
  • How do signals influence the behavior of receivers?
  • What structure in a signal is relevant for a receiver’s behavioral response?
  • How do communication strategies compare between species, and between populations?

These questions give rise to specific hypotheses in the different projects we work on with a variety of scientific partners.

In this roadmap, we lay out the key areas of research we believe will be required to make significant progress toward two-way communication with other species. This research is divided into four major areas: Data, Fundamentals, Decode, and Communicate. There are dependencies between these areas and progress in one supports the others, so we work across all four concurrently. We hope that techniques proposed here can inform communication with other non-human species, whether plants, fungi, or others yet to be discovered.


Data form the heart of any machine learning project. The data we choose to focus on guides all other aspects of our research. We aim to identify multimodal datasets that are promising for the application of ML to decoding animal communication, and support making them widely available.


As sensors and recording devices have become smaller and more efficient, the amount of data recorded by biologists is ever-increasing. However, these new troves of recordings are often too vast to process and analyze manually. Increasingly, researchers turn to machine learning in order to solve the problems associated with data of this scale (Tuia et al., 2022, Stowell 2021). 

For ESP, this increased scale of data enables the use of powerful deep learning models. In addition to using publicly available datasets such as FSD50K and xeno-canto, we access large datasets related to animal communication and behavior through over 40 partnerships with biologists and institutions worldwide (and growing). We work with the permission of data providers to make the data, annotations, and code publicly accessible, in order to promote a more open standard of data availability than exists at present in animal communication research (Baker and Vincent, 2019). 


Like humans, animals perceive and interact with the world through multiple modalities, such as  vision, sound, smell, and touch. We predict that decoding animal communication will require data that reflects the multimodal nature of experience. Sources of multimodal data may include, but are not limited to, animal-borne bio-loggers, camera traps, third-person video recordings, and web crawls. In the video below Dr. Ari Friedlaender deploys a video biologger on a humpback whale.

Focusing on multimodal data carries an additional benefit: enabling certain strategies for training models in the absence of human annotation. For example, a model can be trained to predict the data recorded in one modality, conditioned on the data recorded in another modality (Ayter et al., 2016; Arandjelović and Zisserman, 2016).


There is a set of fundamental data-related challenges along the path to decoding animal communication. These include event detection, individual identification, source separation, and denoising (Sainburg and Gentner, 2021). ESP aims to contribute tools that can enable us and the broader community to solve these challenges for animal communication across various taxa.


A benchmark is a collection of tasks and datasets designed to measure the performance of ML algorithms in a standardized manner. Benchmarks spur the development of new methods and serve as a proxy for measuring the progress in a field of research. For example, benchmarks have served this purpose well in computer vision (ImageNet), human language (SuperGLUE), and speech processing (SUPERB).

Benchmarks for fundamental challenges in animal communication research are largely absent, meaning researchers are working in silos without a way of comparing results. By developing benchmarks in collaboration with our partners, beginning with (Hagiwara et al., 2022), we provide common standards for researchers developing new methods in this field. Additionally, we expect that these benchmarks will serve to draw attention of the ML community to challenges arising in biology.

Foundation Models

Foundation models have recently become dominant in many domains of ML. These models are trained on large amounts of data, typically in a self-supervised manner (Bommasani et al., 2022). In part due to the scale of data, these models can perform difficult predictive and generative tasks (e.g., Brown et al., 2020Radford et al., 2021). These models are also useful for domains with less data, as they often exhibit state-of-the-art performance when fine-tuned for specific applications (e.g. Chen et al., 2020). More generally, models trained using self-supervision do not rely on the availability of human annotations. This makes self-supervision well-suited for problems in bioacoustics and animal behavior, where annotated data are scarce. 

Beginning with the AVES model (Hagiwara, 2022), ESP will develop large foundation models focused on biological data. We will release them as open-source tools for the wider research communities. These models will give us powerful general tools for addressing fundamental challenges related to the decoding of animal communication. 


Working with our partners, we develop and apply new ML-powered methods for better understanding animal communication. In particular, we focus on methods for the discovery of latent patterns and for data interpretation.

Animal communication is influenced by factors which challenge today’s machine learning systems (Bisk et al., 2020), such as long-term temporal context, social structure, selective pressures, and life history. We intend for scientists to use the ML methods we develop within a holistic approach that accounts for the wider biological context.


We develop self-supervised ML models for discovering patterns in animal behavior data. These techniques will help reveal patterns in when animals send signals, as well as how these signals influence behavior. For example, ESP is developing models to predict an individual’s motion conditioned on that individual’s vocal behavior (and vice versa), potentially revealing when an animal’s vocalizations are predictive of its behavior.

By selecting the correct way to complete an animal’s vocalization, given its motions and its previous vocalizations, a model can learn associations between behaviors across modalities

Another self-supervised approach, called clustering, focuses on grouping data with similarities. ESP is currently adapting deep clustering techniques (e.g. Ji et al., 2019) to automatically generate an inventory of an animal’s behavior (automated ethogram discovery) from motion bio-logger data, and plan to apply similar techniques to vocal behavior. Quantifying these behavioral repertoires can enable comparative studies between species and populations.

Because these methods are intended to reveal new patterns in data, we will validate their performance in well-understood contexts, including in benchmark datasets with human annotations.


ML can enable new ways to interpret how vocal signals vary in the wild, and how this variation relates to signal meaning. ESP is developing techniques inspired by recent work in computer vision (Lang et al., 2021) that visualizes the signal variations that are most important for ML-based classification. An analogous approach, sketched in the next figure, could help scientists form hypotheses for how a signal’s acoustic structure relates to its communicative function.

Mockup of a tool for the exploration of acoustic features of animal vocal communication. In one potential use case, a supervised classification model predicts which individual made the recorded vocalization. The sliders at the right alter the acoustic features of the vocalization which were highly influential to the classifier’s prediction. The scientist can explore these features by adjusting the sliders, and investigate how to discriminate between vocalizations coming from different individuals. This could help reveal how individuals encode identity in their signature vocalizations (in e.g. Vergara and Mikus, 2018; Wanker et al., 2005; Jansen et al., 2012)

Models which make cross-modal predictions often do so by learning to represent both modalities in a single shared representation of the data (e.g. Radford et al., 2021). These shared representations provide a second method of data interpretation: if two different communication systems are grounded in a shared modality, that modality can be used as a ‘pivot’ to translate between those systems. As an example, population-specific vocalizations or ‘dialects’ have been documented for several species (Henry et al., 2015; Podos and Warren, 2007). Predicting an animal’s motion based on their vocalizations may enable translation between the dialects of two populations, by using motion as the pivot. Recently, a similar approach was used in the human domain to extend self-supervised translation with monolingual data by pivoting through images (Dinh et al., 2022).

Finally, simulation and reinforcement learning in physically realistic virtual environments may allow us to reproduce and examine aspects of animal communication and behavior in a controlled environment (e.g., Whitehead and Lusseau, 2012). Simulation can also be leveraged for data augmentation when the training signals are sparse (de Melo et al., 2021; Winkler et al., 2022), as well as for studying the emergence and the evolution of communication systems (Lazaridou and Baroni, 2020).


Playback experiments are a popular tool to test hypotheses about communication in the context of animal behavior. In a playback experiment, an experimenter presents communication signals to an animal, and observes the receiver’s behavioral response. Such experiments can reveal how animals perceptually discriminate between signals (e.g. Sayigh et al., 2017) and how communication alters behavior (e.g. Seyfarth, Cheney, and Marler, 1980). ESP aims to design ML models that enable nuanced playback experiments, and to work with partners on developing ethical frameworks that promote positive use of ML-generated communication.

Signal Generation

In acoustic playbacks, vocalizations are typically recorded before an experiment (McGregor, 2000). Currently, modifications of these pre-recorded sounds rely on traditional audio processing techniques. We aim to introduce new ML-based models which allow for more sophisticated signal modifications (e.g. Kameoka et al., 2018; Engel et al., 2020; Karras et al., 2018).

Interactive playback experiments, in which the presentation of vocalizations varies in real-time in response to the animal’s behavior, can reveal the role of ongoing interactions in driving communication (King, 2015). For human vocal communication, recent work shows how an ML model can learn to generate realistic continuations of short segments of speech (Borsos, 2022). Extending such an approach beyond human speech could establish highly sophisticated paradigms for interactive experiments, in which an ML model interacts directly with an animal.

New Ethical Questions

Playback experiments carry the risk of causing harm to an animal, for example by influencing their behavior in such a way that they lose a foraging or mating opportunity. For this reason, all playback experiments will be performed in partnership with biologists. Where established ethical frameworks (Cuthill, 1991; Putnam, 1995) hold, we will work within those frameworks in collaboration with biologists on their species of study.

As it has in the human domain (e.g. Meskys et al, 2019, Bender et al., 2021), the act of generating communicative signals of animals raises new ethical questions. For example, in social species with vocal learning, experiments may run the risk of altering culture by introducing novel calls that spread in wild populations (Garland and McGregor, 2020). To mitigate these risks, we will focus on captive populations, and populations whose vocal communication is not socially learned. Where established ethical frameworks do not hold, we will work with biologists, ethicists, and governance bodies to establish a set of norms, rules, and regulations.


We thank Damian Blasi, Julien Cornebise, Michelle Fournet, Steven Moran, Sebastian Ruder, Christian Rutz, Robert Seyfarth, and John Thickstun for helpful comments in the process of creating this roadmap. We also thank Peter Bermant for his input.

Redirecting you to…