What We Do

Gelada monkeys groom each other. Photo by Rieke T-bo.

Gelada monkeys groom each other. Photo by Rieke T-bo.

Our roadmap to decode

The motivating intuition for ESP was that modern machine learning can build powerful semantic representations of language which we can use to unlock communication with other species. We are currently working on developing a full technical roadmap based on the key stages outlined below. 

While our projects are working towards all areas of the roadmap pictured here, we are currently primarily focused on the Data and Fundamentals stages. 


The amount of data recorded from the natural world is vast and ever-increasing. Recording devices deployed by field biologists have become smaller, more efficient, and less expensive than ever, and modern hard drives are more capable of storing the data produced. We are working closely with individual researchers and institutions on datasets allowing for the development of machine learning models. In this space there is a growing focus on multi-modal data which is becoming increasingly available as the technology for capturing data through animal-borne tags, camera traps and other devices improves exponentially. This multi-modal data may include audio, video, accelerometry and environmental variables, and provides a rich, multi-faceted view of the context in which other species are operating - critical to building semantic representations of language.

While we believe multi-modal data will be key to decoding meaning, we will also continue to work with mono-modal data, such as purely acoustic data. These data give us the opportunity to develop and test new techniques, as well as to work with partners to answer specific research questions about animal communication.

In the process of gathering datasets, we will use animals’ own sensory capabilities to guide our focus. For instance, investigating bat vocalizations will require high-frequency audio recordings. And an investigation of visual displays by birds may require ultraviolet imaging. (He et al., 2022)


We are also actively working to define and solve keystone challenges along the path to building the semantic representations we need to decode non-human communication. These include source separation, denoising, classification, detection, clustering, and automatic motif discovery. All these challenges benefit from “fundamentals”—standardized benchmark datasets and powerful foundation models trained through self-supervised approaches. 

  1. The development of public benchmarks for keystone biology tasks. 
    A benchmark is a collection of tasks and datasets designed to measure the performance of ML algorithms in a standardized manner. Benchmarks have enabled the progress we have seen in AI for the human domain, but are not yet available for non-human species. Publicly available benchmarks will push research at the intersection of machine learning and biology, and set a common standard for all future models and advance the entire field. 
  2. The development of large foundation models using data from our biologist partners and internet-scale open data. 
    Foundation models that are trained on a large amount of data and that are adaptable to a wide range of tasks are now mainstream in many domains of ML (language, vision, speech) and typically outperform models designed and trained for individual tasks. What gives foundation models their power is self-supervised learning, which allows algorithms to learn patterns directly from massive volumes of raw data, without the need for human annotations. (He et al.2021; Liu et al., 2020; Conneau et al., 2020.)

    These models have yet to be applied in bioacoustics and ethology. The scale of our partnerships and breadth of our benchmarks lay the groundwork for ESP to build powerful, general-purpose models meant to break through the keystone challenges. Released as open-source tools for the field, this will be our version of GPT-3 or OPT for the non-human domain.

We are currently engaged in a number of critical AI research projects which have been developed in close partnership with our partners in biology and machine learning.

  • Creating unified benchmarks and data, acoustic and multimodal, to validate our experiments and accelerate the field, vetted by top biologists and AI researchers 
  • Turning motion into meaning — automatic behavior discovery from large-scale animal-borne tag data
  • Training self-supervised foundation models such as HuBERT and evaluating them against our benchmarks
  • Establishing semantic generation and editing of communication — which will ultimately allow for the creation of new signals that carry meaning and unlock two-way communication with another species.
Our full technical roadmap will be published in early 2023. Stay tuned for more details!

“I was just working with a student yesterday on a section of whistles that was so dense that we decided we weren't going to be able to do anything with it... Maybe there is hope for the future!”