Introducing NatureLM-audio: An Audio-Language Foundation Model for Bioacoustics
By David Robinson, Marius Miron, Masato Hagiwara, Olivier Pietquin
Key Takeaways
- Earth Species Project introduces NatureLM-audio, the first large audio-language model tailored specifically for animal sounds.
- This new model addresses the limitations of existing bioacoustics ML tools (limited to a narrow range of species and tasks) and general-purpose audio LLMs (poor domain knowledge about species and behaviors).
- On BEANS-Zero, a new benchmark developed to assess generalization in bioacoustics, NatureLM-audio demonstrated state-of-the-art, zero-shot performance across diverse tasks.
- By leveraging task transfer abilities from human audio (like speaker counting), NatureLM-audio can perform untrained tasks in bioacoustics, such as counting individuals in recordings.
- NatureLM-audio shows an ability to generalize to unseen species, showcasing emergent abilities crucial for addressing data scarcity.
- NatureLM-audio allows users to perform bioacoustic tasks using natural language prompts—no programming expertise required.
Introducing NatureLM-Audio
Analyzing animal sounds unlocks profound insights into the natural world - from decoding complex communication and behaviors to monitoring the health of entire ecosystems. However, existing machine learning methods in this field are limited to a narrow range of species and tasks, hindering our ability to fully leverage the potential of bioacoustic data. On the other hand, general-purpose audio LLMs have poor domain knowledge about species and animal behavior.
To address these challenges, our team is excited to announce NatureLM-audio – the first large audio-language model tailored for animal sounds.
Trained on a newly compiled dataset combining large bioacoustic archives, human speech, and music, this model can solve a wide range of bioacoustics tasks zero-shot by simply being prompted with natural language queries and audio. From this input, it generates free-form text answers, greatly enhancing usability.
For example, NatureLM-audio can classify or detect thousands of species across diverse taxa including birds, whales, and anurans – without the need to retrain the model for each new task and without machine learning and programming expertise.
Evaluating on a novel benchmark, BEANS-Zero, we find NatureLM-audio can complete multiple tasks previously unexplored at a cross-species level, such as predicting life-stage and simple call-types of birds, and captioning bioacoustic audio. These new capabilities show promise to accelerate data processing for communication studies across a wide range of species.
NatureLM-audio also demonstrates preliminary evidence of emergent abilities that address the pervasive issue of data scarcity in bioacoustics. Notably, the model can generalize to completely unseen species, and demonstrates task transfer from speech and music to (non-human) animals. By unifying these tasks and diverse taxa into a single, flexible model that users can interact with through natural language, we believe NatureLM-audio represents a significant step forward for bioacoustic analysis.
Building A Large Language Model for Animal Sounds
In fields like speech and music processing, large audio-language models have recently achieved remarkable success. These models excel in diverse tasks, exhibit emergent abilities to perform entirely new tasks, and offer the flexibility of language-based prompting. Typically, they are built by adapting an existing language model to process audio by connecting it with an audio encoder. This approach allows the model to interpret audio inputs while leveraging the extensive knowledge already embedded in the language model. As a result, the model can take both an audio sample and a text prompt as input and generate a response in natural language. Trained in this manner, the model can learn multiple tasks simultaneously, each prompted separately using language instructions.
We adopted this approach for bioacoustics. We developed NatureLM-audio by compiling and training on a large dataset of millions of audio-text pairs. The majority of this data comes from bioacoustic archives such as Xeno-canto, iNaturalist, the Watkins Marine Mammal Sound Database, and the Animal Sound Archive. We also included general audio, human speech, and music data, aiming to transfer learned abilities from human audio processing to animal sounds. We trained NatureLM-audio on this comprehensive dataset by connecting a self-supervised pretrained audio encoder to a leading language model (LLaMA 3.1-8B).
Evaluating NatureLM-audio Across Species and Tasks
To evaluate the performance of NatureLM-audio, we enhanced our existing benchmark, BEANS, to create BEANS-Zero (Benchmark of Animal Sounds Zero-Shot). In addition to core bioacoustic tasks, this benchmark is designed to assess the model's ability to generalize to unseen species and tasks without additional training—critical capabilities for advancing bioacoustic research. BEANS-Zero provides a standardized way to measure zero-shot performance across various bioacoustic tasks, enabling consistent comparisons and fostering progress in the field.
Zero-Shot Species Classification and Detection Across Taxa
Accurate classification and detection of animal species through their vocalizations are fundamental tasks in animal communication research. These processes are essential for understanding species presence, behavior, and population dynamics. Identifying the precise moments when species vocalize - the onset and offset of sounds - is often the critical first step in analyzing long audio recordings.
To test classification and detection, we evaluated NatureLM-audio on BEANS-Zero “zero-shot,” meaning we ran the model on all datasets without fine-tuning on the benchmark. Compared with other existing large audio-language models, NatureLM-audio achieves state-of-the-art performance on most tasks in the benchmark, including important tasks like classifying diverse bird and marine mammal sounds.
While previous Earth Species’ models including AVES achieved strong performance on these tasks, NatureLM-audio performs even without fine-tuning on the datasets. This out-of-the-box performance means the model can be used to detect and classify a broad range of species, without the need for specialized skills in machine learning to further train.
Generalizing to Unseen Species
Data shortage is a characterizing trait of bioacoustics. Many species are rare, with few recorded calls. For others, including many marine species, collecting and annotating data is arduous and expensive. To tackle this challenge, we investigate NatureLM for two non-intuitive abilities: classifying species unseen during training, and completing unseen tasks.
To investigate classification of unseen species, we ask the model to predict the scientific names of species held out in BEANS-Zero recordings. 20% of the time, the model is able to predict the scientific name of the correct species, given a call and a choice between two hundred species it has never heard. A random classification performance would give a success rate of only 0.5%.
How can a machine learning model identify an animal it has never heard? While traditional models are constrained to a fixed set of classes, NatureLM-audio responds freely via human language text. The underlying language model encodes significant knowledge of animal species. Scientific names have a compositional, binomial structure including genus and species (e.g. Homo sapiens). Both knowledge and name structure can be exploited, and in many practical cases, just getting the genus correct is enough to identify a species based on options and location. We are excited by this early result and its potential to address a fundamental challenge in the field.
Novel Tasks and Transfer from Speech
Going beyond the commonplace species prediction in bioacoustics, we evaluate four largely unstudied tasks, each with significant ecological implications.
Finer-Grained Monitoring through New Classification Abilities
- Predicting Life-Stage of Birds: For the first time, NatureLM-audio demonstrates the ability to determine the life-stage (chick, juvenile, nestling) of birds across many species. This advancement could transform ecological monitoring from simply detecting species presence to assessing both species and age distribution.
- Classifying Call Types: NatureLM-audio is trained to predict call-types at scale, differentiating between calls and songs across hundreds of bird species. Since different call-types are associated with specific behaviors, this capability enhances ethology and communication studies using unstructured data.
Superversion for these tasks is extracted from bioacoustic metadata as semi-structured text – we believe this approach unlocks large amounts of underutilized bioacoustic information, stored as unstructured text and field notes for audio recordings.
Captioning Bioacoustic Audio
As a generative model, NatureLM-audio uniquely enables captioning of bioacoustic audio - generating descriptive annotations of animal sounds. This opens up future applications like characterizing call-types, behaviors, or even captioning nature documentaries. We believe this task requires significant progress, but are excited by the new avenue for exploration.
Transfer from Speech and Music
To address data scarcity in bioacoustics, we investigated the model's ability to generalize to completely unseen tasks. We evaluated NatureLM-audio on counting Zebra Finch individuals – a task it was not explicitly trained on. Surprisingly, the model achieved 38.3% accuracy compared to random chance and baselines at 25%. We hypothesized that this performance stems from training on speech and music tasks, such as counting the number of human speakers in a recording. An ablation study confirmed that without training on speech and music, the model's performance on counting birds drops to random levels. This demonstrates that NatureLM-audio transfers the ability to count human speakers to counting bird speakers. This finding opens up exciting avenues for addressing data shortages in bioacoustics by transferring learned abilities from more abundant human speech and music datasets.
What’s Next for NatureLM?
NatureLM-audio promises to address some of the persistent challenges to using machine learning in bioacoustics.
With results like generalization to unseen species and transfer from speech and music to bioacoustics, we hope to address the data shortages posed by challenging field collection conditions and rare species. With new abilities like lifestage prediction and call-type prediction in birds, large-scale acoustic monitoring could provide finer-grained insights into populations and ecosystem health as well as animal behavior. And with a single model able to complete a wide variety of bioacoustic tasks without retraining, we hope to reduce the barriers to entry of using machine learning to the benefit of biodiversity.
We will continue to scale this approach by incorporating additional modalities – such as visual data and accelerometer readings from animal-borne tags. We’ll open source the code soon and we have plans to develop an intuitive UI to give ethologists and conservation biologists direct access to the model for use with their own data. As we advance this technology, we remain vigilant about ethical considerations—addressing potential biases in species representation and mitigating risks of misuse, such as tracking endangered wildlife. We’re committed to responsibly deploying these tools to foster a richer understanding of the communication systems of our fellow species and to strengthen conservation efforts.
To learn more, see the preprint in arXiv here or check out our demo page.