What Matters for Bioacoustic Encoding: A Practical Training Recipe for Building Generalizable Models


Key Takeaways:
- Our large-scale empirical study evaluated 19 models across 26 datasets in order to understand what matters in training generalizable bioacoustic encoders.
- We found that a 2-stage approach combining self-supervised pre-training followed by supervised post-training and using a mix of bioacoustic + general audio data yields state-of-the-art performance across a range of tasks.
- Data diversity is crucial: models trained on general-audio sources in addition to bioacoustic data outperform those limited to bioacoustic data alone. That’s because diverse audio helps the models learn fundamental sound characteristics that aren't unique to specific animal calls.
- As part of the evaluation, we developed an expanded set of benchmark tasks for ethology including individual identification and vocal repertoire discovery.
- We will open source all model checkpoints to guide and accelerate future work in bioacoustics.
Today, most AI models for bioacoustics are designed for narrow purposes – usually for a specific taxa like birds, or for limited tasks like detection and species classification. Current models like BirdNET and Perch perform well within their domains, but have a harder time with new species, environments, or research questions they weren’t trained for. Most of these models have been trained with supervised learning, which requires large amounts of high quality labelled datasets that aren’t always available or require intensive manual annotation.
The most impactful applications for conservation and ethology require flexibility outside of the environment they were trained in for tasks like: identifying individuals, characterizing vocal repertoires of understudied species, or detecting distress calls in noisy environments. Ethologists often work with limited datasets that may not have the level of high quality annotations required for supervised learning.
This challenge motivates much of our Foundations work at ESP, including the development of NatureLM-audio, which was designed with the intent to generalize to support cross-taxa tasks. We’ve already seen examples of it performing well on detecting human speech within frog recordings – something it was never explicitly trained on.
To understand how to build better generalizable models for bioacoustics, we conducted a large-scale study investigating what factors are most important in training a generalizable bioacoustic encoder — a type of model designed to learn general representations of animal sounds that can be reused and applied across species and research questions. We’ve published the results in our recent paper, What Matters for Bioacoustic Encoding, which can be found here.
The Central Question
The primary goal of this study was to discover what really matters when training a bioacoustic encoder to generalize effectively across diverse species and tasks. To answer this, we tested various combinations of the following training “ingredients”:
- Model architecture: Convolutional Neural Networks (CNNs) vs. Transformers
- Training data: Bioacoustic data only vs. a mix of bioacoustic data + general-audio
- Training paradigm: Self-supervised vs. supervised learning
Below, we summarize the main findings from this study, providing a practical guide on strategies that can be leveraged for building more robust bioacoustic models.
Methodology
We trained and evaluated 19 variations of models with different architectures and training strategies. Each model was tested across 26 datasets and across 4 tasks: species classification, detection, individual ID, and vocal repertoire discovery. The figure below summarizes our empirical study design:

Models
The models we evaluated included several general-audio encoders as baselines, such as BEATs, EAT, and EfficientNet, alongside state-of-the-art bioacoustic encoders like BirdNet and Perch. We include SurfPerch (trained on diverse taxa), BirdAVES (which uses a self-supervised approach), and NatureBEATs (the BEATs encoder extracted from NatureLM-audio) to further analyze how different training choices influence performance.

Training Data
We trained the models with both general audio (from AudioSet) and a range of bioacoustics datasets covering a diverse range of taxa, detailed below:

We were also interested in understanding the impact of data diversity. We compared models trained exclusively on bioacoustic data with those that incorporated general audio to see how much this broader mix improved performance.
Expanded Evaluation Benchmarks
Current bioacoustic benchmarks like BEANS and BirdSET focus on species classification (identifying the focal species from a recording) and detection (identifying when a relevant animal vocalization occurs within a recording). While these are fundamental tasks, they don’t fully test the quality of the encoder’s understanding of the deeper, more nuanced patterns in animal communication. We added two new tasks to address this, that are also essential for real-world ethology research:
- Individual identification: Can the model distinguish between different individuals of the same species, just based on their vocalizations?
- Vocal repertoire discovery: Can the model identify and differentiate between the distinct vocalizations/call types making up that species’ vocal repertoire?
The full details of evaluation benchmarks and datasets can be found in Table 2 of our paper.
Results: A Practical Training Recipe
What we found was that a 2-stage training recipe trained on a mix of broad bioacoustic + general audio data performed the best across these new benchmarks.

This two-stage approach combines the best of both worlds. Self-supervised learning excels at capturing general, robust features from raw audio, making it particularly useful for handling out-of-distribution data – sounds or species the model hasn’t encountered before. On the other hand, supervised learning performs much better in-domain and can sharpen performance for more targeted tasks.

Data Diversity
We also found that having diversity in the training data is important, and models trained on a mix of bioacoustic and general audio outperformed those trained on bioacoustic data alone.

For example, EffNetB0-all, which combines bioacoustic recordings with AudioSet, generally achieved stronger performance than EffNetB0-bio across almost every metric. Incorporating general audio not only improves performance, but it also greatly expands the amount of data available for training bioacoustic models, helping overcome the challenge of data scarcity in bioacoustics.
Foundations for Future Research
While there have been various efforts to build and evaluate bioacoustic encoders, there has not yet been an established shared framework or consistency across methods, findings, and evaluation. We hope that our comprehensive evaluation can address this gap, providing a clear roadmap for building generalizable bioacoustic encoders. By sharing our findings and releasing model checkpoints, our aim is to accelerate progress in the field by enabling others to build on what we’ve learned, rather than starting from scratch.
Future work includes exploring optimal data mixes and ratios for specific tasks or taxa, investigating transfer learning across taxa, and scaling up with larger datasets. Having a generalizable and robust encoder is foundational to the rest of our work at ESP, as it will only make our other models (such as NatureLM-audio and Voxaboxen) stronger as well. As machine learning in bioacoustics continues to evolve, we believe these foundational building blocks will allow researchers to tackle increasingly complex and important questions in animal communication and conservation.