Voxaboxen: New Tool Supports Annotation of Large Audio Files


Benjamin Hoffman and Maddie Cusimano, ESP Senior Research Scientists

We are excited to release Voxaboxen, a new machine learning tool for research in bioacoustics. Based upon Earth Species Project's bioacoustics foundation model AVES, Voxaboxen is designed to detect and classify animal vocalizations in recorded audio, with the high level of temporal specificity that is required for studies focused on communication behavior. In this blog post, we will give an overview of why and how we designed Voxaboxen, as well as how to apply Voxaboxen to your own data. We anticipate this new tool will support and advance research by providing a straightforward way to automate annotation of large audio files.

Carrion crow calling by Alexis Lours

What does Voxaboxen do?

Scientists who work with recordings of animal sounds rely on machine learning methods to speed up the process of annotating raw audio data. Most often, these methods use a segmentation framework, which divides a recording into short chunks, and then predicts whether a species of interest is making sound during each chunk. However, animals often communicate using sequences of discrete vocal elements, and often two or more individuals will exchange vocalizations in a duet or chorus. The temporal details of these exchanges are difficult to capture using segmentation methods (Figure 1). To overcome this shortcoming, we built Voxaboxen, which has the ability to treat each vocalization as its own separate event (Figure 2). To do this, Voxaboxen takes inspiration from frameworks used in computer vision for object detection.

Figure 1: Segmentation-based methods can detect whether animal vocalizations are present, but struggle to capture the onsets and offsets of nearby vocalizations.

Audio of carrion crow (Corvus corone) provided by Vittorio Baglione, Daniela Canestrari, Mark Johnson, Víctor Moreno-González, Eva Trapote Villalaín, and Christian Rutz.

Figure 2: Voxaboxen treats each vocal element as its own event, giving a more detailed description of how vocalizations are sequenced in time.

Transformer Model

Voxaboxen uses AVES as its backbone. AVES is a large transformer-based model which is trained with self-supervision, using 360 hours of animal sound data. Because of this self-supervised pre-training step, we expect AVES-based models like Voxaboxen to achieve high performance on bioacoustics-related tasks, after fine tuning. For example, on the Benchmark of Animal Sounds (BEANS), AVES-based classifiers achieved state-of-the-art performance on eight out of 10 bioacoustic datasets. 

In technical terms, Voxaboxen consists of the AVES backbone, followed by a linear prediction head. This prediction head has N+2 output channels, where N is the number of classes present in the dataset. One channel, analogous to the "objectness" channel in some object detection frameworks, detects the start sample of audio events. It is trained to minimize a masked focal loss term inspired by CornerNet. A second channel, trained to minimize L1 loss, predicts the duration of the audio event. The remaining N channels are trained to minimize a masked weighted cross-entropy loss term. 


We trained Voxaboxen on four bioacoustics datasets which represent a variety of animal species and recording conditions. These datasets were chosen because they contain expert annotations of bounding boxes that are precisely aligned with the onsets and offsets of vocalizations. In Table 1 we report the performance of Voxaboxen on a held-out test set from each of these datasets. 

As an informal baseline, we fine tuned an image-based Faster-RCNN object detection model on each dataset. Adapted from the Detectron2 code base, these networks were pre-trained with images in the COCO detection task and were fine tuned to detect vocalizations from spectrograms. 

For each of these experiments, we performed a small grid search to choose initial learning rate and batch size. For all four datasets, we found that Voxaboxen outperformed Faster-RCNN. 



Number of vocalizations

(Train / Val / Test)

 Number of classes considered

 F1@0.5 IoU



BirdVox 10h

Passeriformes spp.

4196 / 1064 / 3763





Suricata suricatta

773 / 269 / 252





Bird spp.

6849 / 2537 / 2854





Bird spp.

24385 / 9937 / 18034




Table 1: Macro-averaged F1 score for each model, dataset pair. To compute these scores, we matched each predicted bounding box with at most one human-annotated bounding box, subject to the condition that the intersection over union (IoU) score of the proposed match was at least 0.5. Two of these datasets (BirdVox 10h and Meerkat) were previously used in the DCASE few-shot detection task. The code for dataset formatting and for replicating these experiments can be found on our GitHub repository.

What’s next

Detecting objects in images often relies on specialized network architectures such as region proposal networks. To create Voxaboxen, we simply added a set of linear layers at the end of the AVES backbone, and chose an appropriate training scheme. In spite of its simplicity, Voxaboxen had strong performance relative to the baseline model. This is evidence that in bioacoustics, high capacity models trained with self-supervision can be readily adapted to a variety of downstream tasks.

There is clearly room for improving Voxaboxen’s detections, relative to a human’s ability to box and classify vocalizations. Earth Species Project is continuing to develop self-supervised models for encoding audio features, which will likely lead to improvements in the performance of Voxaboxen after fine tuning. In addition, we will continue to develop Voxaboxen, in order to improve performance and usability. If you are interested in contributing to these efforts, please contact us.  If you use Voxaboxen on your data, we would like to hear about your experience.

How to use Voxaboxen

To use Voxaboxen, it is first necessary to train ("fine tune") the model on an annotated subset of the dataset of interest. For the training data, Voxaboxen accepts annotations in the form of selection tables produced by Raven Lite, which is a free and widely used bioacoustics annotation platform. Vocalizations may be categorized into several classes (e.g. species), or be labeled as “unknown” to indicate the presence of a vocalization without knowledge of the category. 

Once fine tuned, Voxaboxen can be used to produce annotations--also in the form of Raven selection tables--for audio files not used during training. We provide a demonstration Colab notebook, which gives the specific details of this process.

If you are interested in using Voxaboxen on your own animal sound data, but have little experience with deep learning, we invite you to get in touch.

Redirecting you to…