In December, my colleague Orkun B. and I attended the annual NeurIPS conference in New Orleans to co-host the ML4Audio Workshop and learn more about the newest research in machine learning. We were able to meet and network with other ML experts from academic organizations, startups and large organizations like Meta, Google, and IBM. Here are some of our takeaways from the conference!
Machine Learning for Audio Workshop
I had the opportunity to co-organize a day-long workshop titled ML4Audio alongside peers from Boston University, Hume AI, Google DeepMind, and more. The workshop featured talks and presentations by audio practitioners, engineers, and machine learning researchers. Throughout the day, we covered a wide range of topics including:
- Music information retrieval
- Acoustic event detection
- Computational paralinguistics
- Speech transcription
- Multimodal modeling
- Generative modeling of speech and other sounds
Because machine learning research in the realm of audio is limited in comparison to computer vision and other domains, my co-organizers and I were excited to create a space dedicated to these topics, to make new connections, and to spark new research directions within the field. It’s important to us to continuously strive to offer forums for researchers to present their work in audio machine learning, which may be proof-of-concept. This differs from the main conference tracks, where research is more finalized.
During the workshop, we heard from invited speakers like Rachel Bittner and Dimitra Emmanouilidou, who shared their research on audio foundation models and novel speech emotion recognition techniques respectively. Talks like these can inspire researchers attending the workshop to reflect on their current implementations and start new projects using new techniques.
We also hosted a poster session, where authors of papers accepted at the workshop shared their work and live demonstrations. Six of those papers also presented brief oral (contributed) talks throughout the course of the day.
As part of the workshop, we shared large audio datasets curated by Modulate and Hume AI, which I hope will spark new analyses and research in the field of machine learning for audio applications. Our goal is to encourage further research and increase accessibility to data that would normally be difficult to collect and label without large-scale infrastructure.
More Highlights from NeurIPS 2023
Being part of the machine learning team at Modulate, we were grateful to meet researchers and explore their latest findings in the field, especially the research and papers on Computer Vision, Reinforcement Learning, Natural Language Processing, and other theoretical advancements like Gradient Non-Convex Optimization Methodologies.
One research paper that was particularly interesting was Pengi: Audio Language Model for Audio Tasks. Orkun shares his major takeaways:
We were interested to see more of the inner workings of the Pengi Audio Language Model, as its framework differs from the way we conceptualize and develop several subsystems that comprise ToxMod. Because of the very specific use-case of ToxMod, we focus on applying Time-Frequency Representation-led mappings to audio samples prior to their utilization in various downstream tasks, Pengi employs an Audio Encoder to extract sequences of embeddings from raw audio. Pengi also uses a Text Encoder to determine the specific task based on an input text prompt. Combining these two sets of extracted information for prompting a pre-trained language model, the paper’s authors obtain a dynamic model that is capable of handling a wide range of audio tasks that surpasses the need for any additional design changes. Pretty cool, although it seems that Pengi likely requires significant computational power to run effectively. While we likely won’t be directly incorporating this framework into our work on the ML team, it’s still important for us to be aware of new research and approaches in the quickly evolving field of audio machine learning.
Although machine learning for audio is less widely researched than other domains, being exposed to new approaches and lessons learned from academic researchers and colleagues across the industry helps us put our own work at Modulate into greater context. We're heading into the rest of 2024 feeling invigorated by the collaborative spirit, the wealth of knowledge exchanged, and the promising directions that lie ahead in the intersection of machine learning and audio technologies.