Skip to main content
Forum for Artificial Intelligence
-
GDC 6.302/Zoom

Towards Audio Intelligence and LLMs: Simulation and Understanding

Dinesh Manocha
Professor, University of Maryland at College Park

Abstract: Audio General Intelligence—the capacity of AI agents to deeply understand and reason about all types of auditory input, including speech, environmental sounds, and music—is crucial for enabling AI to interact seamlessly and naturally with our world. Despite this importance, audio intelligence has traditionally lagged behind advancements in vision and language processing. This gap arises from significant challenges, such as limited datasets, the complexity of audio signals, and a shortage of advanced neural architectures and effective training methodologies explicitly tailored for audio. 

In this talk, we give an overview of our work over the last two decades on audio simulation and understanding, and the development of Audio LLMs. This includes earlier works based on audio ray-tracing and scientific solvers. Next, we talk about acceleration and prediction methods based on machine learning. We demonstrate how synthetic simulation methods can be utilized to generate training data and design networks using geometric deep learning methods. 

Next, we provide an overview of audio large language models (ALLMs) used for advanced audio perception and complex reasoning. This includes GAMA, which is built with a specialized architecture, optimized audio encoding, and a novel alignment dataset. Complementing this, we introduced ReCLAP, a state-of-the-art audio- language encoder, and CompA, one of the first projects to tackle compositional reasoning in audio language models—a critical challenge given the inherently compositional nature of audio. Towards the end, we discuss Audio Flamingo 3 and Music Flamingo, which are our open ALLM models that provide advanced long-audio understanding and reasoning capabilities across speech, sound, and music. We highlight their applications to audio, speech, and music understanding.

About the speaker: Dinesh Manocha is the Paul Chrisman-Iribe Chair in Computer Science & ECE and a Distinguished University Professor at the University of Maryland, College Park. His research interests include virtual environments, physically-based modeling, and robotics. His group has developed numerous software packages that are standard and licensed to over 60 commercial vendors. He has published more than 870 papers & supervised 61 PhD dissertations. His group has received more than 21 best paper and test-of-time awards at leading conferences in computer graphics, solid modeling, multimedia, VR, and robotics. Manocha is a Fellow of AAAI, AAAS, ACM, IEEE, and NAI, a member of ACM SIGGRAPH and IEEE VR Academies, and a recipient of the Bézier Award from the Solid Modeling Association. He received the Distinguished Alumni Award from IIT Delhi and the Distinguished Career in Computer Science Award from the Washington Academy of Sciences. He co-founded Impulsonic, a physics- based audio simulation technology developer, which Valve Inc. acquired in November 2016.

Visit FAI for Zoom link