
Speech Generation and Sound Understanding in the Era of Large Language Models
Abstract: LLMs have not only revolutionized text-based natural language processing, but their multimodal extensions have proven to be extremely powerful models for vision, speech, and natural sounds. By tokenizing input from disparate modalities and mapping them into the same input/output space as text, these models are capable of learning not only how to reason over multimodal inputs, but also generate new speech, audio, and visual outputs guided by text instructions - multiplying the number of capabilities these models have. In my talk, I will discuss several recent works in this direction from my lab at UT Austin.
The first part of my talk will describe our work on VoiceCraft, a neural codec language model capable of performing voice cloning text-to-speech synthesis, as well as targeted edits of speech recordings where words can be arbitrarily inserted, deleted, or substituted in the waveform itself. I will discuss our recent work on text-controllable TTS models that can not only manipulate basic attributes of the speech signal such as pitch and speaking rate, but also more abstract and higher-level vocal styles such as "husky", "nasal", "sleepy", and so forth.
In the second part of my talk, I will discuss our work on spatial sound understanding. I will introduce SpatialSoundQA, a dataset containing 800,000 ambisonic waveforms and accompanying question-answer pairs, which can be used to train and evaluate models on their ability to answer questions such as “Is the sound of the telephone further to the left than the sound of the barking dog?” I will also describe our BAT model, an extension of the LLaMA LLM that is capable of taking spatial audio recordings as input and reasoning about them using natural language.
Bio: David Harwath's research interests are in the areas of automatic speech recognition, spoken language understanding, and multi-modal machine learning. His work aims to develop models of speech and language that are robust, flexible, and capable of learning on the fly from multiple input modalities. He holds a B.S. in Electrical Engineering from the University of Illinois at Urbana-Champaign, a S.M. in Computer Science from MIT, and a Ph.D. in Computer Science from MIT.