Jun 26, 2023
AudiopaLM is a large language model for voice production and comprehension.
Text-based and voice-based language models, PaLM-2 and AudioLM and AudioPaLM, respectively, are combined into a single multimodal architecture.
This multimodal architecture can process and generate both text and speech for use in speech recognition and speech-to-speech translation applications.
The linguistic information found solely in large language models like PaLM-2 and AudioLM is passed down to AudioPaLM.
The capacity to preserve paralinguistic information like speaker identification and tone is also passed down to AudiopaLM.
The model performs voice translation tasks substantially better and and it can execute zero-shot speech-to-text translation for numerous language.
AudioPaLM shows how audio language models work by transferring voices between languages in response to a brief spoken prompt.
