Google Presents New Research at INTERSPEECH 2023 Conference

Google is making its presence known at the annual Conference of the International Speech Communication Association (INTERSPEECH 2023) by presenting over two dozen research papers. These papers showcase the company’s advancements in the field of Natural Language Processing (NLP), which plays a crucial role in communication and understanding.
One notable research paper is “DeePMOS: Deep Posterior Mean-Opinion-Score of Speech,” which introduces DeePMOS, a deep neural network approach for estimating speech signal quality. Unlike traditional methods, DeePMOS provides a distribution of mean-opinion-scores (MOS) with its average and spread. The method demonstrates comparable performance to existing methods, but with the added benefit of robustness and addressing limited and noisy human listener data.
Another paper, “LanSER: Language-Model Supported Speech Emotion Recognition,” presents LanSER, a method to enhance Speech Emotion Recognition (SER) models. LanSER leverages large language models (LLMs) to deduce emotion labels from unlabeled data through weakly-supervised learning. The team used a textual entailment approach to select the best emotion label for a speech transcript, resulting in improved label efficiency.
Furthermore, Google introduces the MD3 dataset, which represents English from India, Nigeria, and the United States. Unlike previous datasets, MD3 combines free-flowing conversation and task-based dialogues, allowing for cross-dialect comparisons without limiting dialect features. This dataset provides valuable insights into distinct syntax and discourse marker usage across dialects.
In the field of Automatic Speech Recognition (ASR), Google addresses the challenge of recognizing personal identifiers in speech while protecting privacy. The researchers propose injecting fake substitutes into training data to improve the recognition of personal identifiers such as names and dates. This approach boosts ASR accuracy without compromising privacy protection.
Lastly, Google presents a model that transcribes speech into the International Phonetic Alphabet (IPA) for any language. This model achieves comparable or better results than human annotators, making the language documentation process more efficient, especially for endangered languages.
These research papers presented by Google at INTERSPEECH 2023 showcase the company’s commitment to advancing NLP and speech-related technologies. The innovations introduced in these papers have the potential to drive significant improvements in speech recognition, emotion recognition, dialect analysis, and language documentation.
Definitions:
– Natural Language Processing (NLP): A subfield of artificial intelligence focused on enabling computers to understand and process human language.
– Mean-Opinion-Score (MOS): A measure used to assess the quality of speech signals.
– Speech Emotion Recognition (SER): The task of automatically recognizing emotions conveyed in speech.
– Personally Identifiable Information (PII): Any data that can be used to identify an individual, such as names and dates.
– International Phonetic Alphabet (IPA): A standardized system of phonetic notation used to represent the sounds of human speech.
Sources:
– [Google AI Blog](https://ai.googleblog.com/2023/08/googles-contribution-to-interspeech-2023.html)