From Speech to Sensing

-
May 30, 2025

Sriram Ganapathy’s lab focuses on problems at the intersection of machine understanding of natural language and human cognition

Photo courtesy: Sriram Ganapathy

“My voice goes after what my eyes cannot reach.” American poet Walt Whitman’s classic portrayal of the human voice as an extension of the soul was more than just a symbolic metaphor. At least that is what Sriram Ganapathy, Associate Professor at the Department of Electrical Engineering (EE), IISc, would like to believe. For the past nine years, he has been running the LEAP (Learning and Extraction of Acoustic Patterns) lab, trying to make sense of human behaviour by studying speech and related acoustic signals.

The LEAP lab works on a variety of research problems but they are all largely aimed at extracting information from human speech in complex environments. Such information can not only help us better understand speech patterns but also train AI tools like conversational assistants. The lab also investigates how the human brain processes and interprets speech signals, hoping to uncover cognitive and neural mechanisms involved in verbal communication.

Twenty years ago, Sriram could not have predicted the direction that his career would take. After completing his BTech in Electronics and Telecommunications from CET, Kerala, he pursued his MTech at IISc, specialising in signal processing. The MTech programme in signal processing at IISc was “one of the best of its kind in the country at that time,” he recalls. His thesis project with TV Sreenivas at the Department of Electrical Communication Engineering (ECE) was his first formal introduction to speech recognition. The goal was to create efficient algorithms to transcribe audio signals into text.

Towards the end of his MTech, he had the opportunity to attend a talk by Hynek Hermansky, a researcher working on acoustic signal processing at the Swiss Federal Institute of Technology (EPFL) in Lausanne, Switzerland, who was visiting IISc. Sriram was so fascinated by the speaker’s work that he approached him directly to ask for a chance to pursue research in his lab. He then joined Hermansky’s group as a Research Assistant at EPFL and later accompanied him to Johns Hopkins University in Baltimore, USA, where he completed his PhD at the Center for Language and Speech Processing. His doctoral research focused on developing robust signal processing techniques to extract reliable information from human speech in noisy environments.

Sriram Ganapathy (first left) during his MTech days at IISc (Photo courtesy: Sriram Ganapathy)

After completing his PhD, Sriram joined the IBM TJ Watson Research Center as a Research Staff Member, where he spent two years – an experience that helped shape his career and guide his research interests, he says. When offered a faculty position at IISc, he faced a dilemma: Should he return to India, having already transitioned into a well-settled industry role in the USA? After much deliberation, he realised his strong desire to carry out independent research and eventually convinced his family to move back with him.

In early 2016, Sriram joined IISc, and his lab began work on representation learning – a subfield of machine learning focused on extracting meaningful features from data to help computers better interpret complex information. These features are applied to speech and acoustic signals, with a focus on generating compact patterns that can capture subtle variations in voice, emotion, and other biometrics of the speaker. Such patterns, also referred to as “representations”, are crucial for building robust audio transcription models that work reliably in noisy, real-world conditions, and for models that recognise speech and emotion. This foundational work has also led to industrial collaborations with Samsung, Sony Research, Google, and others, often in the form of industry-supported research and post-doctoral fellowships that continue to shape the lab’s research directions.

One such direction is Explainable AI (xAI) – an area that focuses on testing how reliable AI models are in making decisions. This is particularly important for deploying AI in high-stakes sectors like healthcare, finance, and law.

In one study, the team tested image recognition models by designing an experiment where the models had to generate bounding boxes around portions of images corresponding to given captions. The results showed how well the models focused on different parts of the input images and how quickly they detected their target portions. The model developed by Sriram and his students showed significantly improved performance over conventional models.

In another study, Sriram and his students sought to understand how the human brain distinguishes between two speakers in a conversation based just on the sound of their voices. Participants were asked to press a button whenever they noticed that the speaker changed during a recorded dialogue. The study yielded intriguing findings, one of which was the role of language familiarity in detecting speaker changes. “If the language of the conversation is unknown to the listener, then the listener is found to detect the speaker change much quickly,” explains Sriram. “[This is because] the brain in such cases does not attempt to understand what is being said but rather focuses on the tone and voice of the speaker.” Such findings could prove instrumental in improving the efficiency of cognitive and neural hearing aids.

PhD student in the LEAP lab analysing cough and speech signals from Coswara dataset (Photo courtesy: Sriram Ganapathy)

Recently, when Sriram decided to take a sabbatical at Google DeepMind, Bangalore, he was thrown head first into the world of large language models (LLMs). He began there in October 2022 – shortly before OpenAI released ChatGPT. He got to learn a great deal about how LLMs are developed, evaluated, and used as research tools. The launch of ChatGPT, he says, drastically reshaped Google’s research landscape. He, too, reoriented his lab’s focus toward practical challenges that involve deploying LLMs for speech and emotion recognition.

After returning from his sabbatical, Sriram’s lab began testing how well LLMs can recognise human emotions. Traditional LLMs, he noted, often struggle to interpret emotion from speech. Recently, the team participated in a challenge hosted at INTERSPEECH 2025, a prestigious AI/ML conference, which invited submissions of emotion-recognition models. Unfortunately, their model was still in training when the challenge closed. “Although we were 12 hours late, the organisers agreed to let us submit our results, and officially we finished fourth on the leaderboard,” Sriram says. “But unofficially, we were informed that our algorithm had scored the most points and would have been on top of the leaderboard, if we had been on time.”

Modern LLMs also suffer from hallucinations – generating one incorrect fact for every four or five correct ones. This poses serious problems when AI is used for decision-making in sectors like finance – approving loans or adjusting interest rates, for instance. Sriram’s lab is now also working on addressing these reliability issues.

Beyond these efforts, Sriram’s group was instrumental in developing Coswara, an app designed to detect signs of COVID-19 from the vocal sounds of infected individuals, like coughing. The underlying model, first developed in 2020, was trained on audio samples collected from people across different regions of India. Sriram shares that there were discussions with the Indian Council of Medical Research (ICMR) to extend the work into a national-level study – including variables like gender-specificity and co-morbidities – but the plans were eventually shelved, and the team instead published their results last year.

Sriram Ganapathy with some of his students during a hike in Himachal Pradesh (Photo courtesy: Sriram Ganapathy)

When not in the lab, Sriram’s lab members enjoy group activities, such as weekly marathons across campus or hikes in the outskirts of Bangalore. As for advice to the next generation of students, Sriram quips, “If you wish to work on LLMs, a deep foundation of the fundamentals is preferred over a shallow understanding of a large number of topics, because the LLM itself is better than humans at doing the tasks requiring shallow understanding.”