Multi-modal and multi-model Speech Recognition Automatic Speech Recognition is a field with everyday uses, such as in cars, smartphones, and creating tools for disabled people to more accessibly live their lives. However, most ASR research overlooks one important detail: people speak differently when there is noise. The Lombard effect is the phenomenon that describes how people’s speech changes in a noisy environment, such as increased volume, more pronounced mouth movements, and changes to enunciation. Figure 1 – An Example of a Dataset used for testing. Participants are given headphones with noise playing through them and asked to read aloud or to a person across from them. We are looking at how state of the art ASR Transformer and thus may not learn the intricacies of Lombard-affected speech. As such, we are currently testing how the current models perform on a mix of Lombard and non-Lombard data, to see a direct comparison of performance. Our results thus far show that there is a small drop in performance consistently from Lombard speech compared to non-Lombard speech across all models, even when the noise in the environment is removed. This implies that without suitable training data or robust models, performance may be lacking in real world conditions. Additionally, we found that models can struggle to identify single letters or digits, especially when applied to a more constructed sentence structure, rather than casual conversational or more formal speech. This work will allow us to identify which models are the most robust and identify areas of weakness with respect to the Lombard effect. This will allow us to develop a multi-modal, multi-model tool that can identify if a segment of speech is Lombard or not, and send it to a specialized model for processing, while non-Lombard data can be handled normally. While the research is currently focused on audio-based approaches, we are looking into visual approaches for speech recognition too. This will allow for better development of assistive tools for disabled people, such as automatic transcription of audio messages or video, such as recorded lectures or film. Additionally, more robust speech recognition would allow for greater everyday tools, such as more responsive smart assistants, or hands-free voice control for use in cars. For more information, contact Gwen Devlin (gwen.devlin.2020@uni.strath.ac.uk), PhD Student for Computer and Information Sciences at the University of Strathclyde, or contact Dr. Andrew Abel (andrew.abel@strath.ac.uk), Chancellor’s Fellow for Computer and Information Sciences at the For a list of the research areas in which ARCHIE-WeSt users are active please click here.