Dr. Vishnu Chandrabanshi

Assistant Professor

Department of Computer Science and Engineering

Contact Details

vishnu.c@srmap.edu.in

Office Location

School of Engineering and Sciences Computer Science and Engineering Faculty Dr. Vishnu Chandrabanshi

Faculty Information

Education

2025

PhD in Computer Science

NIT Tiruchirappalli

India

2018

MCA

NIT Durgapur

India

2015

B.Sc IT

Nilamber Pitamber University, Medininagar

India

Experience

Guest Faculty, Doranda College, Ranchi, Jharkhand
Assistant Professor Radha Govind University,Ramgarh, Jharkhand
Dept. of CSE, Project Fellow, South Asian University, NewDelhi

Research Interest

My research interests include image and video analytics, speech processing, and biometric authentication, with a focus on developing intelligent and secure systems through advanced machine learning and deep learning techniques.
Currently, I am working on visual speech recognition, which has applications in assisting speech-impaired individuals, enhancing information security, improving face authentication systems, and enabling robust speech recognition in audio-visual system.

Awards

UGC-NET

Memberships

No data available

Publications

A deep learning approach for strengthening person identification in face-based authentication systems using visual speech recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Identity verification is essential in both an individual’s personal and professional life. It confirms a person’s identity for various services and establishes their legitimacy as an employee within an organization. As cybercrime evolves and becomes more sophisticated, ensuring robust, and secure personal authentication methods has become a critical challenge. Existing face-based authentication systems typically employ deep learning models for user verification. However, these systems are susceptible to various attacks, such as presentation attacks, 3D mask attacks, and adversarial attacks that exploit and deceive the models by manipulating digital representations of human faces. Although various liveness detection techniques have been proposed to combat face spoofing in face-based authentication systems. However, these systems remain vulnerable and can be exploited by sophisticated techniques. To counteract face spoofing in a face-based authentication system, we have proposed an advanced liveness detection technique using Visual Speech Recognition (VSR). The proposed VSR model is designed to integrate seamlessly with face-based authentication systems, forming a dual authentication framework for enhanced liveness detection. The VSR model decodes silently pronounced speech from video by analyzing unique, unforgeable lip motion patterns into textual representation. Although, various liveness detection techniques have been proposed to combat face spoofing in face-based authentication systems. However, these systems remain vulnerable and can be exploited by sophisticated techniques. To counteract face spoofing in a face-based authentication system, we have proposed an advanced liveness detection technique using VSR. The proposed VSR model is designed to integrate seamlessly with face-based authentication systems, forming a dual authentication framework for enhanced liveness detection. The VSR model decodes silently pronounced speech from video by analyzing unique, unforgeable lip motion patterns into textual representation. To achieve effective liveness detection using VSR, we need to enhance the accuracy of the VSR system. The proposed work employs an encoder-decoder technique to extract more robust features from lip motion. The encoder employs a three-dimensional convolution neural network (3D-CNN) combined with a fusion of bi-directional gated recurrent units and long short-term memory (BiGRU-BiLSTM) to effectively capture spatial-temporal patterns from lip movement. The decoder integrates Multi-Head Attention (MHA) with BiGRU-BiLSTM to effectively focus on relevant features and enhance contextual understanding for more accurate text prediction. The proposed VSR system achieved a word error rate (WER) of 0.79%, demonstrating a significant reduction in error rate and outperforming compared to the existing VSR models.
Binary Authentication Protocol: A Method for Robust Facial Biometric Security Using Visual Speech Recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Facial biometric systems are extensively applied in diverse sectors for the purposes of person authentication and verification, primarily due to the distinctive nature of individual facial characteristics. Deep learning models are typically used in face authentication to validate people with excellent recognition accuracy. However, these systems are susceptible to a variety of cyber attacks that manipulate the digital representations of real-world faces to cheat the models. In the contemporary landscape of digital identity theft, liveness detection stands as a crucial technology. The need for enhanced security prompts the demand for a resilient system that can effectively counter face spoofing attempts and prevent unauthorized access. A Binary Authentication Protocol (BAP) technique is proposed to enhance facial biometric security in combination with Visual Speech Recognition (VSR). In the proposed method, the first verification step entails face authentication. Further, the authentication protocol involves a challenge-response-based method using VSR. The proposed method achieved a word error rate of 2.7% and a word recognition rate of 97.3%, surpassing existing state-of-the art methods in VSR. The proposed scheme offers practical and effective solutions to prevent face spoofing through active liveness detection in face-based authentication systems.
Leveraging 3D-CNN and graph neural network with attention mechanism for visual speech recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Deep learning techniques have demonstrated early advancements in addressing the challenges of complex Visual Speech Recognition (VSR) tasks. Nonetheless, a persistent issue arises when distinguishing characters or words with similar pronunciations, known as homophones, which results in ambiguity. Existing VSR systems also face technical constraints due to insufficient visual data for learning short-duration phonemes like “at”, “an”, “a”, and “eight”. Moreover, cutting-edge VSR techniques perform exceptionally well when interpreting overlapping speakers. However, extending these methods to unseen speakers leads to a significant performance decline due to the limited diversity in the training dataset and substantial variations in physical attributes, such as lip shape and color, across different speakers. To address the existing challenges in VSR, we propose a multi-modal approach that leverages visual and landmark information to capture complex spatio-temporal patterns for the model generalization capabilities. The model employs a multi-layered Three-Dimensional Convolutional Neural Network (3D-CNN) that extracts visual features, while a Graph Convolutional Network (GCN) captures precise landmark information for accurate lip shape localization. The extracted features are then fused for further processing using a Sequence-to-Sequence (Seq2Seq) model based on the attention mechanism. The proposed model achieved a WER of 0.53% and 8.21% for the overlap and unseen speakers category. Notably, these results surpass the performance of existing models, demonstrating remarkable accuracy for VSR on the GRID dataset in both the unseen and overlapping speaker scenarios.
A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Visual Speech Recognition (VSR) is an appealing technology for predicting and analyzing spoken language based on lip movements. Previous research in this area has primarily concentrated on leveraging both audio and visual cues to achieve enhanced accuracy in speech recognition. However, existing solutions encounter significant limitations, including inadequate training data, variations in speech patterns, and similar homophones, which need more comprehensive feature representation to improve accuracy. This article presents a novel deep learning model for performing word level VSR. In this study, we have introduced a dynamic learning rate scheduler to adapt the learning parameter during model training. Additionally, we employ an optimized Three-Dimensional Convolution Neural Network for extracting spatio-temporal features. To enhance context processing and ensure accurate mapping of input sequences to output sequences, we combine Bidirectional Long Short Term Memory with the CTC loss function. We have utilized the GRID dataset to assess word-level metrics, including Word Error Rate (WER) and Word Recognition Rate (WRR). The model achieves 1.11% WER and 98.89% WRR, respectively, for overlapped speakers. This result demonstrates that our strategy outperforms and is more effective than existing VSR methods. Practical Implications - The proposed work aims to elevate the accuracy of VSR, facilitating its seamless integration into real-time applications. The VSR model finds applications in liveness detection for person authentication, improving password security by not relying on written or spoken passcodes, underwater communications and aiding individuals with hearing and speech impairments in the medical field.
HNet: A deep learning based hybrid network for speaker dependent visual speech recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Visual Speech Recognition (VSR) is a popular area in computer vision research, attracting interest for its ability to precisely analyze lip motion and seamlessly convert them into textual representation. VSR systems leverage visual features to augment the understanding of automated speech and predict text. VSR finds various applications, including enhancing speech recognition in scenarios with degraded acoustic signals, aiding individuals with hearing impairments, bolstering security by reducing reliance on text based passwords, facilitating biometric authentication for liveness detection, and enabling underwater communications. Despite the various techniques proposed for improving the resilience and precision of automatic speech recognition, VSR still has challenges like homophones of words, gradient descent issues with varying sequence lengths, and lip reading demands accounting for short and long range correlations between consecutive video frames. We have proposed a hybrid network (HNet) with multilayered three dimensional dilated convolution neural network (3D-CNN). The spatio-temporal feature extraction process will be facilitated by dilated 3D-CNN. HNet integrates two bidirectional recurrent neural networks (BiGRU and BiLSTM) to process the feature sequences bidirectionally to establish the temporal relationship. The fusion of BiGRU-BiLSTM capabilities allows the model to process feature sequences more comprehensively and effectively. The proposed work focuses on face based biometric authentication for liveness detection using the VSR model to boost security against face spoofing. The existing face based biometric systems are widely used for individual authentication and verification but are still vulnerable to 3D masks and adversarial attacks. The VSR system can be added to existing face based verification systems as a second level authentication technique to identify a person with liveness. The working ideology of the VSR system will be based on the challenge response technique, where a person has to pronounce the passcode silently displayed on the screen. The VSR model assesses its effectiveness using word error rate (WER), which matches the pronounced passcode to the one presented on the screen. Overall, the proposed work aims to enhance the accuracy of VSR so that it can be combined with existing face based authentication systems. The proposed system outperforms the existing VSR system and obtained 1.3% WER. The significance of the proposed hybrid model is that it efficiently captures temporal dependencies, enhancing context embedding, improving robustness to input variability, reducing information loss, and enhancing performance and accuracy in modeling and analyzing passcode pronunciation patterns.
Applying Cognitive and Neural Network Approach over Control Flow Graph for Software Defect Prediction
Dr. Vishnu Chandrabanshi, Kumar Rajnish, Vandana Bhattacharjee, Vishnu Chandrabanshi
DOI Link

Patents

Projects

Scholars

Interests

Biometric Authentication
Image & Video Analytics
Speech Processing

Thought Leaderships

There are no Thought Leaderships associated with this faculty.

Top Achievements

Education

2015

B.Sc IT

Nilamber Pitamber University, Medininagar

India

2018

MCA

NIT Durgapur

India

2025

PhD in Computer Science

NIT Tiruchirappalli

India

Experience

Guest Faculty, Doranda College, Ranchi, Jharkhand
Assistant Professor Radha Govind University,Ramgarh, Jharkhand
Dept. of CSE, Project Fellow, South Asian University, NewDelhi

Research Interests

My research interests include image and video analytics, speech processing, and biometric authentication, with a focus on developing intelligent and secure systems through advanced machine learning and deep learning techniques.
Currently, I am working on visual speech recognition, which has applications in assisting speech-impaired individuals, enhancing information security, improving face authentication systems, and enabling robust speech recognition in audio-visual system.

Awards & Fellowships

UGC-NET

Memberships

No data available

Publications

A deep learning approach for strengthening person identification in face-based authentication systems using visual speech recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Identity verification is essential in both an individual’s personal and professional life. It confirms a person’s identity for various services and establishes their legitimacy as an employee within an organization. As cybercrime evolves and becomes more sophisticated, ensuring robust, and secure personal authentication methods has become a critical challenge. Existing face-based authentication systems typically employ deep learning models for user verification. However, these systems are susceptible to various attacks, such as presentation attacks, 3D mask attacks, and adversarial attacks that exploit and deceive the models by manipulating digital representations of human faces. Although various liveness detection techniques have been proposed to combat face spoofing in face-based authentication systems. However, these systems remain vulnerable and can be exploited by sophisticated techniques. To counteract face spoofing in a face-based authentication system, we have proposed an advanced liveness detection technique using Visual Speech Recognition (VSR). The proposed VSR model is designed to integrate seamlessly with face-based authentication systems, forming a dual authentication framework for enhanced liveness detection. The VSR model decodes silently pronounced speech from video by analyzing unique, unforgeable lip motion patterns into textual representation. Although, various liveness detection techniques have been proposed to combat face spoofing in face-based authentication systems. However, these systems remain vulnerable and can be exploited by sophisticated techniques. To counteract face spoofing in a face-based authentication system, we have proposed an advanced liveness detection technique using VSR. The proposed VSR model is designed to integrate seamlessly with face-based authentication systems, forming a dual authentication framework for enhanced liveness detection. The VSR model decodes silently pronounced speech from video by analyzing unique, unforgeable lip motion patterns into textual representation. To achieve effective liveness detection using VSR, we need to enhance the accuracy of the VSR system. The proposed work employs an encoder-decoder technique to extract more robust features from lip motion. The encoder employs a three-dimensional convolution neural network (3D-CNN) combined with a fusion of bi-directional gated recurrent units and long short-term memory (BiGRU-BiLSTM) to effectively capture spatial-temporal patterns from lip movement. The decoder integrates Multi-Head Attention (MHA) with BiGRU-BiLSTM to effectively focus on relevant features and enhance contextual understanding for more accurate text prediction. The proposed VSR system achieved a word error rate (WER) of 0.79%, demonstrating a significant reduction in error rate and outperforming compared to the existing VSR models.
Binary Authentication Protocol: A Method for Robust Facial Biometric Security Using Visual Speech Recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Facial biometric systems are extensively applied in diverse sectors for the purposes of person authentication and verification, primarily due to the distinctive nature of individual facial characteristics. Deep learning models are typically used in face authentication to validate people with excellent recognition accuracy. However, these systems are susceptible to a variety of cyber attacks that manipulate the digital representations of real-world faces to cheat the models. In the contemporary landscape of digital identity theft, liveness detection stands as a crucial technology. The need for enhanced security prompts the demand for a resilient system that can effectively counter face spoofing attempts and prevent unauthorized access. A Binary Authentication Protocol (BAP) technique is proposed to enhance facial biometric security in combination with Visual Speech Recognition (VSR). In the proposed method, the first verification step entails face authentication. Further, the authentication protocol involves a challenge-response-based method using VSR. The proposed method achieved a word error rate of 2.7% and a word recognition rate of 97.3%, surpassing existing state-of-the art methods in VSR. The proposed scheme offers practical and effective solutions to prevent face spoofing through active liveness detection in face-based authentication systems.
Leveraging 3D-CNN and graph neural network with attention mechanism for visual speech recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Deep learning techniques have demonstrated early advancements in addressing the challenges of complex Visual Speech Recognition (VSR) tasks. Nonetheless, a persistent issue arises when distinguishing characters or words with similar pronunciations, known as homophones, which results in ambiguity. Existing VSR systems also face technical constraints due to insufficient visual data for learning short-duration phonemes like “at”, “an”, “a”, and “eight”. Moreover, cutting-edge VSR techniques perform exceptionally well when interpreting overlapping speakers. However, extending these methods to unseen speakers leads to a significant performance decline due to the limited diversity in the training dataset and substantial variations in physical attributes, such as lip shape and color, across different speakers. To address the existing challenges in VSR, we propose a multi-modal approach that leverages visual and landmark information to capture complex spatio-temporal patterns for the model generalization capabilities. The model employs a multi-layered Three-Dimensional Convolutional Neural Network (3D-CNN) that extracts visual features, while a Graph Convolutional Network (GCN) captures precise landmark information for accurate lip shape localization. The extracted features are then fused for further processing using a Sequence-to-Sequence (Seq2Seq) model based on the attention mechanism. The proposed model achieved a WER of 0.53% and 8.21% for the overlap and unseen speakers category. Notably, these results surpass the performance of existing models, demonstrating remarkable accuracy for VSR on the GRID dataset in both the unseen and overlapping speaker scenarios.
A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Visual Speech Recognition (VSR) is an appealing technology for predicting and analyzing spoken language based on lip movements. Previous research in this area has primarily concentrated on leveraging both audio and visual cues to achieve enhanced accuracy in speech recognition. However, existing solutions encounter significant limitations, including inadequate training data, variations in speech patterns, and similar homophones, which need more comprehensive feature representation to improve accuracy. This article presents a novel deep learning model for performing word level VSR. In this study, we have introduced a dynamic learning rate scheduler to adapt the learning parameter during model training. Additionally, we employ an optimized Three-Dimensional Convolution Neural Network for extracting spatio-temporal features. To enhance context processing and ensure accurate mapping of input sequences to output sequences, we combine Bidirectional Long Short Term Memory with the CTC loss function. We have utilized the GRID dataset to assess word-level metrics, including Word Error Rate (WER) and Word Recognition Rate (WRR). The model achieves 1.11% WER and 98.89% WRR, respectively, for overlapped speakers. This result demonstrates that our strategy outperforms and is more effective than existing VSR methods. Practical Implications - The proposed work aims to elevate the accuracy of VSR, facilitating its seamless integration into real-time applications. The VSR model finds applications in liveness detection for person authentication, improving password security by not relying on written or spoken passcodes, underwater communications and aiding individuals with hearing and speech impairments in the medical field.
HNet: A deep learning based hybrid network for speaker dependent visual speech recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Visual Speech Recognition (VSR) is a popular area in computer vision research, attracting interest for its ability to precisely analyze lip motion and seamlessly convert them into textual representation. VSR systems leverage visual features to augment the understanding of automated speech and predict text. VSR finds various applications, including enhancing speech recognition in scenarios with degraded acoustic signals, aiding individuals with hearing impairments, bolstering security by reducing reliance on text based passwords, facilitating biometric authentication for liveness detection, and enabling underwater communications. Despite the various techniques proposed for improving the resilience and precision of automatic speech recognition, VSR still has challenges like homophones of words, gradient descent issues with varying sequence lengths, and lip reading demands accounting for short and long range correlations between consecutive video frames. We have proposed a hybrid network (HNet) with multilayered three dimensional dilated convolution neural network (3D-CNN). The spatio-temporal feature extraction process will be facilitated by dilated 3D-CNN. HNet integrates two bidirectional recurrent neural networks (BiGRU and BiLSTM) to process the feature sequences bidirectionally to establish the temporal relationship. The fusion of BiGRU-BiLSTM capabilities allows the model to process feature sequences more comprehensively and effectively. The proposed work focuses on face based biometric authentication for liveness detection using the VSR model to boost security against face spoofing. The existing face based biometric systems are widely used for individual authentication and verification but are still vulnerable to 3D masks and adversarial attacks. The VSR system can be added to existing face based verification systems as a second level authentication technique to identify a person with liveness. The working ideology of the VSR system will be based on the challenge response technique, where a person has to pronounce the passcode silently displayed on the screen. The VSR model assesses its effectiveness using word error rate (WER), which matches the pronounced passcode to the one presented on the screen. Overall, the proposed work aims to enhance the accuracy of VSR so that it can be combined with existing face based authentication systems. The proposed system outperforms the existing VSR system and obtained 1.3% WER. The significance of the proposed hybrid model is that it efficiently captures temporal dependencies, enhancing context embedding, improving robustness to input variability, reducing information loss, and enhancing performance and accuracy in modeling and analyzing passcode pronunciation patterns.
Applying Cognitive and Neural Network Approach over Control Flow Graph for Software Defect Prediction
Dr. Vishnu Chandrabanshi, Kumar Rajnish, Vandana Bhattacharjee, Vishnu Chandrabanshi
DOI Link

Contact Details

vishnu.c@srmap.edu.in

Scholars

Interests

Biometric Authentication
Image & Video Analytics
Speech Processing

Education

2015

B.Sc IT

Nilamber Pitamber University, Medininagar

India

2018

MCA

NIT Durgapur

India

2025

PhD in Computer Science

NIT Tiruchirappalli

India

Experience

Guest Faculty, Doranda College, Ranchi, Jharkhand
Assistant Professor Radha Govind University,Ramgarh, Jharkhand
Dept. of CSE, Project Fellow, South Asian University, NewDelhi

Research Interests

My research interests include image and video analytics, speech processing, and biometric authentication, with a focus on developing intelligent and secure systems through advanced machine learning and deep learning techniques.
Currently, I am working on visual speech recognition, which has applications in assisting speech-impaired individuals, enhancing information security, improving face authentication systems, and enabling robust speech recognition in audio-visual system.

Awards & Fellowships

UGC-NET

Memberships

No data available

Publications

A deep learning approach for strengthening person identification in face-based authentication systems using visual speech recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Identity verification is essential in both an individual’s personal and professional life. It confirms a person’s identity for various services and establishes their legitimacy as an employee within an organization. As cybercrime evolves and becomes more sophisticated, ensuring robust, and secure personal authentication methods has become a critical challenge. Existing face-based authentication systems typically employ deep learning models for user verification. However, these systems are susceptible to various attacks, such as presentation attacks, 3D mask attacks, and adversarial attacks that exploit and deceive the models by manipulating digital representations of human faces. Although various liveness detection techniques have been proposed to combat face spoofing in face-based authentication systems. However, these systems remain vulnerable and can be exploited by sophisticated techniques. To counteract face spoofing in a face-based authentication system, we have proposed an advanced liveness detection technique using Visual Speech Recognition (VSR). The proposed VSR model is designed to integrate seamlessly with face-based authentication systems, forming a dual authentication framework for enhanced liveness detection. The VSR model decodes silently pronounced speech from video by analyzing unique, unforgeable lip motion patterns into textual representation. Although, various liveness detection techniques have been proposed to combat face spoofing in face-based authentication systems. However, these systems remain vulnerable and can be exploited by sophisticated techniques. To counteract face spoofing in a face-based authentication system, we have proposed an advanced liveness detection technique using VSR. The proposed VSR model is designed to integrate seamlessly with face-based authentication systems, forming a dual authentication framework for enhanced liveness detection. The VSR model decodes silently pronounced speech from video by analyzing unique, unforgeable lip motion patterns into textual representation. To achieve effective liveness detection using VSR, we need to enhance the accuracy of the VSR system. The proposed work employs an encoder-decoder technique to extract more robust features from lip motion. The encoder employs a three-dimensional convolution neural network (3D-CNN) combined with a fusion of bi-directional gated recurrent units and long short-term memory (BiGRU-BiLSTM) to effectively capture spatial-temporal patterns from lip movement. The decoder integrates Multi-Head Attention (MHA) with BiGRU-BiLSTM to effectively focus on relevant features and enhance contextual understanding for more accurate text prediction. The proposed VSR system achieved a word error rate (WER) of 0.79%, demonstrating a significant reduction in error rate and outperforming compared to the existing VSR models.
Binary Authentication Protocol: A Method for Robust Facial Biometric Security Using Visual Speech Recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Facial biometric systems are extensively applied in diverse sectors for the purposes of person authentication and verification, primarily due to the distinctive nature of individual facial characteristics. Deep learning models are typically used in face authentication to validate people with excellent recognition accuracy. However, these systems are susceptible to a variety of cyber attacks that manipulate the digital representations of real-world faces to cheat the models. In the contemporary landscape of digital identity theft, liveness detection stands as a crucial technology. The need for enhanced security prompts the demand for a resilient system that can effectively counter face spoofing attempts and prevent unauthorized access. A Binary Authentication Protocol (BAP) technique is proposed to enhance facial biometric security in combination with Visual Speech Recognition (VSR). In the proposed method, the first verification step entails face authentication. Further, the authentication protocol involves a challenge-response-based method using VSR. The proposed method achieved a word error rate of 2.7% and a word recognition rate of 97.3%, surpassing existing state-of-the art methods in VSR. The proposed scheme offers practical and effective solutions to prevent face spoofing through active liveness detection in face-based authentication systems.
Leveraging 3D-CNN and graph neural network with attention mechanism for visual speech recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Deep learning techniques have demonstrated early advancements in addressing the challenges of complex Visual Speech Recognition (VSR) tasks. Nonetheless, a persistent issue arises when distinguishing characters or words with similar pronunciations, known as homophones, which results in ambiguity. Existing VSR systems also face technical constraints due to insufficient visual data for learning short-duration phonemes like “at”, “an”, “a”, and “eight”. Moreover, cutting-edge VSR techniques perform exceptionally well when interpreting overlapping speakers. However, extending these methods to unseen speakers leads to a significant performance decline due to the limited diversity in the training dataset and substantial variations in physical attributes, such as lip shape and color, across different speakers. To address the existing challenges in VSR, we propose a multi-modal approach that leverages visual and landmark information to capture complex spatio-temporal patterns for the model generalization capabilities. The model employs a multi-layered Three-Dimensional Convolutional Neural Network (3D-CNN) that extracts visual features, while a Graph Convolutional Network (GCN) captures precise landmark information for accurate lip shape localization. The extracted features are then fused for further processing using a Sequence-to-Sequence (Seq2Seq) model based on the attention mechanism. The proposed model achieved a WER of 0.53% and 8.21% for the overlap and unseen speakers category. Notably, these results surpass the performance of existing models, demonstrating remarkable accuracy for VSR on the GRID dataset in both the unseen and overlapping speaker scenarios.
A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Visual Speech Recognition (VSR) is an appealing technology for predicting and analyzing spoken language based on lip movements. Previous research in this area has primarily concentrated on leveraging both audio and visual cues to achieve enhanced accuracy in speech recognition. However, existing solutions encounter significant limitations, including inadequate training data, variations in speech patterns, and similar homophones, which need more comprehensive feature representation to improve accuracy. This article presents a novel deep learning model for performing word level VSR. In this study, we have introduced a dynamic learning rate scheduler to adapt the learning parameter during model training. Additionally, we employ an optimized Three-Dimensional Convolution Neural Network for extracting spatio-temporal features. To enhance context processing and ensure accurate mapping of input sequences to output sequences, we combine Bidirectional Long Short Term Memory with the CTC loss function. We have utilized the GRID dataset to assess word-level metrics, including Word Error Rate (WER) and Word Recognition Rate (WRR). The model achieves 1.11% WER and 98.89% WRR, respectively, for overlapped speakers. This result demonstrates that our strategy outperforms and is more effective than existing VSR methods. Practical Implications - The proposed work aims to elevate the accuracy of VSR, facilitating its seamless integration into real-time applications. The VSR model finds applications in liveness detection for person authentication, improving password security by not relying on written or spoken passcodes, underwater communications and aiding individuals with hearing and speech impairments in the medical field.
HNet: A deep learning based hybrid network for speaker dependent visual speech recognition
Dr. Vishnu Chandrabanshi, Vishnu Chandrabanshi, S Domnic
DOI Link
View abstract ⏷
Visual Speech Recognition (VSR) is a popular area in computer vision research, attracting interest for its ability to precisely analyze lip motion and seamlessly convert them into textual representation. VSR systems leverage visual features to augment the understanding of automated speech and predict text. VSR finds various applications, including enhancing speech recognition in scenarios with degraded acoustic signals, aiding individuals with hearing impairments, bolstering security by reducing reliance on text based passwords, facilitating biometric authentication for liveness detection, and enabling underwater communications. Despite the various techniques proposed for improving the resilience and precision of automatic speech recognition, VSR still has challenges like homophones of words, gradient descent issues with varying sequence lengths, and lip reading demands accounting for short and long range correlations between consecutive video frames. We have proposed a hybrid network (HNet) with multilayered three dimensional dilated convolution neural network (3D-CNN). The spatio-temporal feature extraction process will be facilitated by dilated 3D-CNN. HNet integrates two bidirectional recurrent neural networks (BiGRU and BiLSTM) to process the feature sequences bidirectionally to establish the temporal relationship. The fusion of BiGRU-BiLSTM capabilities allows the model to process feature sequences more comprehensively and effectively. The proposed work focuses on face based biometric authentication for liveness detection using the VSR model to boost security against face spoofing. The existing face based biometric systems are widely used for individual authentication and verification but are still vulnerable to 3D masks and adversarial attacks. The VSR system can be added to existing face based verification systems as a second level authentication technique to identify a person with liveness. The working ideology of the VSR system will be based on the challenge response technique, where a person has to pronounce the passcode silently displayed on the screen. The VSR model assesses its effectiveness using word error rate (WER), which matches the pronounced passcode to the one presented on the screen. Overall, the proposed work aims to enhance the accuracy of VSR so that it can be combined with existing face based authentication systems. The proposed system outperforms the existing VSR system and obtained 1.3% WER. The significance of the proposed hybrid model is that it efficiently captures temporal dependencies, enhancing context embedding, improving robustness to input variability, reducing information loss, and enhancing performance and accuracy in modeling and analyzing passcode pronunciation patterns.
Applying Cognitive and Neural Network Approach over Control Flow Graph for Software Defect Prediction
Dr. Vishnu Chandrabanshi, Kumar Rajnish, Vandana Bhattacharjee, Vishnu Chandrabanshi
DOI Link

Contact Details

vishnu.c@srmap.edu.in

Admission Help Line

Admissions 2026 Open — Apply!

About

Admissions

Centres or Directorates

Overview

BTech

BSc

MTech

Integrated MTech

MSc

BA

BSc

Programmes

Research

Placements

International Relations

Campus Life