Faculty Dr Bidhan Barai

Dr Bidhan Barai

Assistant Professor

Department of Computer Science and Engineering

Contact Details

bidhan.b@srmap.edu.in

Office Location

Homi J Bhabha Block, Level 3 , Cubicle No: 30

Education

2022
Jadavpur University
India
2011
M.Tech.
Jadavpur University
India
2007
BE
Jadavpur University
India

Personal Website

Experience

  • 2012-2014 – Project Linked Person – CVPR & ISRU (formerly BIRU) Division, Indian Statistical Institute, Kolkata
  • March 2022 – July 2022 – Assistant Professor, Swami Vivekananda Institute of Science & Technology, Kolkata
  • August 2022 – February 2025 – Assistant Professor, CSE(AI&ML), Techno Main Salt Lake, Kolkata

Research Interest

  • Design and implementation of machine learning and pattern recognition algorithms (such as SVM and GMMs) to identify or verify a person based on their speech characteristics with an emphasis on enhancing robustness by extracting usable features (such as MFCC, iVector, and xVector).
  • Manipulation and analysis of digital images using algorithms to extract information and classify objects. Application of ML models (such as CNN, RNN, R-CNN) for the task such as object detection, segmentation, and recognition.

Awards

  • 2009 - Qualify Graduate Aptitude Test – MHRD, India.
  • 2015-2020 – RGNF Fellowship for PhD – UGC, India

Memberships

  • Memberships in professional associations to be listed

Publications

  • Towards Classifying Bird Sounds Using a Deep Transfer Learning Model

    Dey S., Ghosh S., Mondal S., Harh A., Bandhu S., Barai B., Singh P.K.

    Conference paper, Lecture Notes in Networks and Systems, 2026, DOI Link

    View abstract ⏷

    The conservation of bird biodiversity relies on accurately identifying and classifying species, which is often time-consuming and requires specialized knowledge. Recent advances in deep learning, particularly in convolutional neural networks (CNNs), have made it possible to detect species passively from acoustic signals, even in challenging environments. This paper presents a high-performance deep convolutional neural network (CNN) model using the VGG-16 architecture for the passive classification of bird sounds, using a remarkably accurate model of Short-Time Fourier Transform (STFT) that accounts for 97.31% of the BirdCLEF 2022 dataset and 98.41% for the Cornell Birdcall Identification dataset. The model discriminates between species, even in complex soundscapes with overlapping records. The framework also uses a tool-based consensus framework to enhance the focus on relevant features, improving classification accuracy for rare and endangered species. This method is highly effective in various phonological and language processing tasks and enhances the model’s robustness, making it suitable for real-world applications.
  • An Efficient Deep Convolutional Neural Network for Automated Bird Sound Classification

    Harh A., Bandhu S., Barai B., Das N., Singh P.K.

    Conference paper, Lecture Notes in Networks and Systems, 2025, DOI Link

    View abstract ⏷

    Classification of bird species from chirping aids biodiversity monitoring, conservation, and ecological research. In our research work, we delve into the effectiveness and utility of Convolutional Neural Network (CNN) in this endeavour, incorporating with the layers of a well-established transfer learning model, InceptionResNetV2. Our research is based on the ‘Bird Song’ dataset, which we obtained from Xeno-Canto on Kaggle, and the “British Bird Song” dataset. We use the Short-time Fourier transform (STFT) to extract key auditory properties from these datasets. The audio data may be subjected to image-based categorization approaches by transforming the audio files into chromagrams. With our custom CNN architecture, we have outperformed a number of current methods, attaining accuracy rates of 94.46% and 97.02% for the corresponding datasets. Our research provides important information on the effectiveness and applicability of bird sound classification. Moreover, the accomplishments of our customized architecture demonstrate the possibility for customized solutions in this domain. The results pave the way for future developments in birdsong ambiance research and have implications for understanding ecosystems, identifying bird species, monitoring the environment, and protecting animals.
  • A Hybrid Deep Learning Framework for Text-Independent Automatic Speaker Recognition System

    Harh A., Bandhu S., Barai B., Singh P.K.

    Conference paper, Lecture Notes in Networks and Systems, 2024, DOI Link

    View abstract ⏷

    A variety of handcrafted features as well as techniques for Speaker Recognition (SR), which combines Speaker Identification (SI) and Speaker Verification (SV), have been developed over the previous 5 decades. Automatic speaker recognition (ASR), also referred to as vocal biometric recognition, is one method of human biometric identification. ASR plays such a crucial part in so many applications, including voice assistants, transcription services, and many more, and it has attracted a lot of attention lately. This paper develops a hybrid deep learning framework for text-independent ASR system. This framework is based on the integration of Gated Recurrent Units in Recurrent Neural Networks (RNN-GRU) with Connectionist Temporal Classification (CTC) loss via the layers of a 2D Convolutional Neural Network (2D-CNN). Using the benchmark LJspeech dataset, the model has been evaluated using a performance metric, Error Rate (ER). The achieved ER rate of about 16–17% for 20 epochs underscores the promising progress in ASR technology, setting the stage for continued advancements in this field.
  • A Chain Code-Based Methodology for Loop Closure Detection in Digital Images

    Das R., Barai B., Singh P.K.

    Conference paper, Lecture Notes in Networks and Systems, 2024, DOI Link

    View abstract ⏷

    Throughout this research, we introduce an innovative technique for real-time detection of the loop(s) present in digital images. Our proposed model is based on the chain code method and implemented on two self-prepared datasets. A unique type of contour will be used to find the loops in source images. The outcomes of the suggested method's use in the real world have been shown to demonstrate its effectiveness. The results unequivocally demonstrate its strong performance. This demonstrates that while each of these descriptors has the potential to be useful, their effectiveness can vary greatly depending on the assessment criteria used. Our dataset provided a 95% accuracy result, which is impressive. This technique is very helpful in various industries, including digital cinematography, astronomy, biometric security, building extraction from overhead satellite photos, etc. The major objective has been achieved since the recommended model outperforms the current works that were proposed previously by the researchers.
  • An empirical study on analysis window functions for text-independent speaker recognition

    Barai B., Das N., Basu S., Nasipuri M.

    Article, International Journal of Speech Technology, 2023, DOI Link

    View abstract ⏷

    This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose Vector Quantization (VQ) and/or Gaussian Mixture Model (GMM) and/or Universal Background Model GMM (UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.
  • Closed-set speaker identification using VQ and GMM based models

    Barai B., Chakraborty T., Das N., Basu S., Nasipuri M.

    Article, International Journal of Speech Technology, 2022, DOI Link

    View abstract ⏷

    An array of features and methods are being developed over the past six decades for Speaker Identification (SI) and Speaker Verification (SV), jointly known as Speaker Recognition(SR). Mel Frequency Cepstral Coefficients (MFCC) is generally used as feature vectors in most of the cases because it gives higher accuracy compared to other features. The presented paper focuses on comparative study of state-of-the-art SR techniques along with their design challenges, robustness issues and performance evaluation methods. Rigorous experiments have been performed using Gaussian Mixture Model (GMM) with variations like Universal Background Model (UBM) and/or Vector Quantization (VQ) and/or VQ based UBM-GMM (VQ-UBM-GMM) with detail discussion. Other popular methods have been included, namely, Linear Discriminate Analysis (LDA), Probabilistic LDA (PLDA), Gaussian PLDA (GPLDA), Multi-condition GPLDA (MGPLDA), Identity Vector (i-vector) for comparative study only. Three popular audio data-sets have been used in the experiments, namely, IITG-MV SR, Hyke-2011 and ELSDSR. Hyke-2011 and ELSDSR contain clean speech while IITG-MV SR contains noisy audio data with variations in channel (device), environment, spoken style. We propose a new data mixing approach for SR to make the system independent of recording device, spoken style and environment. The accuracy we obtained for VQ and GMM based methods for databases, Hyke-2011 and ELSDSR are varies from 99.6 % to 100 % whereas accuracy for IITG-MV SR is upto 98 %. Indeed, in some cases the accuracies degrade drastically due to mismatch between training and testing data as well as singularity problem of GMM. The experimental results serve as a benchmark for VQ/GMM/UBM based methods for the IITG-MV SR database.
  • Closed-Set Device-Independent Speaker Identification Using CNN

    Chakraborty T., Barai B., Chatterjee B., Das N., Basu S., Nasipuri M.

    Conference paper, Advances in Intelligent Systems and Computing, 2020, DOI Link

    View abstract ⏷

    Speaker Identification(SI) has numerous applications in real world. Traditional classifiers like Gaussian Mixture Models (GMM), Support Vector Machine (SVM), and Hidden Markov Models (HMM) were used earlier for SI. Features like Mel Frequency Cepstral Coefficient (MFCC), and Gammatone Frequency Cepstral Coefficients (GFCC) need to be generated first. But these approaches do not perform well when audio data captured through multiple devices and recorded in different environments, i.e., in mismatch condition. Whereas Machine Learning (ML) algorithms usually provide better accuracy, and hence became more popular. Restricted Boltzmann Machine(RBM), Long-Short-Term Memory (LSTM), and Convolutional neural network (CNN) are some of the ML approaches applied on SI. In this paper, CNN is used for automatic feature extraction and speaker classification on IITG-MV noisy dataset. CNN performs better than GMM, specially for device mismatch case.
  • VQ/GMM-based speaker identification with emphasis on language dependency

    Barai B., Das D., Das N., Basu S., Nasipuri M.

    Conference paper, Advances in Intelligent Systems and Computing, 2019, DOI Link

    View abstract ⏷

    The biometric recognition of human through the speech signal is known as automatic speaker recognition (ASR) or voice biometric recognition. Plenty of acoustic features have been used in ASR so far, but among them Mel-frequency cepstral coefficients (MFCCs) and Gammatone frequency cepstral coefficients (GFCCs) are popularly used. To make ASR language and channel independent (if training and testing microphones and languages are not same), i-Vector feature and unwanted variability compensation techniques like linear discriminative analysis (LDA) or probabilistic LDA (PLDA), within-class covariance normalization (WCCN) are extensively used. At the very present days, the techniques for modeling/classification that are used are Gaussian mixture models (GMMs), vector quantization (VQ), hidden Markov model (HMM), deep neural network (DNN), and artificial neural network (ANN). Sometimes, model-domain normalization techniques are used to compensate unwanted variability due to language and channel mismatch in training and testing data. In the present paper, we have used maximum log-likelihood (MLL) to evaluate the performance of ASR on the databases(DBs), namely ELSDSR, Hyke-2011, and IITG-MV SR Phase-I & II, based on MFCCs and VQ/GMM where the scoring technique MLL is used for the recognition of speakers. The experiment is carried out to examine the language dependency and environmental mismatch between training and testing data.
  • Closed-set text-independent automatic speaker recognition system using VQ/GMM

    Barai B., Das D., Das N., Basu S., Nasipuri M.

    Conference paper, Advances in Intelligent Systems and Computing, 2018, DOI Link

    View abstract ⏷

    Automatic speaker recognition (ASR) is one type of biometric recognition of human, known as voice biometric recognition. Among plenty of acoustic features, Mel-Frequency Cepstral Coefficients (MFCCs) and Gammatone Frequency Cepstral Coefficients (GFCCs) are used popularly in ASR. The state-of-the-art techniques for modeling/classification(s) are Vector Quantization (VQ), Gaussian Mixture Models (GMMs), Hidden Markov Model (HMM), Artificial Neural Network (ANN), Deep Neural Network (DNN). In this paper, we cite our experimental results upon three databases, namely Hyke-2011, ELSDSR, and IITG-MV SR Phase-I, based on MFCCs and VQ/GMM where maximum log-likelihood (MLL) scoring technique is used for the recognition of speakers and analyzed the effect of Gaussian components as well as Mel-scale filter bank’s minimum frequency. By adjusting proper Gaussian components and minimum frequency, the accuracies have been increased by 10–20% in noisy environment.
  • An ASR system using MFCC and VQ/GMM with emphasis on environmental dependency

    Barai B., Das D., Das N., Basu S., Nasipuri M.

    Conference paper, 2017 IEEE Calcutta Conference, CALCON 2017 - Proceedings, 2017, DOI Link

    View abstract ⏷

    Automatic speaker recognition (ASR), also known as voice biometric recognition, remains very popular research area over six decades. Among all the acoustic features that are used in ASR, Mel-frequency Cepstral Coefficients (MFCCs) and Gammatone Frequency Cepstral Coefficients (GFCCs) are the most popular ones. However to make ASR environment independent, Relative Spectral Amplitude (RSATA) filtering techniques before feature extraction and feature, model, and score (in classification step) domain normalization techniques are applied. The techniques for modeling/classification that are used in present days are Vector Quantization (VQ), Support Vector Machine (SVM), Gaussian Mixture Models (GMMs), Hidden Markov Model (HMM), Artificial Neural Network (ANN), Deep Neural Network (DNN). In this paper we cite our experimental results upon three databases, namely, Hyke-2011, ELSDSR and IITG-MV SR Phase-I, based on MFCCs and VQ/GMM where Maximum Log-Likelihood (MLL) scoring technique is used for the recognition of speakers. The experimental results in the environmental mismatch condition for the IITG-MV SR Phase I & II databases are provided with explanation of accuracy degradation.

Patents

Projects

Scholars

Interests

  • Artificial Intelligence
  • Machine Learning
  • Speaker Recognition
  • Speech Recognition

Thought Leaderships

There are no Thought Leaderships associated with this faculty.

Top Achievements

Research Area

No research areas found for this faculty.

Recent Updates

No recent updates found.

Education
2007
BE
Jadavpur University
India
2011
M.Tech.
Jadavpur University
India
2022
Jadavpur University
India
Experience
  • 2012-2014 – Project Linked Person – CVPR & ISRU (formerly BIRU) Division, Indian Statistical Institute, Kolkata
  • March 2022 – July 2022 – Assistant Professor, Swami Vivekananda Institute of Science & Technology, Kolkata
  • August 2022 – February 2025 – Assistant Professor, CSE(AI&ML), Techno Main Salt Lake, Kolkata
Research Interests
  • Design and implementation of machine learning and pattern recognition algorithms (such as SVM and GMMs) to identify or verify a person based on their speech characteristics with an emphasis on enhancing robustness by extracting usable features (such as MFCC, iVector, and xVector).
  • Manipulation and analysis of digital images using algorithms to extract information and classify objects. Application of ML models (such as CNN, RNN, R-CNN) for the task such as object detection, segmentation, and recognition.
Awards & Fellowships
  • 2009 - Qualify Graduate Aptitude Test – MHRD, India.
  • 2015-2020 – RGNF Fellowship for PhD – UGC, India
Memberships
  • Memberships in professional associations to be listed
Publications
  • Towards Classifying Bird Sounds Using a Deep Transfer Learning Model

    Dey S., Ghosh S., Mondal S., Harh A., Bandhu S., Barai B., Singh P.K.

    Conference paper, Lecture Notes in Networks and Systems, 2026, DOI Link

    View abstract ⏷

    The conservation of bird biodiversity relies on accurately identifying and classifying species, which is often time-consuming and requires specialized knowledge. Recent advances in deep learning, particularly in convolutional neural networks (CNNs), have made it possible to detect species passively from acoustic signals, even in challenging environments. This paper presents a high-performance deep convolutional neural network (CNN) model using the VGG-16 architecture for the passive classification of bird sounds, using a remarkably accurate model of Short-Time Fourier Transform (STFT) that accounts for 97.31% of the BirdCLEF 2022 dataset and 98.41% for the Cornell Birdcall Identification dataset. The model discriminates between species, even in complex soundscapes with overlapping records. The framework also uses a tool-based consensus framework to enhance the focus on relevant features, improving classification accuracy for rare and endangered species. This method is highly effective in various phonological and language processing tasks and enhances the model’s robustness, making it suitable for real-world applications.
  • An Efficient Deep Convolutional Neural Network for Automated Bird Sound Classification

    Harh A., Bandhu S., Barai B., Das N., Singh P.K.

    Conference paper, Lecture Notes in Networks and Systems, 2025, DOI Link

    View abstract ⏷

    Classification of bird species from chirping aids biodiversity monitoring, conservation, and ecological research. In our research work, we delve into the effectiveness and utility of Convolutional Neural Network (CNN) in this endeavour, incorporating with the layers of a well-established transfer learning model, InceptionResNetV2. Our research is based on the ‘Bird Song’ dataset, which we obtained from Xeno-Canto on Kaggle, and the “British Bird Song” dataset. We use the Short-time Fourier transform (STFT) to extract key auditory properties from these datasets. The audio data may be subjected to image-based categorization approaches by transforming the audio files into chromagrams. With our custom CNN architecture, we have outperformed a number of current methods, attaining accuracy rates of 94.46% and 97.02% for the corresponding datasets. Our research provides important information on the effectiveness and applicability of bird sound classification. Moreover, the accomplishments of our customized architecture demonstrate the possibility for customized solutions in this domain. The results pave the way for future developments in birdsong ambiance research and have implications for understanding ecosystems, identifying bird species, monitoring the environment, and protecting animals.
  • A Hybrid Deep Learning Framework for Text-Independent Automatic Speaker Recognition System

    Harh A., Bandhu S., Barai B., Singh P.K.

    Conference paper, Lecture Notes in Networks and Systems, 2024, DOI Link

    View abstract ⏷

    A variety of handcrafted features as well as techniques for Speaker Recognition (SR), which combines Speaker Identification (SI) and Speaker Verification (SV), have been developed over the previous 5 decades. Automatic speaker recognition (ASR), also referred to as vocal biometric recognition, is one method of human biometric identification. ASR plays such a crucial part in so many applications, including voice assistants, transcription services, and many more, and it has attracted a lot of attention lately. This paper develops a hybrid deep learning framework for text-independent ASR system. This framework is based on the integration of Gated Recurrent Units in Recurrent Neural Networks (RNN-GRU) with Connectionist Temporal Classification (CTC) loss via the layers of a 2D Convolutional Neural Network (2D-CNN). Using the benchmark LJspeech dataset, the model has been evaluated using a performance metric, Error Rate (ER). The achieved ER rate of about 16–17% for 20 epochs underscores the promising progress in ASR technology, setting the stage for continued advancements in this field.
  • A Chain Code-Based Methodology for Loop Closure Detection in Digital Images

    Das R., Barai B., Singh P.K.

    Conference paper, Lecture Notes in Networks and Systems, 2024, DOI Link

    View abstract ⏷

    Throughout this research, we introduce an innovative technique for real-time detection of the loop(s) present in digital images. Our proposed model is based on the chain code method and implemented on two self-prepared datasets. A unique type of contour will be used to find the loops in source images. The outcomes of the suggested method's use in the real world have been shown to demonstrate its effectiveness. The results unequivocally demonstrate its strong performance. This demonstrates that while each of these descriptors has the potential to be useful, their effectiveness can vary greatly depending on the assessment criteria used. Our dataset provided a 95% accuracy result, which is impressive. This technique is very helpful in various industries, including digital cinematography, astronomy, biometric security, building extraction from overhead satellite photos, etc. The major objective has been achieved since the recommended model outperforms the current works that were proposed previously by the researchers.
  • An empirical study on analysis window functions for text-independent speaker recognition

    Barai B., Das N., Basu S., Nasipuri M.

    Article, International Journal of Speech Technology, 2023, DOI Link

    View abstract ⏷

    This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose Vector Quantization (VQ) and/or Gaussian Mixture Model (GMM) and/or Universal Background Model GMM (UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.
  • Closed-set speaker identification using VQ and GMM based models

    Barai B., Chakraborty T., Das N., Basu S., Nasipuri M.

    Article, International Journal of Speech Technology, 2022, DOI Link

    View abstract ⏷

    An array of features and methods are being developed over the past six decades for Speaker Identification (SI) and Speaker Verification (SV), jointly known as Speaker Recognition(SR). Mel Frequency Cepstral Coefficients (MFCC) is generally used as feature vectors in most of the cases because it gives higher accuracy compared to other features. The presented paper focuses on comparative study of state-of-the-art SR techniques along with their design challenges, robustness issues and performance evaluation methods. Rigorous experiments have been performed using Gaussian Mixture Model (GMM) with variations like Universal Background Model (UBM) and/or Vector Quantization (VQ) and/or VQ based UBM-GMM (VQ-UBM-GMM) with detail discussion. Other popular methods have been included, namely, Linear Discriminate Analysis (LDA), Probabilistic LDA (PLDA), Gaussian PLDA (GPLDA), Multi-condition GPLDA (MGPLDA), Identity Vector (i-vector) for comparative study only. Three popular audio data-sets have been used in the experiments, namely, IITG-MV SR, Hyke-2011 and ELSDSR. Hyke-2011 and ELSDSR contain clean speech while IITG-MV SR contains noisy audio data with variations in channel (device), environment, spoken style. We propose a new data mixing approach for SR to make the system independent of recording device, spoken style and environment. The accuracy we obtained for VQ and GMM based methods for databases, Hyke-2011 and ELSDSR are varies from 99.6 % to 100 % whereas accuracy for IITG-MV SR is upto 98 %. Indeed, in some cases the accuracies degrade drastically due to mismatch between training and testing data as well as singularity problem of GMM. The experimental results serve as a benchmark for VQ/GMM/UBM based methods for the IITG-MV SR database.
  • Closed-Set Device-Independent Speaker Identification Using CNN

    Chakraborty T., Barai B., Chatterjee B., Das N., Basu S., Nasipuri M.

    Conference paper, Advances in Intelligent Systems and Computing, 2020, DOI Link

    View abstract ⏷

    Speaker Identification(SI) has numerous applications in real world. Traditional classifiers like Gaussian Mixture Models (GMM), Support Vector Machine (SVM), and Hidden Markov Models (HMM) were used earlier for SI. Features like Mel Frequency Cepstral Coefficient (MFCC), and Gammatone Frequency Cepstral Coefficients (GFCC) need to be generated first. But these approaches do not perform well when audio data captured through multiple devices and recorded in different environments, i.e., in mismatch condition. Whereas Machine Learning (ML) algorithms usually provide better accuracy, and hence became more popular. Restricted Boltzmann Machine(RBM), Long-Short-Term Memory (LSTM), and Convolutional neural network (CNN) are some of the ML approaches applied on SI. In this paper, CNN is used for automatic feature extraction and speaker classification on IITG-MV noisy dataset. CNN performs better than GMM, specially for device mismatch case.
  • VQ/GMM-based speaker identification with emphasis on language dependency

    Barai B., Das D., Das N., Basu S., Nasipuri M.

    Conference paper, Advances in Intelligent Systems and Computing, 2019, DOI Link

    View abstract ⏷

    The biometric recognition of human through the speech signal is known as automatic speaker recognition (ASR) or voice biometric recognition. Plenty of acoustic features have been used in ASR so far, but among them Mel-frequency cepstral coefficients (MFCCs) and Gammatone frequency cepstral coefficients (GFCCs) are popularly used. To make ASR language and channel independent (if training and testing microphones and languages are not same), i-Vector feature and unwanted variability compensation techniques like linear discriminative analysis (LDA) or probabilistic LDA (PLDA), within-class covariance normalization (WCCN) are extensively used. At the very present days, the techniques for modeling/classification that are used are Gaussian mixture models (GMMs), vector quantization (VQ), hidden Markov model (HMM), deep neural network (DNN), and artificial neural network (ANN). Sometimes, model-domain normalization techniques are used to compensate unwanted variability due to language and channel mismatch in training and testing data. In the present paper, we have used maximum log-likelihood (MLL) to evaluate the performance of ASR on the databases(DBs), namely ELSDSR, Hyke-2011, and IITG-MV SR Phase-I & II, based on MFCCs and VQ/GMM where the scoring technique MLL is used for the recognition of speakers. The experiment is carried out to examine the language dependency and environmental mismatch between training and testing data.
  • Closed-set text-independent automatic speaker recognition system using VQ/GMM

    Barai B., Das D., Das N., Basu S., Nasipuri M.

    Conference paper, Advances in Intelligent Systems and Computing, 2018, DOI Link

    View abstract ⏷

    Automatic speaker recognition (ASR) is one type of biometric recognition of human, known as voice biometric recognition. Among plenty of acoustic features, Mel-Frequency Cepstral Coefficients (MFCCs) and Gammatone Frequency Cepstral Coefficients (GFCCs) are used popularly in ASR. The state-of-the-art techniques for modeling/classification(s) are Vector Quantization (VQ), Gaussian Mixture Models (GMMs), Hidden Markov Model (HMM), Artificial Neural Network (ANN), Deep Neural Network (DNN). In this paper, we cite our experimental results upon three databases, namely Hyke-2011, ELSDSR, and IITG-MV SR Phase-I, based on MFCCs and VQ/GMM where maximum log-likelihood (MLL) scoring technique is used for the recognition of speakers and analyzed the effect of Gaussian components as well as Mel-scale filter bank’s minimum frequency. By adjusting proper Gaussian components and minimum frequency, the accuracies have been increased by 10–20% in noisy environment.
  • An ASR system using MFCC and VQ/GMM with emphasis on environmental dependency

    Barai B., Das D., Das N., Basu S., Nasipuri M.

    Conference paper, 2017 IEEE Calcutta Conference, CALCON 2017 - Proceedings, 2017, DOI Link

    View abstract ⏷

    Automatic speaker recognition (ASR), also known as voice biometric recognition, remains very popular research area over six decades. Among all the acoustic features that are used in ASR, Mel-frequency Cepstral Coefficients (MFCCs) and Gammatone Frequency Cepstral Coefficients (GFCCs) are the most popular ones. However to make ASR environment independent, Relative Spectral Amplitude (RSATA) filtering techniques before feature extraction and feature, model, and score (in classification step) domain normalization techniques are applied. The techniques for modeling/classification that are used in present days are Vector Quantization (VQ), Support Vector Machine (SVM), Gaussian Mixture Models (GMMs), Hidden Markov Model (HMM), Artificial Neural Network (ANN), Deep Neural Network (DNN). In this paper we cite our experimental results upon three databases, namely, Hyke-2011, ELSDSR and IITG-MV SR Phase-I, based on MFCCs and VQ/GMM where Maximum Log-Likelihood (MLL) scoring technique is used for the recognition of speakers. The experimental results in the environmental mismatch condition for the IITG-MV SR Phase I & II databases are provided with explanation of accuracy degradation.
Contact Details

bidhan.b@srmap.edu.in

Scholars
Interests

  • Artificial Intelligence
  • Machine Learning
  • Speaker Recognition
  • Speech Recognition

Education
2007
BE
Jadavpur University
India
2011
M.Tech.
Jadavpur University
India
2022
Jadavpur University
India
Experience
  • 2012-2014 – Project Linked Person – CVPR & ISRU (formerly BIRU) Division, Indian Statistical Institute, Kolkata
  • March 2022 – July 2022 – Assistant Professor, Swami Vivekananda Institute of Science & Technology, Kolkata
  • August 2022 – February 2025 – Assistant Professor, CSE(AI&ML), Techno Main Salt Lake, Kolkata
Research Interests
  • Design and implementation of machine learning and pattern recognition algorithms (such as SVM and GMMs) to identify or verify a person based on their speech characteristics with an emphasis on enhancing robustness by extracting usable features (such as MFCC, iVector, and xVector).
  • Manipulation and analysis of digital images using algorithms to extract information and classify objects. Application of ML models (such as CNN, RNN, R-CNN) for the task such as object detection, segmentation, and recognition.
Awards & Fellowships
  • 2009 - Qualify Graduate Aptitude Test – MHRD, India.
  • 2015-2020 – RGNF Fellowship for PhD – UGC, India
Memberships
  • Memberships in professional associations to be listed
Publications
  • Towards Classifying Bird Sounds Using a Deep Transfer Learning Model

    Dey S., Ghosh S., Mondal S., Harh A., Bandhu S., Barai B., Singh P.K.

    Conference paper, Lecture Notes in Networks and Systems, 2026, DOI Link

    View abstract ⏷

    The conservation of bird biodiversity relies on accurately identifying and classifying species, which is often time-consuming and requires specialized knowledge. Recent advances in deep learning, particularly in convolutional neural networks (CNNs), have made it possible to detect species passively from acoustic signals, even in challenging environments. This paper presents a high-performance deep convolutional neural network (CNN) model using the VGG-16 architecture for the passive classification of bird sounds, using a remarkably accurate model of Short-Time Fourier Transform (STFT) that accounts for 97.31% of the BirdCLEF 2022 dataset and 98.41% for the Cornell Birdcall Identification dataset. The model discriminates between species, even in complex soundscapes with overlapping records. The framework also uses a tool-based consensus framework to enhance the focus on relevant features, improving classification accuracy for rare and endangered species. This method is highly effective in various phonological and language processing tasks and enhances the model’s robustness, making it suitable for real-world applications.
  • An Efficient Deep Convolutional Neural Network for Automated Bird Sound Classification

    Harh A., Bandhu S., Barai B., Das N., Singh P.K.

    Conference paper, Lecture Notes in Networks and Systems, 2025, DOI Link

    View abstract ⏷

    Classification of bird species from chirping aids biodiversity monitoring, conservation, and ecological research. In our research work, we delve into the effectiveness and utility of Convolutional Neural Network (CNN) in this endeavour, incorporating with the layers of a well-established transfer learning model, InceptionResNetV2. Our research is based on the ‘Bird Song’ dataset, which we obtained from Xeno-Canto on Kaggle, and the “British Bird Song” dataset. We use the Short-time Fourier transform (STFT) to extract key auditory properties from these datasets. The audio data may be subjected to image-based categorization approaches by transforming the audio files into chromagrams. With our custom CNN architecture, we have outperformed a number of current methods, attaining accuracy rates of 94.46% and 97.02% for the corresponding datasets. Our research provides important information on the effectiveness and applicability of bird sound classification. Moreover, the accomplishments of our customized architecture demonstrate the possibility for customized solutions in this domain. The results pave the way for future developments in birdsong ambiance research and have implications for understanding ecosystems, identifying bird species, monitoring the environment, and protecting animals.
  • A Hybrid Deep Learning Framework for Text-Independent Automatic Speaker Recognition System

    Harh A., Bandhu S., Barai B., Singh P.K.

    Conference paper, Lecture Notes in Networks and Systems, 2024, DOI Link

    View abstract ⏷

    A variety of handcrafted features as well as techniques for Speaker Recognition (SR), which combines Speaker Identification (SI) and Speaker Verification (SV), have been developed over the previous 5 decades. Automatic speaker recognition (ASR), also referred to as vocal biometric recognition, is one method of human biometric identification. ASR plays such a crucial part in so many applications, including voice assistants, transcription services, and many more, and it has attracted a lot of attention lately. This paper develops a hybrid deep learning framework for text-independent ASR system. This framework is based on the integration of Gated Recurrent Units in Recurrent Neural Networks (RNN-GRU) with Connectionist Temporal Classification (CTC) loss via the layers of a 2D Convolutional Neural Network (2D-CNN). Using the benchmark LJspeech dataset, the model has been evaluated using a performance metric, Error Rate (ER). The achieved ER rate of about 16–17% for 20 epochs underscores the promising progress in ASR technology, setting the stage for continued advancements in this field.
  • A Chain Code-Based Methodology for Loop Closure Detection in Digital Images

    Das R., Barai B., Singh P.K.

    Conference paper, Lecture Notes in Networks and Systems, 2024, DOI Link

    View abstract ⏷

    Throughout this research, we introduce an innovative technique for real-time detection of the loop(s) present in digital images. Our proposed model is based on the chain code method and implemented on two self-prepared datasets. A unique type of contour will be used to find the loops in source images. The outcomes of the suggested method's use in the real world have been shown to demonstrate its effectiveness. The results unequivocally demonstrate its strong performance. This demonstrates that while each of these descriptors has the potential to be useful, their effectiveness can vary greatly depending on the assessment criteria used. Our dataset provided a 95% accuracy result, which is impressive. This technique is very helpful in various industries, including digital cinematography, astronomy, biometric security, building extraction from overhead satellite photos, etc. The major objective has been achieved since the recommended model outperforms the current works that were proposed previously by the researchers.
  • An empirical study on analysis window functions for text-independent speaker recognition

    Barai B., Das N., Basu S., Nasipuri M.

    Article, International Journal of Speech Technology, 2023, DOI Link

    View abstract ⏷

    This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose Vector Quantization (VQ) and/or Gaussian Mixture Model (GMM) and/or Universal Background Model GMM (UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.
  • Closed-set speaker identification using VQ and GMM based models

    Barai B., Chakraborty T., Das N., Basu S., Nasipuri M.

    Article, International Journal of Speech Technology, 2022, DOI Link

    View abstract ⏷

    An array of features and methods are being developed over the past six decades for Speaker Identification (SI) and Speaker Verification (SV), jointly known as Speaker Recognition(SR). Mel Frequency Cepstral Coefficients (MFCC) is generally used as feature vectors in most of the cases because it gives higher accuracy compared to other features. The presented paper focuses on comparative study of state-of-the-art SR techniques along with their design challenges, robustness issues and performance evaluation methods. Rigorous experiments have been performed using Gaussian Mixture Model (GMM) with variations like Universal Background Model (UBM) and/or Vector Quantization (VQ) and/or VQ based UBM-GMM (VQ-UBM-GMM) with detail discussion. Other popular methods have been included, namely, Linear Discriminate Analysis (LDA), Probabilistic LDA (PLDA), Gaussian PLDA (GPLDA), Multi-condition GPLDA (MGPLDA), Identity Vector (i-vector) for comparative study only. Three popular audio data-sets have been used in the experiments, namely, IITG-MV SR, Hyke-2011 and ELSDSR. Hyke-2011 and ELSDSR contain clean speech while IITG-MV SR contains noisy audio data with variations in channel (device), environment, spoken style. We propose a new data mixing approach for SR to make the system independent of recording device, spoken style and environment. The accuracy we obtained for VQ and GMM based methods for databases, Hyke-2011 and ELSDSR are varies from 99.6 % to 100 % whereas accuracy for IITG-MV SR is upto 98 %. Indeed, in some cases the accuracies degrade drastically due to mismatch between training and testing data as well as singularity problem of GMM. The experimental results serve as a benchmark for VQ/GMM/UBM based methods for the IITG-MV SR database.
  • Closed-Set Device-Independent Speaker Identification Using CNN

    Chakraborty T., Barai B., Chatterjee B., Das N., Basu S., Nasipuri M.

    Conference paper, Advances in Intelligent Systems and Computing, 2020, DOI Link

    View abstract ⏷

    Speaker Identification(SI) has numerous applications in real world. Traditional classifiers like Gaussian Mixture Models (GMM), Support Vector Machine (SVM), and Hidden Markov Models (HMM) were used earlier for SI. Features like Mel Frequency Cepstral Coefficient (MFCC), and Gammatone Frequency Cepstral Coefficients (GFCC) need to be generated first. But these approaches do not perform well when audio data captured through multiple devices and recorded in different environments, i.e., in mismatch condition. Whereas Machine Learning (ML) algorithms usually provide better accuracy, and hence became more popular. Restricted Boltzmann Machine(RBM), Long-Short-Term Memory (LSTM), and Convolutional neural network (CNN) are some of the ML approaches applied on SI. In this paper, CNN is used for automatic feature extraction and speaker classification on IITG-MV noisy dataset. CNN performs better than GMM, specially for device mismatch case.
  • VQ/GMM-based speaker identification with emphasis on language dependency

    Barai B., Das D., Das N., Basu S., Nasipuri M.

    Conference paper, Advances in Intelligent Systems and Computing, 2019, DOI Link

    View abstract ⏷

    The biometric recognition of human through the speech signal is known as automatic speaker recognition (ASR) or voice biometric recognition. Plenty of acoustic features have been used in ASR so far, but among them Mel-frequency cepstral coefficients (MFCCs) and Gammatone frequency cepstral coefficients (GFCCs) are popularly used. To make ASR language and channel independent (if training and testing microphones and languages are not same), i-Vector feature and unwanted variability compensation techniques like linear discriminative analysis (LDA) or probabilistic LDA (PLDA), within-class covariance normalization (WCCN) are extensively used. At the very present days, the techniques for modeling/classification that are used are Gaussian mixture models (GMMs), vector quantization (VQ), hidden Markov model (HMM), deep neural network (DNN), and artificial neural network (ANN). Sometimes, model-domain normalization techniques are used to compensate unwanted variability due to language and channel mismatch in training and testing data. In the present paper, we have used maximum log-likelihood (MLL) to evaluate the performance of ASR on the databases(DBs), namely ELSDSR, Hyke-2011, and IITG-MV SR Phase-I & II, based on MFCCs and VQ/GMM where the scoring technique MLL is used for the recognition of speakers. The experiment is carried out to examine the language dependency and environmental mismatch between training and testing data.
  • Closed-set text-independent automatic speaker recognition system using VQ/GMM

    Barai B., Das D., Das N., Basu S., Nasipuri M.

    Conference paper, Advances in Intelligent Systems and Computing, 2018, DOI Link

    View abstract ⏷

    Automatic speaker recognition (ASR) is one type of biometric recognition of human, known as voice biometric recognition. Among plenty of acoustic features, Mel-Frequency Cepstral Coefficients (MFCCs) and Gammatone Frequency Cepstral Coefficients (GFCCs) are used popularly in ASR. The state-of-the-art techniques for modeling/classification(s) are Vector Quantization (VQ), Gaussian Mixture Models (GMMs), Hidden Markov Model (HMM), Artificial Neural Network (ANN), Deep Neural Network (DNN). In this paper, we cite our experimental results upon three databases, namely Hyke-2011, ELSDSR, and IITG-MV SR Phase-I, based on MFCCs and VQ/GMM where maximum log-likelihood (MLL) scoring technique is used for the recognition of speakers and analyzed the effect of Gaussian components as well as Mel-scale filter bank’s minimum frequency. By adjusting proper Gaussian components and minimum frequency, the accuracies have been increased by 10–20% in noisy environment.
  • An ASR system using MFCC and VQ/GMM with emphasis on environmental dependency

    Barai B., Das D., Das N., Basu S., Nasipuri M.

    Conference paper, 2017 IEEE Calcutta Conference, CALCON 2017 - Proceedings, 2017, DOI Link

    View abstract ⏷

    Automatic speaker recognition (ASR), also known as voice biometric recognition, remains very popular research area over six decades. Among all the acoustic features that are used in ASR, Mel-frequency Cepstral Coefficients (MFCCs) and Gammatone Frequency Cepstral Coefficients (GFCCs) are the most popular ones. However to make ASR environment independent, Relative Spectral Amplitude (RSATA) filtering techniques before feature extraction and feature, model, and score (in classification step) domain normalization techniques are applied. The techniques for modeling/classification that are used in present days are Vector Quantization (VQ), Support Vector Machine (SVM), Gaussian Mixture Models (GMMs), Hidden Markov Model (HMM), Artificial Neural Network (ANN), Deep Neural Network (DNN). In this paper we cite our experimental results upon three databases, namely, Hyke-2011, ELSDSR and IITG-MV SR Phase-I, based on MFCCs and VQ/GMM where Maximum Log-Likelihood (MLL) scoring technique is used for the recognition of speakers. The experimental results in the environmental mismatch condition for the IITG-MV SR Phase I & II databases are provided with explanation of accuracy degradation.
Contact Details

bidhan.b@srmap.edu.in

Scholars