Admission Help Line

18900 00888

Admissions 2026 Open — Apply!

Faculty Dr Sawan Rai

Dr Sawan Rai

Assistant Professor

Department of Computer Science and Engineering

Contact Details

sawan.r@srmap.edu.in

Office Location

Homi J Bhabha Block, Level 2, Cabin No: 12

Education

2022
PDPM IIITDM, Jabalpur
India
2017
M.Tech
PDPM IIITDM Burla Sambalpur Jabalpur
India
2015
B.E.
R.G.P.V.

Experience

  • July-2022 to Dec.-2023 Assistant Professor at Bennett University, Greater Noida
  • Jan-2022 to June-2022 Assistant Professor at ITER-SOA, Bhubaneswar

Research Interest

  • Source code Documentation
  • Programming Languages

Awards

No data available

Memberships

No data available

Publications

  • Large scale annotated dataset for code-mix abusive short noisy text

    Dr Sawan Rai, Paras Tiwari., C Ravindranath Chowdary

    Source Title: Language Resources and Evaluation, Quartile: Q1, DOI Link

    View abstract ⏷

    With globalization and cultural exchange around the globe, most of the population gained knowledge of at least two languages. The bilingual user base on the Social Media Platform (SMP) has significantly contributed to the popularity of code-mixing. However, apart from multiple vital uses, SMP also suffer with abusive text content. Identifying abusive instances for a single language is a challenging task, and even more challenging for code-mix. The abusive posts detection problem is more complicated than it seems due to its unseemly, noisy data and uncertain context. To analyze these contents, the research community needs an appropriate dataset. A small dataset is not a suitable sample for the research work. In this paper, we have analyzed the dimensions of Devanagari-Roman code-mix in short noisy text. We have also discussed the challenges of abusive instances. We have proposed a cost-effective methodology with 20.38% relevancy score to collect and annotate the code-mix abusive text instances. Our dataset is eight times to the related state-of-the-art dataset. Our dataset ensures the balance with 55.81% instances in the abusive class and 44.19% in the non-abusive class. We have also conducted experiments to verify the usefulness of the dataset. We have performed experiments with traditional machine learning techniques, traditional neural network architecture, recurrent neural network architectures, and pre-trained Large Language Model (LLM). From our experiments, we have observed the suitability of the dataset for further scientific work.

Patents

Projects

Scholars

Interests

  • Deep Learning
  • Natural Language Processing
  • Software Engineering

Thought Leaderships

There are no Thought Leaderships associated with this faculty.

Top Achievements

Education
2015
B.E.
R.G.P.V.
2017
M.Tech
PDPM IIITDM Burla Sambalpur Jabalpur
India
2022
PDPM IIITDM, Jabalpur
India
Experience
  • July-2022 to Dec.-2023 Assistant Professor at Bennett University, Greater Noida
  • Jan-2022 to June-2022 Assistant Professor at ITER-SOA, Bhubaneswar
Research Interests
  • Source code Documentation
  • Programming Languages
Awards & Fellowships
No data available
Memberships
No data available
Publications
  • Large scale annotated dataset for code-mix abusive short noisy text

    Dr Sawan Rai, Paras Tiwari., C Ravindranath Chowdary

    Source Title: Language Resources and Evaluation, Quartile: Q1, DOI Link

    View abstract ⏷

    With globalization and cultural exchange around the globe, most of the population gained knowledge of at least two languages. The bilingual user base on the Social Media Platform (SMP) has significantly contributed to the popularity of code-mixing. However, apart from multiple vital uses, SMP also suffer with abusive text content. Identifying abusive instances for a single language is a challenging task, and even more challenging for code-mix. The abusive posts detection problem is more complicated than it seems due to its unseemly, noisy data and uncertain context. To analyze these contents, the research community needs an appropriate dataset. A small dataset is not a suitable sample for the research work. In this paper, we have analyzed the dimensions of Devanagari-Roman code-mix in short noisy text. We have also discussed the challenges of abusive instances. We have proposed a cost-effective methodology with 20.38% relevancy score to collect and annotate the code-mix abusive text instances. Our dataset is eight times to the related state-of-the-art dataset. Our dataset ensures the balance with 55.81% instances in the abusive class and 44.19% in the non-abusive class. We have also conducted experiments to verify the usefulness of the dataset. We have performed experiments with traditional machine learning techniques, traditional neural network architecture, recurrent neural network architectures, and pre-trained Large Language Model (LLM). From our experiments, we have observed the suitability of the dataset for further scientific work.
Contact Details

sawan.r@srmap.edu.in

Scholars
Interests

  • Deep Learning
  • Natural Language Processing
  • Software Engineering

Education
2015
B.E.
R.G.P.V.
2017
M.Tech
PDPM IIITDM Burla Sambalpur Jabalpur
India
2022
PDPM IIITDM, Jabalpur
India
Experience
  • July-2022 to Dec.-2023 Assistant Professor at Bennett University, Greater Noida
  • Jan-2022 to June-2022 Assistant Professor at ITER-SOA, Bhubaneswar
Research Interests
  • Source code Documentation
  • Programming Languages
Awards & Fellowships
No data available
Memberships
No data available
Publications
  • Large scale annotated dataset for code-mix abusive short noisy text

    Dr Sawan Rai, Paras Tiwari., C Ravindranath Chowdary

    Source Title: Language Resources and Evaluation, Quartile: Q1, DOI Link

    View abstract ⏷

    With globalization and cultural exchange around the globe, most of the population gained knowledge of at least two languages. The bilingual user base on the Social Media Platform (SMP) has significantly contributed to the popularity of code-mixing. However, apart from multiple vital uses, SMP also suffer with abusive text content. Identifying abusive instances for a single language is a challenging task, and even more challenging for code-mix. The abusive posts detection problem is more complicated than it seems due to its unseemly, noisy data and uncertain context. To analyze these contents, the research community needs an appropriate dataset. A small dataset is not a suitable sample for the research work. In this paper, we have analyzed the dimensions of Devanagari-Roman code-mix in short noisy text. We have also discussed the challenges of abusive instances. We have proposed a cost-effective methodology with 20.38% relevancy score to collect and annotate the code-mix abusive text instances. Our dataset is eight times to the related state-of-the-art dataset. Our dataset ensures the balance with 55.81% instances in the abusive class and 44.19% in the non-abusive class. We have also conducted experiments to verify the usefulness of the dataset. We have performed experiments with traditional machine learning techniques, traditional neural network architecture, recurrent neural network architectures, and pre-trained Large Language Model (LLM). From our experiments, we have observed the suitability of the dataset for further scientific work.
Contact Details

sawan.r@srmap.edu.in

Scholars