Large scale annotated dataset for code-mix abusive short noisy text
Tiwari P., Rai S., Chowdary C.R.
Article, Language Resources and Evaluation, 2025, DOI Link
View abstract ⏷
With globalization and cultural exchange around the globe, most of the population gained knowledge of at least two languages. The bilingual user base on the Social Media Platform (SMP) has significantly contributed to the popularity of code-mixing. However, apart from multiple vital uses, SMP also suffer with abusive text content. Identifying abusive instances for a single language is a challenging task, and even more challenging for code-mix. The abusive posts detection problem is more complicated than it seems due to its unseemly, noisy data and uncertain context. To analyze these contents, the research community needs an appropriate dataset. A small dataset is not a suitable sample for the research work. In this paper, we have analyzed the dimensions of Devanagari-Roman code-mix in short noisy text. We have also discussed the challenges of abusive instances. We have proposed a cost-effective methodology with 20.38% relevancy score to collect and annotate the code-mix abusive text instances. Our dataset is eight times to the related state-of-the-art dataset. Our dataset ensures the balance with 55.81% instances in the abusive class and 44.19% in the non-abusive class. We have also conducted experiments to verify the usefulness of the dataset. We have performed experiments with traditional machine learning techniques, traditional neural network architecture, recurrent neural network architectures, and pre-trained Large Language Model (LLM). From our experiments, we have observed the suitability of the dataset for further scientific work.
A Review of Existing Conversational Recommendation Systems
Zaidi S., Rai S., Juneja K.
Conference paper, 2024 2nd International Conference on Disruptive Technologies, ICDT 2024, 2024, DOI Link
View abstract ⏷
ChatGPT, Alexa, Siri, Okay Google are an indispensable part of our lives today. These assistants are referred to as Digital Assistants and enable users to communicate their choices through natural language. The Digital Assistants ease the customer task of selecting items in various applications like movies, songs and so on. This process of making a choice through natural language conversations is known as a Conversational Recommender system (CoRS). CoRS is a dialogue-based model which aims to provide customer with accurate and quality recommendations. The interaction-oriented method gives the customer an edge over the traditional way of seeking recommendations. The traditional recommendation systems are static in nature and derive information through past history of the customer. A CoRS mitigates the challenges faced in the earlier methods of recommendation like cold start where in a new user is often recommended inaccurate choices. Other issues like data sparsity and lack of diversity due to not so updated content to choose from are common. CoRS is dynamic in nature, it works on delivering high end choices by interpreting the customer demands one dialogue at a time. This comprehensive survey aims to give an overview of the research in progress using conversation as a means to achieve better results for recommendation systems.
Advanced Hierarchical Topic Labeling for Short Text
Tiwari P., Tripathi A., Singh A., Rai S.
Article, IEEE Access, 2023, DOI Link
View abstract ⏷
Hierarchical Topic Modeling is the probabilistic approach for discovering latent topics distributed hierarchically among the documents. The distributed topics are represented with the respective topic terms. An unambiguous conclusion from the topic term distribution is a challenge for readers. The hierarchical topic labeling eases the challenge by facilitating an individual, appropriate label for each topic at every level. In this work, we propose a BERT-embedding inspired methodology for labeling hierarchical topics in short text corpora. The short texts have gained significant popularity on multiple platforms in diverse domains. The limited information available in the short text makes it difficult to deal with. In our work, we have used three diverse short text datasets that include both structured and unstructured instances. Such diversity ensures the broad application scope of this work. Considering the relevancy factor of the labels, the proposed methodology has been compared against both automatic and human annotators. Our proposed methodology outperformed the benchmark with an average score of 0.4185, 49.50, and 49.16 for cosine similarity, exact match, and partial match, respectively.
A Mathematical Model for the Effect of Vaccination on COVID-19 Epidemic Spread
Singh A., Rai S., Bajpai M.K.
Conference paper, Lecture Notes in Electrical Engineering, 2023, DOI Link
View abstract ⏷
In many parts of the world, there is growing support for the development of a COVID-19 vaccine. This brief examines the challenges that the world must face in order to effectively produce and distribute a vaccine. It advocates for the vaccine to be distributed in a fair and equitable manner in order to save as many lives as possible. This manuscript is a multi-parameter mathematical model to justify a vaccine claim based on age, co-morbidity, and income. In this time of public health crisis, it is critical to create a regulatory framework for the distribution of vaccines and the allocation of scarce healthcare resources. Vaccination is an effective way to protect vulnerable people from infectious diseases. Different age groups of the population have different disease vulnerabilities and contact frequencies. This model’s primary motivation is to maximize the effects and optimize the distribution of vaccine doses to individual groups, thereby reducing the number of infectious people during an epidemic.
Investigating the Application of Multi-lingual Transformer in Graph-Based Extractive Text Summarization for Hindi Text
Rai S., Belwal R.C., Sharma A.
Conference paper, Lecture Notes in Networks and Systems, 2023, DOI Link
View abstract ⏷
Generating meaningful summary for the given natural language text is one of the challenging and popular task in present era. Researchers have come up with various techniques for the abstractive and extractive summarization. This experimental study is focused on the extractive summarization. In graph-based extractive text summarization techniques, the sentences of the input document are used as the nodes of the graph and various similarity measurements are used to weight the edges of the graph. Each node’s rating is determined using the graph ranking algorithms, and the top-ranked nodes (sentences) are then added to the output extractive summary. In this work, we first translate the publicly available dataset into Hindi text using the Google Translate service. Next, we apply a pre-trained multi-lingual transformer to generate embedding vectors of each sentence of the document. We use these embedding vectors as the nodes of the graph. Rest of the approach remains unchanged. At last, we evaluate the generated extractive summaries on the basis on ROUGE score. Evaluation results indicate that the use of pre-trained multi-lingual transformer can be effective to generate more meaningful extractive summaries.
Is the Corpus Ready for Machine Translation? A Case Study with Python to Pseudo-Code Corpus
Rai S., Belwal R.C., Gupta A.
Article, Arabian Journal for Science and Engineering, 2023, DOI Link
View abstract ⏷
The availability of data is the driving force behind most of the state-of-the-art techniques for machine translation tasks. Understandably, this availability of data motivates researchers to propose new techniques and claim about the superiority of their techniques over the existing ones by using suitable evaluation measures. However, the performance of underlying learning algorithms can be greatly influenced by the correctness and the consistency of the corpus. We present our investigations for the relevance of a publicly available python to pseudo-code parallel corpus for automated documentation task, and the studies performed using this corpus. We found that the corpus had many visible issues like overlapping of instances, inconsistency in translation styles, incompleteness, and misspelled words. We show that these discrepancies can significantly influence the performance of the learning algorithms to the extent that they could have caused previous studies to draw incorrect conclusions. We performed our experimental study using statistical machine translation and neural machine translation models. We have recorded a significant difference (∼ 10% on BLEU score) in the models’ performance after removing the issues from the corpus.
Extractive text summarization using clustering-based topic modeling
Belwal R.C., Rai S., Gupta A.
Article, Soft Computing, 2023, DOI Link
View abstract ⏷
Text summarization is the process of converting the input document into a short form, provided that it preserves the overall meaning associated with it. Primarily, text summarization is achieved in two ways, i.e., abstractive and extractive. Extractive summarizers select a few best sentences out of the input document, while abstractive methods may modify the sentence structure or introduce new sentences. The proposed approach is an extractive text summarization technique, where we have expanded topic modeling specifically to be applied to multiple lower-level specialized entities (i.e., groups) embedded in a single document. Our goal is to overcome the lack of coherence issues found in the summarization techniques. Topic modeling was initially proposed to model text data at the multi-document and word levels without considering sentence modeling. Subsequently, it has been applied at the sentence level and used for the document summarization; however, certain limitations were associated. Topic modeling does not perform as expected when applied to a single document at the sentence level. To address this shortcoming, we have proposed a summarization approach that is incorporated at the individual document and clusters level (instead of the sentence level). We aim to choose the best statement from each group (containing sentences of the same kind) found in the given text. We have tried to select the perfect topic by evaluating the probability distribution of the words and respective topics’ at the cluster level. The method is evaluated on two standard datasets and shows significant performance gains over existing text summarization techniques. Compared to other text summarization techniques, the Rouge parameters for automatic evaluation show a considerable improvement in F-measure, precision, and recall of the generated summary. Furthermore, a manual evaluation has demonstrated that the proposed approach outperforms the current state-of-the-art text summarization approaches.
Accurate module name prediction using similarity based and sequence generation models
Rai S., Belwal R.C., Gupta A.
Article, Journal of Ambient Intelligence and Humanized Computing, 2023, DOI Link
View abstract ⏷
Software code understanding is strongly dependent on the identifier names; therefore, software developers spend a lot of time specifying appropriate names for variables, functions, classes, and files. Manually suggesting a useful name is a time consuming and difficult problem for developers. For automatic identifiers name recommendation, various techniques have been proposed. Most of the work has been done for method and class name prediction. Module names play an important role when reusing software libraries to develop new source code. A good module name communicates purpose, while an inappropriate name creates ambiguity and frustration in the developer’s mind. To the best of our knowledge, we did not find any work on module name suggestions or analysis of module names. In this paper, we emphasize the module name and propose the module name prediction approach. First, we extract module files from the online python projects to create a corpus. Next, we apply preprocessing steps to prepare the data for prediction models. We construct four similarity based models and three sequence generation models. The sequence generation models can predict the module name tokens in a sequence, while similarity based models only suggest pre-stored module names. Experimental results indicate that the TF-IDF model performed best among all the models, followed by the three sequence generation models.
Effect of Identifier Tokenization on Automatic Source Code Documentation
Rai S., Belwal R.C., Gupta A.
Article, Arabian Journal for Science and Engineering, 2022, DOI Link
View abstract ⏷
In software development, source code documents play essential role during program comprehension and software maintenance. Natural language descriptions and identifier names are the main parts of the source code document. Source code document generation spares the working hours of developers. Automatic source code documentation is a rapidly growing research area at the present time. Researchers have proposed various template based, IR based (information retrieval), and learning-based techniques for automatic source code documentation. There is not much work related to preprocessing and its effect on the automatic source code documentation. Tokenization is one of the essential steps in preprocessing. We found some important flaws in the basic tokenization steps that could affect automatic source code documentation performance. Therefore, we propose an updated tokenization approach to remove the flaws of basic tokenization steps. We performed method name prediction and comment generation studies to analyze the effect of updated tokenization approach. We found that the updated tokenization helped in improving the performance of the automatic source code documentation. Name prediction and comment generation performance improved by more than 2.5% and 3.5%, respectively, in terms of F1 score.
A Review on Source Code Documentation
Rai S., Belwal R.C., Gupta A.
Review, ACM Transactions on Intelligent Systems and Technology, 2022, DOI Link
View abstract ⏷
Context: Coding is an incremental activity where a developer may need to understand a code before making suitable changes in the code. Code documentation is considered one of the best practices in software development but requires significant efforts from developers. Recent advances in natural language processing and machine learning have provided enough motivation to devise automated approaches for source code documentation at multiple levels. Objective: The review aims to study current code documentation practices and analyze the existing literature to provide a perspective on their preparedness to address the stated problem and the challenges that lie ahead. Methodology: We provide a detailed account of the literature in the area of automated source code documentation at different levels and critically analyze the effectiveness of the proposed approaches. This also allows us to infer gaps and challenges to address the problem at different levels. Findings: (1) The research community focused on method-level summarization. (2) Deep learning has dominated the past five years of this research field. (3) Researchers are regularly proposing bigger corpora for source code documentation. (4) Java and Python are the widely used programming languages as corpus. (5) Bilingual Evaluation Understudy is the most favored evaluation metric for the research persons.
Generating class name in sequential manner using convolution attention neural network
Rai S., Belwal R.C., Gupta A.
Article, Expert Systems with Applications, 2022, DOI Link
View abstract ⏷
Software code comprehension is strongly dependent on identifier names; therefore, software developers spend a lot of time assigning suitable names to identifiers. Manually suggesting a good name is a time taking and hard problem for developers. For automatic identifiers name recommendation, various techniques have been proposed. Most of the work has been done for method name prediction. We found very few research works on class name recommendation. A good class name communicates the class's intent, whereas bad ones create confusion and frustration in the developer's mind. In this paper, we first analyze the existing class name recommendation approach for dynamically typed language. In this approach, we represent the nature or behavior of python classes in quantitative form using the embedding concept for heterogeneous graphs. Next, we use these embeddings to suggest class names. The first approach can only suggest the existing class names. Therefore, we propose a new approach, which is based on the convolution attention model. In the new approach, we try to generate class name as a token sequence, instead of whole class name at once. We use two variants of the attention mechanism: simple attention and copy attention. Copy attention based model is able to predict out-of-vocab tokens during prediction. Experimental results suggest that the convolution attention model can predict accurate class name tokens.
Development of web browser prototype with embedded classification capability for mitigating Cross-Site Scripting attacks
Malviya V.K., Rai S., Gupta A.
Article, Applied Soft Computing, 2021, DOI Link
View abstract ⏷
Mitigation of Cross-Site Scripting (XSS) with machine learning techniques is the recent interest of researchers. A large amount of research work is reported in this domain. A lack of real-time tools working on the basis of these approaches is a gap in this domain. In this work, a web browser that works on machine learning classification to mitigate XSS attacks is developed. This browser classifies webpages into malicious and non-malicious pages using features identified by observation of malicious web pages and features collected from the different authors works. Classification experiments are conducted to evaluate the effectiveness of these features, and it is found that this approach performs better than other proposed methods in terms of classification accuracy, precision, recall, and F1-score. A web browser is implemented with the open-source browser WebKit. Experiments are conducted to assess the overhead created by the added functionality of classification in the web browser. The browser is found effective in classifying web pages and in real-time browsing scenarios with very less generated overhead. This makes web browser better than other proposed solutions to mitigate (XSS) attacks with minimal overhead. This developed web browser will be beneficial not only for researchers working in this domain but also for the users who can be the victims of XSS attacks.
Mind Your Tweet: Abusive Tweet Detection
Conference paper, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2021, DOI Link
View abstract ⏷
The abusive posts detection problem is more complicated than it seems due to its unseemly, unstructured noisy data and unpredictable context. The learning performance of the neural networks attracts researchers to get the highest performing output. Still, there are some limitations for noisy data while training for a neural network. In our work, we have proposed an approach that considers the assets of both the machine learning and neural network to get the most optimum result. Our approach performs with the F1 score of 92.79.
Text summarization using topic-based vector space model and semantic measure
Belwal R.C., Rai S., Gupta A.
Article, Information Processing and Management, 2021, DOI Link
View abstract ⏷
The primary shortcoming associated with extractive text summarization is redundancy, where more than one sentence representing a similar type of information are incorporated in summary. In the last two decades, a lot of extractive text summarization methods have been proposed, but less attention was paid to the redundancy issue. In this paper, we propose a text summarization technique that incorporates topic modeling and semantic measure within the vector space model to find the extractive summary of the given text. Our main objective is to address the redundancy problem associated with summarization methods and include only those sentences in summary, which represent the maximum of the topics embedded in the given text document. We generate the topic vector of the given document by representing the sentences in an intermediate form using a vector space model and topic modeling. Moreover, to make the proposed method efficient, we incorporate the semantic similarity measure to find the relevance of the sentence. We introduce two different ways to create the topic vector from the given document, i.e., Combined topic vector and Individual topic vector approach. Evaluation results on two datasets show that the summaries generated by both variants (Combined and Individual topic vector techniques) of the proposed method are found to be closer to the human-generated summaries when compared with the existing text summarization methods.
A new graph-based extractive text summarization using keywords or topic modeling
Belwal R.C., Rai S., Gupta A.
Article, Journal of Ambient Intelligence and Humanized Computing, 2021, DOI Link
View abstract ⏷
In graph-based extractive text summarization techniques, the weight assigned to the edges of the graph is the crucial parameter for the sentence ranking. The weights associated with the edges are based on the similarity between sentences (nodes). Most of the graph-based techniques use the common words based similarity measure to assign the weight. In this paper, we propose a new graph-based summarization technique, which, besides taking into account the similarity among the individual sentences, also considers the similarity between the sentences and the overall (input) document. While assigning the weight among the edges of the graph, we consider two attributes. The first attribute is the similarity among the nodes, which forms the edges of the graph. The second attribute is the weight given to a component that represents how much the particular edge is similar to the topics of the overall document for which we incorporate the topic modeling. Along with these modifications, we use the semantic measure to find the similarity among the nodes. The evaluation results of the proposed method demonstrate a significant improvement of the summary quality over the existing text summarization techniques.
Development of a plugin based extensible feature extraction framework
Malviya V., Rai S., Gupta A.
Conference paper, Proceedings of the ACM Symposium on Applied Computing, 2018, DOI Link
View abstract ⏷
An important ingredient for a successful recipe for solving machine learning problems is the availability of a suitable dataset. However, such a dataset may have to be extracted from a large unstructured and semi-structured data like programming code, scripts, and text. In this work, we propose a plug-in based, extensible feature extraction framework for which we have prototyped as a tool. The proposed framework is demonstrated by extracting features from two different sources of semi-structured and unstructured data. The semi-structured data comprised of web page and script based data whereas the other data was taken from email data for spam filtering. The usefulness of the tool was also assessed on the aspect of ease of programming.
Method Level Text Summarization for Java Code Using Nano-Patterns
Rai S., Gaikwad T., Jain S., Gupta A.
Conference paper, Proceedings - Asia-Pacific Software Engineering Conference, APSEC, 2017, DOI Link
View abstract ⏷
Rapid growth in providing automated solutions resulted in large code bases to get quickly developed and consumed. However, maintaining code and its subsequent reuse pose some challenges here. One of the best practices used to handle such issues is also to provide suitable text summary of the code to allow the human developers to comprehend the code easily, but this can be quite time-consuming and costly affair. A few efforts have been made in this direction where the text summary of the code either generated from the method signature or its body. In this paper, we propose a text summarization approach for Java code that makes use of identification of code level nano-patterns to obtain text summary. The approach also looks for associations between these nano-patterns in a Java method code and then use a template based text generation to obtain the final text summary of the Java method. We evaluated the summary generated by the proposed approach using a controlled experiment with other three existing approaches. Our results suggested that the summary generated by our approach was better on the part of completeness and correctness criteria. The feedback obtained during the experimental validation suggested additional inputs to improve the generated text summary on the other two accounts as well.