The Second Visual Object Tracking Segmentation VOTS2024 Challenge Results
Kristan M., Matas J., Tokmakov P., Felsberg M., Zajc L.C., Lukezic A., Tran K.-T., Vu X.-S., Bjorklund J., Chang H.J., Fernandez G., Attari M., Chan A., Chen L., Chen X., Collins J., Cui Y., Devarapu G.S.M., Du Y., Fan H., Fan W.-C., Feng Z., Gao M., Gorthi R.K.S., Goyal R., Han J., Hatuwal B., He Z., Hu X., Huang X., Huang Y., Jiang D., Kang B., Kannappan P., Kittler J., Lai S., Li N., Li X., Li X., Liang C., Lin L., Ling H., Liu T., Liu Z., Lu H., Luo Y., Miao D., Mogollon J., Pang Z., Pochimireddy J.R., Prutyanov V., Rahmon G., Romanov A., Shi L., Siam M., Sigal L., Sivapuram A.K., Solovyev R., Kazemi E.S., Toubal I.E., Wan J., Wang L., Wang X., Wang Y., Wang Y.-X., Wang Z., Wu G., Wu Q., Wu X., Xia Z., Xie J., Xu C., Xu T., Xu Y., Xue C., Yang C., Yang J., Yang M.-H., Yu C., Yu K., Zhang C., Zhang J., Zhang Z., Zheng F., Zheng Y., Zhong B., Zhou J., Zhou J., Zhou Y., Zhou Z., Zhu G., Zhu J., Zhu X., Zunin V.
Conference paper, Lecture Notes in Computer Science, 2025, DOI Link
View abstract ⏷
The Visual Object Tracking Segmentation VOTS2024 challenge is the twelfth annual tracker benchmarking activity of the VOT initiative. This challenge consolidates the new tracking setup proposed in VOTS2023, which merges short-term and long-term as well as single-target and multiple-target tracking with segmentation masks as the only target location specification. Two sub-challenges are considered. The VOTS2024 standard challenge, focusing on classical objects and the VOTSt2024, which considers objects undergoing a topological transformation. Both challenges use the same performance evaluation methodology. Results of 28 submissions are presented and analyzed. A leaderboard, with participating trackers details, the source code, the datasets, and the evaluation kit are publicly available on the website (https://www.votchallenge.net/vots2024/).
SA-LfV: self-annotated labeling from videos for object detection
Sivapuram A.K., Komuravelli P., Gorthi R.K.S.
Article, Machine Learning, 2025, DOI Link
View abstract ⏷
In the realm of object detection, the remarkable strides made by deep neural networks over the past decade have been hampered by challenges such as data labeling and the need to capture natural variations in training samples. Existing benchmark datasets are confined with limited set of classes, and natural variations. This paper presents "SA-LfV", a novel framework designed to streamline object detection from videos with minimal human input. By utilizing basic computer vision tasks, such as image classification and tracking single objects, our method generates pseudo-labels for object detection efficiently. To ensure a rich variety of training samples, we introduce two innovative sampling strategies. The first applies density-based clustering, choosing samples that represent a wide range of scenarios. The second analyzes object movements and their mutual information, capturing diverse behaviors and appearances. The proposed object detection data labeling procedure is demonstrated on object-tracking datasets and custom-downloaded videos. Through these methods, our framework has produced a dataset with 70,000 pseudo-labeled bounding boxes across 13 object classes, significantly diversifying the available data for object detection tasks. Our experiments show that the proposed framework can effectively adapt to unlabelled ImageNet classes, indicating its potential to broaden the capabilities of object detection models. Moreover, integrating our self-annotated dataset with standard benchmark datasets leads to a notable improvement in object detection performance. This new approach not only simplifies the traditionally labor-intensive process of manual labeling but also paves the way for expanding object detection to a wider range of classes and applications.
Towards Accurate Disease Segmentation in Plant Images: A Comprehensive Dataset Creation and Network Evaluation
Prashanth K., Harsha J.S., Kumar S.A., Srilekha J.
Conference paper, Proceedings - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024, 2024, DOI Link
View abstract ⏷
Automated disease segmentation in plant images plays a crucial role in identifying and mitigating the impact of plant diseases on agricultural productivity. In this study, we address the problem of Northern Leaf Blight (NLB) disease segmentation in maize plants. We present a comprehensive dataset of 1000 plant images annotated with NLB disease regions. We employ the Mask R-CNN and Cascaded Mask R-CNN models with various backbone architectures to perform NLB disease segmentation. The experimental results demonstrate the effectiveness of the models in accurately delineating NLB disease regions. Specifically, the ResNet Strikes Back-50 backbone architecture achieves the highest mean average precision (mAP) score, indicating its ability to capture intricate details of NLB disease spots. Additionally, the cascaded approach enhances segmentation accuracy compared to the single-stage Mask R-CNN models. Our findings provide valuable insights into the performance of different backbone architectures and contribute to the development of automated NLB disease segmentation methods in plant images. The generated dataset and experimental results serve as a resource for further research in plant disease segmentation and management.
VISAL—A novel learning strategy to address class imbalance
S. S.R.V., Sivapuram A.K., Ravi V., Senthil G., Gorthi R.K.
Article, Neural Networks, 2023, DOI Link
View abstract ⏷
In the imbalance data scenarios, Deep Neural Networks (DNNs) fail to generalize well on minority classes. In this letter, we propose a simple and effective learning function i.e, Visually Interpretable Space Adjustment Learning (VISAL) to handle the imbalanced data classification task. VISAL's objective is to create more room for the generalization of minority class samples by bringing in both the angular and euclidean margins into the cross-entropy learning strategy. When evaluated on the imbalanced versions of CIFAR, Tiny ImageNet, COVIDx and IMDB reviews datasets, our proposed method outperforms the state of the art works by a significant margin.
Depth camera based dataset of hand gestures
Jeeru S., Sivapuram A.K., Leon D.G., Groli J., Yeduri S.R., Cenkeramaddi L.R.
Data Paper, Data in Brief, 2022, DOI Link
View abstract ⏷
The dataset contains RGB and depth version video frames of various hand movements captured with the Intel RealSense Depth Camera D435. The camera has two channels for collecting both RGB and depth frames at the same time. A large dataset is created for accurate classification of hand gestures under complex backgrounds. The dataset is made up of 29718 frames from RGB and depth versions corresponding to various hand gestures from different people collected at different time instances with complex backgrounds. Hand movements corresponding to scroll-right, scroll-left, scroll-up, scroll-down, zoom-in, and zoom-out are included in the data. Each sequence has data of 40 frames, and there is a total of 662 sequences corresponding to each gesture in the dataset. To capture all the variations in the dataset, the hand is oriented in various ways while capturing.