13 November 2024

Automated Manufacturing Engineering Research and Innovation Intelligent and Autonomous Systems Research Chairs and Units LIVIA – Imaging, Vision and Artificial Intelligence Laboratory

Deep Learning Model for Object Localization

Colorful visual of a cat and a dog with labeled frames.

Abstract

Recent advances in computer vision have significantly improved the performance of deep learning-based object localizers. However, these localizers rely on large-scale annotated datasets, which limits their ability in a wide range of industrial applications, due to the high cost of annotation. To address this, ÉTS researchers have developed Discriminative Pseudo-Label Sampling (DiPS), a novel approach that leverages self-supervised vision transformers to enhance object localization while eliminating the dense annotation required for training data.

High Costs of Annotation

Deep learning models have been widely used for object localization. However, they require dense supervision, meaning that every object in an image must be carefully labelled by a human annotator in order to be classified in the proper category. This process is not only time-consuming but also expensive, limiting the scalability of these methods. Weakly supervised object localization (WSOL) techniques are designed to reduce costs by using only image-level class labels, representing the most dominant object in the image. However, these methods often struggle to identify objects accurately, because they don’t have access to exact annotations.

DiPS: A New Approach

DiPS introduces a novel deep-learning framework designed to train object localizers using pseudo-annotations derived from self-supervised transformers. This eliminates the need for manual annotation. The pseudo-annotations, along with the class labels, are used to train the model, which consists of a transformer backbone with two heads: one for generating localization maps and the other for producing classification scores.

The core of DiPS is the generation of pseudo-annotations. To generate pseudo-ground annotations, we first extracted attention from the self-supervised transformer. As these attention maps may contain different objects, to filter out regions of relevant objects, we used the attention maps to perturb the images by hiding non-relevant objects. The images were subsequently fed into a pre-trained classifier, and the top-performing maps with the highest classifier scores were selected as pseudo-labels. From these top N-selected maps, a representative map was chosen to sample pixels for foreground and background regions based on activation values. These pseudo-pixels were used to train a localizer network, aiding in object identification. Our approach allowed the network to learn from limited pseudo-pixel annotations, enabling it to explore different parts of the object. This eliminated the need for manual annotation by human experts, enabling weakly supervised training.

Training and inference stages in a machine learning model with localization and classification steps.

Fig. 1. DiPS: Our proposed method to train a transformer network for WSOL tasks using a combination of localization and classification networks. The full model is composed of a vision transformer encoder, a classification head, and a localization head. Training: An image class label is required to train the classification head, and our generated pixel-wise pseudo-labels train the localization head.

Results

DiPS was tested on several challenging datasets, including ILSVRC, OpenImages, and CUB-200-2011, displaying superior localization performance compared to existing methods. Furthermore, the proposed model demonstrated an overall ability to cover all parts of the object, unlike alternative methods. For instance, DiPS outperformed state-of-the-art models in the CUB dataset, as shown in the image below. For detailed analysis, please refer to the published article.

Visual comparison of various feature visualization techniques and attention mechanisms in deep learning models.

Additional Information

For more information on this research, please read the following paper: Murtaza, Shakeeb, et al. "DiPS: Discriminative pseudo-label sampling with self-supervised transformers for weakly supervised object localization." Image and Vision Computing 140 (2023): 104838.

Portes ouvertes

Deep Learning Model for Object Localization

Abstract

High Costs of Annotation

DiPS: A New Approach

Results

Additional Information

Featured articles

L’année 2025 à l’ÉTS : une année de reconnaissance et d’affirmation

Du laboratoire au terrain : trois innovations qui ont des effets concrets sur la société

Des livres inspirants pour soigner notre planète
Suggestions des professeurs et professeures

Portes ouvertes

Deep Learning Model for Object Localization

Abstract

High Costs of Annotation

DiPS: A New Approach

Results

Additional Information

Featured articles

L’année 2025 à l’ÉTS : une année de reconnaissance et d’affirmation

Du laboratoire au terrain : trois innovations qui ont des effets concrets sur la société

Des livres inspirants pour soigner notre planète Suggestions des professeurs et professeures

Des livres inspirants pour soigner notre planète
Suggestions des professeurs et professeures