Abstract
Recent advances in computer vision have significantly improved the performance of deep learning-based object localizers. However, these localizers rely on large-scale annotated datasets, which limits their ability in a wide range of industrial applications, due to the high cost of annotation. To address this, ÉTS researchers have developed Discriminative Pseudo-Label Sampling (DiPS), a novel approach that leverages self-supervised vision transformers to enhance object localization while eliminating the dense annotation required for training data.
High Costs of Annotation
Deep learning models have been widely used for object localization. However, they require dense supervision, meaning that every object in an image must be carefully labelled by a human annotator in order to be classified in the proper category. This process is not only time-consuming but also expensive, limiting the scalability of these methods. Weakly supervised object localization (WSOL) techniques are designed to reduce costs by using only image-level class labels, representing the most dominant object in the image. However, these methods often struggle to identify objects accurately, because they don’t have access to exact annotations.
DiPS: A New Approach
DiPS introduces a novel deep-learning framework designed to train object localizers using pseudo-annotations derived from self-supervised transformers. This eliminates the need for manual annotation. The pseudo-annotations, along with the class labels, are used to train the model, which consists of a transformer backbone with two heads: one for generating localization maps and the other for producing classification scores.
The core of DiPS is the generation of pseudo-annotations. To generate pseudo-ground annotations, we first extracted attention from the self-supervised transformer. As these attention maps may contain different objects, to filter out regions of relevant objects, we used the attention maps to perturb the images by hiding non-relevant objects. The images were subsequently fed into a pre-trained classifier, and the top-performing maps with the highest classifier scores were selected as pseudo-labels. From these top N-selected maps, a representative map was chosen to sample pixels for foreground and background regions based on activation values. These pseudo-pixels were used to train a localizer network, aiding in object identification. Our approach allowed the network to learn from limited pseudo-pixel annotations, enabling it to explore different parts of the object. This eliminated the need for manual annotation by human experts, enabling weakly supervised training.
Results
DiPS was tested on several challenging datasets, including ILSVRC, OpenImages, and CUB-200-2011, displaying superior localization performance compared to existing methods. Furthermore, the proposed model demonstrated an overall ability to cover all parts of the object, unlike alternative methods. For instance, DiPS outperformed state-of-the-art models in the CUB dataset, as shown in the image below. For detailed analysis, please refer to the published article.
Additional Information
For more information on this research, please read the following paper: Murtaza, Shakeeb, et al. "DiPS: Discriminative pseudo-label sampling with self-supervised transformers for weakly supervised object localization." Image and Vision Computing 140 (2023): 104838.