18 January 2023

Information Technology Engineering Research and Innovation Intelligent and Autonomous Systems SYNCHROMEDIA – Multimedia Communication in Telepresence

Detecting Tables with Weakly Supervised Bounding Box Extraction

Purchased on Istockphoto.com. Copyright

Tables in Ancient Manuscripts — A Wealth of Information

Historic documents contain long-term studies in a wide range of research areas. Because of the scarcity of these documents, their information is in danger of decomposition and irretrievable loss. To preserve and retrieve some of the most important parts from the vast amount of information in these documents, we focused on detecting document pages that contain tables.

These graphical elements are very useful for scientists in obtaining essential information in an abstract format. This task is categorized in the field of object detection, which saw recent progress with the advent of deep-learning algorithms. One of these algorithms is the Faster RCNN [1] which we combined with a pre-processing Gabor filter [2], weakly supervised bounding box extraction [3], and pseudo-labeling to respond to the following challenges:

High generalization in detecting images with tables among 32 million image data
Detecting tables with various structures (figure 1)
Insufficient labeled data for the training phase of deep learning algorithms

Figure 1. Samples of tables in historic documents

Applying a Gabor filter

In the first step of our system design, we applied the Gabor filter to:

Make the data set more compatible with Faster-RCNN-based framework.
Obtain better discrimination between the target object (table) from other parts of an image by exaggerating the gap or white background between text and tables.
Remove visual noise, such as ink stains.

Figure 2 shows the preprocessed image with the Gabor filter.

Figure 2. Processed image with Gabor filter

Terms and Definitions

In this research, we used two sources of scanned historic documents as follows:

ECCO: Eighteenth-Century Collections Online (ECCO) is an enormous collection of historic documents with over 32 million pages. Based on the timeline of collected data, ECCO is divided into ECCO1 and ECCO2.
NAS: This data set contains around 0.5 million scanned document images from a longer time period than ECCO (1666 to 1916).

For this binary detection task, we defined two labels:

Table: Presentation of important data in text or numerical format in rows and columns to summarize information in a compact manner.
Non-table: All scanned document images without tables, such as diagrams, illustrations, maps, and images either on a blank page or on a page with text (figure 3).

Figure 3. Samples of non-tables in historic documents

Faster-RCNN

Based on our data sets and the characteristics of the Faster-RCNN algorithm, we used the algorithm as the main object detection module in our research, for the following reasons:

Better performance on images with low resolutions
Detecting large and small size objects
One of the best algorithms to reach a balance between speed and accuracy

Weakly Supervised Bounding Box Extraction

A Faster-RCNN-based model must be trained with adequately labeled data and bounding boxes around their objects to reach proper performance. But manual labeling data and extracting bounding boxes are costly procedures. To solve this issue, in our research we introduced the weakly supervised bounding box extraction (figure 4) technique, which is an automatic spiral learning approach. It consists of the five following phases:

Phase 1: Train and bias the model based on table
Phase 2: Test the previous biased model on non-table ‒ Output: weak bounding boxes for non-table
Phase 3: Train with two labels i.e., tables with accurate bounding boxes and non-tables with weak bounding boxes
Phase 4: Pseudo labeling ‒ Testing on unlabeled data to augment our train set
Phase 5: Train ‒ Retrain the model by adding data from the previous step

Figure 4. Weakly supervised bounding box extraction

Results

We compared the Faster-RCNN-based model with and without the weakly supervised bounding box extraction using the subsets of ECCO (mix of ECCO1 and ECCO2) and NAS data sets:

Table 1. Results of Faster-RCNN based model with and without the weakly bounding box extraction on the subset of the ECCO data set

Table 2. Results of Faster-RCNN based model with and without the weakly bounding box extraction on the subset of the NAS data set

To detect all images with tables, we applied our model to three different data sets, which include 32 million images in total (figure 5).

Results obtained with the bounding Box Extraction method

Figure 5. Results of our model

Conclusion

By taking advantage of the Gabor filter and weakly supervised bounding box extraction, we prepared better input data and enough bounding boxes around the target objects for the training phase, which lead to high performance at low costs. It is also a generalized and robust methodology for detecting tables with various layouts among 32 million scanned historical document images.

High labor costs of extracting bounding boxes, and reliable performance on unbalanced data sets are two common challenges in most machine learning tasks, which we solved with a spiral learning approach using the weakly supervised bounding box extraction technique.

Additional Information

For more information on this research, please read the following research paper:

Samari, A., Piper, A., Hedley, A., Cheriet, M. (2021). Weakly Supervised Bounding Box Extraction for Unlabeled Data in Table Detection. In: , et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12667. Springer, Cham. https://doi.org/10.1007/978-3-030-68787-8_25

Portes ouvertes

Detecting Tables with Weakly Supervised Bounding Box Extraction

Tables in Ancient Manuscripts — A Wealth of Information

Applying a Gabor filter

Terms and Definitions

Faster-RCNN

Weakly Supervised Bounding Box Extraction

Results

Conclusion

Additional Information

Featured articles

L’ÉTS s’unit aux sept universités montréalaises pour accélérer l'action climatique et renforcer la résilience de la métropole dans le cadre du Sommet Climat Montréal

Du rêve entrepreneurial à la réalité

Bon début de session d'été!