Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval

Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis

ICDAR 2015 "Best Student Paper"

Abstract

This paper presents a new state-of-the-art for document image classification and retrieval, using features learned by deep convolutional neural networks (CNNs). In object and scene analysis, deep neural nets are capable of learning a hierarchical chain of abstraction from pixel inputs to concise and descriptive representations. The current work explores this capacity in the realm of document analysis, and confirms that this representation strategy is superior to a variety of popular handcrafted alternatives. Extensive experiments show that (i) features extracted from CNNs are robust to compression, (ii) CNNs trained on non-document images transfer well to document analysis tasks, and (iii) enforcing region-specific feature-learning is unnecessary given sufficient training data. This work also makes available a new labelled subset of the IIT-CDIP collection, containing 400,000 document images across 16 categories.

Paper

Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval

Citation

A. W. Harley, A. Ufkes, K. G. Derpanis, "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval," in ICDAR, 2015

Bibtex format:

@inproceedings{harley2015icdar,
    title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval},
    author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis},
    booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}},
    year = {2015}
}

Dataset

This paper introduced the RVL-CDIP dataset.

Caffe models and setup files

Here are all of the files related to the "holistic CNN" featured in the paper (fine-tuned from ImageNet on RVL-CDIP).

File Size md5sum
create_docnet.sh 71 B
make_docnet_mean.sh 126 B
docnet_mean.binaryproto 618362 B (604 KB)
train_docnet.sh 162 B
test_docnet.sh 245 B
docnet_train_val.prototxt 4973 B (4.9 KB)
docnet_test.prototxt 4974 B (4.9 KB)
docnet_solver.prototxt 749 B
caffe_reference_imagenet_model 243862418 B (233 MB) af678f0bd3cdd2437e35679d88665170
docnet_train_iter_50000.caffemodel 227736621 B (218 MB) 9c96dac588b7b35447c099cec1a08e4d

Here are the Caffe models for the region-trained CNNs. The setup files for these CNNs are nearly identical to those used for the holistic CNN.

File Size md5sum
header.caffemodel 227736621 B (218 MB) e3844e41c9816270727517c72ef37f33
bodyLeft.caffemodel 227736621 B (218 MB) 84237765cbc994df6f3007d72328daf3
bodyRight.caffemodel 227736621 B (218 MB) 83b51bb09efab1552471e00145b678be
footer.caffemodel 227736621 B (218 MB) 6511b69549eee9d6dee84b444b40639d

Acknowledgements

This work was supported by NSERC Discovery and Engage grants (held by K.G.D.), and an NSERC USRA (awarded to A.W.H.). The authors thank Palomino System Innovations Inc. for posing the problem and providing data with helpful discussions. The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of a Tesla K40 GPU used for this research.