The RVL-CDIP Dataset

Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis

The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. The images are sized so their largest dimension does not exceed 1000 pixels.

Here are the classes in the dataset, and an example from each:

letter
memo
email
filefolder
form
handwritten
invoice
advertisement
budget
news article
presentation
scientific
publication
questionnaire
resume
scientific
report
specification

This dataset is a subset of the IIT-CDIP Test Collection 1.0 [1], which is publicly available here. The file structure of this dataset is the same as in the IIT collection, so it is possible to refer to that dataset for OCR and additional metadata. The IIT-CDIP dataset is itself a subset of the Legacy Tobacco Document Library [2].

Download

Update (April 2022): The RVL-CDIP Dataset is now on the HuggingFace Datasets Library! The data should be very convenient to access from there. As backup, we maintain the google drive links and information below.

File Size md5sum
rvl-cdip.tar.gz 38762320458B (37GB) d641dd4866145316a1ed628b420d8b6c
labels_only.tar.gz 6359157B (6.1MB) 9d22cb1eea526a806de8f492baaa2a57

Details

The label files list the images and their categories in the following format:

path/to/the/image.tif category

where the categories are numbered 0 to 15, in the following order:

  1. letter
  2. form
  3. email
  4. handwritten
  5. advertisement
  6. scientific report
  7. scientific publication
  8. specification
  9. file folder
  10. news article
  11. budget
  12. invoice
  13. presentation
  14. questionnaire
  15. resume
  16. memo

Citation

If you use this dataset, please cite our paper:

A. W. Harley, A. Ufkes, K. G. Derpanis, "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval," in ICDAR, 2015

Bibtex format:

@inproceedings{harley2015icdar,
    title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval},
    author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis},
    booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}},
    year = {2015}
}

License

RVL-CDIP is a subset of IIT-CDIP, which came from the Legacy Tobacco Document Library, for which license information can be found here.

References:

  1. D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, "Building a test collection for complex document information processing," in Proc. 29th Annual Int. ACM SIGIR Conference (SIGIR 2006), pp. 665-666, 2006
  2. The Legacy Tobacco Document Library (LTDL), University of California, San Francisco, 2007. http://legacy.library.ucsf.edu/.