A Collection of Face Recognition Datasets and Benchmarks at Year 2019

There are many public face datasets available on the Internet for reseach purposes at present. In this post, I collect most of them and give each of them a small desciption so that people can select the proper one quickly. Selecting and preprocessing the datasets properly can be critical to the performance and reliability of the results.

Keywords: face-recognition, dataset, VGG-Face, VGG-Face2, CASIA-WebFace, UMDFaces, MS-Celeb-1M, MegaFace

Traning Datasets

  • VGG-Face
    • From: Visual Geometry Group - University of Oxford
    • URL: http://www.robots.ox.ac.uk/~vgg/data/vgg_face/
    • Number of Identities: 2,622
    • Number of Images: N/A
    • Landmarks/Bounding Box: N/A
    • Per-subject Samples: N/A
    • Benchmark Overlap Removal: ![LFW, YouTube Faces Dataset, IARPA Janus Benchmark A]
    • Paper: O. M. Parkhi, A. Vedaldi, A. Zisserman, Deep Face Recognition British, Machine Vision Conference, 2015.
    • Comments: Relatively Small Dataset
  • VGG-Face2 (Recommended)
    • From: Visual Geometry Group - University of Oxford
    • URL: http://www.robots.ox.ac.uk/~vgg/data/vgg_face2/
    • Number of Identities: 9131
    • Number of Images: 3.31 million
    • Landmarks/Bounding Box: Estimated bounding box and 5 facial landmarks
    • Per-subject Samples: 362.6
    • Benchmark Overlap Removal: N/A
    • Paper: Q. Cao, L. Shen, W. Xie, O. M. Parkhi, A. Zisserman VGGFace2: A dataset for recognising face across pose and age International Conference on Automatic Face and Gesture Recognition, 2018.
  • UMDFaces
    • From: Computer Vision Laboratory, the University of Maryland Institute for Advanced Computer Studies
    • URL: http://umdfaces.io/
    • Number of Identities: 8,277
    • Number of Images: 367,888
    • Landmarks/Bounding Box: N/A
    • Per-subject Samples: N/A
    • Benchmark Overlap Removal: N/A
    • Paper: Ankan Bansal, Anirudh Nanduri, Carlos D Castillo, Rajeev Ranjan, and Rama Chellappa, UMDFaces: An Annotated Face Dataset for Training Deep Networks, Arxiv preprint, 2016.
      Ankan Bansal, Carlos Castillo, Rajeev Ranjan, and Rama Chellappa, The Do’s and Don’ts for CNN-based Face Verification, Arxiv preprint, 2017.

Benchmarks

Some Notes

In data preprocessing of facial datasets, we will need to take care about the overlapping problems of our training datasets and testing dataset, which step is often ignored by some researchers.

Why some of the datasets are usually used as benchmarks? I guess this is because they are more reliable and relatively small.

Reference