A Collection of Face Recognition Datasets and Benchmarks at Year 2019

Posted on 2019-01-02 Edited on 2024-04-22 In Research Disqus: 0 Comments

There are many public face datasets available on the Internet for reseach purposes at present. In this post, I collect most of them and give each of them a small desciption so that people can select the proper one quickly. Selecting and preprocessing the datasets properly can be critical to the performance and reliability of the results.

Keywords: face-recognition, dataset, VGG-Face, VGG-Face2, CASIA-WebFace, UMDFaces, MS-Celeb-1M, MegaFace

Traning Datasets

VGG-Face
- From: Visual Geometry Group - University of Oxford
- URL: http://www.robots.ox.ac.uk/~vgg/data/vgg_face/
- Number of Identities: 2,622
- Number of Images: N/A
- Landmarks/Bounding Box: N/A
- Per-subject Samples: N/A
- Benchmark Overlap Removal: ![LFW, YouTube Faces Dataset, IARPA Janus Benchmark A]
- Paper: O. M. Parkhi, A. Vedaldi, A. Zisserman, Deep Face Recognition British, Machine Vision Conference, 2015.
- Comments: Relatively Small Dataset

VGG-Face2 (Recommended)
- From: Visual Geometry Group - University of Oxford
- URL: http://www.robots.ox.ac.uk/~vgg/data/vgg_face2/
- Number of Identities: 9131
- Number of Images: 3.31 million
- Landmarks/Bounding Box: Estimated bounding box and 5 facial landmarks
- Per-subject Samples: 362.6
- Benchmark Overlap Removal: N/A
- Paper: Q. Cao, L. Shen, W. Xie, O. M. Parkhi, A. Zisserman VGGFace2: A dataset for recognising face across pose and age International Conference on Automatic Face and Gesture Recognition, 2018.
CASIA WebFace
- From: Institute of Automation, Chinese Academy of Sciences
- URL: http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html (404 Not Found Now)
- URL-Backup:
  - Baidu cloud: https://pan.baidu.com/s/1hQCOD4Kr66MOW0_PE8bL0w Password: y3wj
  - google drive: https://drive.google.com/open?id=1Of_EVz-yHV7QVWQGihYfvtny9Ne8qXVz
- Number of Identities: 10,575 +
- Number of Images: 453,453
- Landmarks/Bounding Box: N/A
- Per-subject Samples: N/A
- Benchmark Overlap Removal: N/A
- Paper: Yi, D., Lei, Z., Liao, S., & Li, S., Learning Face Representation from Scratch, Arxiv preprint, 2014.
UMDFaces
- From: Computer Vision Laboratory, the University of Maryland Institute for Advanced Computer Studies
- URL: http://umdfaces.io/
- Number of Identities: 8,277
- Number of Images: 367,888
- Landmarks/Bounding Box: N/A
- Per-subject Samples: N/A
- Benchmark Overlap Removal: N/A
- Paper: Ankan Bansal, Anirudh Nanduri, Carlos D Castillo, Rajeev Ranjan, and Rama Chellappa, UMDFaces: An Annotated Face Dataset for Training Deep Networks, Arxiv preprint, 2016.
  Ankan Bansal, Carlos Castillo, Rajeev Ranjan, and Rama Chellappa, The Do’s and Don’ts for CNN-based Face Verification, Arxiv preprint, 2017.
MS-Celeb-1M
- From: Microsoft Research
- URL: https://www.microsoft.com/en-us/research/project/ms-celeb-1m-challenge-recognizing-one-million-celebrities-real-world/
- Number of Identities: the top 100K celebrities
- Number of Images: 10M
- Landmarks/Bounding Box: N/A
- Per-subject Samples: 100 images for each celebrity
- Benchmark Overlap Removal: N/A
- Paper: Guo, Y., Zhang, L., Hu, Y., He, X., & Gao, J. (2016, October). Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision (pp. 87-102). Springer, Cham.
- Comments: All from the Internet, Large, with a large environmental variety but not clean engouth
  A cleanded version from Deepglint can be found here: http://trillionpairs.deepglint.com/data

Benchmarks

LFW
- Labeled Faces in the Wild
- URL: http://vis-www.cs.umass.edu/lfw/
CFPW
- Celebrities in Frontal-Profile in the Wild
- URL: http://www.cfpw.io/
AgeDB
- URL: https://ibug.doc.ic.ac.uk/resources/agedb/

Some Notes

In data preprocessing of facial datasets, we will need to take care about the overlapping problems of our training datasets and testing dataset, which step is often ignored by some researchers.

Why some of the datasets are usually used as benchmarks? I guess this is because they are more reliable and relatively small.

Traning Datasets

Benchmarks

Some Notes

Reference