Dataset Gallary

We will collect some popular datasets including Scene Understanding, Image Captioning, Visual Questin Answering (VQA), Object Detection, Video Analysis, etc., in this post. The post will be continuously updated.

Image Captioning

    • 1000 images & 5 sents / im
    • Designed for image classification, object detection and segmentation.
    • No filtering, complex scenes, scaling, view points of different objects.
    • 8108 images & 5 sents / im
    • obtained from the Flickr website by University of Illinois at Urbana, Champaign
  • FLICKR 30K
    • extension to the Flickr 8K
    • Largest Caption dataset
    • Includes captions & object annoatations
    • 328,000 images & 5 sents / im
  • Visual Genome
    • Densely-annotated dataset
    • Includes objects, scene graphs, region captions (grounded), Q&As (grounded), attributes
    • 108,077 images with full annotations
    • Not very clean, need a little pre-processing Image Paragraphs
    • To benchmark the progress in generating paragraphs that tell a story about an image;
    • 19,561 images from the Visual Genome (training/val/test sets contains 14,575/2487/2489 images). One paragraph per image.
    • Each image also contains 50 region descriptions (short phrases describing parts of an image), 35 objects, 26 attributes and 21 relationships and 17 question-answer pairs.

Visual Question Answering

    • first dataset and benchmark released for the VQA task
    • Images are from NYU Depth V2 dataset with semantic segmentations
    • 1449 images (795 training, 654 test), 12468 question (auto-generated & human-annotated)
    • Automatically generated from image captions.
    • 123287 images, 78736 train questions, 38948 test questions
    • 4 types of questions: object, number, color, location
    • Answers are all one-word
  • VQA
    • Most widely-used VQA dataset
    • two parts: one contains images from COCO, the other contains abstract scenes
    • 204,721 COCO and 50,000 abstract images with ~5.4 questions/im
    • A Diagnostic Dataset for the reasoning ability of VQA models
    • rendered images and automatically-generated questions with functional programs and scene graphs
    • 100,000 images (70,000 train & 15,000 val & 15,000 test) with ~10 questions/im
  • Visual Genome
    • Densely-annotated dataset
    • Includes objects, scene graphs, region captions (grounded), Q&As (grounded), attributes
    • 108,077 images with 1.7M grounded Q&A pairs
    • Not very clean, need a little pre-processing
  • ActivityNet
    • Contains a wide range of complex human activities that are of interest to people in their daily living
    • 200 activity classes with 100 untrimmed videos per class; 1.54 activity instances per video class; 648 video hours
    • Can be used for global video classification,trimmed activity classification and activity detection.
  • ActivityNet Captions(Dense-captioning Events in Videos)
    • Videos are annotated with a series of temporally sentence descriptions; Each sentence covers an unique segment of the video, describing multiple events that occur.
    • Each of the 20k videos contains 3.65 temporally localized sentences, resulting in a total of 100k sentences, where each sentece has an average length of13.48 words.
  • Atomic Visual Actions (AVA)
    • AVA dataset densely annotates 80 atomic visual actions in 57.6k movie clips with actions localized in space and time
    • 210k action labels with multiple labels per human occurring frequently
    • Use diverse, realistic video material (movies)


[1] Exploring Image Captioning Datasets

[2] Survey of Visual Question Answering: Datasets and Techniques