Our dataset aims to allow simulation of robotic motion through an environment for object detection. We collect data in many scenes, which may be one or more rooms in a home or office.

Collection Procedure:

For most scenes we conduct a full scan, and then move instances around the scene and collect either a full second scan or a partial second scan. A partial scan has much less images and coverage of the scene than a full scan. A number of object instances from the BigBIRD dataset are placed in every scene. The second scan usually has a different subset of BigBIRD instances than the first scan. We then sample 12 images 30 degrees apart at various locations in the scene. We recommend using our visualization code with the example scene to understand the dataset. More detailed info can be found in our paper, and on the Add Data page.


Currently we label 33 unique instances in our scenes. More info about these instances can be found in the instances tab above.

Check out our github for code to visualize and load our data.

When you download our dataset you will receive:

One directory for each scan of each scene holding:
  • RGB images: Lossy JPG compressed format. 1920x1080 resolution
  • Depth images: 16 bit PNG format, registered to be the same resolution as the RGB images. They are then losslessly compressed using the optipng tool.
  • annotations.json: See our format in the next tab. Annotations include 2D bounding boxes and pointers to allow movement through each scene
  • image_structs.mat: Holds data including reconstructed camera position for each image. Used in some of our visualizations
  • present_instance_names.txt List of all the instances in the scene, using their names from BigBIRD.
  • special_note.txt In some scenes an instance may appear twice, but never twice in the same image. These scenes include this file to indicate which instances appear twice. The scenes are Home_006_1, Office_001_1 as of this writing.

Upon request, we may also provide the following:
  • Sparse/Dense reconstrucions of each scene
  • 3D point cloud labels of each instance in the dense reconstruction
  • Original resolution(512x424) depth images
  • Original uncompressed PNG format RGB images

Here we define our format for:

Check out our github for code to visualize and load our data.

bounding box format

Each bounding box has 6 numbers: 
[xmin ymin xmax ymax instance_id difficulty]

1. xmin - minimum x value of bounding box
2. ymin - minimum y value of bounding box
3. xmax - maximum x value of bounding box
4. ymax - maximum y value of bounding box
5. instance_id - numeric id of the instance that is labeled
6. difficulty - a measure of how difficult the box 
                may be for a computer to replicate

*Currently, difficulty is just a measure of the box size, defined below.
 We hope to improve this to account for occlusion in the future. 

if box > (300x100)
    difficulty = 1;
elseif(box  > (200x75)
    difficulty = 2;
elseif box > (100x50)
    difficulty = 3;
elseif box > (50x30)
    difficulty = 4;
    difficulty = 5;

**Box's larger dimension must be greater than first number, 
  smaller dimension greater than second number.

Ex)  A box that is 250x80 has difficulty 2.
     A box that is 80x250 has difficulty 2.
     A box that is 250x60 has difficulty 3.
     A box that is 60x250 has difficulty 3.

**We do not gurantee the presence of boxes smaller than 50x30.
 That is, if an object's true box is < 50x30,
 then we may not have labeled that box. Many of these boxes
 are labeled, however, and any box that is labeled contains 
 the object.

Annotation format

Each scene has one JSON file for all annotaions, with following format:

          [xmin ymin xmax ymax instance_id difficulty],
          [xmin ymin xmax ymax instance_id difficulty],
  "next_image_name":{ ...


Image Name Format

***Each image has a unique name.***
Our image names have 15 digits, followed by the file extension(.jpg or .png).

Example: 000120001620101.jpg 

(Digit 1) = Scene Type
        0 - Home
        1 - Office 

(Digits 2-4) = Scene Number 
        Ex. 001 = scene 1 of this type

(Digit 5) = Scan Number 
        Ex. 2 = scan 2 of this scene

(Digits 6-11) = Image Index 
       Ex. 000162 = the 162nd image captured in this scan 

(digits 12-13) = Camera Index 
       Ex. 01 = this image was taken with Camera 1

(digits 14-15) = Image type 
        01 - RGB 
        02 - raw_depth (512x424)
        03 - high_res_depth (1920x1080)
        05 - improved_depths (see paper, not currently available for download)

So for image 000120001620101.jpg:

Home, Scene 1, Scan 2, Image 162, Camera 1, RGB image


Download Our Data

Here you can download our entire dataset, excluding the few scans we have held our for testing purposes. If you don't want to download everything, checkout our example scene below. An evaluation server will be online in the future. To reduce the download size, we have broken up the dataset into a few .tar files. Check out the instances tab to see examples of our instances, and download some images of them.

Example Scene (1.2GB)

Instead of downloading our entire dataset you can get an idea of what our images look like, our label format, and a general sense of how our data is organized by downloading this single example scan. Check out our github for code to visualize and load our data.
Note: This scan is also included in our full dataset.

Example scan

Description of common instances

We place a subset of our 33 common instances in each scene. When we can, we ask the owner of the home to place the objects in naturual places, to avoid any bias in object placement.
We chose our instance based on instances in the BigBIRD dataset, but not all of our instances are exact matches.
Below we provide:

Instance names/ids
Instance images

advil_liqui_gels aunt_jemima_origin
bumblebee_albacore cholula_chipotle_h
crystal_hot_sauce expo_marker_red hersheys_bar honey_bunches_of_o
hunts_sauce listerine_green mahatma_rice nature_valley_gran
paper_plate pepto_bismol pringles_bbq progresso_new_engl
red_bull red_cup softsoap_clear softsoap_gold softsoap_white
tapatio_hot_sauce vo5_tea_therapy_he


Here we provide access to code to replicate the experiments in our original paper.

Instance Detection


Active Vision

Github ||| Extra data

Here are the test/train splits we use in our experiments

Train 1 Test 1
Home_002_1, Home_003_1, Home_003_2, Home_004_1, Home_004_2, Home_005_1, Home_005_2, Home_006_1, Home_014_1, Home_014_2, Office_001_1 Home_001_1, Home_001_2, Home_008_1
Train 2 Test 2
Home_001_1, Home_001_2, Home_002_1, Home_004_1, Home_004_2, Home_005_1, Home_005_2, Home_006_1, Home_008_1, Home_014_1, Home_014_2, Home_003_1, Home_003_2, Office_001_1
Train 3 Test 3
Home_001_1, Home_001_2, Home_003_1, Home_003_2, Home_004_1, Home_004_2, Home_005_1, Home_005_2, Home_006_1, Home_008_1, Office_001_1 Home_014_1, Home_014_2, Home_002_1