Finding Things: Image Parsing with Regions and Per-Exemplar Detectors

Joseph Tighe and Svetlana Lazebnik
Dept. of Computer Science, University of North Carolina at Chapel Hill

Abstract: This paper presents a system for image parsing, or labeling each pixel in an image with its semantic category, aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled. The system combines region-level features with per-exemplar sliding window detectors. Per-exemplar detectors are better suited for our parsing task than traditional bounding box detectors: they perform well on classes with little training data and high intra-class variation, and they allow object masks to be transferred into the test image for pixel-level segmentation. The proposed system achieves state-of-the-art accuracy on three challenging datasets, the largest of which contains 45,676 images and 232 labels.
Citation:
Joseph Tighe and Svetlana Lazebnik "Finding Things: Image Parsing with Regions and Per-Exemplar Detectors," CVPR, 2013. (PDF) (Sup) (Poster) (Code)

Sift Flow Dataset:

Output for our entire testset: Web
Pre-Trained Exemplar SVM

LM+Sun Dataset:

Output for our entire testset: Web
Full Dataset
Pre-Trained Exemplar SVM

CamVid Dataset:

Output for our entire testset: Web

Note on CamVid training data: For the CamVid results of Section 3, our training set is not identical to that of [5, 10, 16, 26, 29]. Specifically, the 101 frames labelled at 15Hz were also included in the training set, increasing its size from 367 in [5, 10, 16, 26, 29] to 468 frames. We do not believe this extra training data has had a significant impact on the accuracy of our system as it only adds frames that are very similar to ones already seen, but the comparison to other works is not strictly fair.
Our original superparsing system results and code page.