Learning Dense Correspondence via 3D-guided Cycle Consistency
Paper: CVPR 2016 (Oral)
Link: https://arxiv.org/abs/1604.05383
Approach
Network
- feature encoder of 8 convolution layers that
extracts relevant features from both input images with
shared network weights;
- flow decoder of 9 fractionallystrided/up-sampling
convolution (uconv) layers that assembles
features from both input images, and outputs a dense
flow field;
matchability decoder of 9 uconv layers that
assembles features from both input images, and outputs a
probability map indicating whether each pixel in the source
image has a correspondence in the target.
- conv+relu(except last uconv for decoders)
- kernel 3*3
- no pooling; stride = 2 when in/decrease the spatial dimension
- output of matchability decoder + sigmoid for normalization
- training: same network for \(s_1 \rightarrow r_1, r_1 \rightarrow r_2, s_1 \rightarrow r_2\)
Experiments
Training set
- real images: PASCAL3D+ dataset
- cropped from bounding box;
- rescaled to 128*128
- 3D CAD models: ShapeNet database
- render 3D models from the same viewpoint
- choose K=20 nearest models using HOG Euclidean distance
- valid training quartet for each category: 80,000
Network training
- Initialization:
- feature encoder + flow decoder pathway: mimic SIFT flow by randomly sampling image pairs from the training quartets and training the network to minimize the Euclidean loss between the network prediction and the SIFT flow output on the sampled pair
- other initialization strategies (e.g. predicting ground-truth flows between synthetic images), and found that initializing with SIFT flow output works the best.
- Parametes:
- ADAM solver \(\beta_1\) = 0.9, \(\beta_2\) = 0.999, lr = 0.001,
step size of 50k, step multiplier of 0.5 for 200k iterations.
- batch = 40 during initialization and 10 quartets during fine-tuning.
Feature
embedding layout appears to be viewpoint-sensitive
(might implicitly learn that viewpoint is an important cue for correspondence/matchability tasks through our consistency training.)
Keypoint transfer task
Evaluate the quality ofcorrespondence output
- For each category, we exhaustively sample pairs from the val split (not seen during training), and determine if a keypoint in the source image is transferred correctly (by measuring the Euclidean distance between our correspondence prediction and the annotated ground-truth (if exists) in the target image)
- . A correct transfer: prediction falls within \(\alpha \cdot \max(H, W)\) pixels of the ground-truth with H and W being the image height and width, respectively (both are 128 pixels in our case)
- Metric: e percentage of correct keypoint transfer (PCK)
Matchability prediction
- PASCAL-Part dataset(provides humanannotated part segment labeling)
Shape-to-image segmentation transfer