RCNN论文笔记

RCNN本身概念比较简单，但是论文中提到了许多object detection的经典方法非常有用，这些内容绝大部分在附录中，在阅读其他论文时相信会有很大帮助。

Basic Notions

Pipeline method
1. 提取独立于物体类别的region proposal
2. 使用CNN做特征提取
3. 调整CNN参数用于从候选的proposal中获取包含object的proposal(domain-specific fine-tuning)
4. SVM进行分类

Selective search 提取region proposal(非本文重点)
使用已经训练好的CNN(dataset: ILSVRC2012 classification caffe implementation) 提取proposal的特征，为了将输入标准化，强制将每个proposal resize到227*227
CNN提取的特征输入到SVM中进行分类

按照笔者个人理解，为了让CNN有detection的能力，我们需要让网络具备分辨哪些proposal可能包含object，因为selective search选取的许多proposal其实没有object或者仅包含了object的一小部分。

因此文中加了一步对CNN参数的调整(domain-specific fine-tuning),主要目的就是筛选proposal

在实验部分作者也进行了对比，结果证明使用了fine-tuning比不使用要高8%的mAP

Supervised pre-training
1. CNN on ILSVRC2012 with 2.2% error rate
2. training set: images with image-level annotations, no bounding boxes
Domain specific fine-tuning
1. N+1 classification: N object classes and 1 background
2. Use warped region proposals for tuning
3. Positive training examples: region proposals with >0.5IoU overlap(the rest as negative)
Object category classification
1. Positive examples: ground-truth boxes, Negative: region proposals with <0.3IoU overlap
2. Trade-off between SVM and Softmax
3. Different definition for training sets between fine-tuning and classification

这里作者提了两个问题 (在附录里均有解答):

1. 为什么在第二步fine-tune CNN参数使用的训练集和训练SVM分类器的训练集不同(尤其是threshold的选择不一样)?
2. 为什么不使用softmax分类器而选择SVM?

Object proposal transformations 由于CNN输入是227*227的正方形图像，因此需要对proposal进行缩放处理，文中提出了3种不同的处理方式：
1. tightest square with context(等比例缩放proposal在原始图像中的bounding square)
2. tightest square without context(等比例缩放proposal在原始图像中的bounding square,但要去掉在bounding square中而不在原始proposal中的部分)
3. 强制缩放proposal 3种变形的方式如下图中所示:

origin proposal (B) tightest square with context (C) tightest square without context (D) warp

Positive vs negative examples and softmax
1. 用于fine－tuning的数据集过少，因此将降低了positive example的要求，只要IoU大于0.5就认为可能有object在这个proposal里面，作者同时指出这样能够防止overfitting
2. 使用softmax比SVM降低3.3%的mAP，主要原因可能在于fine-tuning的训练集不那么强调box的位置，且SVM使用的nagative sample更具有区分度
Bounding-box 回归
1. 使用最后一个pooling层的输出训练线性回归模型来预测detection window
2. 实验结果表明使用BB regression效果更优

下面两点是这篇文章的核心：

High-capacity CNN to bottom-up region proposals in order to localize and segment objects(CNN用于detection)
Effective method: supervised pretraining (auxiliary data eg. classification)/fine-tuning(scarce data eg. detection)(fine-tuning的重要性不言而喻)