[Paper] Inverting and Visualizing Features for Object Detection

Here are some short reading notes of a paper that came out on arXiv this week. I have a few RSS feeds positioned there, and I was immediately caught by the title:

Inverting and Visualizing Features for Object Detection

by Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz and Antonio Torralba (MIT/CSAIL).

The paper

What is it about ?

As the title says, it’s about feature inversion and visualization. Yes, but not any feature: the now ubiquitous HOG feature. In short, a HOG vector is obtained by slicing a region of interest into several small cells, compute the histogram of the gradient direction in each cell (usually quantized in 9 bins), and concatenate the results after some block-wise normalization.

The algorithms

The authors propose 4 different algorithms for HOG inversion, that all rely on some prior training but have different speed and produce different types of results:

  • LDA is easy to apply, but produces blurred estimates ;
  • Ridge regression is very fast, but also yields blurred estimates ;
  • Direct optimization (in a well chosen dictionary) gives sharp but noisy results and seems much slower ;
  • Sparse coding on paired dictionaries (one trained on the images, the other one trained on the HOG features), that yields results middle ground between the others.

Mechanical Turk joins the game

In addition to some reconstruction quality measures by correlation between original/reconstructed image, the authors used Amazon’s Mechanical Turk. Amazon’s Mechanical Turk is a crowdsourcing marketplace where you can submit some well defined tasks, that workers will eventually perform, allowing you to have manpower on a large scale even your job is going to be a one shot (and letting most of the administrative charge on Amazon).

The online participants 1 were asked to classify the output of the reconstruction algorithms into 20 classes issued from the standard PASCAL VOC benchmark (flavor 2011). The interesting outcome of this study is the following quote from the paper:

There is strong correlation with the accuracy of humans classifying the HOG inversions with the performance of HOG based descriptors.

Said differently, this means that false positives are actually not so false: they correspond to features characteristic of some given class that arise on spurious locations (image background, instance of another class….). The understanding of these false positives is crucial: it means that either the classification task is ill defined 2 or that the detector is not specific enough 3 And there is no point trying to build a super-kernelized-discriminative-svm-classifier to solve these cases: they correspond to actual limits of the HOG descriptors.

Miscellaneous comments

One more feature !

So, after SIFT (see this work) and LBD’s such as BRIEF and FREAK (in our work here and here), it’s the turn of HOG! Hence, there seems to be a kind of trend on this topic. Note that unlike the paper of Weinzaepfel et al. that was more oriented towards security and privacy, this paper and ours are more focused on visualizing the features in order to better understand them.

One more algorithmic approach (well, actually another 4)

Interestingly, with this paper there is yet another approach for feature inversion:

  • Weinzaepfel et al. used nearest-neighbour queries in a reference dataset ;
  • we used direct inversion via an inverse problem formulation ;
  • this paper stands midway between the two: it uses some optimization problems such as LDA and sparse coding, but these approaches need to be trained on some dataset.

Since the algorithms from this paper can probably be applied to different features than HOG, you now have a real algorithmic choice if you want to invert some feature. Note however that only our algorithm downs not need a dataset, but the spatial structure of the LBD instead.

As a sidenote, the authors used their algorithm to reconstruct an image with a person in front of a very dark background. The HOG reconstruction produces a lot of detail on the almost black background, which is I guess a consequence of the normalization in the descriptor computation process.

In brief

It’s interesting at several levels:

  • it was developed completely independently for us, so the timing for the apparition of this pre-print clearly makes thinking of a beginning trend: feature inversion ;
  • unlike our work (one algorithm for several features) this paper studies several algorithms for 1 feature;
  • it yields an interesting insight into false positives understanding.

The pre-print can be found here and the project’s homepage (including code, movies and more results) is here.

  1. Should we call them Turkers? ^
  2. Because there is too much ambiguity between some classes. I have seen this case for military helicopters classification before. ^
  3. But in this case I assume the false negative rates will be dramatic. ^