Computers don’t see

But we try hard teaching them.

Nouveaux Types Dans OpenCV

Etrennant une nouvelle machine, j’essayais de recompiler gentiment mon code de thèse avec la dernière version d’OpenCV obtenue depuis github. Et là, horreur, fureur et tremblement ! Les scripts de compilation se bloquent à de nombreuses reprises sur des symboles inconnus, et même llvm n’arrive pas à suggérer de substitutions pertinentes…

Un tour rapide sur la page de doc suffit à voir qu’elle n’est pas alignée sur les dernières révisions… J’ai donc dû aller explorer un peu le code source. Pour vous épargner cette recherche un peu fastidieuse, voici les principales différences que j’ai pu trouver.

Bonne lecture, en espérant que cela vous sera utile !

Les nouveaux symboles OpenCV 2.4.9

  • Les différentes constantes préfixées par CV_ disparaissent. Par exemple, CV_WINDOW_KEEPRATIO devient cv::WINDOW_KEEPRATIO, les différents CV_SORT_* deviennent cv::SORT_*. Idem pour les CV_REDUCE_* qui donnent cv::REDUCE_* ainsi que pour les événements souris/clavier qui passent de CV_EVENT_* à cv::EVENT_*.
  • Toutefois, pas de règle sans exception! Ainsi, les codes de conversion de couleur deviennent préfixés par cv::COLOR_* à la place de CV_*.
  • Les CV_CHECKBOX deviennent des cv::QT_CHECKBOX.
  • La macro CV_FOURCC() devient une fonction statique de la classe VideoWriter: cv::VideoWriter::fourcc(). Elle passe en minuscules par la même occasion.
  • La structure cv::TermCriteria a également évolué, mais ses changements sont cette fois documentés en ligne.

A Typical Implementation of NL-means

Semi-nonlocal implementations

As can be inferred from the description, NL-means is a greedy algorithm: for each pixel, the denoising pixel can explore the whole image. Thus, it is common practice to limit the nonlocal exploration stage at a limited window around each pixel1. However, since nonlocality is achieved only inside a subwindow around each pixel, I refer to these implementations as semi-nonlocal.

It is common for an implementation of NL-means to include an additional parameter for the size of the learning window. Hence, patches are computed and compared only in a subset of the image that is close to the pixel to denoise, which makes these methods strctly speaking only semi-nonlocal.

This choice was originally made to make the computations lighter, otherwise the denoising of each single pixel would require the exploration of the whole image. However, as we will see in the next post, this does also help in achieving better denoising results

Central pixel handling

During the learning window exploration, there is always a time when the patch centered on the pixel of interest is compared with itself. Obviously, it will be affected the maximum weight of 1, which will disturb the noise removal process.

Hence, it is a common practice to replace this weight by the maximum weight found during the learning window exploration. This allows for a more efficient denoising.

Show me the code!

No examples nor code in this post in order to keep its length reasonable. The next post will contain some examples to support a somewhat disturbing claim: more non-locality in NL-means can degrade the result. The code to generate the examples will be posted on the blog’s github at the end of the series.

  1. This window can sometimes by really small o_0.

NL-means: Some Math


\[ \def\x{\bf x}\def\y{\bf y}\def\G{\mathcal{G}} \def\RR{\mathbb{R}} \def\R{\bf R} \]

In the previous episode, we have seen the main idea of NL-means:

A good denoising algorithm can be obtained by a simple weighted average of noisy pixels, at the condition that the weights depend on the content of patches around the pixels instead of their spatial distance.

In this post, we will put proper mathematical definitions behind those patches and weighted average. I will proceed in successive steps:

  • first define what a digital image, and consequently the patches as subparts of the images
  • then introduce a distance between patches, such that small distance values correspond to identical patches and large values to dissimilar patches
  • then, computing weights from these distances that can be used in a denoising process
  • and finally, how to obtain denoised pixel values.

So, let’s dive in!

Mathematical framework

Digital images and patches

From now on, we suppose that we are given an image \(I\) defined on a domain \( \Omega \subset \RR^2 \), The image has values in $\RR$ (for grayscale images) or $\RR^3$ (for color images). We write $\x,\y \in \Omega \times \Omega$ two pixel locations. We define a patch extraction operator $\R$ that returns the pixels inside a squared $\sqrt{d} \times \sqrt{d}$ neighbourhood around $\x$ and stacks them in a vector using lexicographic ordering:

\[ \R : \x \mapsto \R(\x) = ( I(x_1), \ldots, I(x_d) )^T \in \RR^d. \]

Distance between patches

Given two patches, we define the distance between them as the usual squared Euclidean distance:

\[ d \left( \R(\x), \R(\y) \right) = \| \R(\x) - \R(\y)\|_2^2 = \sum_{i=1}^d \left( \R(\x)_i - \R(\y)_i \right)^2, \]

Weights (or similarity score)

Given the distance between two patches, one can compute their visual similarity, normalized into the interval $ [0,1] $ (ranging from no similarity to identical) by a simple exponential filtering:

\[ w(\x,\y) = \exp { \{-\frac{d(\x,\y)}{h^2} \} } = \exp \{ - \frac{\| \R(\x) - \R(\y) \|^2_2}{h^2} \}. \label{eq:nl-w} \]

The values of $w(\x,\y)$ are obviously between 0 and 1. The parameter $h$ controls the decay of the exponential function: for small $h$, only very similar patches will have a significant similarity score and the filter will be very selective. On the other hand, a large $h$ will attribute a more uniform importance to all the patches of the input image, yielding a stronger denoising but also blurring more aggressively the image structures. We’ll visualize the effect of the parameters in a later post.

Denoising: a weighted average

Finally, the denoised pixel value is obtained by computing the weighted average over an area centered on $\x$ and with radius $\rho$:

\[ NL(\x) = \frac{1}{Z(\x)} \sum_{\y: \| \x - \y \| \leq \rho} w(\x,\y) I(\y). \label{eq:nlm}\]

This area can be seen as a learning window. Its size determines the amount on non-locality: if it extends to the whole image, we are in a fully non-local case. Usually, this distance will be computed using the $\ell_\infty$-norm, thus making the learning window rectangular.

The term $Z(\x)$ is the sum of all the contributions of all the pixels inside the learning window:

\[ Z(\x) = \sum_{\y: \| \x - \y \| \leq \rho} w(\x,\y). \label{eq:nl-z}\]

It acts as a normalization factor that enforces the stability of the dynamic range of the image.

Alternate interpretation

The value $NL(\x)$ can be interpreted as the center of mass of the values of the pixels in the circular domain centered on $\x$, with the weights $w(\x,\y)$.

I feel lost! Maybe you have a picture?

Yes, of course!

Let’s have a look at the following picture:

"NL-means sketch"

In this image, the red empty squares are patches centered on specific pixels that one wants to denoise. The dashed green squares are the corresponding learning windows. Semi-transparent red squares are examples of patches taken inside the learning windows.

The different cases are interesting:

  • the topmost pixel on the vertical edge shows the interest of NL-means. Other pixels on the vertical edge will have the same black-and-white patch and will favor an output with a nice, sharp edge. Bye bye Gaussian blur!
  • the black patch in the middle shows another powerful property of NL-means: if the learning window is large enough, then NL-means is able to “jump” above the white part to find another black area and provide better denoising.
  • finally, the bottom patches show possible issues. For the leftmost (white) patch, only white patches will contribute to the denoising and the results should be good. For the rightmost patch (in the black area), some patches in the learning window will include a small white part. This part may not be small enough however to ensure that only correct (black) pixels will be selected.

Next episodes

In the next posts, we will see how to implement the NL-means and proceed to various experiments in order to have a good understanding of the parameters. Then, we will study several strategies to accelerate this algorithm, thus addressing its main weakness.

One Half

Achievement (almost) unlocked

A quick note to say that I’ve fulfilled the first requirement towards graduation: my doctoral dissertation was accepted by the jury1 during a private oral exam. Wow ! And I have to say that I am really grateful to the jury, both for the kind comments and for their criticism about my work, that will help me deliver a better manuscript.

What’s coming next ?

I still have to make up a public presentation to officially disclose the content of the thesis and graduate, which should happen in 2 or 3 months. Right now, I’m making the necessary corrections (typos, incorrect equations or statements…), and of course assuring my daily work as en embedded DSP developer. I’m learning a lot these days, hopefully I will be able to make something out of the [Beagleboard-xm](|en|Beagleboard-xm official page] that was given to me at an ICASSP tutorial by TI!

For the thesis, since I’m not really involved in public research anymore, I’ve decided to disclose it little by little on this blog. Coming soon, an introduction to Non-Local Means (NL-means), followed by an example where it is actually better to be semi-nonlocal.

You don’t really see what I mean ? Stay tuned via the blog’s feed or Twitter! The posts will be ready in a few days, with supporting code on github.

  1. Prof. J.-Ph. Thiran (EPFL - head of jury), Prof. J.-F. Aujol (IMB), Prof. P. Fua (EPFL), Prof. L. Jacques (UCL) ,and Prof P. Vandergheynst (EPFL - thesis supervisor).

Introducing NL-means

Non-locality and NL-Means

Non-locality is a powerful recent paradigm in Image Processing that was introduced around 2005. Put in simple words, it consists in considering that an image is a collection of highly redundant patches. In this context, pixels get described by the surrounding patch instead of their sole value), and patch relationships are only deduced from their visual similarity (thus ignoring their spatial relationship).

Non-locality comes in two flavors: exemplar-based 1 or sparsity-based. The most famous example of sparsity based non-local algorithm is BM3D, and it is still considered as a top-ranking denoising algorithm.

In this series we are interested in the very popular Non-Local means, that are inspired from exemplar-based algorithms. NL-Means are conceptually much simpler than BM3D, yet is still a very powerful denoising method. Hence, they are a very good introduction to the non-local intuition.

NL-means did also inspire a lot of derivative work.

The NL-Means algorithm

The intuition

There is one big secret behind NL-means denoising:

Denoising is as simple as averaging.

The real problem is: what should be averaged ?

Averaging spatially connected yields blur, so it is not a good solution. Luckily, here comes non-locality!

There exists a well known and very simple denoising method: averaging. However, it is also well known that averaging pixels in an image yields a lot of undesirable blur that destroys both the noise and the texture of the image.

The reason for this is simple: pixels can’t be assumed to be measurements of the same static phenomenon just because they are spatially close inside the image! For example, this assumption will be violated along the edges of each object, or everywhere a light or color gradient exists.

The NL-means algorithm

To avoid this undesirable blur, NL-means forgets about spatial relationship between pixels. This is why it is called a non-local algorithm.

Instead of using spatial coordinates, we are left with their color (or gray level). Since considering only a single pixel value is error prone2, we zoom out a bit to consider a small square around the pixel of interest. This small square is called a patch.

A patch is transformed into a feature vector simply by reshaping it as a vector. It can also be interpreted as the coordinates of the central pixel in a high dimensional space that describes all the possible image patches.

"Turning a patch into a feature vector"

Then, instead of using spatial proximity between pixels, the distance between their patches is computed. The denoised value of a given pixel is obtained as the weighted average of the other pixels of the image, where the weight (or contribution) of each pixel is inversely proportional to the distance between its patch and the reference patch of the pixel to denoise.

Next posts

In the next post, I will introduce the mathematical formulation of NL-means. Then, we will see typical implementations.

Once we have a good understanding of NL-means, we will come to the real novelty:

  • a simple experiment that shows that choosing a very nonlocal NL-means implementation can yield worse results than a local one!
  • some simple strategies to speed-up the computations while retaining the main qualities of the original NL-means.

So stay tuned!

If you don’t want to miss the next post, you can register to the blog’s feed or follow me on Twitter !

  1. Aka neighborhood filters. Actually, A. Buades quotes both these families of algorithm as inspiration for NL-Means.

  2. Remember we consider noisy images!

Rush Hour

The end of 2012…

Good and bad news

But not only. Starting on January 1st, 2013, I won’t be a research assistant anymore.

So, good news or bad news ?

Goods news is, I’ve found a job, so I have to leave. Bad news is, well, we you have a real day job, you can’t really spend a lot of time on your personal research, including writing a thesis.

So the big challenge is now: will I be able to write and defend a proper thesis before the deadline ? The manuscript is due mid-January, but I have to do as much as possible before starting working. My evaluation is that half the thesis (2 chapters) is ready, but the other half still needs a lot of experiments and also a bit of rewriting. So, let’s see in two weeks how things are doing well ! (or bad).

By the way, another good news: I’ve finally found a name for my new nonlocal algorithm: it’s going to be DANSE (or NASH, I’m still hesitating a bit). Since it’s never been published anywhere before, you’ll have to wait a little for more explanations, but I like it.

The end of the world… almost

I’m writing my thesis in a Dropbox folder (alongside my TimeMachine backups). It seems that my local dropbox install went out of sync in some way, then never managed to rebuild the index (freezing the machine) and finally started deleting my thesis files ! At this point I got really really scared. Luckily, I got my files back from my backup drive and shut down Dropbox, but beware if you do also have important files somewhere in the cloud…

The future of this blog

I still need to figure out what this blog will become in the future. It won’t disappear, for sure ! (I renewed the domain name and the hosting, so everything’s all right in this regard.)

Since I won’t have an academic page anymore, I think I’ll still use it to blog about research, especially my research that is still in the publishing pipe-line. Due to the confidential aspects of working at a company, don’t expect to much blogging about work. In particular, I should keep the series about LBD alive because it’s very popular and seems to be a hot1 topic at the moment. OpenCV and CUDA tricks seemed appreciated, maybe I will have more time for these too2! I’d like to talk also about two recent Computer Vision books I’ve almost finished reading.

See you soon !

Anyway, see you on this blog in 2013, hopefully with good news ! In the meanwhile, don’t forget to celebrate the New Year’s Eve, or you’ll have to wait for another year ;-)

  1. Or at least warming!

  2. And a new computer with a recent CUDA card.

[Paper] Inverting and Visualizing Features for Object Detection

Here are some short reading notes of a paper that came out on arXiv this week. I have a few RSS feeds positioned there, and I was immediately caught by the title:

Inverting and Visualizing Features for Object Detection

by Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz and Antonio Torralba (MIT/CSAIL).

The paper

What is it about ?

As the title says, it’s about feature inversion and visualization. Yes, but not any feature: the now ubiquitous HOG feature. In short, a HOG vector is obtained by slicing a region of interest into several small cells, compute the histogram of the gradient direction in each cell (usually quantized in 9 bins), and concatenate the results after some block-wise normalization.

The algorithms

The authors propose 4 different algorithms for HOG inversion, that all rely on some prior training but have different speed and produce different types of results:

  • LDA is easy to apply, but produces blurred estimates ;
  • Ridge regression is very fast, but also yields blurred estimates ;
  • Direct optimization (in a well chosen dictionary) gives sharp but noisy results and seems much slower ;
  • Sparse coding on paired dictionaries (one trained on the images, the other one trained on the HOG features), that yields results middle ground between the others.

Mechanical Turk joins the game

In addition to some reconstruction quality measures by correlation between original/reconstructed image, the authors used Amazon’s Mechanical Turk. Amazon’s Mechanical Turk is a crowdsourcing marketplace where you can submit some well defined tasks, that workers will eventually perform, allowing you to have manpower on a large scale even your job is going to be a one shot (and letting most of the administrative charge on Amazon).

The online participants 1 were asked to classify the output of the reconstruction algorithms into 20 classes issued from the standard PASCAL VOC benchmark (flavor 2011). The interesting outcome of this study is the following quote from the paper:

There is strong correlation with the accuracy of humans classifying the HOG inversions with the performance of HOG based descriptors.

Said differently, this means that false positives are actually not so false: they correspond to features characteristic of some given class that arise on spurious locations (image background, instance of another class….). The understanding of these false positives is crucial: it means that either the classification task is ill defined 2 or that the detector is not specific enough 3 And there is no point trying to build a super-kernelized-discriminative-svm-classifier to solve these cases: they correspond to actual limits of the HOG descriptors.

Miscellaneous comments

One more feature !

So, after SIFT (see this work) and LBD’s such as BRIEF and FREAK (in our work here and here), it’s the turn of HOG! Hence, there seems to be a kind of trend on this topic. Note that unlike the paper of Weinzaepfel et al. that was more oriented towards security and privacy, this paper and ours are more focused on visualizing the features in order to better understand them.

One more algorithmic approach (well, actually another 4)

Interestingly, with this paper there is yet another approach for feature inversion:

  • Weinzaepfel et al. used nearest-neighbour queries in a reference dataset ;
  • we used direct inversion via an inverse problem formulation ;
  • this paper stands midway between the two: it uses some optimization problems such as LDA and sparse coding, but these approaches need to be trained on some dataset.

Since the algorithms from this paper can probably be applied to different features than HOG, you now have a real algorithmic choice if you want to invert some feature. Note however that only our algorithm downs not need a dataset, but the spatial structure of the LBD instead.

As a sidenote, the authors used their algorithm to reconstruct an image with a person in front of a very dark background. The HOG reconstruction produces a lot of detail on the almost black background, which is I guess a consequence of the normalization in the descriptor computation process.

In brief

It’s interesting at several levels:

  • it was developed completely independently for us, so the timing for the apparition of this pre-print clearly makes thinking of a beginning trend: feature inversion ;
  • unlike our work (one algorithm for several features) this paper studies several algorithms for 1 feature ;
  • it yields an interesting insight into false positives understanding.

The pre-print can be found here and the project’s homepage (including code, movies and more results) is here.

  1. Should we call them Turkers ?

  2. Because there is too much ambiguity between some classes. I have seen this case for military helicopters classification before.

  3. But in this case I assume the false negative rates will be dramatic.