Title: APPROXIMATING CNNS WITH BAG-OF-LOCAL-FEATURES MODELS WORKS SURPRISINGLY WELL ON IMAGENET
Description (in my own words): Deep convolutional neural networks can easily classify partly scrambled images, which leads to conclude that they rely in large part on texture detection. This is in opposition to the common interpretation that they build a hierarchical representation of shapes.
I agree with your interpretation. I think this is the heart of the paper and you are right in stating that the actual "inference" stage is not described clearly enough.
I find this technique reminiscent of transfer learning. In this case we transfer the learning that we obtained in building semantic vectors (the paragraph vectors) that augment a short sequence of words in the task of predicting the next word.
Count me in (not sure if I can keep the pace, but I'll try!)