By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned which can reveal how the network typically represents some concepts.
Neural networks can learn to classify images more accurately than any system humans directly design. This raises a natural question: What have these networks learned that allows them to classify images so well?
Feature visualization is a thread of research that tries to answer this question by letting us “see through the eyes” of the network
These approaches are exciting because they can make the hidden layers of networks comprehensible. These layers are the heart of how neural networks outperform more traditional approaches to machine learning and historically, we’ve had little understanding of what happens in them
Unfortunately, visualizing activations has a major weakness — it is limited to seeing only how the network sees a single input. Because of this, it doesn’t give us a big picture view of the network. When what we want is a map of an entire forest, inspecting one tree at a time will not suffice.
There are techniques which give a more global view, but they tend to have other downsides.
For example, Karpathy’s CNN codes visualization
In this article we introduce activation atlases to this quiver of techniques. (An example is shown at the top of this article.) Broadly speaking, we use a technique similar to the one in CNN codes, but instead of showing input data, we show feature visualizations of averaged activations. By combining these two techniques, we can get the advantages of each in one view — a global map seen through the eyes of the network.
In theory, showing the feature visualizations of the basis neurons would give us the global view of a network that we are seeking. In practice, however, neurons are rarely used by the network in isolation, and it may be difficult to understand them that way. As an analogy, while the 26 letters in the alphabet provide a basis for English, seeing how letters are commonly combined to make words gives far more insight into the concepts that can be expressed than the letters alone. Similarly, activation atlases give us a bigger picture view by showing common combinations of neurons.
These atlases not only reveal visual abstractions within a model, but later in the article we will show that they can reveal high-level misunderstandings in a model that can be exploited. For example, by looking at an activation atlas we will be able to see why a picture of a baseball can switch the classification of an image from “grey whale” to “great white shark”.
Of course, activation atlases do have limitations. In particular, they’re dependent on the distribution of the data we choose to sample activations from (in our examples, we use one million images chosen at random from the ImageNet dataset
Before we dive into Activation Atlases, let’s briefly review how we use feature visualization to make activation vectors meaningful (“see through the network’s eyes”). This technique was introduced in Building Blocks
Throughout this article, we’ll be focussing on a particular neural network: InceptionV1
InceptionV1 consists of a number of layers, which we refer to as “mixed3a”, “mixed3b”, “mixed4a”, etc., and sometimes shortened to just “3a”. Each layer successively builds on the previous layers.
To visualize how InceptionV1 sees an image, the first step is to feed the image into the network and run it through to the layer of interest. Then we collect the activations — the numerical values of how much each neuron fired. If a neuron is excited by what it is shown, its activation value will be positive.
Unfortunately these vectors of activation values are just vectors of unitless numbers and not particularly interpretable by people. This is where feature visualization comes in. Roughly speaking, we can think of feature visualization as creating an idealized image of what the network thinks would produce a particular activation vector. Whereas we normally use a network to transform an image into an activation vector, in feature visualization we go in the opposite direction. Starting with an activation vector at a particular layer, we create an image through an iterative optimization process.
Because InceptionV1 is a convolutional network, there is not just one activation vector per layer per image. This means that the same neurons are run on each patch of the previous layer. Thus, when we pass an entire image through the network, each neuron will be evaluated hundreds of times, once for each overlapping patch of the image. We can consider the vectors of how much each neuron fired for each patch separately.
The result is a grid of feature visualizations, one for each patch. This shows us how the network sees different parts of the input image.
Activation grids show how the network sees a single image, but what if we want to see more? What if we want to understand how it reacts to millions of images?
Of course, we could look at individual activation grids for those images one by one. But looking at millions of examples doesn’t scale, and human brains aren’t good at comparing lots of examples without structure. In the same way that we need a tool like a histogram in order to understand millions of numbers, we need a way to aggregate and organize activations if we want to see meaningful patterns in millions of them.
Let’s start by collecting activations from one million images. We’ll randomly select one spatial activation per image.
Thankfully, we have modern dimensionality reduction techniques at our disposal. These algorithms, such as t-SNE
We perform feature visualization with the regularizations described in Feature Visualization
For each activation vector, we also compute an attribution vector.
The attribution vector has an entry for each class, and approximates the amount that the activation vector influenced the logit for each class.
Attribution vectors generally depend on the surrounding context.
We follow Building Blocks
This average attribution can be thought of as showing what classes that cell tends to support, marginalizing over contexts. At early layers, the average attribution is very small and the top classes are fairly arbitrary because low-level visual features like textures tend to not be very discriminative without context.
So, how well does this work? Well, let’s try applying it to InceptionV1
This atlas can be a bit overwhelming at first glance — there’s a lot going on! This diversity is a reflection of the variety of abstractions and concepts the model has developed. Let’s take a tour to examine this atlas in more depth.
If we look at the top-left of the atlas, we see images which look like animal heads. There is some differentiation between different types of animals, but it seems to be more a collection of elements of generic mammals — eyes, fur, noses — rather than a collection of different classes of animals. We’ve also added labels that show which class each averaged activation most contributes to. Please note, in some areas of a layer this early in the network these attribution labels can be somewhat chaotic.
As we move further down we start to see different types of fur and the backs of four-legged animals.
Below this, we find different animal legs and feet resting on different types of ground.
Below the feet we start to lose any identifiable parts of animals, and see isolated grounds and floors. We see attribution toward environments like “sandbar” and also toward things that are found on the ground, like “doormat” or “ant”.
These sandy, rocky backgrounds slowly blend into beaches and bodies of water. Here we see lakes and oceans, both above and below water. Though the network does have certain classes like “seashore”, we see attribution toward many sea animals, without any visual references to the animals themselves. While not unexpected, it is reassuring to see that the activations that are used to identify the sea for the class “seashore” are the same ones used when classifying “starfish” or “sea lion”. There is also no real distinction at this point between lakes and ocean — “lakeside” and “hippopotamus” attributions are intermingled with “starfish” and “stingray”.
Now let’s jump to the other side of the atlas, where we can see many variations of text detectors. These will be useful when identifying classes such as “menu”, “web site” or “book jacket”.
Moving upward, we see many variations of people. There are very few classes that specifically identify people in ImageNet, but people are present in lots of the images. We see attribution toward things people use (“hammer”, “flute”), clothes that people wear (“bow tie”, “maillot”) and activities that people participate in (“basketball”). There is a uniformity to the skin color in these visualizations which we suspect is a reflection of the distribution of the data used for training. (You can browse the ImageNet training data by category online: swimming trunks, diaper, band aid, lipstick, etc.)
And finally, moving back to the left, we can see round food and fruit organized mostly by colors — we see attribution toward “lemon”, “orange” and “fig”.
We can also trace curved paths through this manifold that we’ve created. Not only are regions important, but certain movements through the space seem to correspond to human interpretable qualities. With the fruit, we can trace a path that seems to correlate with the size and number of fruits in the frame.
Similarly, with people, we can trace a path that seems to correspond to how many people are in the frame, whether it’s a single person or a crowd.
With the ground detectors, we can trace a path from water to beach to rocky cliffs.
In the plants region, we can trace a path that seems to correspond to how blurry the plant is. This could possibly be used to determine relative size of objects because of the typical focal lengths of cameras. Close up photos of small insects have more opportunity for blurry background foliage than photos of larger animals, like monkeys.
It is important to note that these paths are constructed after the fact in the low-dimensional projection. They are smooth paths in this reduced projection but we don’t necessarily know how the paths operate in the original higher-dimensional activation space.
In the previous section we focused on one layer of the network, mixed4c, which is in the middle of our network. Convolutional networks are generally deep, consisting of many layers that progressively build up more powerful abstractions. In order to get a holistic view, we must look at how the model’s abstractions develop over several layers.
To start, let’s compare three layers from different areas of the network to try to get a sense for the different personalities of each — one very early layer (mixed3b), one layer from the middle (mixed4c), and the final layer (mixed5b) before the logits. We’ll focus on areas of each layer that contribute to the classification of “cabbage”.
As you move through the network, the later layers seem to get much more specific and complex. This is to be expected, as each layer builds its activations on top of the preceding layer’s activations. The later layers also tend to have larger receptive fields than the ones that precede them (meaning they are shown larger subsets of the image) so the concepts seem to encompass more of the whole of objects.
There is another phenomenon worth noting: not only are concepts being refined, but new concepts are appearing out of combinations of old ones. Below, you can see how sand and water are distinct concepts in a middle layer, mixed4c, both with strong attributions to the classification of “sandbar”. Contrast this with a later layer, mixed5b, where the two ideas seem to be fused into one activation.
Finally, if we zoom out a little, we can see how the broader shape of the activation space changes from layer to layer. By looking at similar regions in several consecutive layers, we can see concepts getting refined and differentiated — In mixed4a we see very vague, generic blob, which gets refined into much more specific “peninsulas” by mixed4e.
Below you can browse many more of the layers of InceptionV1. You can compare the curved edge detectors of mixed4a with the bowls and cups of mixed5b. Mixed4b has some interesting text and pattern detectors, whereas mixed5a appears to use those to differentiate menus from crossword puzzles from rulers. In early layers, like mixed4b, you’ll see things that have similar textures near each other, like fabrics. In later layers, you’ll see specific types of clothing.
Looking at an atlas of all activations can be a little overwhelming, especially when you’re trying to understand how the network goes about ranking one particular class. For instance, let’s investigate how the network classifies a “fireboat”.
We’ll start by looking at an atlas for the last layer, mixed5b. Instead of showing all the activations, however, we’ll calculate the amount that each activation contributes toward a classification of “fireboat” and then map that value to the opacity of the activation icon.
The layer we just looked at, mixed5b, is located just before the final classification layer so it seems reasonable that it would be closely aligned with the final classes. Let’s look at a layer a little earlier in the network, say mixed4d, and see how it differs.
Here we see a much different pattern. If we look at some more input examples, this seems entirely reasonable. It’s almost as if we can see a collection of the component concepts the network will use in later layers to classify “fireboat”. Windows + crane + water = “fireboat”.
One of the clusters, the one with windows, has strong attribution to “fireboat”, but taken on its own, it has an even stronger attribution toward “streetcar”. So, let’s go back to the atlas at mixed4d, but isolate “streetcar” and compare it to the patterns seen for “fireboat”. Let’s look more closely at the four highlighted areas: the three areas we highlighted for fireboat plus one additional area that is highly activated for streetcars.
If we zoom in, we can get a better look at what distinguishes the two classifications at this layer. (We’ve cherry-picked these examples for brevity, but you can explore all the layers and activations in detail in a explorable playground below.)
If we look at a couple of input examples, we can see how buildings and water backgrounds are an easy way to differentiate between a “fireboat” and a “streetcar”.
By isolating the activations that contribute strongly to one class and comparing it to other class activations, we can see which activations are conserved among classes and which are recombined to form more complex activations in later layers. Below you can explore the activation patterns of many classes in ImageNet through several layers of InceptionV1. You can even explore negative attributions, which we ignored in this discussion.
Highlighting the class-specific activations in situ of a full atlas is helpful for seeing how that class relates to the full space of what a network “can see.” However, if we want to really isolate the activations that contribute to a specific class we can remove all the other activations rather than just dimming them, creating what we’ll call a class activation atlas. Similar to the general atlas, we run dimensionality reduction
A class activation atlas gives us a much clearer view of which detectors the network is using to rank a specific class. In the “snorkel” example we can clearly see ocean, underwater, and colorful masks.
In the previous example, we are only showing those activations whose strongest attribution is toward the class in question. This will show us activations that contribute mostly to our class in question, even if their overall strength is low (like in background detectors). In some cases, though, there are strong correlations that we’d like to see (like fish with snorkelers). These activations on their own might contribute more strongly to a different class than the one we’re interested in, but their existence can also contribute strongly to our class of interest. For these we need to choose a different filtering method.
Using the magnitude filtering method, let’s try to compare two related classes and see if we can more easily see what distinguishes them. (We could have instead used rank, or a combination of the two, but magnitude will suffice to show us a good variety of concepts).
It can be a little hard to immediately understand all the differences between classes. To help make the comparison easier, we can combine the two views into one. We’ll plot the difference between the attributions of the “snorkel” and “scuba diver” horizontally, and use t-SNE
In this comparison we can see some bird-like creatures and clear tubes on the left, implying a correlation with “snorkel”, and some shark-like creatures and something round, shiny, and metallic on the right, implying correlation with “scuba diver” (This activation has a strong attribution toward the class “steam locomotive”). Let’s take an image from the ImageNet dataset labeled as “snorkel” and add something that resembles this icon to see how it affects the classification scores.
The failure mode here seems to be that the model is using its detectors for the class “steam locomotive” to identify air tanks to help classify “scuba diver”. We’ll call these “multi-use” features — detectors that react to very different concepts that are nonetheless visually similar. Let’s look at the differences between a “grey whale” and a “great white shark” to see another example of this issue.
In this example we see another detector that seems to be playing two roles: detecting red stitching on a baseball and a sharks’s white teeth and pink inner mouth. This detector also shows up in the activation atlas at layer mixed5b filtered to “great white shark” and its attribution points towards all sorts of balls, the top one being “baseball”.
Let’s add a picture of a baseball to a picture of a “grey whale” from ImageNet and see how it effects the classification.
The results follow the pattern in previous examples pretty closely. Adding a small-sized baseball does change the top classification to “great white shark”, and as it gets bigger it overpowers the classification, so the top slot goes to “baseball”.
Let’s look at one more example: “frying pan” and “wok”.
One difference stands out here — the type of related foods present. On the right we can clearly see something resembling noodles (which have a strong attribution toward the class “carbonara”). Let’s take a picture from ImageNet labeled as “frying pan” and add an inset of some noodles.
Here the patch was not as effective at lowering the initial classification, which makes sense since the noodle-like icons were plotted more toward the center of the visualization thus having less of a difference in attribution. We suspect that the training set simply contained more images of woks with noodles than frying pans with noodles.
So far we’ve only shown single examples of these patches. Below we show the result of ten sample patches (each set includes the one example we explored above), run on 1,000 images from the ImageNet training set for the class in question. While they aren’t effective in all cases, they do flip the image classification to the target class in about 2 in 5 images. The success rate reaches about 1 in 2 images if we also allow to position the patch in the best of the four corners of the image (top left, top right, bottom left, bottom right) at the most effective size. To ensure our attack isn’t just blocking out evidence for the original class, we also compare each attack to a random noise image patch.
Our “attacks” can be seen as part of a larger trend (eg.
We also want to emphasize that not all class comparisons reveal these type of patches and not all icons in the visualization have the same (or any) effectiveness and we’ve only tested them on one model. If we wanted to find these patches more systematically, a different approach would most likely be more effective. However, the class activation atlas technique was what revealed the existence of these patches before we knew to look for them. If you’d like to explore your own comparisons and search for your own patches, we’ve provided a notebook to get you started:
Activation atlases give us a new way to peer into convolutional vision networks. They give us a global, hierarchical, and human-interpretable overview of concepts within the hidden layers. Not only does this allow us to better see the inner workings of these complicated systems, but it’s possible that it could enable new interfaces for working with images.
The vast majority of neural network research focuses on quantitative evaluations of network behavior. How accurate is the model? What’s the precision-recall curve?
While these questions can describe how the network behaves in specific situations, it doesn’t give us a great understanding of why it behaves the way it does. To truly understand why a network behaves the way it does, we would need to fully understand the rich inner world of the network — it’s hidden layers. For example, understanding better how InceptionV1 builds up a classifier for a fireboat from component parts in mixed4d can help us build confidence in our models and can surface places where it isn’t doing what we want.
Engaging with this inner world also invites us to do deep learning research in a new way. Normally, each neural network experiment gives only a few bits of feedback — whether the loss went up or down — to inform the next round of experiments. We design architectures by almost blind trial and error, guided by vague intuitions that we build up over years. In the future, we hope that researchers will get rich feedback on what each layer in their model is doing in a way that will make our current approach seem like stumbling in the dark.
Activation atlases, as they presently stand, are inadequate to really help researchers iterate on models, in part because they aren’t comparable. If you look at atlases for two slightly different models, it’s hard to take away anything. In future work, we will explore how similar visualizations can compare models, showing similarities and differences beyond error rates.
Machine learning models are usually deployed as black boxes that automate a specific task, executing it on their own. But there’s a growing sense that there might be an alternate way for us to relate to them: that instead of increasingly automating a task, they could be used more directly by a person. One vision of this augmentation that we find particularly compelling is the idea that the internal representations neural networks learn can be repurposed as tools
We think of activation atlases as revealing a machine-learned alphabet for images — a collection of simple, atomic concepts that are combined and recombined to form much more complex visual ideas. In the same way that we use word processors to turn letters into words, and words into sentences, we can imagine a tool that would allow us to create images from a machine-learned language system for images. Similar to GAN painting
While classification models are not generally thought of as being used to generate images, techniques like
Such a tool would not necessarily be limited to targeting realistic images either. Techniques like style transfer
We could also use these atlases to query large image datasets. In the same way that we probe large corpuses of text with words, we could, too, use activation atlases to find types of images in large image datasets. Using words to search for something like a “tree” is quite powerful, but as you get more specific, human language is often ill-suited to describe specific visual characteristics. In contrast, the hidden layers of neural networks are a language optimized for the sole purpose of representing visual concepts. Instead of using the proverbial thousand words to uniquely specify the image one is seeking, we can imagine someone using the language of the activation atlas.
And lastly, we can also liken activation atlases to histograms. In the same way that traditional histograms give us good summaries of large datasets, activation atlases can be used to summarize large numbers of images.
In the examples in this article we used the same dataset for training the model as we did for collecting the activations. But, if we use a different dataset to collect the activations, we could use the atlas as a way of inspecting an unknown dataset. An activation atlas could show us a histogram of learned concepts that exist within the images. Such a tool could show us the semantics of the data and not just visual similarities, like showing histograms of common pixel values.
While we are excited about the potential of activation atlases, we are even more excited at the possibility of developing similar techniques for other types of models. Imagine having an array of machine learned, but human interpretable, languages for images, audio and text.
In this section we note some of the limits and pitfalls that we’ve noticed while developing and working with activation atlases.
Activation Atlases are a sample-based method which can only show a manifold of sampled activations. First, the dataset from which those activations are sampled from needs to come from the same distribution as the one that we are interested in. In this article we sample from the training set because we are interested in what the model has learned to recognize. Second, we need to provide enough samples to span the full manifold we want to observe. In this article we generally used one million activations but we found that 100,000 activations was often sufficient.
Neural network activations have an underlying compositional, combinatorial structure. We can mix together a couple hundred basis neurons to get any activation vector. Unfortunately, this has the problem of being an exponentially large space and hard to find the interesting activations. Activation Atlases solve this problem by sampling the interesting activation vectors, but completely lose the original compositional structure.
By not surfacing the compositional structure, activation atlases don’t really support us thinking about novel combinations of directions. They also necessarily can’t show a lot of connections between different parts of the atlas, and have to collapse local structure down to two dimensions.
Surfacing compositionality is deeply connected to the high-dimensional nature of the original space. As a result, it’s naturally hard to do in 2D. That said, it may be possible to partially surface compositionality, like the neuron addition diagram in Feature Visualization
There’s no way to link two views together. For instance, when looking at two layers side-by-side, similar features appear in random locations.
It’s computationally expensive. While it depends on many factors, activation atlases generally take somewhere between several minutes and several hours, end to end.
Because it is based on dimensionality reduction, the final output can be very sensitive to the hyperparameters chosen for the reduction step. UMAP and t-SNE are both reasonable choices and each could also produce different results.
Because the choice of a grid size is somewhat arbitrary, much like choosing a histogram bin size
One might think that a true clustering algorithm, like k-means, might produce a more robust decomposition of the activation manifold. After all, centroids produced by such an algorithm are by definition close to clusters of activations the network produces, a property that discretizing the dimensionality-reduced activations doesn’t guarantee. We experimented with different clustering techniques such as k-means, spherical k-means, and DBSCAN. However, the images produced by visualizing the resulting centroids were subjectively worse and less interpretable than the technique described in this article. Also, dimensionality reduction followed by 2D binning allows for multiple levels of detail while preserving spatial consistency, which is necessary for making atlasses zoomable. Thus, we preferred that method over clustering in this article — but finding tradeoffs between these techniques remains an open question.
Shan Carter wrote the majority of the article and performed most of the experiments. Zan Armstrong helped with the interactive diagrams and the writing. Ludwig Schubert provided technical help throughout and performed the numerical analysis of the manual patches. Ian Johnson provided inspiration for the original idea and advice throughout. Chris Olah provided essential technical contributions and substantial writing contributions throughout.
Thanks to Kevin Quealy and Sam Greydanus for substantial editing help. Thanks to Colin Raffel, Arvind Satyanarayan, Alexander Mordvintsev, and Nick Cammarata for additional feedback during development.
We’re also very grateful to Phillip Isola for stepping in as acting Distill editor for this article, and to our reviewers who took time to give us feedback, significantly improving our paper.
The photo used to illustrate a sub-manifold in the introduction was taken by Alexandru-Bogdan Ghita.
Review 1 - Anonymous
Review 2 - Anonymous
Review 3 - David Bau
If you see mistakes or want to suggest changes, please create an issue on GitHub.
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
For attribution in academic contexts, please cite this work as
Carter, et al., "Activation Atlas", Distill, 2019.
BibTeX citation
@article{carter2019activation, author = {Carter, Shan and Armstrong, Zan and Schubert, Ludwig and Johnson, Ian and Olah, Chris}, title = {Activation Atlas}, journal = {Distill}, year = {2019}, note = {https://distill.pub/2019/activation-atlas}, doi = {10.23915/distill.00015} }