This blog post looks at the growth of computation, data, deep learning researcher demographics to show that the field of deep learning could stagnate over slowing growth. We will look at recent deep learning research papers which strike up similar problems but also demonstrate how one could to solve these problems. After discussion of these papers, I conclude with promising research directions which face these challenges head on.
This blog post series discusses long-term research directions and takes a critical look at short-term thinking and its pitfalls. In this first blog post in this series, I firstly will discuss long-term trends of data and computational power by using trends in computing and hardware. Then we look at the demographics of researchers, and I show that the fraction of researchers that do not have access to powerful computational resources is increasing rapidly. Then we have a look at the core paper of this blog post “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era” which reveals that more data can improve predictive performance but it comes with a rather heavy computational burden. We will also see that compared to specialized techniques, pre-training on more data is just on-a-par with respect to predictive performance.
From this, I conclude that more data is only helpful for large companies that have the computational resources to process them and that most researchers should aim for research where the limiting resource is creativity and not computational power. However, I also show that the future holds ever growing amounts of data, which will make large datasets a requirement. Thus, we need techniques to make it feasible to process more data, but we also we need techniques to make deep learning inclusive for as many researchers as possible, many of whom will come from developing countries.
After the discussion of the core paper, we have a look at possible solutions introduced in four recent papers. These articles aim to overcome these long-term trends by (1) making operations, like convolution, more efficient, (2) by developing smart features so that we can use smaller, fast models, that yield the same results as big, fat, stupid models, (3) how companies with substantial computational resources can use those resources to create research that benefits everyone by searching for new architectures, (4) how we can solve the problem of ever-growing data by pre-selecting the relevant data via information retrieval.
I will conclude by discussing what place these papers have in the long-term research directions in deep learning.
The Problem of Short-term Thinking in Deep Learning Research
This blog post series aims to foster critical thinking for deep learning research and encourage the deep learning community to pursue research which is critical for the progress of the field. Currently, an unhealthy hype and herd mentality gained strong traction in the field of deep learning and, in my opinion, a lot of research is becoming more and more short-sighted. The short-sightedness has mostly to do with competitive pressure from increasing number of new students entering the field, pressure from our publish-or-perish culture, and pressure from the publish-on-arXiv-before-you-get-scooped mindset, which favors incomplete research which provides quick gains rather than to advance the deep learning community.
Another problem is that many researchers use Twitter as the primary source for current deep learning research trends, which exacerbates these herd mentality problems: It encourages more of the same, that is, doing that and thinking about that which is popular, and, secondly, it encourages to follow big players and big names rather than a mix of researchers, which leads to single-mindedness. Twitter is not a discussion forum where one can discuss ideas in depth and come to a conclusion that lets everyone benefit. Twitter is a platform where the big win big, and the small disappear. If the big make a mistake, everybody in the deep learning community is misled. The thing is that the big make mistakes too.
It is like the explore vs exploit problem: If everybody just exploits there will be no discoveries, just incremental advancements — more of the same. And I would like to believe that the world needs breakthroughs. AI can help us prosper and solve difficult problems, but only if we chose to explore more.
This blog post is no antidote to all of this but aims to give you a nudge in a direction where you analyze research directions with a more critical eye. I hope you leave this blog post thinking about your own direction and how it relates to this long-term picture that I draw here.
The research trends discussed in this blog post series aims to (1) highlight the important but ignored research on the sidelines of deep learning, or (2) raise problems that make very popular deep learning research evidently short-sighted or naive. I do not try to glorify a rogue mindset here: Being defiant for the sake of being defiant has no merit. I also do not want to say that all major research directions are garbage: Most popular research is popular because it is important. What I want is to help you feed a critical mindset and long-term thinking.
The theme for this blog post is a topic from the category (1), it deals with deep learning research which is important, but all so often goes unnoticed: Computational efficiency and the problems that come with data. Usually an ignored topic, I will analyze trends to outline why this is an important long-term problem that everybody should be concerned about. Indeed, the field of deep learning may stagnate if we do not tackle this problem. After discussion of these trends, we see current research which exposes the core problems of this research direction. Finally, I will discuss four research papers from the past two months which try to address the raised issues.
Long-term trend: Computational Efficiency and the Growth of Data
The key paper of this blog post deals with how more data can improve prediction results. The main problem with this work is that it required 50 GPUs for two months to produce the results. If we look at the time needed for GPUs becoming so fast that we could replicate this research on a single computer (4 GPUs) within two months or 2 weeks, respectively, we would need to wait until the year 2025 or 2029, respectively. This is if you assume the now outdated computational doubling time of 2 years. With three years doubling time we can get access to such a system by the year 2029 or 2035, respectively.
Most researchers in developing countries will find it hard to own a single GPU for themselves. Often these researchers also need to share resources with other researchers. If such a researcher could at most claim a single GPU for two months or two two weeks, respectively, then given the current growth in computation, this researcher would be able to replicate this research in the year 2029 or 2035, respectively, for a double period of 2 years. If the doubling period is three years, then these numbers are the year 2035 and 2044, respectively. The problem is that computing growth is expected to slow further [c,d] so that these predicted numbers are relatively optimistic.
Which of these examples is more realistic for the future, will the average researcher have access to one or four GPUs? An answer can be found if we look at the growth in researcher demographics and income of these researchers, in particular I mean here the growth of deep learning research in China along with its growth in GDP (PPP) per capita.
Growth trends for research quality are unclear [f], but for research quantity, there is a rapid growth in China [e]. With current growth rates, that is 60% per year for the US, 170% per year for China (averaged growth over for the period 2012-2016), we can expect 98% of all deep learning research will come from China by 2030. This trend is of course not sustainable. We probably will see a stagnation of growth in China at some point (we see this for the US, but not yet for China), but due to the very supportive government in China, that will pour 150 billion into AI over the next decade, and a science-adverse government in the US, it is very likely that China will soon take a very definite lead in AI research. I would not be surprised if more than 80% of deep learning research will come from China by 2030. A pessimistic estimate would be then more than 66% of all deep learning research will be done in China by 2030.
However, with the current GDP growth rate of around 5%, the average Chinese person will still have about 80% of the annual income in 2030 compared to an average US person in 2017[g]. With these numbers, the average researcher will likely not be able to use more than one or two GPUs (or the equivalents of that time). If you put these numbers together one has to conclude that most research in the future will be severely limited by computational power and thus by algorithmic complexity.
All these numbers above assume the current 2-year or 3-year doubling rates for computing power. While general sentiment remains optimistic, we are slowing more and more in the growth of computational power and while a doubling rate of 3 years seems to be more realistic now it is not realistic for the future [c, d]. Currently, there exists no technology which will substantially boost computational growth beyond 2019 [c, d]. We have been here before, and we could innovate ourselves out of this slump, but now that we are near the physical limits of possibility, we can no longer innovate — creativity alone cannot go beyond the atom level; creativity alone cannot defy quantum effects. Progress in computation will be slow and difficult from here.
Another problem is just the sheer growth of data [a,b]. The growth of data is now faster than the growth of computing power. This means never again in the future will we be able to work with as large as a percentage of data as now. In the future, the fraction of the data on which we can run our deep learning algorithms will continuously shrink. This means that in the future there will never be a point where we will be able to utilize all the data that we have. All these trends make computational efficiency very critical for long-term progress in deep learning. If our algorithms cannot do more with less computation the field will stagnate quickly. We need to overcome this problem if we want AI to prosper in the next decades. And the solution has to be algorithms and software — we can no longer rely on hardware for our growth.
While this growth in data is mostly in video [a,b] which has a small knowledge density (average youtube video) compared to text data (Wikipedia) one can expect that the overall growth of useful data is also exponential, that means, if 1% of all youtube videos contain general information useful for a broad audience, then this data will still grow exponentially faster than our computational resources. Over the long-term, the growth rate alone is almost everything. The base does not matter too much. A base of 1% compared to 100% at a doubling rate of 1.5 years will just lag about 8 years behind to achieve the same amount of information.
To illustrate the sheer size of data, you can take this example: Currently, 800 hours of youtube videos are uploaded every minute. If 1% of video is informative then by 2025 we can expect that for every minute there will be 800 hours of information dense video uploaded every minute where every second of each video contains relevant information. It will be a very difficult problem to extract all the useful information from 800 hours of video within a minute. And remember, we will never again in the future have a ratio of computation time (1 minute) to useful information (800 hours of video) smaller than now. Data is running away, and we will never be able to catch up.
So in short:
- Limitations in computational power will hamper the training of deep learning models on increasingly large datasets.
- By 2030, most deep learning researchs will not have access to much more than the equivalent of 1-4 GPUs per person.
- Data grows faster than computation. Computationally problems will continue to worsen from here. We live in the time where we can process the largest fraction of information, and it will decrease from here bit by bit.
- Computational performance growth slows. It is doubtful to get better — we are bound to the limitations of physics.
With these insights, we now dive into the core paper of this blog post.
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era; Chen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta; Google Research + Carnegie Mellon University (CMU); Published: arXiv, 2017-07-10
Train a big convolutional net on the 300 million image JFT-300M dataset as a pre-training step. Then use this pre-trained network to do computer vision tasks such as image classification, object detection, image segmentation, and pose estimation on ImageNet, PASCAL VOC 2012, and Microsoft COCO. Compare the results with a network pretrained on ImageNet (300 times fewer images) or with a network trained on sub-sets of JFT-300M.
Use 50 GPUs and additional parameter servers in Downpour Gradient descent which is an asynchronous, parallel gradient descent method for training. Train standard ResNets (50, 101, 152) (+Faster RCNN for object detection) with hierarchical labels where the fully connected layers are distributed among parameter servers.
- The number of additional classes in JFT-300M (18k) compared to ImageNet (1k) does not seem to impact performance much. If we use all ImageNet classes from the JFT-300M dataset for pretraining, the performance appears to be similar than to pretrain on all classes.
- The model capacity of smaller networks (especially ResNet-50) is not sufficient to capture all the information of JFT-300M, that is, the performance improvements through pretraining on JFT-300M stagnates with more data if a model with small capacity is used (e.g. ResNet-50 over ResNet-152) indicating that high model capacity is needed to utilize all of the data.
- Increases in performance for mean average precision on MS COCO for 10M -> 30M, 30M -> 100M, and 100M -> 300M images are about 3, 4, and 2, respectively. Compared to augmentations of the Faster RCNN we have for box refinement, context, and multi-scale testing an improvement of 2.7, 0.1, and 2.5, respectively . This result indicates that specialized enhancements for a deep learning architecture may be as effective as more data.
This paper circulated widely on Twitter. The hype carried a universal message for this paper: “The more data we train our deep learning models with, the better our models become— so we need more data!” However, as we discussed above this is not entirely accurate, as currently, we have not enough computational power to make such big datasets practical for the majority of researchers and this will only worsen in the future (for relative numbers; it will improve somewhat in absolute terms). Other than computational power there is also a problem with the memory requirements of this work. Since the fully connected layers do not fit into GPU memory, they had to be split up in a hierarchical fashion across multiple GPUs. This is unfortunate since this makes sure that nobody but big companies can replicate the results.
The main issue though is that the results are not good enough for the effort that went into the entire work. If one uses specialized techniques for MS COCO (box refinement, context and multi-scale testing) then one can exceed the results of a network pretrained on 300M images . Another point for such specialized techniques is, that their development is not bound by computational resources but by creativity. For most researchers, computational resources are a more limiting factor than creativity is: We do not lack ideas, we just lack the GPUs to run all of them. And this will become more and more real as larger fractions of researchers will come from developing countries. We would have the most efficient research if we are both limited by our creativity and our computational resources. Our best research should be inclusive and replicable by other researchers and as such working with large amounts of data is not the best direction for most researchers to take.
We would have the most efficient research if we are both limited by our creativity and our computational resources. Our best research should be inclusive and replicable by other researchers and as such working with large amounts of data is not the best direction for most researchers to take.
However, even though the sheer size of the data is impractical as of now, it will soon become practical. We as a community should aim for having enough data to keep our standard hardware busy. As our methods and hardware improve, ImageNet-sized datasets will be not enough anymore, and we will need move on to larger datasets similar to JFT-300M. But it is not productive yet to work on these very large datasets where the challenge is to process the data and run the model.
The problem should be to come up with a good idea and implement its algorithm for most researchers. The area of “big data problems” however has its place in industrial research which can be a guiding light in questions that can only be answered with the right amount of resources. Academic research should work on methods that can be used by everybody, and not only by the very elite institutions and companies.
The rest of this blog posts offers direction and possible solutions to the problems and challenges laid open in this core paper.
ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices; Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun; Megvii Inc (Face++); Published: arXiv, 2017-07-04
The authors improve the efficiency of convolutional nets, that is, their model has the same classification performance with fewer parameters by using, (1) group-wise convolutions and (2) channel-shuffle to improve feature sharing between groups.
Group convolution is an operation where you split the channels of the inputs (Height x Width x Channel: 10x10x9 -> (10x10x3, 10x10x3, 10x10x3)) into groups and compute a separate convolution on each group (one convolution on each 10x10x3 feature map). This means that each of the convolution operations effectively performs convolution on a sub-set of input channels. Note that group convolutions operate in different feature spaces since they are split by channel, for example, one group might have specialized features for noses, eyes while the other group has features for ears and hair (a normal convolution would operate on all features: noses, eyes, ears, hair). This reduces the predictive performance a bit but saves a lot of computation and memory. Group convolution was first used by Alex Krizshevsky where he used this technique to make it feasible to process ImageNet data with two GPUs.
Channel-shuffle is an operation which seeks to eliminate the specialization of features in different groups. The biggest problem with successive group convolutions is that a second group convolution will effectively only learn more specialized features from the distribution of the previous layer, for example, group convolution 1 might have eye and nose input features and the next group convolution will split these further into separate eye and nose features, so that each convolution will compute just one feature. Usually, convolution learns one or more features per channel, but through successive group convolution, we would eventually only learn 1 feature per convolution (theoretically at least, practically it is more complicated, but you get the idea).
Channel-shuffle solves this problem by shuffling the channels of all the groups so that each convolution will be exposed to all possible features (noses, ears, eyes, hair) over time.
ShuffleNet receives marginal improvements over other networks designed for mobile devices. ShuffleNet yields 13x speedup over vanilla AlexNet while keeping the performance consistent. It can achieve the same performance as reference architectures while reducing parameter counts significantly.
While ShuffleNet only yields marginal improvements over existing methods its main appeal lies in the fact that it only relies on simple, standard operations (group convolution + channel shuffling) which anyone can easily implement in any deep learning framework. This makes this work highly useful for many applications. Good advice would be to exchange regular convolutions to group convolutions by default and fallback to regular convolutions only if group convolutions do not work. This is exactly the type of research that we need. Easy to use for anyone, easy to implement, fast and improved performance. Research like this will become immensely important in the future.
This is exactly the type of research that we need. Easy to use for anyone, easy to implement, fast and slightly improved performance. Research like this will become immensely important in the future.
The paper was framed in a way that this is most useful for mobile devices which can only run small, and fast networks, but this research goes much beyond that. The group convolution + shuffle operations are indeed general, and anyone working with convolution should embrace them.
Natural Language Processing with Small Feed-Forward Networks; Jan A. Botha, Emily Pitler, Ji Ma, … , Slav Petrov; Google; Published: arXiv, 2017-08-01
The task is simple: Predefine a memory budget and try to design multi-layer perceptrons (MLPs) that, (1) fit into this budget, (2) achieve the best possible performance on a variety of natural language processing (NLP) tasks. Since the computational expressiveness is limited for MLPs we need to provide the network with smart, expressive features to do its tasks. If successful, this method will yield architectures which are suitable for either very large amounts of data, or environments where computational resources are limited (mobile phones).
Feature Engineering & Optimizations
1-char gram embeddings, that is mapping each unique character in a document to a unique vector, has been done successfully in the past especially with recurrent neural networks. However, in this work, n-char grams are used, especially 2, 3, and 4 char-grams. However, if we just take the English alphabet (26 characters), we already have 456976 combinations of 4-char grams which breaks the memory constraints which were set a priori for this work. To prevent this parameter explosion, the authors hash the char grams and group them into buckets via a modulo operation, for example, “hash(char gram) mod 1000” will yield an index between 0 and 999 and thus most char grams share parameters with another char gram. This trick has already been applied to out-of-vocabulary words so that different out-of-vocabulary words are mapped to different out-of-vocabulary vectors , but it seems this hashing of n-grams is very efficient and one of the main reasons why the performance of the simple networks is so good.
Using this method, we hope that network can differentiate from the context which n-char gram was meant. This process is similar how you differentiate words: “He hit the ball with his bat” vs “She could see the bat flying in the twilight”. For humans, this works well, and for machines, it seems to work well too.
This trick has already been applied to out-of-vocabulary words so that different out-of-vocabulary words are mapped to different but shared out-of-vocabulary vectors , but it seems this hashing of n-grams is very efficient and one of the main reasons why the performance of the regular MLPs is so good.
Interestingly, the authors note that this sharing of parameters also allows the reduction of the embedding dimension to the range of 8-16 dimensions compared to the usual 50-500 dimensions (or even higher for LSTMs).
The embeddings take up a lot of memory and the authors apply 8-bit quantization to reduce the memory footprint of the embeddings. Quantization is a method where we represent floating point numbers by approximating them by integers. A common quantization for 8-bit is to normalize the numbers which ought to be compressed in the range of 0 to 255 (8-bit), and then just sort them proportionally into these “buckets”, that is all numbers between 0 and 1/256 are given the label 0, all numbers between 1/256 and 2/256 are given the label 1, and so forth. I developed and reviewed more sophisticated 8-bit approximation methods which are also very useful if you want to accelerate large deep learning networks in GPU clusters. If you are interested in learning more about this topic you can read my ICLR paper which summarized 8-bit approximation methods and how these methods affect the training and parallelization of neural networks.
Just like in my work, the 8-bit embeddings are only used for storage, so one de-quantizes them to 32-bit before being used in computation. Using this compression the authors demonstrate no loss in predictive performance while reducing embedding memory footprint by a factor of four.
Word clusters have been forgotten since word embeddings, but they contain similar information in that they map similar words to the same clusters. Once these clusters are computed it is fast and memory efficient to replace all the words with cluster IDs and then work with these word clusters IDs instead of the words. In this work, the authors apply the “distributed Exchange algorithm” a standard algorithm which has been augmented by other Googlers . I have no direct comparison how this clustering algorithm compares to other cluster algorithms, but I bet that many cluster algorithms would yield similar results.
The authors also apply key-value bloom filters maps to these clusters, that is a fast and memory efficient way to find which word maps to which cluster. However, one could also standard data structures such as dictionaries, which are faster and easier to use, but the advantage of bloom filters is that they require less memory. Since we have a strict memory budget, in this case, bloom filters make sense.
The key point to take away from work clusters is that they can enable much faster calculation than word embeddings while yielding almost the same or even better predictive performance.
The authors also use other tricks like pre-ordering for part-of-speech (POS) classification, but these tricks are more specific and require more background knowledge which would increase the length of this already very lengthy blog post. Thus I omit these tricks.
They achieve state-of-the-art or near state-of-the-art on part-of-speech (POS) tagging (what words nouns, verbs etc), language identification (given some text, which language is it?), segmentation (given a string of characters, what are the boundaries of words (important for many Asian languages)). The authors could also reduce the required operations and the size of the network considerably while keeping results steady or even exceeding the previous state-of-the-art. On the bottom line, the used methods generally reduce the computational budget and memory footprint by roughly one order of magnitude, that is about 15x faster computation and 10x less memory.
How can these results be so strong with such a simple, fast and memory efficient network? It is because of feature engineering: A large part of the performance is due to word clusters. Hashed n-grams give another good performance boost. Quantization, or lower precision in general, make things efficient and has nearly no drawback regarding predictive quality. These are the main takeaways from the paper.
The authors demonstrate the strength of ordinary, shallow, feed-forward networks by levering new, or rather overlooked features. Both the predictive results and the computational and memory footprint results are impressive. As the data grows such methods will become more and more important and it shows us that feature engineering is not dead but may be even required for sharp progress in deep learning or NLP in general.
We have seen word embeddings which gave rise to deep learning for natural language understanding, but since then we have seen relatively little innovation for features. Sure we have sentence embeddings now, and other fancy embeddings tricks, like the new word2vec that uses sub-words to enrich its representation , which improve the performance on most tasks, but, in the end, it seems that it is just more of the same. If we want to progress in the field of NLP I do not think building more complicated models with the same features is the way to go. We need to innovate the base with which our algorithms work. If we give rocket engineers bad building material to build a rocket, we cannot hope that they built rocket will reach space. Word embeddings alone are not enough. And we do not need more of the same. If we want to reach space we need to improve our building material.
If we want to progress in the field of NLP I do not think building more complicated models with the same features is the way to go. We need to innovate the base on which our algorithms operate. If we give rocket engineers bad building material to build a rocket, we cannot hope that they built a rocket that will reach space. Word embeddings alone are not enough. If we want to reach space we need to improve our building material.
Learning Transferable Architectures for Scalable Image Recognition; Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V. Le; Google Brain; Published: arXiv, 2017-07-21
Use neural architecture search (NAS), that is reinforcement learning to find the best deep learning architecture. Since neural architecture search is very expensive computationally, we make multiple simplifications:
- Do not optimize the parameters with neural architecture search
- Do not optimize all layers of the network, but instead, only search for the general architecture for two different blocks: (1) a feature expansion block and (2) feature reduction block. Construct the full an architecture by repeatedly stacking these blocks.
- Since finding a good architecture on ImageNet data takes too long, perform the search on a smaller dataset (CIFAR10) and then just use the best block from CIFAR10 and use it on ImageNet.
- Doing the search on a few GPUs is still slow, so use 450 GPUs instead.
Background: Neural Architecture Search
Use an RNN with a final softmax layer to select an individual layer or function in the architecture. Algorithm:
- Predict which hidden layer to use as input, that is, select one hidden layer in either the last or second to the last block. Use the output of the chosen hidden layer as input A
- Predict which hidden layer to use as input, that is, select one hidden layer in either the last or second to the last block. Use the output of the chosen hidden layer as input B
- Predict operation (convolution, max pooling, separable convolution etc) for new hidden layer which processes input A
- Predict operation (convolution, max pooling, separable convolution etc) for new hidden layer which processes input B
- Select merge operation to merge hidden layer A and B
- Repeat K times (K=5 in this paper)
- Concatenate everything from 6 into one feature vector
- Respect this fixed heuristic rule: Whenever a previously hidden layer operation has a stride of 2 (reduction block), double the convolutional filters in the current hidden layer
The architecture found by neural architecture search (NAS) is impressive. What stands out is that the network never uses standard 3×3 convolution, even though it has the possibility to do so. Instead, it always chooses separable convolution over regular convolution. Another interesting bit is the use of average pooling. It is a quite common operation in these cells (or blocks). Another very surprising fact is, that the authors tried manually adding residual connections to the final blocks, but these residual connections decreased performance. Since the trained networks are too deep to yield good performance via regular gradient transmission through nonlinearities, this means that the identity connections in the blocks must be better at preserving relevant information than residual connections.
These unusual identity connections hint that the interpretation of residual and dense (convolutional) connections is somewhat wrong since the reliance on inter-block connections alone imply that the gradients do not need to flow to the very end of the network to be useful.
It is unclear why the network chose the identity operations in this way. These connections are critical so that one would expect a pattern that makes sense, but one just cannot see a pattern which makes sense in this case. Can you make sense of it?
One thing to note is that the reduction cell (or the feature reduction block) is hierarchical, while the normal cell (or feature expansion block) is not.
Note also that no pooling operation follows another pooling operation, which makes sense since pooling aims to reduce its vicinity to important features, suppressing the noise from features which are not useful. After this, you have a high density of useful features, and another pooling operation would throw away too many good features. So we would exactly expect this behavior from a well-trained neural architecture search algorithm.
The found architecture gives us much to learn about the design of convolutional networks. However, it also raises new questions: Why are residual connections not helpful? Why are the identity connections in this architecture enough?
Another thing which is interesting to me is that these blocks look more and more like a biological neuron. Biological neurons as studied since 2013 use a hierarchy of convolution and aggregation operations which are then aggregated in the cell body where they are translated to a combination of neurotransmitters which in turn represent the features into the next neuron. So this architecture can also be interpreted as a more biological neural network with a few stacked neurons where each block represents a neuron. If you want to learn more about this modern interpretation of biological neurons you can read the neuroscience section of my AI vs Neuroscience vs Singularity blog post.
The architecture achieves almost state-of-the-art when its predictive performance is compared to other networks in a slightly restricted parameter setting. Only Shake-Shake 26 beats its performance on ImageNet. For an unconstrained parameter count the architecture yields state-of-the-art performance on ImageNet while still being more computationally efficient than other networks. Thus the architecture is equally capable of state-of-the-art results and fast training. For small networks, the architecture yields better results than ShuffleNet but ShuffleNet is even more computationally efficient especially for very small parameters settings.
This work yields fascinating results and yields a lot of insight what makes up a successful architecture. It also defies common understanding that residual, highway or dense connections are necessary for good performance on ImageNet. What I find most interesting about this work is that the found block architecture resembles more and more the architecture of a neuron.
This work is a weird mix of impossible for regular labs (450 GPUs required) and enabling for regular labs (found architecture requires less computation and is more parameter efficient). From an ordinary researcher’s point of view, this is the best possible research that big companies like Google can do. It advances the field and everybody’s knowledge about deep learning in a practical way while the research itself has unpractically high computing resource requirements that it can only be satisfied by industry giants in the first place. This type of the investigation is the probably the optimal research industry giants can do to contribute the most to the field — we need more of this kind of work!
A natural follow-up on this study would be to apply the same methods to natural language understanding and recurrent networks. We also have seen some successful transfer from pre-training on SNLI to generalization to many other tasks  and thus training such models on SNLI would be the way to go. The resource requirements, however, are too big for regular labs, so a research direction would also revolve around simplifying the method of neural architecture search, even further than it is done in this paper.
Reading Wikipedia to Answer Open-Domain Questions; Danqi Chen, Adam Fisch, Jason Weston & Antoine Bordes (Facebook AI Research + Stanford); Published: ACL, 2017-08-01
Problems with Traditional Q&A
Current question answering tasks basically consist of search tasks, where one needs to find the relevant passage for a given question in a given paragraph (SQuAD, CNN + DailyMail, bAbI, WikiReading). This is not a very realistic setting as you usually are not given a paragraph with the exact answer when you have a question. Thus good models on these datasets do not generalize to other datasets or any realistic setting. Another problem is that these models learn to pick up on the first sentence which contains the answer to 75% of all questions for some datasets . This means that traditional models fail to answer questions if (1) they are not given the answer, (2) the answer is not in the first sentence.
When you think about it, this means that these models are quite “stupid” in that they cannot generalize and no longer work on data that contains more than a couple of sentences. Recent research comes to the same conclusion: Our natural language understanding models do not understand language .
Idea: Use Information Retrieval Methods
We can use information retrieval methods to solve both problems and make our model generalize, that is, we use (a), an information retrieval model to retrieve potential answers, and (b), the traditional model searches the answer in the paragraph provided from (a). With this system, we no longer need a paragraph with an answer to resolve a question, and we can also overcome the growing data problem by just processing relevant information which is quickly and accurately found via information retrieval. We indirectly make our model smarter, since now it has to figure out with a paragraph contains useful information, no information, or even misleading information.
The information retrieval model indexes Wikipedia with bi-gram features, that is, pairs of words are hashed into buckets which form our vocabulary. You can imagine this step in this way: We assign an (approximately) unique number to each unique pair of words. For each paragraph (or document, or sentence) we then take an empty (all zeros) sparse vector which has the length of the unique number of word-pairs (our vocabulary). Each element in this vector (“v[idx]”) represents the index of a unique word-pair (for example, we might have “Tim Dettmers” -> 67593, so v represents “Tim Dettmers”). For a given paragraph we can then identify the word-pairs which are present and mark them with a “1” at the index position of that word (v[i] = 1, if word pair with index “i” is present). This is called a bag-of-words. Once we have these features we can calculate the term frequency – inverse document frequency (TF-IDF) value for each vector and each word-pair in each paragraph.
With such features for each paragraph, we then can take the cosine similarity between two vectors which indicates how similar these two vectors (paragraphs) are to each other. With this method, one can then rank which are the most similar paragraphs compared to all paragraphs in our database. In our case, we can treat the question as a paragraph and then look for the paragraphs which are most similar to the content of the question. Once we indexed our text source (Wikipedia) with this method, we can readily retrieve paragraphs that contain potential answers to any question that there is. To find the answer, however, we still need a model that “reads” the paragraph.
The reader model is the model that finds such potential answers in paragraphs and it made with a rather standard recipe for such tasks. It consists of a 3 layer bidirectional LSTM with word-by-word attention. This means the inputs (paragraph) are word embeddings into an LSTM which encode the sentence into a feature representation. Then the output is conditioned by the question (embeddings) to determine which outputs (words) are relevant for the question. This conditioned output is normalized and used to re-weight the unconditioned output itself. This is done for every time-step, that is for every word. This technique is termed word-by-word attention.
Note that we only need to train the reader model. The information retrieval model does not contain any parameters that we need to train with labels.
Training on Datasets
For traditional datasets like SQuAD, we can just train our reader model on the train data, but the beauty of this method is that you can also train on knowledge graphs via distantly supervised learning.
Such knowledge graphs contain triples like (Donald Trump, president of, the United Stated) where a relation (president of) links two entities (Donald Trump and the United States). You can also imagine the triple as a graph where the entities (Donald Trump, the United States) are nodes and the relation (president of) is the edge between those nodes. To generate training data for our reader model via knowledge graphs, we can use distantly supervised data.
Distant supervision is a technique where we take knowledge graph triples and use simple string matching to find sentences which contain both entities (Donald Trump and the United States). This assumes that the sentences found in this way actually express the true relation (president of) rather than some other relation (for example born in).
For most entities this assumption holds, but for more common entities this assumption will be broken since two entities can have multiple relationships at once. Nevertheless, this works really well, and allows us to supply our model with training data. In this case, the model needs to complete the triple with only two pieces of information given (X, president of, Unites States). The system works on this problem by (1) using information retrieval to find paragraphs similar to the question tuple (president of + Unites States) and (2) the reader model will read all sentences to find what the right answer is (how to complete the triple).
Thus we have a model which can learn to find answers to any question posed in a triple format. This is quite useful since many question answering datasets were created through knowledge graphs. If we can learn to work with knowledge graphs we can learn to work with many question answering datasets.
During the time of the first arXiv submission of the work (March 2017), the reader model achieved state-of-the-art results, however, now it is a bit outdated. The information retrieval model is elementary but still, performs better than the regular search engine used by the Wikipedia webpage.
The most impressive result from this work is the fact that we can answer SQuAD questions without being provided the answer paragraphs and get a relatively good accuracy of 27.1%. With the noisy distant supervision data from other tasks (multitask learning), we can do a little better with 29.8%. This is far from the current SQuAD state-of-the-art result of, currently, 75%, but here the system of retriever + reader needs to find the answer in the entirety of Wikipedia, rather than in the short paragraph which is usually provided (where the answer is usually in the first sentence). It was also shown that current question answering systems do not really understand questions . This work probably fairs better since the model needs to learn to deal with spurious information so it will likely generalize better than traditional models. In that sense, this 29.8% system might be superior to the 75% model.
The analysis shows that most of the performance is lost due to false positive in the information retrieval step, that is many paragraphs found by the retriever are not relevant. The authors note that this could be due to the rather ambiguous questions in SQuAD.
The model also generalizes to many other questions answering tasks, which is not possible with SQuaD. The achieved generalization hints that the presented approach might also be a good direction for multitask learning, where we learn to solve multiple tasks with one model (or one system).
This is very important work. As we have seen from the discussion in the introduction, we are at a point where more and more data will exist and over time and we will be able to process a smaller and smaller fraction of it. If we want to use computationally expensive models, such as deep learning models, then the only solution is to search for relevant information before we process it. The information that we pass to our deep learning model must be highly condensed and highly relevant. The best and fastest technique for searching for relevant information are standard information retrieval methods.
This work demonstrates that the combination of deep learning and information retrieval works well and generalized to open domain questions. The work also shows that this is a viable approach to multitask learning. There are of course many problems, and the predictive performance is not optimal as of now, but this hardly matters since the long-term importance of this research is evident.
From the core paper, we have seen that more data improves the predictive performance, but also that these improvements, although general, are only marginal compared to improvements gained through advanced specialized techniques. We also saw that the computational requirements for such large datasets are too high for most researchers right now and they will be even more constraining for researchers in the future. The results of the core paper are meaningful but hint that the most cost and time-effective way to push the field forward is not yet with very large datasets which required dozens of GPUs and months of computation time.
The time for large datasets will come though and we need to make sure that our methods scale once we get there. We not only do research for the sake of doing research, but many of us do this work in order to develop solutions to significant problems of tomorrow. Thus it is critical that we try to find ways in which we can scale more efficiently and make deep learning practical on large scales. The supplementary papers gave hints what this could look like.
Big Research Labs: Finding Computationally Efficient Architectures
Some of these solutions, like neural architecture search (NAS), are impractical on its own since one needs hundreds of GPUs to do this research efficiently. However, once such studies are completed they provide a huge benefit to all other researchers in their regular work. By leveraging the insights gained from such research, we can figure out even more efficient designs. For example, it now says pretty much “stop using regular convolution and switch to separable convolution” — a practical insight which will give a good boost to the entire community.
If we provide NAS with new kind of operators (group convolution + shuffle?), it might yield new architectures which then give even more insights how to proceed with architectural design. Overtime, we can thus find more and more efficient architectures which will help us tackle large monster datasets like the JFT-300M. It will also enable future deep learning processing on mobile devices and improve our predictive performance further.
I am sure we will see improvements to NAS to make it more feasible for larger labs to experiment with it, but even with that this research direction should only be a preferred research direction for large companies and maybe some very large research labs.
In general, beyond NAS, finding new architectures is an important research direction. It is so important because it affects the entire community. However, often this direction requires more computational resources and might not suitable for small labs.
Startups and Poorly Funded Labs: Finding Computationally Efficient Operations
ShuffleNet reminds us that simple changes in operations ― which in this case can be done easily with any deep learning library ― can have large effects and establish a useful and practical method for the entire deep learning community. This is a very solid contribution and I think this direction is best for AI startups which have the need to push their performance for the mobile platform. This will be an important step to make research on large datasets practical. This research direction is also feasible for any research lab which is constraint by computational resources.
The research included in this direction also includes low-bit computation such as XNOR-Net and others. Such work will become more and more important as GPUs allow for 8-bit operations on the hardware level [h] or allow specific 16-bit deep learning operations [i].
In general, any lab can do this kind of research, but it has most synergy in those labs which are either constraint by computational resources, or whose products are computationally constraint.
Language Researchers: Computational Efficiency Through Features
We also saw that plain, old, boring multi-layer perceptrons, can achieve state-of-the-art results if we provide them with the right features. In general, natural language understanding (NLU) has neglected feature engineering for a long time.
In the past, models were not as important as features. One would carefully hand-engineer linguistically motivated features and thus achieve good results with simple models. With the advent of word embeddings, along with GPUs which provided the required computational horsepower, we focused on model architectures instead. And we rightly did so: With the computational power that we have now, we can have the computer do the feature engineering for us through feature learning.
The problem with that is, that feature learning highly depends on the input data in itself. If you take a deep learning speech recognition network and give it raw speech data, it will likely fail to train on that data; if you provide the same network with Fourier transformed data it will work well and will be easy to train.
Simple models with the right features will beat big models with wrong features anytime. So instead of designing the next new architecture, it might be much more impactful to try to design the next new feature. I believe that especially in natural language understanding this is an excellent research direction.
One might say, how is this different from old school natural language processing (NLP) research? Is this not a step back?
If it is done in the wrong way I think it is. I think the key ingredient is to exploit that what we have: Feature learning via deep learning. We are no longer required to have very precise linguistically motivated features that do all the magic. It is sufficient to come up with a representation which compresses a particular type of information (while throwing away other information) and then deep learning will do the rest. This was not possible with old school NLP feature engineering. I think this is how we can leverage the advantages of both worlds and thus push NLP forward in a way which would not have been possible with old school NLP research, or deep learning research alone.
Language and Video Researchers: Sifting Through Data
We have seen that data grows faster than computation which means that no point in the future we will be able to process all useful data. The flood of data means we need to pick some of the data to process — we cannot process all of it. But how do we select the right data? We have seen that information retrieval might be a good answer.
I think this is a very important research direction, not only because the growth of data is inevitable, but it also lets us let go of the awful question answering tasks that we have. It is hard to say if current state-of-the-art algorithms do anything more than “fancy string matching” , and it is not clear why we should care about a task where we are given the answer to a question from the get-go.
This research is mainly relevant for language researchers since information retrieval systems are primarily built for text, but systems for vision are also imaginable. Compress for example every 5th image of a youtube video by labeling it with a small network and indexing it with information retrieval methods and then one can search for information by (1) retrieving possible video segments and (2) applying video recognition on such segments. The same approach would work for subtitle information for videos. Want to answer a question which requires visual information? Search subtitles and then process relevant video segments in depth.
 Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. arXiv preprint arXiv:1705.02364.
 Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. ACL 2017.
 Rastegari, M., Ordonez, V., Redmon, J., & Farhadi, A. (2016, October). Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision (pp. 525-542). Springer International Publishing.