Research Guide for Neural Architecture Search

[Nearly] Everything you need to know in 2019

From training to experimenting with different parameters, the process of designing neural networks is labor-intensive, challenging, and often cumbersome. But imagine if it was possible to automate this process. That imaginative leap-turned-reality forms the basis of this guide.

We’ll explore a range of research papers that have sought to solve the challenging task of automating neural network design.

In this guide, we assume that the reader has been involved in the process of designing neural networks from scratch using one of the frameworks such as Keras or TensorFlow.

Neural Architecture Search with Reinforcement Learning

In this paper, the authors use a recurrent neural network (RNN) to generate the model descriptions of neural networks. The RNN is then trained with reinforcement learning in order to improve its accuracy on the validation set. The method achieves an error rate of 3.65 on the CIFAR-10 dataset.

The Neural Architecture Search presented in this paper is gradient-based. This paper’s proposal is based on the consideration that the structure and connectivity of a neural network can be described by a variable-length string. A neural network referred to as the controller is used to generate such a string. The child network that’s specified by the string is then trained on real data and results in an initial measure of accuracy on the validation set. This accuracy is then used to compute the policy gradient that updates the controller. As a result, architectures that have higher accuracies are given higher probabilities.

The controller is used to generate architectural hyperparameters of neural networks in Neural Architecture Search. In the illustration below, the controller is used to generate a convolutional neural network. The controller predicts the filter height, filter width, and the stride height. The predictions are performed by a softmax classifier and then fed into the next time step as input. Once the controller completes the process of generating the architecture, a neural network with this architecture is built and trained.

Since training of the child network could take hours, the authors use distributed training and asynchronous parameter updates in order to speed up the learning process of the controller.

The error rate for this model compared to other models is shown below:

Learning Transferable Architectures for Scalable Image Recognition

In this paper, the authors search for an architecture building block on a small dataset and then transfer this block to a large dataset. This is because using a large dataset would be very cumbersome and time-consuming.

The authors search for the best convolutional layer on the CIFAR-10 dataset and apply it to the ImageNet dataset. This is done by stacking together more copies of the layer. Each layer has its own parameters that are used to design the convolutional architecture. The authors refer to this architecture as NASNet architecture.

They also introduce a regularization technique — ScheduledDropPath — that improves generalization in the NASNet models. The method proposed here achieves a 2.4% error rate. The largest NASNet model achieves a 43.1% mean average precision.

Like the previous paper, this one also uses the Neural Architecture Search (NAS) framework. In this paper’s proposal, the overall architectures of the convolutional networks are manually preset. They’re made up of convolution cells repeated several times. Each convolution cell has the same architecture but different weights.

The network has two types of cells: convolutional cells that return a feature map of the same dimension — Normal Cell — and convolution cells that return a feature map — Reduction Cell. The feature map height and width is reduced by a factor of two.

In the search space proposed in this paper, each cell receives as input two initial hidden states, which are the outputs of two cells in the previous two lower layers or the input image. The controller RNN recursively predicts the rest of the structure of the convolutional cell given these two initial hidden states.

Here’s the performance of the Neural Architecture Search on the CIFAR-10 dataset:

Efficient Neural Architecture Search via Parameter Sharing

The authors here propose a method called Efficient Neural Architecture Search (ENAS). In this method, a controller discovers neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained to select a subgraph that obtains the best accuracy on the validation set.

The model corresponding to the selected subgraph is then trained to minimize a canonical cross-entropy loss. Parameters are usually shared among the child models so that ENAS can deliver better performances. ENAS achieves a 2.89% error rate on the CIFAR-10 test compared to 2.65% by Neural Architecture Search (NAS).

This paper improves the efficiency of NAS by forcing all child models to share weights in order to avoid training each child model from scratch to convergence.

The paper represents the NAS’s search space using a single directed acyclic graph (DAG). Recurrent cells are designed by employing a DAG with N nodes, which represent local computations, while the edges represent the flow of information between the N nodes.

ENAS’s controller is an RNN that decides which computations are performed at each node in the DAG and which edges are activated. The controller network is an LSTM with 100 hidden units.

In ENAS, two sets of parameters are supposed to be learned: the parameters of the controller LSTM, and the shared parameters of the child models. In the first phase of the training, the shared parameters of the child models are trained. In the second phase, the parameters of the controller LSTM are trained. The two phases are alternated during the training of ENAS.

Here’s how ENAS performs on the CIFAR-10 dataset:

The algorithm proposed in this network achieves a top-1 error of 3.6% on CIFAR-10 and 20.3% when transferred to ImageNet. The authors introduce hierarchical representations for describing neural network architectures, show that competitive architectures for image classification can be obtained with simplistic random search, and present a scalable variant of evolutionary search.

For the flat architecture representation, they examine a family of neural network architectures formed by a single-source, single-sink computational graph that converts the input at the source to the output at the sink. Every node of the graph corresponds to a feature map, and each directed edge is associated with some operation, such as pooling or convolution. This operation transforms the feature map in the input node and passes it to the output node.

For the hierarchical architecture, the plan is to have several motifs at different levels of hierarchy. The lower-level motifs are used as building blocks during the construction of higher-level motifs.

Here’s the error rate for different models on the CIFAR-10 test set:

The approach proposed in this paper uses a sequential model-based optimization (SMBO) strategy for learning the structure of convolutional neural networks (CNNs). This paper is based on the Neural Architecture Search (NAS) method.

In this paper, the search algorithm is tasked with the responsibility of identifying a good convolutional cell, not a full CNN. Each cell contains B blocks, where a block is a combination operator applied to two inputs. Each of the inputs can be transformed — for example, via convolution — before being combined. The cell structure is then stacked depending on the size of the training set and the required running time of the final CNN.

A cell is converted to a CNN by stacking a predefined number of copies of the basic cell using either stride-1 or stride-2, as shown in the figure above. The number of stride-1 cells between stride-2 cells is then modified with up to N number of repeats. Average pooling and softmax classification layers are implemented at the top of the network.

The figure below shows the performance of the model on the CIFAR test set:

Auto-Keras: An Efficient Neural Architecture Search System

This paper proposes a framework for enabling Bayesian optimization to guide network morphism for efficient NAS. Based on their method, the authors build an open-source AutoML system known as Auto-Keras.

The major building block of the proposed method is to explore the search space via morphing the neural architectures, guided by a Bayesian optimization (BO) algorithm. Since the NAS space is not a Euclidean space, the authors tackle this challenge by designing a neural network kernel function. The kernel function is the edit distance for morphing one neural architecture to another.

The second challenge of using Bayesian optimization to guide network morphism is the optimization of the acquisition function. These methods are not applicable to the tree-structured search via network morphism. The challenge is solved by optimizing the acquisition function on tree-structured space. The upper-confidence bound (UCB) is selected as the acquisition function.

The Searcher module of this architecture is the module containing the Bayesian Optimizer and Gaussian Process. The search algorithms run on the CPU, and the Model Trainer module is responsible for computation on GPUs.

The module trains neural networks with training data in a separate process for parallelism. The Graph module processes the computational graphs of neural networks and is controlled by the Searcher module for the network morphism operations. The Model Storage is a pool containing trained models. Since these models are large, they’re stored on the storage devices.

Below is the performance of the model on different datasets in comparison to other models:

Neural Architecture Search with Bayesian Optimisation and Optimal Transport

This paper proposes NASBOT, a Gaussian process-based (Bayesian Optimisation) BO framework for neural architecture search. This is achieved by developing a distance metric in the space of neural network architectures that can be calculated via an optimal transport program.

The authors develop a (pseudo-) distance for neural network architectures called OTMANN (Optimal Transport Metrics for Architectures of Neural Networks) that can be computed efficiently via an optimal transport program. They also develop a BO framework for optimizing functions on neural network architectures called NASBOT (Neural Architecture Search with Bayesian Optimisation and Optimal Transport).

In order to achieve the BO scheme, a kernel for neural architectures is specified, as well a method to optimize the acquisition over these architectures. An evolutionary algorithm is used to optimize the acquisition function.

This is achieved by starting with an initial pool of networks and evaluating the acquisition on these networks. A set of Nmut mutations of this pool are then generated. The first is to stochastically select Nmut candidates from the set of networks evaluated, such that those with a higher acquisition function are more likely to be selected. Each candidate is then modified to produce a new architecture.

The modifications might change the architecture by either increasing or decreasing the number of computational units in a layer, by adding or deleting layers, or by changing the connectivity of existing layers.

The final step is to evaluate the acquisition on these Nmut mutations, adding it to the initial pool, and repeating for the prescribed number of times. In experimentation, the authors use NASBOT to optimize the acquisition. Through this experimentation, they conclude that NASBOT performs better than the evolutionary algorithm used to optimize the acquisition.

NASBOT performance compared to other models is shown below:

SNAS: Stochastic Neural Architecture Search (ICLR 2019)

The authors of this paper propose Stochastic Neural Architecture Search (SNAS). SNAS is an end-to-end solution to NAS that trains neural operation parameters and architecture distribution parameters in the same round of backpropagation. It maintains the completeness and differentiability of the NAS pipeline while doing this.

The authors reformulated NAS as an optimization problem on parameters of a joint distribution for the search space in a cell. A search gradient is used to leverage the gradient information in the generic differentiable loss for architecture search. This search gradient optimizes the same objective as reinforcement-learning-based NAS but assigns credits to structural decisions more efficiently.

As shown below, the search space is represented using a directed acyclic graph (DAG), referred to as the parent graph. In this figure, nodes xi represent latent representation. Edges (i, j) represent information flow and operations to be selected between the nodes.

Here are the classification errors of SNAS and state-of-the-art image classifiers on CIFAR-10:

This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner.

Instead of searching over a discrete set of candidate architectures, the authors of this paper relax the search space to be continuous. The architecture can, therefore, be optimized with respect to its validation set performances via gradient descent. The data efficiency of gradient-based optimization enables DARTS to achieve exemplary performance using fewer computation resources. This model also outperforms ENAS. DARTS is applicable to both convolutional and recurrent networks.

The authors search for a computation cell as the building block of the final architecture. The learned cell could be stacked to form a convolutional network or a recurrent network by being recursively connected. A cell is a directed acyclic graph consisting of an ordered sequence of N nodes. Each node is a latent representation — for example a feature map — and each directed edge is associated with some operation that transforms the node. A cell is assumed to have two input nodes and a single output node. The input nodes for convolutional cells are defined as the cell outputs in the previous two layers. In recent cells, they’re defined as the input at the current step and the state carried from the previous step. Applying a reduction operation — eg concatenation — to all intermediate nodes generates the output cell.

A comparison of this model with state-of-the-art image classifiers on CIFAR-10 is shown below:

Conclusion

We should now be up to speed on some of the most common — and a couple of very recent — techniques for performing neural architecture search in a variety of contexts.

The papers/abstracts mentioned and linked to above also contain links to their code implementations. We’d be happy to see the results you obtain after testing them.

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

wix banner square