A New Perspective on NAS: Graph Neural Networks Drive Universal Architecture Space, Hybrid Convolutional and Transformer Performance Leaps!

Image showing the article's main title or concept

The authors propose a universal neural architecture space (UniNAS), a general search space for Neural Architecture Search (NAS) that unifies convolutional networks, transformers, and their hybrid architectures within a single, flexible framework. UniNAS allows for discovering novel architectures and analyzing existing ones within this universal framework.
The authors also propose a new search algorithm to traverse the proposed search space and demonstrate that this space contains interesting architectures that outperform state-of-the-art hand-designed architectures when adopting the same training settings.
Finally, the authors introduce a unified toolkit with standardized training and evaluation protocols to promote reproducibility and enable fair comparisons in NAS research. Overall, this work paves the way for systematically exploring the full spectrum of neural architectures through a unified, graph-based NAS perspective.

1. Introduction

Despite the undeniable success of Neural Architecture Search (NAS) in identifying optimal hyperparameter configurations for predefined architectures [37, 51] or in improving inference latency on edge devices [15, 36], to the best of the authors' knowledge, it has not yet produced a novel network architecture that significantly outperforms popular, hand-designed network architectures based on ResNet [13, 34, 35] or Vision Transformer [7, 10, 39].

As with many areas of computer vision research, progress in NAS has been largely driven by available benchmarks. Consequently, in recent years, most NAS methods [11, 17, 31, 42] have focused on tabular NAS benchmarks from the NAS-Bench family [9, 46]. A tabular benchmark is a fixed dataset of network architectures where the accuracy and other parameters of each network have been pre-computed, so NAS algorithms can obtain their accuracy without training the network. Although tabular datasets facilitate NAS research by reducing the required computational cost, by definition, they also prevent NAS methods from discovering new, better architectures—the best architectures in the search space are already known and can be found simply by consulting pre-computed accuracy tables. Furthermore, the search spaces of these tabular datasets are often relatively small and confined to repeating the same building blocks multiple times, due to the need to train each network at least once during benchmark creation.

Thanks to the emergence of training-free NAS methods [5, 19, 22, 23, 40, 48], it is now possible to estimate the accuracy of network architectures without training them, thereby eliminating computational constraints for NAS and enabling the exploration of larger search spaces and novel architectures. The search spaces introduced for this purpose—namely Zen-NAS [23] and AutoFormer [3]—are created by varying the hyperparameters (such as depth, width, or expansion ratio) of selected hand-designed architectural blocks, which are MobileNetV2 [33] and Vision Transformer [10], respectively. Since each search space is created by varying the hyperparameters of a given architectural block, using these search spaces still prevents the discovery of networks with different topologies or building blocks. Subsequently, hand-designed architectures (such as CoAtNet [7]) still outperform the best known architectures in the AutoFormer search space [2].

In this paper, the authors aim to address the aforementioned limitations by introducing a novel search space called Universal Neural Architecture Space (UniNAS). This search space is designed in a general manner, without favoring any specific existing architecture, while ensuring that all state-of-the-art hand-designed architectures are encompassed within it. Therefore, this search space not only allows for the exploration of various combinations of existing architectures and entirely new architectures but also permits a systematic analysis of state-of-the-art networks, their topologies, and design choices within a unified framework.

Furthermore, the authors propose a new architecture search algorithm to traverse the proposed UniNAS search space. When combined with the state-of-the-art training-free NAS Agent [40], after a few steps of traversing the search space, the authors discovered a novel network architecture—UniNAS-A—which outperforms current state-of-the-art architectures under the same training protocol and size constraints, indicating that this search space contains interesting architectures worthy of further exploration.

Last but equally important, the authors provide a toolkit (as an easy-to-install Python package) that allows traversing the proposed search space, creating PyTorch network modules for any point in the search space, and most importantly, providing consistent and explicit training protocols and training code to create and evaluate final models. This is extremely important for reproducibility and fair comparison of NAS methods, because although authors usually report final accuracy on standard datasets (like ImageNet), they use different training schedules, different hyperparameter settings and data augmentation techniques, or they even use an existing larger model pre-trained on significantly more data as a teacher network—making fair comparison between proposed architectures almost impossible.

In summary, the authors make the following contributions to the community:

The authors propose a novel universal NAS search space called UniNAS. This search space is designed in a general manner, thus encompassing many novel network topologies, while also containing known hand-designed architectures, thereby allowing for systematic analysis of network topologies within a unified framework.
The authors introduce a new search algorithm to traverse the proposed design space and demonstrate that this space contains interesting novel architectures—such as UniNAS-A—which outperform state-of-the-art hand-designed architectures on multiple tasks (classification, detection, and segmentation).
The authors provide a unified toolkit, including standardized training and evaluation protocols, to aid in the adoption of the proposed universal search space, enhance reproducibility of NAS methods, and facilitate future NAS research.

2. UniNAS

This section introduces the UniNAS search space, beginning with the UniNAS block as its core component, followed by the overall network architecture, and finally highlighting key differences and improvements through comparison with existing search spaces.

A. UniNAS Block

a) Basic Operations: The authors' goal is to create a search space that spans multiple different architectures. Therefore, the authors adopt a generic setting in the definition of the UniNAS block to achieve maximum flexibility during architecture search, while retaining practical constraints compatible with modern hierarchical networks. Specifically, a UniNAS block is defined as an arbitrary directed acyclic graph (DAG) with one input node and one output node, where each intermediate node represents one basic operation. Basic operations include convolutional layers (depth-wise separable, point-wise, standard, etc.), pooling, masking, normalization elements, various forms of matrix multiplication and dot products, operations with multiple input or multiple output edges, and non-linear functions (see Table 1). This allows the block to represent diverse local computational patterns while maintaining consistent interface properties.

While most basic operations in Table 1 are indeed fundamental and thus require no further explanation, the authors will specifically detail two more complex operations: For tensor X, the authors define

For Y and X, and for Z, X, and Y. Matmul1 and Matmul2 represent a multiplicative relationship, but unlike element-wise multiplication, they combine all channel information, and when combined with Softmax, these can produce attention mechanisms. However, the authors again note that the variability goes far beyond this, as any graph is possible, only dimensions matter (!).

Table 1: UniNAS Block Basic Operations. The authors report the number of parameters (Params) and floating-point operations (FLOPs) for a single forward pass with an input size of X. Some nodes change shape and/or operate in multi-input or multi-output mode, where X denotes X tensors of the same shape.

b) Block Computation Graph: The only constraints the authors impose on this general formulation are: 1) the input and output dimensions of the block remain the same, and 2) dimensions match between adjacent nodes within the DAG to ensure tensors propagate correctly through the graph. This constraint simplifies integration into larger architectures, where consistent feature map shapes across blocks significantly reduce the complexity of dynamic shape handling during forward propagation and enable stable training across various searched blocks. By enforcing consistent dimensions, the authors avoid the need to automatically insert extra projections, which could otherwise interfere with the analysis of the searched topologies.

The authors emphasize that many commonly used modules in modern deep learning architectures, such as the residual blocks of ResNet [13] and its variants, self-attention layers with or without relative positional biases in transformer-based models [7, 10], squeeze-and-excitation modules for adaptive channel recalibration [16], and the inverted mobile bottleneck structure used in EfficientNets [34, 35], can all be represented as instances of the authors' UniNAS block. Examples of how these modules can be crafted into the UniNAS block format, including precise graph structures and node operations, can be seen in Figure 3.

Figure 3: Examples of modules crafted into UniNAS block format

However, the authors encourage readers not to be limited to these classic examples; one can certainly imagine more diverse networks. The search space includes not only simply stacking convolutions and non-linearities in a chain but also covers tree-like structures, parallel paths with selective attention merging, and hybrid combinations of convolutional and self-attention layers, all encapsulated as a UniNAS block. This demonstrates the expressive power of the block design while maintaining a unified and coherent representation across the entire search space.

B. UniNAS Network

The final network structure sequentially stacks different UniNAS blocks into a single chain (see Figure 2). This follows the general design of current state-of-the-art networks, with a hierarchical Backbone network similar to [7, 13, 35, 39]. The stem stage (S0) downsamples the input using convolutional layers, followed by multiple stages, each containing several different UniNAS blocks, and finally a classification head composed of global average pooling and fully connected layers. Since UniNAS blocks maintain dimension invariance, at the beginning of each stage, spatial dimensions are reduced by standard max pooling with stride 2, while the number of channels is increased by channel projection 1. As mentioned earlier, the structure of UniNAS blocks (i.e., the graph representation) varies throughout the network, enhancing the network's topological variability. The number of stages, number of blocks, and spatial and channel dimensions serve as scaling hyperparameters, allowing the exploration of different network scaling patterns.

Figure 2: UniNAS Network Structure

Table 2: NAS Search Space Comparison

UniNAS covers the widest variety of networks; it does not restrict block topology and allows for the exploration of novel architectures. The authors also report the classification accuracy on ImageNet-1k for the best-known network in each space—when reported results for a comparable training setting were not found, the authors trained the best-known architecture of the given search space using the same training protocol as in Table 3 (indicated by †).

Table 2: Comparison of NAS Search Spaces

C. Comparison with Existing Search Spaces

Existing search spaces suffer from two main limitations: restricted topological variability and poor scalability. Consequently, despite extensive prior research in this field, the best-performing models found in these spaces still underperform compared to architectures obtained using UniNAS (see Table 2).

DARTS methods, such as those in [26, 49], rely on weight-sharing supernets, which are computationally expensive and produce biased gradient estimates, leading to unreliable architecture rankings. When extended to transformers [4], the cost of self-attention layers forces these search spaces to be severely limited—often simplifying the search to a binary choice of "use attention or not." Even with these limitations, DARTS-based networks remain confined to small-scale settings, with ImageNet-1k accuracy below X (Table 2).

Benchmarks such as NAS-Bench [9, 46] face a fundamental scalability issue: the number of possible networks grows exponentially with the number of operations, making exhaustive training feasible only on toy spaces. Therefore, these benchmarks are limited to simplified convolutional networks evaluated on small datasets like CIFAR [18] or ImageNet16-120 [6]. On ImageNet-1k, either no results exist, or the reported performance is far below state-of-the-art. Since the architecture space is limited and has been fully explored, there is no further room for improvement.

Zen-NAS [23] and AutoFormer (V1/V2, also known as S3) [3, 4] restrict the search space to networks with repeated MobileNetV2 blocks [33] or Vision Transformer blocks [10, 27]. The resulting architectures differ only in hyperparameters, such as expansion ratios, number of channels, or number of heads. However, it is currently impossible to fairly compare the top models reported in these spaces, as they are trained through knowledge distillation from a larger teacher model and use substantially more data [19]. Therefore, it is unclear whether the reported performance improvements are due to the search itself or the distillation process. In fact, when distillation is applied, [12] achieves higher accuracy with a standard EfficientNet-B2 [34] under the same FLOPs and training budget. Similarly, when trained under the same parameter budget, the hand-designed transformer in [43] outperforms AutoFormer's best transformer [2, 3] on ImageNet-1k by X.

In summary, there are two main areas for improvement: network diversity and fair reproducible comparison. Both problems are addressed by UniNAS. 1) The flexibility of UniNAS allows for truly topology-aware architecture search, where attention mechanisms can be inserted at selected locations, combined with convolutions, or completely replaced in certain branches, a flexibility previously unfeasible in frameworks limited to a single operation choice; 2) UniNAS allows for fair comparison of network topologies because it covers all mentioned spaces as well as topologies such as ResNet [13], EfficientNet [34], ViT [10], and CoAtNet [7] (see Figure 3). Therefore, it enables direct comparison of their features and accuracy, both among themselves and with novel architectures, within a single framework.

Unlike previous search spaces, in the authors' UniNAS, it is feasible to use different blocks in each stage (see [28] for a critique of the NAS-Bench framework, which arises from the fact that simple model parameter count has the highest predictive strength for estimating final network accuracy, thus hindering researchers from exploring new and interesting topological structures).

3. Architecture Search

In this section, the authors propose an algorithm for finding networks within the UniNAS search space based on given criteria. Specifically, the authors assume that design choices such as the number of stages, number of blocks, and base channel dimensions are known, and constraints in the form of Params and FLOPs boundaries are provided. The authors' goal is then to find a network that maximizes a specific objective (e.g., highest accuracy) under the aforementioned constraints.

a) Search Steps: More formally, any UniNAS network is identified by a sequence of graphs G = {G1, ..., Gn}, where n is the total number of blocks. This allows the authors to formulate UniNAS search as a graph-based algorithm involving the addition and elimination of nodes. To ensure that the search efficiently traverses 1) only networks within the network size constraints, and

feasible graphs (in terms of node dimensions and parity), the authors adopt the following approach.

The authors associate each node v (where v is a node in a graph) with the number of trainable parameters (Params) and approximate floating-point operations (FLOPs). These values can be easily obtained as a function of the input tensor shape for each basic operation node (found in Table 1), allowing the authors to efficiently calculate the overall network cost after adding/removing a given node.

The authors note that the Params value for each basic operation is straightforward to calculate. However, estimating FLOPs is not trivial, and the authors choose to include all multiplications and additions for each operation in its final FLOPs value. This is done to assign a non-zero cost to all operations, preventing uncontrollable complexity divergence. Consequently, this value may differ from values returned by runtime FLOPs estimators (such as PyTorch profiler), and furthermore, this value may also vary depending on the underlying hardware.

The authors define feasible node addition and elimination operations for the search space. These operations will ensure that a UniNAS block remains a feasible computation graph after a single search step, while allowing the authors to freely explore the large UniNAS space. Recall that graph feasibility is determined by the input/output node shapes in Table 1. Simply ensure that: a) RelPosBias is added only if the input spatial dimension has an integer square root, b) Chunk2, Chunk3 are added only if the channel dimension is divisible by 2 or 3 respectively, c) ConvRed4 is added only if the channel dimension is divisible by 4, and d) nodes that change dimensions and/or have multiple outputs are added along with their coupled nodes. When eliminating, the branch between these two nodes is also eliminated.

Algorithm 1 UniNAS Search Step

Algorithm 1: UniNAS Search Step

Requirements: UniNAS network, i.e., a sequence of graphs G, a list of possible node types (see Table 1, including node costs FLOPs, Params), search boundaries FLOPsmin, FLOPsmax, Paramsmax, Paramsmin), elimination probability peliminate, maximum number of attempts max_attempts. 1: Execute while current_attempts < max_attempts

2: Randomly select a graph Gi and a node v

3: Sample p from [0, 1]

4: If p < peliminate then

5: {Node v elimination}

6: Identify the minimal subgraph S between v and its coupled node (empty for nodes with only one output and no dimension change)

7: Calculate the change in FLOPs and Params after potential elimination of S

8: If the change is credible with respect to FLOPs_boundaries and Params_boundaries, then

9: Eliminate S

10: Break

11: end if

12: Else

13: {Node addition after v}

14: Calculate the change in FLOPs and Params after potential addition of new_node and its coupled node (empty for single-output nodes that don't change dimensions)

15: If the change is reasonable with respect to FLOPs_boundaries and Params_boundaries, then

16: Add new_node and its coupled node

17: Break

18: end if

19: End If

20: current_attempts++

21: end while Finally, the authors construct the (single) search step. In each step, the authors select a UniNAS block to be adjusted and decide whether to add or eliminate a node. If authors add, it is added after a randomly selected node, provided this is feasible according to 1) and 2) above. The parameter peliminate for determining add/eliminate options is usually set below 0.5, as the authors prefer to add first and then eliminate nodes, because for some node types, eliminating them also means eliminating the entire branch between that node and its paired counterpart, which can lead to significant cuts in network size. In experiments, the authors choose a value for peliminate as shown in Figure 1. With this choice, the random walk avoids degradation of network size towards uncontrolled growth or shrinkage. See Algorithm 1 for a more formal description.

Figure 1: Illustration of Random Walk in UniNAS

b) Training and Evaluation Protocols: When a training-free NAS algorithm finishes, its identified most promising candidate architecture still needs to be trained to obtain its final (true) accuracy. Unfortunately, the NAS literature has consistently lacked precise training schemes, with different authors using varying training data, number of epochs, training batch sizes, etc., to report results, making direct comparison of different NAS methods impossible. In the UniNAS search space, the authors therefore also provide detailed training protocols for training the final networks (see Table 3), so that different architectures and NAS methods (including future work) can be compared in a fair and reproducible manner.

c) UniNAS Toolkit: The authors provide a pip-installable package, uninas, enabling researchers to easily access all components needed to use the proposed UniNAS space. Specifically, the authors offer tools for generating any UniNAS network as a PyTorch model, as well as an intuitive interface for constructing custom UniNAS networks. Additionally, the authors include a module for graphical visualization of computation graphs, specifically designed for complex network structures. The authors also provide an implementation of the search algorithm described in Algorithm 1, which accounts for computational budget constraints and is encapsulated in a single function call. This function call can trigger either a random walk in the UniNAS space or an optimization algorithm targeting any objective computable from a PyTorch network. Finally, the authors also include the training protocol described in Table 3, with optional support for distributed training, to foster reproducibility and fair comparison of any future UniNAS networks.

4. Results

a) Random Walk in UniNAS: The authors performed a random walk using their search step algorithm 1 and parameter con

Table 3: UniNAS Evaluation Protocol

The authors evaluate NAS methods in the UniNAS space for classification, detection, and segmentation tasks by first training image classification on ImageNet-1k, then fine-tuning on COCO and ADE20K datasets using the aforementioned hyperparameters. The batch size is chosen to fit a single A100 GPU and is reported per GPU; therefore, when using more GPUs, the learning rate needs to be adjusted accordingly.

The number of parameters is limited to 22-28M and FLOPs to 6-20G to evaluate the navigation capabilities of the authors' architecture search within the vast UniNAS space. In Figure 1, the authors can see the Params and FLOPs of 500,000 sampled networks, indicating that the authors' search can easily span different network sizes and configurations, also covering very different network architectures based on transformers and convolutions. In Figure 4, the authors provide a further breakdown of UniNAS basic operations, showing that network size is highly variable between specific operations, indicating that the authors' exploration effectively covers a wide diversity of configurations.

Figure 4: Breakdown of UniNAS Basic Operations

b) Architecture Search: When searching for the best architecture in the authors' UniNAS, the authors use the training-free NAS Agent VKDNW [40] to evaluate network performance instead of training it. The authors define

where X denotes the k-th decile of the eigenvalues of the Empirical Fisher Information Matrix (FIM) as a representation of the FIM spectrum when initialized under random input batches. Specifically, the authors calculate Equation (3) for each block Gi separately and then average the results across all these blocks, yielding a single scalar value that the authors maximize during the search. The authors chose this training-free Agent because it is orthogonal to network size (see Figure 3 in [40]), and the authors' goal is to search for optimal topologies within a specific network size budget.

The authors then iteratively search in UniNAS: starting from an initial network, they execute 1024 steps of Algorithm 1,

Table 4: Classification Accuracy of Models in the UniNAS Search Space on ImageNet-1k. All models have a similar number of parameters and are trained in the same way using the training protocol from Table 3. Training took 2 days on 8 A100 GPUs.

Table 4: Classification Accuracy on ImageNet-1k

where only the top 64 networks (population size) are kept. During the search, network size was limited to 27M parameters and 20G FLOPs. In Figure 2, the authors used 4 stages, an initial output size of 64, with 2, 3, 5, and 2 UniNAS blocks respectively in each stage, and hidden dimensions of 96, 192, 384, 768, consistent with modern architectures. The search took 12 hours on a single A100 GPU, and finally, the authors selected the best network based on the VKDNW score—which they denote as UniNAS-A (see Figure 5).

Figure 5: UniNAS-A Architecture

c) Image Classification: First, the authors compare UniNAS-A with other networks in the UniNAS search space: EfficientNet [34], ResNet [13], CoAtNet, and ViT with relative positional bias [7]. To compare these networks in a fair setting, the authors scaled the networks to the same number of stages and blocks, and by adjusting the number of channels, they also kept the network size similar. Each network was trained on ImageNet1k [8] using the same training protocol as in Table 3. Training took 2 days on 8 A100 GPUs. As shown in Table 4, UniNAS-A significantly outperforms standard hand-designed networks.

d) Downstream Tasks

Next, the authors fine-tuned each of the networks from the previous step on two downstream tasks—object detection on MS-COCO [24] and semantic segmentation on ADE20K [50], using the same settings as in Table 3. Training for the segmentation task took about 10 hours and for the detection task 21 hours on 4 A100 GPUs. Again, UniNAS-A significantly outperformed existing networks (see Table 5).

e) Agent Ablation

Finally, the authors performed an ablation study on the choice of VKDNW [40] as the Agent by running Algorithm 1 with different Agents and evaluating the best architecture found by each Agent. In Table 6, the authors show that the search for the UniNAS space remains largely robust to the choice of specific Agent, but VKDNW was still able to find the best network architecture within the same search budget (12 hours).

Table 6: Agent Ablation Study Results

5. Related Work

Since the discussion of NAS search spaces was presented in Section II-C, here the authors focus on NAS search methods.

a) One-shot NAS: These methods are based on the relaxation of discrete architecture spaces, typically using a supernet that encompasses all possible nodes and edges in a given search space. This method was first proposed in DARTS [26], where the supernet consists of all possible operations, each assigned a weight that is adjusted via gradient descent during supernet training. Once supernet training is complete, the operations with the highest weights are retained, thus constructing the final network architecture. Robust-DARTS [47] improved test-time generalization by introducing data augmentation during supernet training, while SGAS [21] aimed to improve the stability of supernet training by explicitly selecting allowed operations at each stage. The main challenges of One-shot NAS are memory consumption, as the supernet must contain all possible operations across the entire search space; and ranking disorder, i.e., the performance of architectures evaluated within the supernet may differ from their performance as independent networks.

6. Conclusion

The authors propose UniNAS, a universal neural architecture search space designed to systematically explore, analyze, and compare network topologies within a unified framework. Unlike previous work, the authors further decompose computational modules into basic operations, thereby enabling the expression and extension of both hand-designed and NAS-discovered architectures, while supporting systematic study of topological variability in network design.

The authors propose an efficient architecture search algorithm that operates directly within UniNAS, allowing precise control over FLOPs and parameter budgets, while supporting fine-grained modifications to traverse diverse families of architectures.

References

Universal Neural Architecture Space: Covering ConvNets, Transformers and Everything in Between