NeurIPS'25! AutoPrune: A Plug-and-Play Adaptive Pruning Framework for Large Models

Paper Authors | Hanshi Wang et al.

Edited by | Autonomous Driving Heart

Previous papers on token pruning for large model lightweighting primarily focused on metrics for measuring token importance. However, experiments showed that some of the most basic and simple algorithms are actually more generalized. Therefore, this paper approaches the problem from a different dimension: given a set budget, how should the pruning ratio be allocated to each layer of the network?

Existing methods typically employ a fixed layer-wise allocation strategy—either pruning aggressively at the very beginning of the decoder or manually fixing the ratios for certain layers. This is clearly suboptimal because the difficulty of the input problem and scene varies, and the speed at which token attention converges also differs.

Addressing these issues, a team from Shanghai Jiao Tong University and the Chinese Academy of Sciences proposed AutoPrune, a training-free complexity-adaptive pruning framework. The proposed algorithm uses the visual-text Mutual Information derived from shallow decoder layers to quantify sample and task difficulty. This measure is then mapped to a Logistic retention curve constrained by the global computation budget (FLOPs budget), thereby generating a layer-wise visual token retention trajectory for each sample. This enables dynamic early or delayed pruning under a fixed computational budget. Taking LLaVA-1.5-7B as an example, AutoPrune removes 89% of visual tokens and reduces FLOPs by 76.8% while still retaining 96.7% of the original accuracy, representing a 9.1% improvement over PDrop (CVPR). The method is also applicable to LLaVA-NeXT and autonomous driving VLA models.

Paper Title: Each Complexity Deserves a Pruning Policy

Authors' Affiliations: SJTU, CAS, Anyverse Intelligence

Paper Link: https://arxiv.org/abs/2509.23931

Code Link: https://github.com/AutoLab-SAI-SJTU/AutoPrune

Background Review

Vision-Language Models (VLMs) have become central to multimodal systems, supporting tasks like image captioning, VQA, and multimodal dialogue. Extensions towards embodied intelligence, such as the VLA (Vision-Language-Action) framework for autonomous driving, couple perception and control for end-to-end reasoning. High-resolution images or videos converted into a large number of visual tokens introduce significant memory and latency bottlenecks. Therefore, efficient, concise, and training-free pruning is crucial for real-time scenarios.

Prior work commonly observed that the information contribution of visual tokens decays significantly in the later stages of the decoder. However, the authors found that existing algorithms typically use fixed strategies when setting the pruning ratio for each layer. This approach lacks global computational budget constraints and often requires manual tuning to meet target token or FLOPs budgets, limiting generalization. For tasks requiring multi-step reasoning and dynamic cross-modal interaction, such as VQA, a fixed strategy struggles to adapt to sample and task differences. As shown in the figure, our analysis indicates that the layer-wise variation in token importance changes depending on the difficulty of the input image and the posed question.

Comparing this to human observation and thinking, we find that humans quickly converge to the goal when the problem is clearly expressed and the scene is simple. When expression is vague and the scene is complex, multiple hypotheses must be maintained in the prefrontal-parietal network, requiring multiple shifts in gaze. Correspondingly, our analysis of VLM shows that simple samples (where both the question and scene are relatively simple) achieve rapid convergence of cross-modal attention in shallow layers. Complex samples exhibit stronger significance fluctuations and more dispersed attention across layers. This suggests that a single fixed layer-wise allocation pruning strategy cannot meet diverse reasoning demands.

To address this, we propose Complexity-Adaptive Pruning, which assigns a personalized pruning policy for each input. We estimate the Mutual Information between visual and text tokens from the attention maps of shallow decoder layers to serve as an indicator of task and scene complexity. High mutual information signifies strong alignment (simple task), suggesting less exploration is needed. Low mutual information signifies weak alignment (complex task), requiring a longer exploration process. Upon obtaining the mutual information scalar, we map it to a layer-wise token retention curve (Logistic curve), which describes the process of token retention from exploration to convergence. The curve's slope and inflection point are linearly mapped from the mutual information. The resulting curve shape dictates the pruning strategy for that specific sample: more aggressive early pruning for simple samples, and a more conservative approach for complex samples. To ensure strict adherence to the given computation budget, we compute the integral and rescale the curve so that the area under the curve equals the specified token or FLOPs budget. The distribution of Logistic curves obtained for different samples is shown in the figure below.

Key Contributions

Complexity Measurement: Directly calculating the Mutual Information between visual and text tokens from cross-modal attention to characterize sample difficulty and task complexity.
Budget-Constrained Retention Curve: Mapping the Mutual Information to a Logistic retention function, using analytical integration and rescaling to strictly satisfy the token or FLOPs budget.
General and Plug-and-Play: No training required, easily integrates with various VLMs and VLAs, and consistently outperforms existing training-free methods across datasets and pruning ratios.

Algorithm Details

We model the pruning of visual tokens as a constrained optimization problem with a global computational budget. The decision variables include three types of policies: first, the layer-wise token allocation policy, which specifies how many tokens to retain in each layer; second, the token selection policy, which determines which specific tokens to retain; and third, the token restoration policy, which governs how discarded tokens can be restored and remapped if needed. These three policies are jointly optimized under a unified computation constraint to minimize the expected loss.

We focus on optimizing the layer-wise allocation policy. Previous methods either use a uniform strategy for all tasks, failing to adapt to different visual-text requirements, or adjust layer-by-layer independently, lacking global budget constraints, often resulting in insufficient pruning and limited speedup. Our approach dynamically allocates token budgets across layers globally, strictly adhering to the total computation constraint, thereby achieving both adaptability and stable speedup benefits.

Based on cognitive neuroscience and visualization analysis, we found that cross-modal attention follows two patterns depending on task difficulty. Simple tasks converge quickly in shallow layers, with attention rapidly collapsing in irrelevant regions. Complex tasks exhibit significant attention migration and diffusion across multiple layers, requiring a longer exploration process. Therefore, effective pruning should follow a dynamic and globally consistent trajectory, rather than a single fixed strategy. To achieve dynamic and controllable pruning, we propose AutoPrune, using the mutual information of early visual and text tokens as a complexity indicator. High mutual information indicates strong alignment (simple task), allowing for more aggressive redundancy reduction in shallow layers, saving computation for deeper layers. Low mutual information indicates weak alignment (complex task), requiring a more conservative retention strategy to ensure key evidence is utilized deeper in the network.

We map the complexity indicator to a budget-constrained Logistic retention curve. We perform analytical integration and rescaling over the network depth interval to ensure the area under the curve equals the given token budget or FLOPs budget. In practice, for the discrete problem, we round the target retention number for each layer and use binary search to adjust the global scaling factor, ensuring the accumulated cost strictly matches the budget without manual layer-by-layer tuning.

To obtain a truly complexity-adaptive policy, we make the slope and inflection point of the Logistic curve linearly dependent on the mutual information. When mutual information is high, the curve drops quickly in shallow layers, facilitating early redundancy removal and dedicating computation to deeper layers. When mutual information is low, the curve remains flat initially, delaying the rapid drop to deeper layers, preventing premature loss of critical information. This parameterization directly maps the complexity signal to a sample- and task-specific pruning policy.

Regarding implementation overhead, the extra cost mainly comes from mutual information estimation, curve generation, and layer-wise sorting. The overall time complexity is approximately independent of the feature dimension. In common configurations, this overhead is negligible compared to the total inference cost, demonstrating engineering feasibility.

Experimental Results

LLaVA-1.5-7B: Retaining 64 tokens still maintains 96.7% of the original accuracy, with FLOPs reduced to 23.2%. Pruning moderately results in virtually no loss.

LLaVA-NeXT-7B: Outperforms comparison methods under 640, 320, and 160 token budgets. At 160 budget, it still retains 94.9% of the performance.

VLA Autonomous Driving Planning: Outperforms baseline methods across different token retention rates in Senna and custom nuScenes tasks, sometimes even exceeding the unpruned model, showing the positive effect of removing noisy tokens.

Conclusion

This paper presents AutoPrune, a novel, training-free framework for complexity-adaptive pruning, designed to mitigate the computational burden caused by long visual sequences in VLMs. Inspired by cognitive neuroscience, AutoPrune quantifies sample and task complexity via the mutual information between early visual and text tokens. It maps this measure to a personalized Logistic retention curve constrained by the budget, thereby determining the token pruning strategy for each decoder layer. Extensive experiments demonstrate that AutoPrune is simple, generalizable, and highly effective, supporting efficient real-time multimodal inference and embodied AI. Our research also reveals subtle differences in attention distribution, a point observed in related work. Although token importance generally decreases with increasing decoder depth, our results (see Figure 1) show that deep layers sometimes retain more critical tokens than shallow layers. While this paper advances sample-specific layer-wise pruning, there remains scope for further research, such as enabling the strategy to dynamically match the distribution of critical tokens as it changes across network depths.