Can Large Models Handle Precision Work Too?! MIT Top Conference Paper Teaches AI to Operate Industrial CAD Software

The MIT team, in their VideoCAD research released at the top conference NeurIPS 2025, used over 41,000 video samples to demonstrate the inability of current top large models to handle professional engineering software, and proposed a solution for learning complex 3D interactions from videos.

Image

Current AI excels at chatting, drawing, and even writing code on 2D screens, but becomes illiterate when facing industrial software requiring precise operations and 3D spatial logic.

Computer-Aided Design (CAD) software is the cornerstone of modern industry, essential for designing everything from phone cases to aircraft engines.

The operation logic of such software differs vastly from habitual web clicks or mobile swipes; it requires users to construct 3D models mentally and implement them on a 2D screen via hundreds of menus, shortcuts, and mouse actions.

This long-horizon, high-precision interaction process is a chasm that current AI agents struggle to cross.

VideoCAD bridges this gap.

Instead of having AI read tedious software manuals, the research team used reverse engineering to let machines watch and learn to operate professional CAD platforms like Onshape just like human engineers.

Interaction Barriers of Precision Engineering Software

To understand VideoCAD's value, first see how solid the fortress it aims to conquer is.

Ordinary internet apps, whether ordering food delivery or watching videos, have short-path UI interactions where each user action directly corresponds to a clear result with high fault tolerance. Wrong button? Just go back and select again.

Industrial-grade CAD software is entirely different.

Platforms like SolidWorks, Autodesk Inventor, or PTC Onshape have hundreds to thousands of toolbar options.

A simple hole-drilling on a cube involves selecting the correct plane, sketching, defining circle center coordinates, setting diameter constraints, exiting sketch mode, selecting extrude cut tool, setting depth parameters, and more—a series of steps.

This sequence has strong dependencies; choosing the wrong plane first makes all subsequent fine operations futile.

More tricky, these operations occur on a WebGL or OpenGL-based canvas.

For AI, web buttons are readable text labels via Document Object Model (DOM) code, but CAD canvas is just pixels.

To operate here, AI must visually judge model edges, circle centers like human eyes, and output precise (x, y) pixel coordinates.

Existing AI training datasets focus on Android phone ops or simple web browsing, not touching deep 3D spatial understanding and pixel-level precise control.

VideoCAD chose browser-based cloud CAD platform Onshape as entry point to tackle this in a standardized environment.

Image

To teach AI CAD use, the most direct way is to record videos of thousands of engineers working—unrealistic in cost and time.

MIT researchers used an ingenious reverse generation strategy, building an automated factory to produce data.

Data source is DeepCAD, a dataset of 178,000 parametric CAD models created by human designers.

These models include not just final 3D shapes but complete construction histories (Construction Sequence).

Researchers focused on the most challenging multi-extrusion sequences, involving multiple sketches and extrusions, complex structures reflecting industrial design logic.

With blueprints, next is machine performance.

Animation

The team developed a hybrid automation framework.

For standard UI ops like menu clicks and dialog inputs, system uses Selenium to directly control browser DOM elements; for canvas sketching, PyAutoGUI for pixel-level mouse simulation.

Since Onshape lacks public drawing API, simulation must be millisecond and pixel precise.

To make generated data more than cold machine instructions, researchers injected human soul into automation scripts.

Real engineers hesitate and double-check.

Thus, data generation adds random delays floating 0.2 to 0.5 seconds.

When selecting sketch planes, script randomly samples surface points, not always center.

For tiny hard-to-select features, script zooms, simulating human view zoom for precise input.

This system ran nonstop on 64 cloud VMs, recording full-res videos at 60 fps.

After a week, generated over 118 days of video material.

Then, rigorous quality control.

Each generated video's final CAD model rendered to isometric view, compared to original DeepCAD render using DINOv2 vision model.

CLIP good at semantics (e.g., recognizing a chair) but poor at fine geometry.

Self-supervised DINOv2 catches subtle shape differences sharply.

Only if cosine similarity in DINOv2 feature space >0.7, data retained.

Finally, VideoCAD refined 41,005 high-quality samples, each with video, precisely aligned action sequence, and target image.

Image

Data Scale and Complexity's Dimensional Strike

VideoCAD's release makes existing UI interaction datasets look childish.

Data scale and task complexity are two core dimensions valuing datasets.

Before VideoCAD, largest related WebLinx averaged 43 actions/task; VideoCAD averages 186 actions, over 4x.

AI must maintain memory and logic consistency over longer spans.

Deeper difference in task nature.

Most existing (e.g., Mind2Web) are info retrieval or form filling; AI just recognizes text/buttons.

VideoCAD among few requiring 3D reasoning.

AI can't cheat with DOM parser; must truly understand screen geometry.

Onshape UI averages 6,740 elements, 6x ordinary web.

High-density info + pixel coord reqs force strong visual perception/decision.

Action distribution stats reveal CAD work reality.

Many ops on mouse move/click/keyboard, reflecting fine adjustments in drawing.

Unlike next-click tasks, CAD modeling switches between 2D/3D thinking constantly.

This complexity makes VideoCAD a touchstone for true general computer ops ability.

Image

With data, how to teach AI these ops?

Generic video models suboptimal, ignoring strong causal deps in CAD.

MIT designed VideoCADFormer, Transformer-based autoregressive model for long-horizon CAD action prediction.

Design philosophy: tightly decouple yet deeply fuse visual perception and action prediction.

Each timestep, receives two visual signals: current UI screenshot and final target CAD image.

Former: where I am; latter: where to go.

ViT-encoded for local progress + global goal context.

Actions not simple text seq; structured vectors with cmd type + params. E.g., circle draw: cmd + (x,y) center + radius.

Uses dual-mask Transformer decoder.

Image

Causal mask prevents peeking future; window mask focuses on recent history.

Fits UI: current click deps recent secs, not min-old details.

Output: two heads for cmd type and param values.

Continuous coords discretized to 1000 classes for classification. Like fill-in-blanks for complex instr.

Exps prove efficacy vs baselines like VPT.

Image

Cmd accuracy 98.08%, param 82.35%.

Impressive: >200 step long seqs, 85.46% perfect pred rate; baselines collapse from error accum.

For geometric accuracy, not just pixels: run generated models in Onshape, compute Chamfer Distance to targets.

Generated models match human originals spatially, proving true 3D shape construction understanding.

Top Large Models' Collective Failure Scene

VideoCAD is textbook for new models and demon mirror for existing.

Team built VideoCADQA visual QA benchmark to test GPT-4, Claude 3.7, Gemini 2.5 etc. on 3D reasoning. Results shocking.

In extrusion depth comparison: watch video, judge if 2nd extrusion deeper than 1st. Human engineers see instantly. GPT-4.1: 18% acc. Reveals severe hallucinations in rel depth/geo relations.

Image

Extrusion count: how many extrusions make final object. GPT-4.1: 47%. Frame ordering (temporal): Claude 3.7: 23%.

Further: LLMs as UI agents via BrowserGym on Onshape modeling tasks.

Total wipeout.

Image

All LLMs, stunning in text gen, fail any full CAD task.

Main issue: can't convert semantic cmds (draw circle) to precise screen coords.

Know to click sketch btn but hit nearby blank or try code selector on pixel-only canvas.

Shows general models far from pro embodied/digital interaction.

VideoCAD reveals AI bottleneck: from talk to real ops.

AI generates pretty images but not producible engineering drawings; pretty code but not complex dev envs.

VideoCADFormer shows possibility: from human op videos, learn software logic and spatial causality.

Mature: AI not just chatbots, but engineer copilots.

Observe design intent, auto-complete tedious steps; mid-design, predict final shape + suggest ops.

Image

Breaks boundaries of CV, RL, HCI.

VideoCADFormer not perfect (synth data dep etc.), but directions: AI can learn industrial tool use.

References:

https://ghadinehme.github.io/videocad.github.io/

https://github.com/ghadinehme/VideoCAD

https://arxiv.org/abs/2505.24838

https://news.mit.edu/2025/new-ai-agent-learns-use-cad-create-3d-objects-sketches-1119

Main Tag:VideoCAD

Sub Tags:MITAI AgentsCAD SoftwareNeurIPS 2025


Previous:Achieve Cluster Efficiency on a Single GPU! Hugging Face TRL and RapidFire AI's Super-Parallel Revolution

Next:Karpathy Forms a Large Model 'Parliament': GPT-5.1, Gemini 3 Pro, and Others Become the Strongest Think Tank

Share Short URL