The Pipeline Is Moving Inside the Model

Software 2.0 replaced hand-written computer-vision algorithms with learned models. A second transition is now becoming visible: the learned models are beginning to replace the interfaces between previously separate pipeline stages.

The claim: computer vision is moving from models inside a pipeline to the pipeline inside a model. The reason is not aesthetic simplicity. Joint optimization can preserve information, uncertainty, and context that explicit intermediate outputs throw away.

For decades, a serious vision system was built as a chain: enhancement, feature extraction, detection, association, tracking, geometry, event logic, and finally an application decision. The design was deliberate. Explicit boundaries made components testable, replaceable, inspectable, and easier to control.

Deep learning first changed the implementation inside those boxes. Hand-engineered features became neural features; classifiers became neural networks; detectors became learned end-to-end. But the system diagram often remained intact. A learned detector still emitted boxes to a separately engineered tracker.

Evolution from classical computer vision to specialized learned models and broader generalist systems

The first transition replaced algorithms inside blocks. The current transition increasingly absorbs the boundaries between blocks.

Why pipelines lose information

Every explicit interface is also a bottleneck. A detector may convert a rich visual representation into a box, class, and confidence score. The tracker downstream receives those outputs, but not the evidence, ambiguity, or context that produced them. When the detector commits too early, the tracker cannot recover what was discarded.

Separate stages also optimize separate proxy metrics. Better detection AP does not necessarily produce better identity continuity. More accurate depth does not necessarily improve planning. A model trained around the final task can learn which intermediate information actually matters to that task.

Tight fusion therefore has a real advantage when information shared across tasks matters more than the control provided by explicit interfaces. The historical evidence is clearest in tracking, geometry, autonomous driving, and robotics.

Tracking: association moves into attention

Before — SORT, 2016: detector outputs feed a Kalman filter and Hungarian matching. Detection, motion estimation, and association remain explicit and independently inspectable.

Transition — TrackFormer and MOTR, 2021: persistent track queries and attention learn association across frames, removing a separate matching algorithm from the core formulation.

Now: video segmentation models increasingly treat detection, masks, identity, and temporal memory as one streaming task.

SORT is a beautiful modular baseline precisely because its interfaces are explicit. Five years later, TrackFormer and MOTR reframed multi-object tracking as learned set prediction over time. Association stopped being a downstream procedure and became behavior represented inside the model.

Current work pushes farther. Autoregressive Universal Video Segmentation combines prompted segmentation with unprompted detect-and-track-everything, while VidEoMT shows a plain encoder-only ViT performing online segmentation and tracking.

Geometry: the reconstruction pipeline becomes a representation

Before: features → matching → calibration → pose → triangulation → reconstruction.

Transition — DUSt3R, 2024: regress pointmaps directly, then recover depth, correspondence, and camera parameters from a shared representation.

Now: one model exposes depth, pose, correspondence, geometry, and motion through queries over a shared scene state.

Classical multi-view geometry is an especially strong example because every stage has a mathematical purpose and a measurable output. DUSt3R deliberately takes the opposite stance: it relaxes the usual requirement to first recover camera calibration and pose, directly regresses pointmaps, and recovers several traditional outputs from that common representation.

More recent systems extend the same move into dynamic scenes. D4RT presents depth, correspondence, pose, geometry, and motion as queries; Any4D jointly predicts metric geometry and motion; and UniCorrn shares weights across 2D–2D, 2D–3D, and 3D–3D correspondence.

Multiple visual sensors flowing into one shared correspondence and world model supporting many visual tasks

A shared representation can preserve correspondence across sensor, viewpoint, space, and time, then expose several task-specific outputs.

Driving: optimize perception for the final task

Before: perception → tracking → prediction → planning.

Transition — UniAD, 2023: incorporate full-stack driving tasks into one network and organize them around the final goal: planning.

The important change: the final task influences what the model learns to perceive.

UniAD states the case directly: sequential modules suffer from accumulated errors and weak task coordination. Its answer is not to erase every concept, but to communicate through unified query interfaces and train the full stack in pursuit of planning.

This distinction matters. A fused model is not necessarily an undifferentiated black box. It may still contain recognizable modules and intermediate supervision. What changes is that the boundaries are differentiable and the final objective can shape the complete system.

Robotics: perception and action become one learned loop

The strongest version of boundary collapse appears when perception no longer ends at a description of the world. Vision-language-action models connect instructions, observations, active sensing, and control. A robot can choose another view because the current observation is insufficient for the task.

This is qualitatively different from merely sharing a backbone. Perception becomes task-conditioned: what the system needs to see depends on what it is trying to do. SaPaVe is a current example, jointly learning active perception and manipulation.

Is this Software 3.0?

Strictly speaking, no. The boundary-collapse story is primarily Software 2.0 maturing.

In his 2017 essay Software 2.0, Andrej Karpathy described neural-network weights as software produced by optimization rather than directly written by humans. He also identified this exact advantage: separately trained learned modules can be joined and optimized together. In his words, modules can “meld into an optimal whole.”

Stack	Program representation	How behavior changes
Software 1.0	Explicit code and algorithms	Edit code
Software 2.0	Learned weights	Edit data, objectives, and training
Software 3.0	General models controlled through language and context	Edit prompts, examples, memory, and tools

The pipeline moving inside a model is therefore not, by itself, Software 3.0. It becomes Software 3.0 when the fused capability is also programmable through intent: when an operator can specify a new visual task through language, examples, or tools without engineers assembling a new fixed pipeline or retraining a task-specific model.

Software 2.0: optimize the pipeline as one learned system
Software 3.0: program that system through intent

Why the pipeline will not disappear

The old architecture existed for good reasons. Explicit components can be independently tested, replaced, constrained, and run at different rates or on different hardware. Their failures are easier to localize. In safety-critical systems, those properties are not optional.

Tight fusion also creates new problems:

a regression in one shared representation can affect many capabilities;
intermediate behavior becomes harder to inspect and certify;
joint objectives can create task conflicts rather than cooperation;
data and evaluation must cover interactions between capabilities;
latency and cost become harder to allocate to individual tasks.

The likely production architecture is therefore not one model replacing everything. It is a fused learned core inside a modular deterministic shell. The model absorbs boundaries where joint reasoning improves the final task. The surrounding system preserves boundaries where control, guarantees, observability, and exactness matter.

The real engineering shift

End-to-end learning is not new. Multi-task learning is not new. What is changing is the scale of the learned responsibility:

early models replaced one classical algorithm;
later models shared features across related tasks;
current models absorb complete task pipelines and persistent state;
emerging models accept goals instead of only fixed task definitions.

That last step is where computer vision begins to approach Software 3.0. The pipeline has not only moved inside the model; the model itself becomes programmable.

This essay is separate from my production-oriented CVPR 2026 review. The production review asks which current works can ship. This essay uses historical and current works to ask whether the architecture of computer-vision systems is changing.