BACK TO INSIGHTS

BACK TO INSIGHTS

Latest In AI

15 mins

Five Papers That Redrew the Map: A 2026 AI Architecture Tour for the Working Engineer

This is a tour of five of the most notable papers in the field of AI — a mechanism-level walkthrough.

Published 28th June, 2026

You know what a Transformer is. You've fine-tuned a model, you understand attention, you can read a loss curve. But the field is moving fast, and 2026 quietly delivered a batch of papers that change the default answers to questions like "how do we handle long context?", "do we really need a giant model?", and "how does an agent remember anything?"

This is a tour of five of those papers — not a shallow listicle, but a mechanism-level walkthrough. We'll keep the intuition grounded in analogies, but we won't stop at the analogy. By the end you should understand not just what each paper achieves, but how — enough to explain it at a whiteboard or start prototyping a toy version.

A quick orientation before we dive in. Almost every paper here is a reaction to the same two bottlenecks that have defined Transformer scaling:

  1. Attention is quadratic. Processing a sequence of length T costs O(T²) compute and an ever-growing KV cache. This is the wall you hit with long context.
  2. Dense models are expensive. If every parameter fires for every token, capacity and compute are chained together — you can't get one without paying for the other.

Keep these two enemies in mind. Each paper below is, in some sense, a different escape route.


1. Nemotron 3 Super — The Hybrid That Escapes Quadratic Attention

NVIDIA Research · arXiv 2604.12374

The headline: A 120B-parameter model with only ~12B active parameters per token, built by interleaving three different layer types — Mamba-2 state-space layers, sparse Mixture-of-Experts feed-forwards, and occasional full attention — and then post-trained heavily with reinforcement learning to make it good at acting, not just answering.

Let's unpack each of the three ingredients, because this is a genuinely good case study in modern architecture design.

Ingredient 1: Mamba-2 state-space layers (the cure for quadratic attention)

Here's the core problem with attention. To produce the output for token t, standard attention computes:

Attn(X) = softmax(QKᵀ / √d) · V

That QKᵀ term is a T×T matrix. Double your context length, quadruple your cost. And during generation, you must cache every past key and value — the KV cache grows without bound.

State-space models (SSMs) like Mamba-2 take a completely different approach, borrowed from classical control theory. Instead of letting each token look back at all previous tokens, an SSM maintains a single fixed-size recurrent state s_t and updates it as each token streams in:

s_t = f(s_{t-1}, x_t)      # update the state with the new token
y_t = g(s_t)               # read an output from the state

Think of it as the difference between two ways of summarizing a meeting:

  • Attention is the obsessive note-taker who writes down every word everyone said, and to answer any question re-reads the entire transcript. Perfectly faithful, but the transcript (and the re-reading effort) grows endlessly.
  • An SSM is the person who keeps a single running mental summary, updating it as the meeting goes. The summary is a fixed size no matter how long the meeting runs.

The consequences are dramatic:

  Softmax Attention Mamba-2 SSM
Time complexity O(T²) O(T) — linear
Memory during decoding KV cache grows with T Constant — fixed state size

The catch is that a fixed-size summary is, in principle, lossy — you can't cram infinite history into a fixed state perfectly. So Nemotron doesn't go all SSM. It uses Mamba-2 for most layers (the cheap "sequence grind") and sprinkles in full attention layers at a few select depths for precise global mixing. You get linear-time efficiency for the bulk of the work, with periodic moments of full, exact, all-to-all communication. This hybrid pattern — mostly SSM, occasionally attention — is one of the defining architectural motifs of 2026.

This is what makes Nemotron's 1M-token context practical rather than theoretical. An agent that needs to hold a long task history, tool outputs, and intermediate plans in context can actually do so.

Ingredient 2: Latent MoE (the cure for dense compute)

Now the second enemy: dense compute. The fix is Mixture-of-Experts (MoE). Instead of one big feed-forward network (FFN) that every token passes through, you have E experts (each its own FFN) and a small router network that, for each token, picks the top-k most relevant experts. Only those run.

The restaurant analogy is apt: rather than one chef cooking every dish, a head chef (the router) reads each order and sends it to the right specialists. Most of the kitchen stays idle for any given order, so you can afford a much bigger kitchen.

Nemotron's twist is Latent MoE. In a vanilla MoE, the router and the experts all operate on the full model dimension d, which is expensive. Latent MoE first compresses each token into a smaller latent dimension (where ℓ ≪ d), does both the routing and the expert computation in that compressed space, then projects the result back up to d:

z   = W_latent · h        # compress h (dim d) → z (dim ℓ)
z'  = combine(experts run on z)   # route + run experts in cheap latent space
h'  = W_out · z'          # project back up to dim d

Because the experts live in the small latent space, you can afford to consult roughly 4× more experts for the same FLOP budget. More experts means more room for specialization — different experts can encode different tools, domains, or coding idioms.

Ingredient 3: Multi-Token Prediction + RL post-training (the "agentic" part)

Two more pieces make this an agent model rather than just an efficient LLM:

  • Multi-Token Prediction (MTP): Instead of predicting one next token per forward pass, the model predicts several. Combined with speculative decoding, this cuts latency for the long, multi-step interactions agents live in — and gives the model a mild "look-ahead" that helps planning.
  • Multi-environment RL: Super is post-trained with reinforcement learning across 21 environment configurations and over 1.2 million environment rollouts, using verifiable rewards. This is the difference between a model that can describe how to use a tool and one that has actually practiced calling tools, observing results, and correcting course — millions of times.

Why it matters going forward: Nemotron 3 Super is widely treated as a reference blueprint for agent-native foundation models. The lesson isn't "this specific model"; it's the recipe — mostly-linear sequence processing + sparse high-capacity FFNs + a dash of full attention + heavy RL on real environments. Expect to see this template echoed across the next wave of open models.


2. Step 3.5 Flash — Frontier Performance at 11B Active Parameters

StepFun · arXiv 2602.10604

The headline: A model with 196B total parameters but only ~11B active per token, that performs near the frontier. If Nemotron showed you the hybrid recipe, Step 3.5 Flash is a masterclass in squeezing frontier behavior out of a small active compute budget.

The first thing to clear up is that phrase, "active parameters," because it confuses a lot of people.

What "11B active" actually means

It's not a different kind of parameter — it's MoE sparsity. The full model is a 45-layer sparse MoE Transformer (3 dense layers + 42 MoE layers). Each MoE layer holds:

  • 288 routed experts (the specialists the router chooses among), plus
  • 1 shared expert that is always active (the generalist that handles common patterns).

For each token, the router selects the top-8 routed experts, and the shared expert always fires — so 9 experts run per layer. Each expert is small, so although the total parameter count across all 288×42 experts is 196B, the compute for any single token only touches ~11B of them.

So "196B total / 11B active" reads as: enormous storehouse of specialized knowledge, but only a small, relevant slice is consulted per token. You pay the memory cost of 196B but the compute cost of 11B. That shared expert is a nice design detail — it prevents the model from fragmenting all knowledge across specialists, keeping general "glue" competence always on tap.

(A training-time wrinkle: routers love to play favorites, sending most tokens to a handful of experts and letting the rest atrophy. Step 3.5 uses loss-free load balancing — auxiliary penalties that nudge the router toward even expert utilization — so the full capacity actually gets used.)

Hybrid attention: 3 parts local, 1 part global

Step 3.5 attacks the quadratic-attention enemy differently from Nemotron. Rather than SSMs, it uses a 3:1 ratio of Sliding-Window Attention (SWA) to full attention:

  • Sliding-Window Attention lets each token attend only to a local window of the last W tokens. Cost drops from O(T²) to O(T·W). Great for local coherence.
  • Full attention layers, interspersed every fourth layer, restore global reach — long-range dependencies, cross-document structure.

Three cheap local layers for every one expensive global layer. This is what lets Step 3.5 support a 256K-token context economically. It's the same philosophy as Nemotron's "mostly cheap, occasionally exact," just implemented with windowed attention instead of state-space recurrence.

MTP-3: predicting three tokens at a time

Like Nemotron, Step 3.5 uses Multi-Token Prediction — here specifically MTP-3, predicting 3 tokens per forward pass. The model emits tentative logits for positions t+1, t+2, t+3; a speculative-decoding verifier accepts or rejects them. When the guesses are good (which they often are for predictable spans), you skip forward passes and throughput jumps.

The stability tricks that make small-active reasoning work

Pushing a small active model through long, multi-step reasoning chains is numerically dangerous — activations can explode over many steps. Step 3.5 leans on a couple of stabilizers worth knowing:

  • Head-wise gated attention: each attention head's output is multiplied by a learned scalar gate (a projection followed by a sigmoid). This both clips runaway activation amplitude and lets the model dynamically dial individual heads up or down based on input.
  • Activation clipping + multi-round RL: the reasoning-focused RL phase is paired with explicit limits on activation amplitude, keeping deep reasoning chains stable.

Why it matters going forward: This paper is exhibit A for the year's biggest shift — capability is decoupling from raw size. Sara Hooker's widely-circulated 2026 essay on "broken scaling assumptions" is essentially a meditation on results like this. The practical upshot: strong reasoning is becoming cheap enough to run locally, privately, and at low latency, which changes who can deploy capable AI and where.


3. Gated DeltaNet-2 — Giving Linear Attention an Eraser

DeltaNet research community (NVIDIA) · arXiv 2605.22791

The headline: A linear-attention layer that compresses all history into a single fixed-size state — but adds separate, independently-controlled gates for erasing old information and writing new information. This is the most mathematically elegant paper in the batch, and worth slowing down for.

We've established that linear-attention / SSM approaches replace the unbounded KV cache with a fixed-size state matrix S_t. The whole game in this family is: how do you update that state well?

The starting point: linear attention as a key→value memory

Picture the state S_t as a little associative memory — a matrix that maps keys to values, roughly approximating Σ vⱼ kⱼᵀ over all tokens seen so far. To read it, you query with k_t:

y_t = S_t · k_t        # read the value associated with this key

The naive update is just to add each new association: S_t = S_{t-1} + v_t k_tᵀ. But this is the notebook-with-no-eraser problem — you only ever accumulate. Old, stale, or contradicted information never leaves. The memory gets crowded and noisy.

The delta rule: correct, don't just accumulate

DeltaNet's key idea is to borrow the delta rule from classical learning theory: before writing, check what the memory currently predicts, and only write the correction. The original DeltaNet update:

S_t = S_{t-1} − β_t (S_{t-1} k_t − v_t) k_tᵀ

Read that middle term carefully — it's intuitive once you see it:

  1. S_{t-1} k_t is what the memory currently predicts for key k_t.
  2. (S_{t-1} k_t − v_t) is the error — how wrong that prediction is versus the true value v_t.
  3. We nudge the state to reduce that error, scaled by β_t.

So instead of blindly piling on, the memory self-corrects at each step: "you thought key k mapped to X, but it actually maps to Y — let me fix just that." Importantly, this is happening in the forward pass — it's a memory mechanism, not backprop.

The Gated DeltaNet-2 innovation: decouple erase from write

Earlier gated versions added a single scalar gate to control forgetting. But here's the insight the "-2" paper is built on: erasing old information and writing new information are two conceptually different actions, and they shouldn't share one knob.

Gated DeltaNet-2 introduces two separate channel-wise gate vectors:

  • an erase gate b_t — controls which key-dimensions of the old state to wipe, and
  • a write gate w_t — controls which value-dimensions of new info to commit,

plus a channel-wise decay D_t. The full state update (don't panic, we'll read it in plain English):

S_t = (I − k_t (b_t ⊙ k_t)ᵀ) · D_t · S_{t-1}  +  k_t (w_t ⊙ v_t)ᵀ

In three plain-English steps:

  1. Decay: gently fade the old state — D_t · S_{t-1}.
  2. Erase: the (I − k_t (b_t ⊙ k_t)ᵀ) factor is a surgical eraser. It removes the components of the old state aligned with the current key k_t, and the b_t vector controls which dimensions get erased and how hard.
  3. Write: add the new association k_t (w_t ⊙ v_t)ᵀ, with w_t controlling which value dimensions actually get stored.

Both gates are produced from the input token (b_t = σ(W_b x_t), w_t = σ(W_w x_t)), so the model learns when and what to erase versus write, per-token, per-channel.

The everyday version: it's the difference between a notebook where the eraser and the pen are wired to the same hand (erase a lot → forced to write a lot) versus a notebook where you can erase aggressively in one place while writing gently in another. That independence is the whole point, and it measurably improves how cleanly the model manages long-range memory.

One practical note for engineers: recurrent updates sound sequential and GPU-unfriendly, but the paper provides a chunkwise (WY) algorithm that parallelizes these updates across chunks of tokens, plus a gate-aware backward pass — so training stays efficient.

Why it matters going forward: This is foundational plumbing. Alongside related work ("Delta Attention Residuals," "Deep Delta Learning"), Gated DeltaNet-2 is helping define the post-vanilla-Transformer toolbox — architectures that handle very long context stably and cheaply, with explicit, controllable memory rather than an ever-growing cache. If long-horizon agents are the destination, this is the road being paved.


4. EverMemOS — A Memory Operating System for Agents

arXiv 2601.02163

The headline: A "memory OS" that sits in front of an LLM and manages long-term memory the way an operating system manages files — turning raw conversation into structured, consolidated, retrievable memory objects, and injecting only the minimal sufficient context back into the model at inference.

The previous three papers gave models better internal memory (longer context, better state). EverMemOS attacks a different layer entirely: persistent memory across sessions. The frustrating reality of most assistants is amnesia — close the window, and tomorrow it has no idea who you are. EverMemOS is an architecture for fixing that, and it's notably not just "stuff everything into a vector database."

The three-stage memory lifecycle

EverMemOS structures memory as a pipeline that deliberately mirrors how human memory consolidates: Episodic Trace Formation → Semantic Consolidation → Reconstructive Recollection.

Stage 1 — MemCells (the atomic unit). Raw dialogue isn't stored as a flat log. Each interaction is distilled into a MemCell, a structured record — formally a tuple (E, F, P, M):

  • E — a concise third-person episodic narrative ("the user decided to ship the feature on Friday"),
  • F — a set of atomic facts,
  • P — profile information about the user,
  • M — temporal / foresight metadata (time-bounded info, like a deadline).

Note this is not just an embedding vector — it's a compressed, structured object. Compression happens at write time, which is already a form of pruning.

Stage 2 — MemScenes (consolidation). As MemCells accumulate, related ones are clustered (online) into MemScenes — higher-level, thematically-unified groupings. From each scene, a more stable profile/summary is distilled. This is the "repeated events compress into a stable category" behavior of human memory: instead of 50 separate notes about a project, you form one consolidated "scene" of that project that updates as new cells arrive. Redundant information gets merged rather than duplicated.

Stage 3 — Reconstructive Recollection (retrieval). At query time, EverMemOS does not dump all memory into the prompt. It runs a two-stage hybrid retrieval:

  1. Score and rank MemScenes for relevance, using a hybrid of dense embedding similarity and BM25 lexical matching, fused via reciprocal rank fusion (RRF). (Worth noting: it's hybrid dense+lexical, not a graph database, despite what you might assume.)
  2. Select the top MemCells within the top scenes, filter foresight items by whether their time window is still valid, and assemble a reconstructed context.
  3. An LLM verification step checks the assembled context is sufficient before answering.

That guiding principle — retrieve what is necessary and sufficient, not everything — is the heart of the system. EverMemOS is essentially a memory manager that brokers between raw history and the LLM's limited context window.

The numbers

EverMemOS reports solid gains on standard long-term-memory benchmarks:

  • up to +9.2% overall accuracy on LoCoMo over the best baseline,
  • +19.7% on multi-hop accuracy (LoCoMo) — the hardest, most reasoning-heavy slice,
  • up to +6.7% on LongMemEval, and +20.6% on knowledge updating (LongMemEval) — i.e., correctly revising facts that changed over time.

Those multi-hop and knowledge-updating numbers are the interesting ones: they show structured memory helping reasoning over time, not just raw recall.

Why it matters going forward: Memory is the missing ingredient for lifelong agents — assistants that maintain projects, preferences, and context over weeks and months, getting more useful over time rather than resetting daily. EverMemOS became a canonical 2026 design and seeded a whole family of follow-ups (Mem0, SimpleMem, MemRL, Focus). The broader lesson: a lot of "agent intelligence" lives outside the model weights, in how you architect memory around it.


5. Cosmos 3 — One Model That Sees, Reads, Imagines, and Acts

NVIDIA · arXiv 2606.02800

The headline: An omnimodal world model built on a Mixture-of-Transformers (MoT) architecture, that jointly understands and generates across text, image, video, audio, and even action sequences — and is trained to predict future states, making it usable as a vision-language model, a video generator, a simulator, and a robot policy.

This is the most ambitious paper in the set, and the one pointing most directly at physical AI.

"Mixture-of-Transformers" — shared interface, specialized paths

Don't confuse MoT with MoE. MoE routes among many small FFN experts. MoT is about handling many modalities. In Cosmos 3, the architecture combines:

  • an autoregressive (AR) reasoning tower — for understanding and prediction, and
  • a diffusion generation tower — for producing images/video,

with modality-specific parameter sets within each Transformer layer, all interacting through joint attention. So it's not "one undifferentiated Transformer that sees everything," nor is it "separate models bolted together." It's a middle path:

  • Shared: the overall backbone and the joint attention that lets modalities talk to each other.
  • Specialized: each modality gets its own encoder (a ViT-style encoder for visual understanding, a VAE-based representation for generation, domain-aware vectors for actions), and AR vs. diffusion tokens use separate parameters in each layer.

The analogy: most AI today is a company with siloed departments (text dept, vision dept) passing memos. Cosmos 3 is a single mind where the senses live together and share a workspace — the way you fuse sight, sound, and language into one understanding without consciously "switching modules."

Tokenizing everything into a shared space

How do you get pixels and words into the same Transformer? Each modality is first encoded/compressed by its own tokenizer into a learned latent representation, then projected into a shared latent token space:

  • text → standard discrete language tokens,
  • image/video/audio → compressed by modality-specific encoders (video, notably, via a temporally-causal tokenizer) → projected into the common space.

The shared space isn't raw pixels-next-to-words; it's a learned latent space where every modality is mapped to compatible Transformer inputs.

What makes it a "world model"

This is the crucial conceptual leap. A world model doesn't just classify inputs — it learns a latent representation of environment dynamics well enough to predict what happens next. Cosmos 3 is trained to:

  • reason about motion, causality, and spatial relationships,
  • predict future video and future action sequences,
  • and thereby support physical-AI tasks (robotics, autonomous driving).

Training uses a progressive multimodal curriculum: data is curated and tokenized per modality, shared training aligns the latent space, AR losses train the reasoning/prediction tower, diffusion losses train the generation tower, and complexity ramps up across modalities over the curriculum.

On results, Cosmos 3 was reported at release as the top open-source Text-to-Image and Image-to-Video model per Artificial Analysis, and the best policy model per RoboArena — a telling combination, because it spans pure generation and embodied control with one architecture.

Why it matters going forward: Cosmos 3 prototypes the move toward omnimodal world models — systems that build a unified internal model of how the world works across senses, rather than treating each modality as a separate problem. This is the architectural direction underpinning grounded, embodied, interactive AI: robots, AR, autonomous systems — AI that doesn't just describe the world but perceives, predicts, and acts within it.


Zooming out: the through-line of 2026

Step back and the five papers tell one coherent story. Remember our two enemies — quadratic attention and dense compute? Look at how the field is routing around them, and what it's building once it does:

Paper Primary move The lesson it teaches
Nemotron 3 Super Mamba SSM + Latent MoE + sparse attention + RL Hybrid architectures beat pure Transformers for efficient, agentic reasoning
Step 3.5 Flash Sparse MoE (11B active/196B total) + hybrid attention + MTP Capability is decoupling from size — frontier reasoning at small active compute
Gated DeltaNet-2 Linear attention with decoupled erase/write gates Memory should be explicitly controllable, not an ever-growing cache
EverMemOS Structured, consolidated memory OS around the LLM Persistent intelligence lives outside the weights, in memory architecture
Cosmos 3 Mixture-of-Transformers omnimodal world model The frontier is unified, grounded, world-modeling AI — not text engines

The unifying shift: the era of "just make it bigger" is giving way to "make it structured, efficient, persistent, and grounded." Three of these papers are about doing more with less active compute (Nemotron, Step 3.5, Gated DeltaNet-2). One is about memory that lasts (EverMemOS). One is about perception and action that's unified and embodied (Cosmos 3). Together they're the scaffolding for the thing everyone is actually building toward: capable, efficient, persistent autonomous agents that operate in the real world.

If you want to go deeper, the natural next steps for an engineer are to (a) implement a toy linear-attention layer and add the delta rule, then the erase/write gates — it's the most self-contained mechanism here; (b) read a minimal MoE implementation to internalize routing and load balancing; and (c) skim a Mamba-2 explainer to get comfortable with the SSM recurrence. Those three concepts unlock most of what's in this list.


What this means for your business: the cost of AI is about to fall off a cliff

Here's the part that should matter to anyone running a company rather than a research lab.

For the last few years, the unspoken tax on adopting AI was compute cost. Capable models lived in giant data centers, and using them well meant renting expensive GPUs or paying per-token to whoever owned them. That tax made a lot of businesses hesitate — "let's wait until it's cheaper."

The five papers above are, collectively, the sound of that tax being repealed. Look at the through-line again: frontier-level capability is decoupling from raw size. An 11B-active model now reasons like yesterday's giants. Linear-attention and state-space architectures are slashing the cost of long context. Smaller, smarter models increasingly run on readily available, affordable infrastructure — the kind of hardware you can already access today, and that gets cheaper and more powerful every single quarter.

The strategic implication is clear: the cost of AI compute is heading down, fast — and the right time to begin your AI transition is now, not after the prices fall. Here's why waiting is the more expensive choice:

  • The technology curve is on your side. Whatever feels expensive today will be markedly cheaper within a few quarters. Building your foundation now means you ride that curve down instead of starting from zero once it bottoms out.
  • The real bottleneck isn't compute — it's readiness. Getting your data, workflows, and team prepared to use AI takes time. That's the work that pays off the moment cheap, capable models become ubiquitous. Businesses that start that groundwork now will be the ones positioned to capitalize immediately.
  • Early movers compound their advantage. AI adoption isn't a switch you flip; it's a capability you build. The organizations that begin experimenting today will have months of accumulated learning, tooling, and institutional know-how that late adopters simply can't buy overnight.

In short: don't let the current price of compute scare you off a transition whose costs are already collapsing. Start small, start now, and let the falling cost curve work in your favor.

Ready to begin your AI journey? Get in touch with wbos — we can help you take the first practical steps toward an AI transition, so your business is ready to capitalize the moment capable AI becomes cheap enough to run anywhere. The future favors those who started early.


A note on sourcing: these five were selected from expert reading lists (Sebastian Raschka's 2026 LLM-papers series, dair-ai's weekly roundups) and Hugging Face trending papers — 2026 has no single official "top papers" ranking yet. Mechanism details are drawn from the papers' technical reports and accompanying analyses; where public materials only partially expose an implementation (notably EverMemOS's exact pruning thresholds and some Cosmos 3 benchmark tables), I've described what's confirmed and flagged what isn't.

Start your AI journey

we can help you take the first practical steps toward an AI transition, so your business is ready to capitalize the moment capable AI becomes cheap enough to run anywhere. The future favors those who started early.

Partner. Build. Scale

Partner. Build. Scale

READY TO START ?

Let's discuss how we can engineer it into reality. We are ready to partner with you.

All rights reserved. WBOS.
POWERED BYOPMIZ