Latest In AI
15 mins
This is a tour of five of the most notable papers in the field of AI — a mechanism-level walkthrough.
Published 28th June, 2026
You know what a Transformer is. You've fine-tuned a model, you understand attention, you can read a loss curve. But the field is moving fast, and 2026 quietly delivered a batch of papers that change the default answers to questions like "how do we handle long context?", "do we really need a giant model?", and "how does an agent remember anything?"
This is a tour of five of those papers — not a shallow listicle, but a mechanism-level walkthrough. We'll keep the intuition grounded in analogies, but we won't stop at the analogy. By the end you should understand not just what each paper achieves, but how — enough to explain it at a whiteboard or start prototyping a toy version.
A quick orientation before we dive in. Almost every paper here is a reaction to the same two bottlenecks that have defined Transformer scaling:
Keep these two enemies in mind. Each paper below is, in some sense, a different escape route.
NVIDIA Research · arXiv 2604.12374
The headline: A 120B-parameter model with only ~12B active parameters per token, built by interleaving three different layer types — Mamba-2 state-space layers, sparse Mixture-of-Experts feed-forwards, and occasional full attention — and then post-trained heavily with reinforcement learning to make it good at acting, not just answering.
Let's unpack each of the three ingredients, because this is a genuinely good case study in modern architecture design.
Here's the core problem with attention. To produce the output for token t, standard attention computes:
Attn(X) = softmax(QKᵀ / √d) · V
That QKᵀ term is a T×T matrix. Double your context length, quadruple your cost. And during generation, you must cache every past key and value — the KV cache grows without bound.
State-space models (SSMs) like Mamba-2 take a completely different approach, borrowed from classical control theory. Instead of letting each token look back at all previous tokens, an SSM maintains a single fixed-size recurrent state s_t and updates it as each token streams in:
s_t = f(s_{t-1}, x_t) # update the state with the new token
y_t = g(s_t) # read an output from the state
Think of it as the difference between two ways of summarizing a meeting:
The consequences are dramatic:
| Softmax Attention | Mamba-2 SSM | |
|---|---|---|
| Time complexity | O(T²) | O(T) — linear |
| Memory during decoding | KV cache grows with T | Constant — fixed state size |
The catch is that a fixed-size summary is, in principle, lossy — you can't cram infinite history into a fixed state perfectly. So Nemotron doesn't go all SSM. It uses Mamba-2 for most layers (the cheap "sequence grind") and sprinkles in full attention layers at a few select depths for precise global mixing. You get linear-time efficiency for the bulk of the work, with periodic moments of full, exact, all-to-all communication. This hybrid pattern — mostly SSM, occasionally attention — is one of the defining architectural motifs of 2026.
This is what makes Nemotron's 1M-token context practical rather than theoretical. An agent that needs to hold a long task history, tool outputs, and intermediate plans in context can actually do so.
Now the second enemy: dense compute. The fix is Mixture-of-Experts (MoE). Instead of one big feed-forward network (FFN) that every token passes through, you have E experts (each its own FFN) and a small router network that, for each token, picks the top-k most relevant experts. Only those run.
The restaurant analogy is apt: rather than one chef cooking every dish, a head chef (the router) reads each order and sends it to the right specialists. Most of the kitchen stays idle for any given order, so you can afford a much bigger kitchen.
Nemotron's twist is Latent MoE. In a vanilla MoE, the router and the experts all operate on the full model dimension d, which is expensive. Latent MoE first compresses each token into a smaller latent dimension ℓ (where ℓ ≪ d), does both the routing and the expert computation in that compressed space, then projects the result back up to d:
z = W_latent · h # compress h (dim d) → z (dim ℓ)
z' = combine(experts run on z) # route + run experts in cheap latent space
h' = W_out · z' # project back up to dim d
Because the experts live in the small latent space, you can afford to consult roughly 4× more experts for the same FLOP budget. More experts means more room for specialization — different experts can encode different tools, domains, or coding idioms.
Two more pieces make this an agent model rather than just an efficient LLM:
Why it matters going forward: Nemotron 3 Super is widely treated as a reference blueprint for agent-native foundation models. The lesson isn't "this specific model"; it's the recipe — mostly-linear sequence processing + sparse high-capacity FFNs + a dash of full attention + heavy RL on real environments. Expect to see this template echoed across the next wave of open models.
StepFun · arXiv 2602.10604
The headline: A model with 196B total parameters but only ~11B active per token, that performs near the frontier. If Nemotron showed you the hybrid recipe, Step 3.5 Flash is a masterclass in squeezing frontier behavior out of a small active compute budget.
The first thing to clear up is that phrase, "active parameters," because it confuses a lot of people.
It's not a different kind of parameter — it's MoE sparsity. The full model is a 45-layer sparse MoE Transformer (3 dense layers + 42 MoE layers). Each MoE layer holds:
For each token, the router selects the top-8 routed experts, and the shared expert always fires — so 9 experts run per layer. Each expert is small, so although the total parameter count across all 288×42 experts is 196B, the compute for any single token only touches ~11B of them.
So "196B total / 11B active" reads as: enormous storehouse of specialized knowledge, but only a small, relevant slice is consulted per token. You pay the memory cost of 196B but the compute cost of 11B. That shared expert is a nice design detail — it prevents the model from fragmenting all knowledge across specialists, keeping general "glue" competence always on tap.
(A training-time wrinkle: routers love to play favorites, sending most tokens to a handful of experts and letting the rest atrophy. Step 3.5 uses loss-free load balancing — auxiliary penalties that nudge the router toward even expert utilization — so the full capacity actually gets used.)
Step 3.5 attacks the quadratic-attention enemy differently from Nemotron. Rather than SSMs, it uses a 3:1 ratio of Sliding-Window Attention (SWA) to full attention:
Three cheap local layers for every one expensive global layer. This is what lets Step 3.5 support a 256K-token context economically. It's the same philosophy as Nemotron's "mostly cheap, occasionally exact," just implemented with windowed attention instead of state-space recurrence.
Like Nemotron, Step 3.5 uses Multi-Token Prediction — here specifically MTP-3, predicting 3 tokens per forward pass. The model emits tentative logits for positions t+1, t+2, t+3; a speculative-decoding verifier accepts or rejects them. When the guesses are good (which they often are for predictable spans), you skip forward passes and throughput jumps.
Pushing a small active model through long, multi-step reasoning chains is numerically dangerous — activations can explode over many steps. Step 3.5 leans on a couple of stabilizers worth knowing:
Why it matters going forward: This paper is exhibit A for the year's biggest shift — capability is decoupling from raw size. Sara Hooker's widely-circulated 2026 essay on "broken scaling assumptions" is essentially a meditation on results like this. The practical upshot: strong reasoning is becoming cheap enough to run locally, privately, and at low latency, which changes who can deploy capable AI and where.
DeltaNet research community (NVIDIA) · arXiv 2605.22791
The headline: A linear-attention layer that compresses all history into a single fixed-size state — but adds separate, independently-controlled gates for erasing old information and writing new information. This is the most mathematically elegant paper in the batch, and worth slowing down for.
We've established that linear-attention / SSM approaches replace the unbounded KV cache with a fixed-size state matrix S_t. The whole game in this family is: how do you update that state well?
Picture the state S_t as a little associative memory — a matrix that maps keys to values, roughly approximating Σ vⱼ kⱼᵀ over all tokens seen so far. To read it, you query with k_t:
y_t = S_t · k_t # read the value associated with this key
The naive update is just to add each new association: S_t = S_{t-1} + v_t k_tᵀ. But this is the notebook-with-no-eraser problem — you only ever accumulate. Old, stale, or contradicted information never leaves. The memory gets crowded and noisy.
DeltaNet's key idea is to borrow the delta rule from classical learning theory: before writing, check what the memory currently predicts, and only write the correction. The original DeltaNet update:
S_t = S_{t-1} − β_t (S_{t-1} k_t − v_t) k_tᵀ
Read that middle term carefully — it's intuitive once you see it:
S_{t-1} k_t is what the memory currently predicts for key k_t.(S_{t-1} k_t − v_t) is the error — how wrong that prediction is versus the true value v_t.β_t.So instead of blindly piling on, the memory self-corrects at each step: "you thought key k mapped to X, but it actually maps to Y — let me fix just that." Importantly, this is happening in the forward pass — it's a memory mechanism, not backprop.
Earlier gated versions added a single scalar gate to control forgetting. But here's the insight the "-2" paper is built on: erasing old information and writing new information are two conceptually different actions, and they shouldn't share one knob.
Gated DeltaNet-2 introduces two separate channel-wise gate vectors:
b_t — controls which key-dimensions of the old state to wipe, andw_t — controls which value-dimensions of new info to commit,plus a channel-wise decay D_t. The full state update (don't panic, we'll read it in plain English):
S_t = (I − k_t (b_t ⊙ k_t)ᵀ) · D_t · S_{t-1} + k_t (w_t ⊙ v_t)ᵀ
In three plain-English steps:
D_t · S_{t-1}.(I − k_t (b_t ⊙ k_t)ᵀ) factor is a surgical eraser. It removes the components of the old state aligned with the current key k_t, and the b_t vector controls which dimensions get erased and how hard.k_t (w_t ⊙ v_t)ᵀ, with w_t controlling which value dimensions actually get stored.Both gates are produced from the input token (b_t = σ(W_b x_t), w_t = σ(W_w x_t)), so the model learns when and what to erase versus write, per-token, per-channel.
The everyday version: it's the difference between a notebook where the eraser and the pen are wired to the same hand (erase a lot → forced to write a lot) versus a notebook where you can erase aggressively in one place while writing gently in another. That independence is the whole point, and it measurably improves how cleanly the model manages long-range memory.
One practical note for engineers: recurrent updates sound sequential and GPU-unfriendly, but the paper provides a chunkwise (WY) algorithm that parallelizes these updates across chunks of tokens, plus a gate-aware backward pass — so training stays efficient.
Why it matters going forward: This is foundational plumbing. Alongside related work ("Delta Attention Residuals," "Deep Delta Learning"), Gated DeltaNet-2 is helping define the post-vanilla-Transformer toolbox — architectures that handle very long context stably and cheaply, with explicit, controllable memory rather than an ever-growing cache. If long-horizon agents are the destination, this is the road being paved.
arXiv 2601.02163
The headline: A "memory OS" that sits in front of an LLM and manages long-term memory the way an operating system manages files — turning raw conversation into structured, consolidated, retrievable memory objects, and injecting only the minimal sufficient context back into the model at inference.
The previous three papers gave models better internal memory (longer context, better state). EverMemOS attacks a different layer entirely: persistent memory across sessions. The frustrating reality of most assistants is amnesia — close the window, and tomorrow it has no idea who you are. EverMemOS is an architecture for fixing that, and it's notably not just "stuff everything into a vector database."
EverMemOS structures memory as a pipeline that deliberately mirrors how human memory consolidates: Episodic Trace Formation → Semantic Consolidation → Reconstructive Recollection.
Stage 1 — MemCells (the atomic unit). Raw dialogue isn't stored as a flat log. Each interaction is distilled into a MemCell, a structured record — formally a tuple (E, F, P, M):
Note this is not just an embedding vector — it's a compressed, structured object. Compression happens at write time, which is already a form of pruning.
Stage 2 — MemScenes (consolidation). As MemCells accumulate, related ones are clustered (online) into MemScenes — higher-level, thematically-unified groupings. From each scene, a more stable profile/summary is distilled. This is the "repeated events compress into a stable category" behavior of human memory: instead of 50 separate notes about a project, you form one consolidated "scene" of that project that updates as new cells arrive. Redundant information gets merged rather than duplicated.
Stage 3 — Reconstructive Recollection (retrieval). At query time, EverMemOS does not dump all memory into the prompt. It runs a two-stage hybrid retrieval:
That guiding principle — retrieve what is necessary and sufficient, not everything — is the heart of the system. EverMemOS is essentially a memory manager that brokers between raw history and the LLM's limited context window.
EverMemOS reports solid gains on standard long-term-memory benchmarks:
Those multi-hop and knowledge-updating numbers are the interesting ones: they show structured memory helping reasoning over time, not just raw recall.
Why it matters going forward: Memory is the missing ingredient for lifelong agents — assistants that maintain projects, preferences, and context over weeks and months, getting more useful over time rather than resetting daily. EverMemOS became a canonical 2026 design and seeded a whole family of follow-ups (Mem0, SimpleMem, MemRL, Focus). The broader lesson: a lot of "agent intelligence" lives outside the model weights, in how you architect memory around it.
NVIDIA · arXiv 2606.02800
The headline: An omnimodal world model built on a Mixture-of-Transformers (MoT) architecture, that jointly understands and generates across text, image, video, audio, and even action sequences — and is trained to predict future states, making it usable as a vision-language model, a video generator, a simulator, and a robot policy.
This is the most ambitious paper in the set, and the one pointing most directly at physical AI.
Don't confuse MoT with MoE. MoE routes among many small FFN experts. MoT is about handling many modalities. In Cosmos 3, the architecture combines:
with modality-specific parameter sets within each Transformer layer, all interacting through joint attention. So it's not "one undifferentiated Transformer that sees everything," nor is it "separate models bolted together." It's a middle path:
The analogy: most AI today is a company with siloed departments (text dept, vision dept) passing memos. Cosmos 3 is a single mind where the senses live together and share a workspace — the way you fuse sight, sound, and language into one understanding without consciously "switching modules."
How do you get pixels and words into the same Transformer? Each modality is first encoded/compressed by its own tokenizer into a learned latent representation, then projected into a shared latent token space:
The shared space isn't raw pixels-next-to-words; it's a learned latent space where every modality is mapped to compatible Transformer inputs.
This is the crucial conceptual leap. A world model doesn't just classify inputs — it learns a latent representation of environment dynamics well enough to predict what happens next. Cosmos 3 is trained to:
Training uses a progressive multimodal curriculum: data is curated and tokenized per modality, shared training aligns the latent space, AR losses train the reasoning/prediction tower, diffusion losses train the generation tower, and complexity ramps up across modalities over the curriculum.
On results, Cosmos 3 was reported at release as the top open-source Text-to-Image and Image-to-Video model per Artificial Analysis, and the best policy model per RoboArena — a telling combination, because it spans pure generation and embodied control with one architecture.
Why it matters going forward: Cosmos 3 prototypes the move toward omnimodal world models — systems that build a unified internal model of how the world works across senses, rather than treating each modality as a separate problem. This is the architectural direction underpinning grounded, embodied, interactive AI: robots, AR, autonomous systems — AI that doesn't just describe the world but perceives, predicts, and acts within it.
Step back and the five papers tell one coherent story. Remember our two enemies — quadratic attention and dense compute? Look at how the field is routing around them, and what it's building once it does:
| Paper | Primary move | The lesson it teaches |
|---|---|---|
| Nemotron 3 Super | Mamba SSM + Latent MoE + sparse attention + RL | Hybrid architectures beat pure Transformers for efficient, agentic reasoning |
| Step 3.5 Flash | Sparse MoE (11B active/196B total) + hybrid attention + MTP | Capability is decoupling from size — frontier reasoning at small active compute |
| Gated DeltaNet-2 | Linear attention with decoupled erase/write gates | Memory should be explicitly controllable, not an ever-growing cache |
| EverMemOS | Structured, consolidated memory OS around the LLM | Persistent intelligence lives outside the weights, in memory architecture |
| Cosmos 3 | Mixture-of-Transformers omnimodal world model | The frontier is unified, grounded, world-modeling AI — not text engines |
The unifying shift: the era of "just make it bigger" is giving way to "make it structured, efficient, persistent, and grounded." Three of these papers are about doing more with less active compute (Nemotron, Step 3.5, Gated DeltaNet-2). One is about memory that lasts (EverMemOS). One is about perception and action that's unified and embodied (Cosmos 3). Together they're the scaffolding for the thing everyone is actually building toward: capable, efficient, persistent autonomous agents that operate in the real world.
If you want to go deeper, the natural next steps for an engineer are to (a) implement a toy linear-attention layer and add the delta rule, then the erase/write gates — it's the most self-contained mechanism here; (b) read a minimal MoE implementation to internalize routing and load balancing; and (c) skim a Mamba-2 explainer to get comfortable with the SSM recurrence. Those three concepts unlock most of what's in this list.
Here's the part that should matter to anyone running a company rather than a research lab.
For the last few years, the unspoken tax on adopting AI was compute cost. Capable models lived in giant data centers, and using them well meant renting expensive GPUs or paying per-token to whoever owned them. That tax made a lot of businesses hesitate — "let's wait until it's cheaper."
The five papers above are, collectively, the sound of that tax being repealed. Look at the through-line again: frontier-level capability is decoupling from raw size. An 11B-active model now reasons like yesterday's giants. Linear-attention and state-space architectures are slashing the cost of long context. Smaller, smarter models increasingly run on readily available, affordable infrastructure — the kind of hardware you can already access today, and that gets cheaper and more powerful every single quarter.
The strategic implication is clear: the cost of AI compute is heading down, fast — and the right time to begin your AI transition is now, not after the prices fall. Here's why waiting is the more expensive choice:
In short: don't let the current price of compute scare you off a transition whose costs are already collapsing. Start small, start now, and let the falling cost curve work in your favor.
Ready to begin your AI journey? Get in touch with wbos — we can help you take the first practical steps toward an AI transition, so your business is ready to capitalize the moment capable AI becomes cheap enough to run anywhere. The future favors those who started early.
A note on sourcing: these five were selected from expert reading lists (Sebastian Raschka's 2026 LLM-papers series, dair-ai's weekly roundups) and Hugging Face trending papers — 2026 has no single official "top papers" ranking yet. Mechanism details are drawn from the papers' technical reports and accompanying analyses; where public materials only partially expose an implementation (notably EverMemOS's exact pruning thresholds and some Cosmos 3 benchmark tables), I've described what's confirmed and flagged what isn't.
we can help you take the first practical steps toward an AI transition, so your business is ready to capitalize the moment capable AI becomes cheap enough to run anywhere. The future favors those who started early.
Partner. Build. Scale
Partner. Build. Scale
INFORMATION
Let's discuss how we can engineer it into reality. We are ready to partner with you.