The Future of Agentic Computing

Specialized chips and on-device architectures could reshape the future of AI.

Chris Roth
airesearcheconomics

Three things are converging that could fundamentally change the future of AI: companies are building specialized chips purpose-built for LLM inference, frontier models are plateauing, and model knowledge is diverging from model ability. The implications could ripple through hardware, databases, privacy, and the entire software stack.

Specialized Inference Chips

A wave of companies is building custom silicon purpose-built for LLM inference. The approaches vary — hardwired model weights, transformer-specific architectures, analog compute-in-memory, deterministic scheduling — but the goal is the same: strip away the overhead of general-purpose GPUs and deliver faster, cheaper, more power-efficient inference. TrendForce projects custom ASIC shipments growing 44.6% in 2026 versus GPU shipments at 16.1%. The economics are pulling the industry toward specialization.

Etched ($620M raised) hardwires the transformer architecture into silicon but still loads weights from memory, so the same chip can run any transformer model. Groq ($20B licensing deal with NVIDIA) uses deterministic scheduling to eliminate GPU overhead. Cerebras ($10B+ agreement with OpenAI) builds wafer-scale chips with massive on-chip memory. EnCharge AI ($144M) does analog compute-in-memory. All are programmable and model-flexible.

The most radical approach comes from Taalas, which goes the full distance of permanently encoding a specific model's weights directly into the chip's wiring. No memory bandwidth bottleneck, no expensive high-bandwidth memory chips, no power-hungry data shuffling — just raw, hardwired inference. Multiple independent journalists have verified speeds of ~15,000–17,000 tokens/sec on Llama 3.1 8B at ~200–250W, a dramatic speed advantage over existing solutions.

Taalas's approach borrows from the structured ASIC playbook of the early 2000s: pre-manufacture a base chip with all the common circuitry, then customize just the final wiring layers for a specific model. A compiler-like system generates the chip design in about a week, then the foundry turns around just those layers. This eliminates the biggest bottleneck in LLM inference — the "memory wall" where the chip spends most of its time waiting for model weights to arrive from memory rather than actually computing. The claimed result: 1000x better performance-per-watt and 20x lower build cost.

The main caveat with Taalas specifically is quality: the speed comes from aggressive 3-bit weight quantization — compressing a model's parameters to fit into hardwired silicon — and no independent quality benchmarks exist yet. Printed chips also run exactly one model, permanently, with a ~2-month turnaround for new silicon. But the base chip can be stockpiled in advance; only the final wiring layers are customized per model. MarkTechPost frames this as a "seasonal hardware cycle" — fine-tune a model in spring, deploy hardwired chips by summer.

And the single-model limitation matters less with each passing month. Models are plateauing and converging — the gap between leader and 10th place on Chatbot Arena has compressed dramatically in one year. The open-weight vs. proprietary gap has narrowed from ~8 percentage points to ~1.7 on MMLU. One analysis found frontier models are overkill for ~80% of enterprise workloads. A hardwired Llama 4 that's 6 months old but runs at 17,000 tok/s might outperform a current frontier model in practice for most tasks — because speed enables richer agent loops, more retrieval passes, and better chain-of-thought at negligible marginal cost.

Inference ASICs could become common the way ASICs became common in networking: specialized, high-volume, quietly ubiquitous in the infrastructure layer even if most people never think about them. Not for every workload — but for the high-volume, latency-sensitive inference tasks where the model is stable and context windows are modest, which is a genuinely large slice of production AI.

Where Inference ASICs Might Dominate

The biggest near-term impact would be making agentic workloads dramatically faster, cheaper, and more energy-efficient — the kind of rapid, iterative reasoning loops that tools like Cursor and Claude Code run today. Agents that can run offline, on-device, with no cloud dependency open up entirely new deployment scenarios.

Many devices already use custom silicon for traditional ML — Tesla's HW4/AI5 chips, Axis and Verkada security cameras, Starkey and ReSound hearing aids. What specialized LLM chips add is the possibility of fast reasoning on top of those existing pipelines. A security camera that detects objects is nothing new; one that can reason about what it sees in real time is a different product.

What Remains Unproven

Taalas is the most speculative of these bets. No technical papers have been published at any conference. Current printed chips have limited on-chip memory for tracking conversation context, meaning they work best for shorter interactions — Hacker News commenters noted they're compelling for sub-10K token contexts but less so for long-context reasoning.

The cautionary tales are worth noting. Mythic AI pursued analog compute-in-memory, raised $165M, and ran out of runway before reaching revenue. Wave Computing claimed "1000x performance" for neural networks, raised over $200M, and filed for bankruptcy. Graphcore raised $710M, peaked at a $2.8B valuation, then sold to SoftBank at a steep loss.

Agents Would Get Faster and Cheaper

If specialized inference chips take off, the speed implications are staggering. Taalas's printed chips have demonstrated ~15,000–17,000 tok/s versus ~350 tok/s on current hardware — and even more conservative ASIC approaches deliver order-of-magnitude speedups over general-purpose GPUs. Models deployed on edge devices eliminate network latency entirely. Every "turn" in tools like Cursor or Claude Code that currently takes seconds could compress to sub-second timelines.

The cost collapse is equally dramatic. Inference costs have already dropped ~1,000-fold in three years while demand rose ~10,000-fold — the cost to achieve a benchmark AI task score plunged from $4,500 to $11.64, a 387x improvement. Specialized chips push this further — Taalas claims 20x lower build cost than an H100 and 1000x better performance-per-watt, and even less radical approaches offer substantial efficiency gains over general-purpose GPUs.

What this could enable: richer agent loops with more retrieval passes and better chain-of-thought at negligible marginal cost. Agents running in parallel across cloud, edge, desktop, and mobile. Always-on AI assistants that are economically viable at consumer scale. And if history is any guide, cheaper inference won't reduce total AI spend — it will increase it. DeepSeek R1 delivered comparable performance at ~27x cheaper inference costs, and usage exploded. Enterprise AI spending more than tripled from $11.5B to $37B in a single year. This is Jevons's Paradox in real time.

On-Device, Edge, & Cloud

Faster, cheaper inference doesn't just mean faster agents — it changes where they run. Cloud AI dominates today ($89–122B in 2025, projected to reach $363B–$1.7T by 2030–2033), and that's not going anywhere. But as inference gets cheap enough to run on smaller hardware, compute starts migrating toward the edge. Edge AI ($25–36B in 2025) is growing faster in percentage terms, and the hyperscalers know it — Azure Arc, AWS Outposts, and Google Distributed Cloud are all bets on pushing compute closer to users.

On-device is where it gets really interesting. AI PCs will hit 55% of global PC shipments by 2026 (Gartner). Qualcomm's Snapdragon X2 Elite reaches 80 TOPS. AMD's Ryzen AI Max runs Llama 3 70B on a laptop. Apple's on-device AI strategy offers inference at zero API cost to developers — potentially the most significant play here. Dr. Ben Lee (UPenn/Google) estimates we "could be getting 80% of compute done locally and leaving 20% for the data center cloud".

When specialized inference chips make on-device inference dramatically faster and more power-efficient, the economics shift hard toward running things locally whenever you can — while cloud handles the frontier tasks that require the largest models and the most compute.

Network Latency Becomes the Bottleneck

Here's where it gets interesting for the software stack.

Human browsing: a handful of requests to load a page, then the human thinks and reads, then clicks something. Agentic workloads are fundamentally different — rapid back-and-forth turns, iterative steps, fetching, reasoning loops. The latency caused by network waterfalls adds up much faster.

With an ultra-fast model on a specialized chip running 50x faster with no network latency and all data on the same device, every "turn" compresses to sub-second timelines. But if the model is fast and the data is far away, you've just moved the bottleneck from compute to network.

This is the forcing function for what comes next.

Local-First Databases For Agents

If agents and their data need to be co-located for maximum speed, multi-master, eventually consistent, local-first databases become essential infrastructure for agentic computing. Agents running in parallel across cloud, edge, and desktop use eventually consistent data to reconcile their state.

Local-first development is already gaining momentum independently of AI. There are 26+ major tools in the space — Turso, PowerSync, ElectricSQL, CRDT libraries like Automerge and Yjs — driven by the appeal of offline-capable apps, instant UI, and user-owned data. Adam Wiggins (co-founder of Heroku and Ink & Switch) compares local-first to "React in 2013" — exciting but early. Production deployments exist: Figma's CRDT-inspired multiplayer, Linear's IndexedDB-first architecture, Notion's offline mode. The SQLite renaissance is real — Turso, LiteFS, Cloudflare D1, PGlite. On-device AI makes all of this dually relevant: the same infrastructure that gives users instant, offline-capable apps also gives local AI agents instant, co-located data. Turso/libSQL is already positioning for "database-per-agent" architectures.

Today, inference is so slow (2–20 seconds per turn) that data access times are a rounding error. But in a world with ultra-fast ASIC inference, data access actually becomes the bottleneck. When the model responds in milliseconds, every network round-trip to a remote database starts to matter. And even at current speeds, agents working with large datasets — codebases, filesystem trees, document repositories — already spend significant time on I/O that has nothing to do with inference. Claude Code spends substantial time reading, searching, and indexing files. For these workloads, co-locating the data with the agent matters regardless of how fast the model runs.

The Privacy Dimension

There's a deeper reason local-first matters beyond latency.

AI makes "inferential privacy violations" possible — deducing sensitive information never explicitly disclosed through patterns in seemingly innocuous data. The Markup investigated Microsoft's advertising platform Xandr and found 650,000 audience segments that advertisers could use to target people — labels like "Heavy Purchasers of Pregnancy Tests," depression-prone, specific medical diagnoses, and "Struggling Elders" — all inferred from browsing history, purchase records, and location data. The FTC brought enforcement actions against data brokers selling smartphone GPS data that revealed visits to reproductive health clinics, addiction treatment centers, and places of worship.

53% of Americans say AI does more to hurt than help people keep their personal information private (Pew). Cisco's 2024 Consumer Privacy Survey found 84% of GenAI users are concerned about data entered into AI tools going public. No comprehensive U.S. federal privacy law exists, and the regulatory landscape remains fragmented.

With AI so accessible, anyone can connect disparate data pieces, create profiles, and fill in missing information. Local-first architectures provide a structural answer: keep data on-device, under user control.

But What Does It All Mean?!

If these trends play out, we're looking at a fundamentally different computing paradigm where AI is fast, cheap, private, and everywhere.

  • Ultra-fast, ultra-cheap inference on specialized chips → agents that operate at sub-second speeds
  • Edge/on-device deployment → low network latency for the model
  • Local-first databases → low network latency for the data
  • Tool-augmented architectures → freshness without model retraining
  • Privacy by architecture → data stays on-device

The terrifying flip side: sub-second agent turns with local data means autonomous AI systems that iterate at inhuman speed. There are legitimate concerns about what happens when you remove all the friction — latency, cost, network dependencies — that currently acts as a natural brake on autonomous AI.

The bigger picture across all of this is that AI is becoming dramatically more distributed, faster, and cheaper — and in all likelihood, more democratizing than many people expect. When inference is nearly free and runs on commodity hardware, AI stops being something controlled by a handful of cloud providers and becomes something that runs everywhere, for everyone.