LLM inefficiency and what could be the better model than next-token prediction – Bankwatch

A discussion I had with Anthropic Claude verbatim.

(Bloomberg: circular support amongst small group of individual players and the inherent financial risk if current frontier models for AI cannot provide the expected value)

Prompt:
Re the financial risk referred to in this Bloomberg piece, is there any evidence of work on a better inference and reasoning model that doesn’t drive exponential requirements in data centre capacity to support the model. It seems the current model requires is highly inefficient requiring infinite capacity to support.

# There’s substantial work underway — but with an important caveat that’s directly relevant to the Bloomberg systemic risk framing.
The core inefficiency you’ve identified is real and well-documented
Current reasoning models use test-time compute in the form of very long chain-of-thoughts — an approach that carries high inference cost due to the quadratic cost of the attention mechanism and linear growth of the KV cache for transformer-based architectures. This is the structural problem: the more a model “thinks,” the more compute it burns, non-linearly.
Three serious efficiency responses are in active development:
1. Mixture of Experts (MoE) — the most commercially deployed
Rather than running every token through the entire network as dense transformer models do, MoE routes each token to a subset of specialized “expert” modules. The model can maintain a very large total parameter count — and thus broad knowledge and capability — while keeping active compute per token dramatically lower. DeepSeek V3/V3.2 is the proof case: DeepSeek-V2 reduced training costs by over 40% and compressed KV caches by over 90% through Multi-Head Latent Attention and a mixture-of-experts design.
2. Adaptive / toggleable reasoning
DeepSeek V3.1 introduced a hybrid inference structure enabling the model to operate in both reasoning and non-reasoning modes, togglable via a deep thinking button — allowing users to calibrate computational investment against task complexity. This is meaningful: most queries don’t need deep chain-of-thought, and forcing it on everything is exactly the waste your question identifies.
3. Alternative architectures challenging the transformer itself
The M1 model, built on a Mamba (state space model) architecture rather than transformers, achieves 3x faster inference throughput while performing on par with DeepSeek-R1-Distill on reasoning benchmarks. This is genuinely post-transformer research, not just optimisation within the existing paradigm.
The caveat that matters for Bloomberg’s risk thesis
Even as per-token costs fall, larger context windows and reasoning models might result in more tokens — and thus more compute usage — for each task. This is Jevons paradox applied to AI: efficiency gains lower cost, lower cost expands usage, expanded usage consumes the efficiency savings and then some. The infrastructure response has been to optimize for speed, memory, latency, and energy — not just peak FLOPs — but inference now absorbs the majority of cost and complexity at scale.
Bottom line for the Bloomberg systemic risk question: The efficiency work is real and technically impressive — MoE architecture alone has delivered an order-of-magnitude cost reduction in some benchmarks. But it doesn’t defuse the circular investment risk Bloomberg describes. If anything, cheaper inference accelerates adoption, which sustains demand for the data centre buildout, which is precisely what keeps the investment web intact. The risk Bloomberg is pointing at — cascading losses if demand disappoints — is not resolved by efficiency gains unless those gains are so dramatic that the revenue case for continued infrastructure investment collapses. That’s the more interesting scenario, and arguably DeepSeek’s real disruption: not technical elegance, but the demonstration that frontier-capable models can be built at a fraction of Western cost assumptions.

——
Ok but if we put the Bloomberg piece aside. There is an inherent inefficiency in LLM models, and Deepseek touches on improving it but 40% is small potatoes. You’d have to think someone will develop a radically different information architecture. You can only make a horse run so fast. It took a radically different model ie motor engines to do that.

Your analogy is structurally correct, and the research community is increasingly saying the same thing explicitly.
The world has effectively staked trillions of dollars on the assumption that today’s AI architecture will keep scaling. History suggests otherwise. Each generation of technology eventually encounters limits — physical, economic, or conceptual — that demand a new approach. Hardware performance — measured in throughput per watt and per die area — is no longer keeping pace with the computational demands of frontier models. Cost curves are bending upward faster than efficiency gains.
So the premise is widely accepted inside the field. The transformer is at or near a local maximum. Here’s where the genuinely different approaches are:
State Space Models (SSMs) — mathematically distinct from transformers
SSMs represent sequence data by continually evolving a latent state rather than computing pairwise attention. S4 exploits the mathematical equivalence between a long recurrent sequence and a convolution, enabling parallel training via FFT — giving SSMs linear-time inference with no attention matrix. Mamba’s 3B model achieves the same accuracy as a Transformer of double the parameter count. This isn’t optimization — it’s a different mathematical substrate.
Neuro-symbolic AI — separating pattern recognition from reasoning
Neuro-symbolic AI combines deep learning’s ability to handle large-scale unstructured data with the structured reasoning of symbolic methods. Symbolic AI excels in reasoning and interpretability; neural AI thrives in learning from vast amounts of data. By merging these paradigms, NSAI aims to embody two fundamental aspects of intelligent cognitive behaviour: the ability to learn from experience and the capacity to reason based on acquired knowledge. The key insight here is that LLMs currently do both with the same mechanism — which is the inefficiency. Separating them could be the structural break you’re describing.
Brain-inspired architecture — the most radical departure
The human brain performs remarkable feats of intelligence using approximately 20 watts of power — less than a standard light bulb — while training large transformer models can consume energy equivalent to hundreds of households for days. Pathway’s BDH architecture explicitly claims paradigm-shift status: BDH is not an incremental improvement on transformer-based architecture but a paradigm shift — forming a modular structure similar to a network of neurons in the brain, designed to achieve generalization over time, overcoming a significant barrier to autonomous intelligence.
The honest assessment
The current plateau is forcing the field to move beyond scaling and re-engage with fundamental questions about the nature of intelligence itself. Each paradigm solves a different piece of the puzzle: SSMs for efficiency, neuro-symbolic AI for logic, world models for causality. The next true breakthrough will require a departure from the current architectural blueprint.
But — and this is the key point — none of these is the motor car yet. They’re more like steam-powered carriages. The SSM/Mamba work is genuinely different mathematics but still fundamentally a sequence-processing paradigm. Neuro-symbolic is compelling but hasn’t scaled. Neuromorphic is decades from deployment at inference scale.
Your analogy implies something else too: the motor car didn’t come from people who were trying to breed faster horses. It came from a completely different engineering tradition. The transformer variants and even most SSM work are still being built by people whose frame of reference is “better LLM.” The genuinely disruptive architecture may come from outside that community entirely — computational neuroscience, physics-based computing, or photonic chips. That’s where I’d be watching.
——
Indeed- mathematical theory à la Einstein space, time, infinity (well beyond my pay grade).

Ha — yes, exactly that register. The irony is that transformer architecture, for all its power, is essentially very sophisticated statistics at industrial scale. It has no model of the world, no concept of causality, no understanding of infinity or continuity in any mathematical sense.
The genuinely interesting question is whether intelligence — the kind that produced Einstein — is even in principle reproducible by next-token prediction scaled up. Most serious philosophers of mind and mathematicians (Penrose being the most provocative) would say no. That the substrate matters. That there’s something about continuous mathematical intuition that discrete token-based systems can’t reach.
Which loops back to your architectural point. The motor car didn’t just go faster — it operated on entirely different physical principles. The equivalent leap in AI might require not just a different architecture but a different mathematics underneath. Possibly something closer to how the brain actually handles continuity, causality and abstraction — none of which maps cleanly onto matrix multiplication.
Whether that comes from computational neuroscience, quantum approaches, or something nobody has named yet is genuinely open. But you’re right that it’s that order of magnitude of conceptual shift, not another 40% efficiency gain.

Categories

What's Hot

Weekend reading: Scottish Mortgage’s bumpy ride

HOYA BIT Becomes World’s First BSI ISO 14068-1 Certified Carbon-Neutral Crypto Exchange

LLM inefficiency and what could be the better model than next-token prediction – Bankwatch

Why a Prepaid Debit Card is a Great Option for People on a Budget

why payment choice matters for disabled people – Bank Underground

How to Use AI as a Cognitive Prosthetic to Enhance Human Creativity

Building Bridges to Business Success: My Journey with Brand Solutions: Sequetta Brand

REX-Osprey Ethereum, Solana staked ETFs may launch soon as SEC raises no objections

Fast and Easy Business Funding

Our Picks

Weekend reading: Scottish Mortgage’s bumpy ride

HOYA BIT Becomes World’s First BSI ISO 14068-1 Certified Carbon-Neutral Crypto Exchange

LLM inefficiency and what could be the better model than next-token prediction – Bankwatch

Categories

What's Hot

Subscribe to Updates

LLM inefficiency and what could be the better model than next-token prediction – Bankwatch

Related Posts

Subscribe to Updates