CONNECT WITH US

AI shifts to inference as costs rise, memory constraints emerge, researcher says

Sherri Wang, DIGITIMES Asia, Taipei 0

Winston Hsu speaks at AI Expo Taiwan 2026 on March 25. Credit: Sherri Wang

Artificial intelligence is entering a new phase in which inference, rather than training, is becoming the dominant driver of computing demand, as rising costs and memory constraints begin to reshape AI infrastructure, according to researchers.

The shift reflects a structural change in how AI systems are deployed, as large-scale usage replaces earlier stages focused on model development.

Inference demand outpaces training growth

While model training remains essential, demand for inference is expanding more rapidly as AI adoption accelerates, said Winston Hsu in an interview with DIGITIMES Asia on the sidelines of AI Expo Taiwan 2026.

"Training is still growing, but inference is growing faster," Hsu said.

Training costs are also rising sharply and are approaching the billion-dollar level, according to Hsu, underscoring the increasing investment required to develop advanced models. As workloads shift toward inference, the primary constraint is moving away from compute performance toward memory bandwidth and data movement. Hsu said inference workloads place greater emphasis on memory efficiency, as systems must process large volumes of tokens continuously.

The shift is driving changes in system design, with memory architecture becoming increasingly important in determining performance and cost efficiency.

Credit: Sherri Wang

Credit: Sherri Wang

Efficiency reshapes AI cost structure

Despite growing concerns over energy consumption, power itself is not the primary cost driver. Instead, the key factor is how efficiently systems can generate tokens relative to energy usage.

"The question is how much energy you need to generate each token," he said.

The transition toward inference-heavy workloads is accelerating the development of application-specific hardware. Chip design can be optimized by removing training-related features and focusing on efficiency and memory performance, improving cost-performance ratios. This is expected to drive increased adoption of custom silicon, including application-specific integrated circuits, alongside general-purpose GPUs.

Hybrid deployment model begins to take shape

AI infrastructure is also evolving toward a hybrid model combining centralized data centers with localized systems.

Hsu stated that enterprises are likely to deploy smaller on-premise systems to support inference workloads and provide redundancy, particularly as reliance on AI services increases.

Open models are also narrowing the performance gap with proprietary systems, making localized deployment more viable. The rise of inference is reshaping the economics of AI, shifting focus from model training to large-scale deployment and usage. As token demand increases, infrastructure design is increasingly influenced by efficiency, memory performance, and deployment flexibility.

The shift suggests future AI growth will depend not only on advances in model capabilities but also on how efficiently infrastructure can support inference at scale.

Article edited by Jack Wu