Arm raises stakes in mobile AI with Lumex CSS; debuts C1 CPU and Mali G1-Ultra GPU

Arm SVP & GM of Client Line of Business, Chris Bergey. Credit: Arm Unlocked Shanghai

Arm unveiled its Lumex Compute Subsystem (CSS) on September 10 at the Unlocked Summit in Shanghai, signaling its boldest step yet to elevate AI and gaming on flagship smartphones. The platform debuts the C1 CPU cluster based on Armv9.3 and the Mali G1 GPU lineup, headlined by the Mali G1-Ultra with support for second-generation Scalable Matrix Extensions (SME2). Together, they bring edge devices closer to data center-class performance.

Chris Bergey, Arm's SVP and GM of the Client Line of Business, said AI has become the "foundation of next-generation mobile and consumer technology." He noted that SME2 will extend across all Arm CPU platforms, projecting that by 2030, SME and SME2 will add over 10 billion TOPS of compute power across more than 3 billion devices, as Sina and ICsmart noted.

C1 CPU cluster: Armv9.3 with SME2 for flagship AI performance

Arm rolled out four C1 CPU variants, giving OEMs flexibility to balance performance, power, and area:

● C1-Ultra: The flagship super-core with the industry's widest microarchitecture, boosting IPC by 12%, delivering 25% higher single-thread performance and 28% lower power than Cortex-X925. Geekbench 6.3 scores show a 26% uplift over Cortex-X925 at lower energy cost.

● C1-Premium: Arm's first "sub-flagship" shrinks core area by 35% versus C1-Ultra with L2 cache, enabling cost-efficient 8-core designs that maintain comparable performance.

● C1-Pro: A big core optimized for energy efficiency, offering 11% more performance or 26% lower power draw than Cortex-A725, with real-world workloads showing 16% performance gains and 12% lower power use.

● C1-Nano: Built for wearables and compact devices, improving efficiency by 26% and SPECint2017 scores by 5.5%, with only a 2% area increase versus Cortex-A520.

At the cluster level, the new C1-DSU (DynamIQ Shared Unit) manages heterogeneous cores with SME2 support, cutting typical power by 11% and RAM wake-up energy by 7% compared with DSU-120.

A top-end 2× C1-Ultra + 6× C1-Pro cluster delivers 17 times the performance of a basic 2× C1-Nano setup, though with 25 times the die area. Across workloads, the C1 cluster averages 30% better performance, 15% faster gaming and streaming, and 12% lower power use in daily mobile tasks.

Credit: Arm

SME2: AI on CPU gets real

SME2 turns the CPU into an AI accelerator. By enabling low-precision data formats and efficient matrix operations, it narrows the gap with GPUs for smaller AI tasks. Arm says SME2 delivers up to 5× higher AI performance and 3× better efficiency than the prior generation.

Benchmarks show sharp gains:

● Whisper Base speech recognition latency dropped 4.7×;

● Google Gemma 3 chat inference sped up 4.7×;

● Stable Audio generation ran 2.8× faster;

● Neural camera denoising hit 120 fps at 1080p or 30 fps at 4K on a single core.

SME2 is already being adopted by Alibaba, Ant Group, Samsung System LSI, Tencent, and Vivo, underscoring its role in next-generation AI optimization.

Credit: ICsmart

Mali G1-Ultra GPU: doubling down on ray tracing and AI

The Mali G1-Ultra GPU headlines Arm's fifth-gen GPU architecture, engineered for mobile gaming and AI. With mobile gamers now making up 83% of the global gaming population and logging 390 billion hours annually, the G1-Ultra pushes performance to new heights.

Highlights include:

● RTUv2 ray tracing with 2× performance and 40% higher frame rates;

● 20% graphics uplift;

● 20% faster AI inference with MMUL.FP16, 104% better FP32 ML throughput;

● 9% lower per-frame power use for longer gaming sessions.

Arm also launched Mali G1-Premium and G1-Pro GPUs for mid-tier markets, scalable from 1 to 24 shader cores. Early testing shows games like Genshin Impact, Honkai: Star Rail, and Fortnite all running with significant improvements on the G1-Ultra.

For developers, the GPU adds tile-based hardware counters, Vulkan support, and future RenderDoc integration for deeper profiling. Arm Accuracy Super Resolution (ASR), already in Fortnite Mobile via Unreal Engine 5, enhances visuals while keeping GPU load low.

Lumex CSS platform: system IP, 3nm readiness, and developer tools

Lumex CSS is more than CPUs and GPUs. It integrates system IP and is validated for 3nm process nodes across leading foundries, giving chipmakers production-ready building blocks.

● SI L1 interconnect: Reduces leakage power by 71%, cuts latency 75% versus CI-7000, and integrates system-level cache with Memory Tagging Extension (MTE) support.

● MMU L1 memory management unit: Lowers TBU latency by up to 83%, enabling secure, scalable virtualization for Android and Windows devices.

● NoC S3 interconnect: Tailored for cost-sensitive, non-coherent mobile SoCs.

On the software side, Arm's KleidiAI library –already integrated into PyTorch, Llama.cpp, MediaPipe, and more– delivers SME2-optimized AI performance automatically. With Android 16-ready stacks and observability tools like Vulkan counters, Streamline, and Perfetto, developers can tune workloads before hardware launches.

Industry adoption and market impact

No early design wins were announced, though Vivo signaled interest on stage. Analysts suggest MediaTek's Dimensity 9500 may adopt the C1 CPU cluster and Mali G1-Ultra GPU, though not the full Lumex CSS.

Executives clarified that Lumex CSS is a reference subsystem, not a turnkey SoC. OEMs must still integrate NPUs, basebands, and interface IPs to build commercial chips.

Still, as AI shifts to the edge, Lumex CSS offers pre-validated, 3nm-ready IP that helps chipmakers shorten design cycles and lower risk.

By combining C1 CPUs with SME2 and Mali G1 GPUs with RTUv2, Lumex CSS delivers desktop-grade gaming, generative AI, and real-time multimedia on mobile. With 3nm readiness, strong ecosystem support, and a full developer toolchain, Arm positions Lumex CSS as a foundation for next-generation flagship devices in the AI-first era.

Article edited by Jack Wu