Large Language Models (LLMs) are becoming a core workload for emerging mobile applications such as smart glasses, real-time multimodal assistants, and on-device AR guidance. However, today's wireless systems are fundamentally mismatched to the behavior of LLM workloads.
Through 1.6 million cross-layer measurements on real cellular networks, we uncover three key properties that make LLM services fundamentally different from traditional mobile workloads: (1) LLM interactions constantly shift between uploading large multimodal inputs (e.g., images) and downloading large generated outputs, causing the wireless bottleneck to flip between uplink and downlink. (2) LLM computation and wireless transmission interact in tightly coupled ways. Changes in networking slice configuration can shift the bottleneck back and forth between computation and communication. (3) LLMs generate tokens in bursty, irregular waves rather than steady streams, creating traffic patterns that existing wireless schedulers are not designed to handle.
To address this mismatch, we present WiLLM, the first cross-layer wireless system built specifically for LLM workloads. WiLLM introduces an LLM-aware slicing architecture, cross-layer scheduling primitives that dynamically adapt to LLM behavior across the User Equipment (UE), RAN, and core network, and a universal UE compatibility mechanism requiring no firmware changes. We implement WiLLM with LLM-centric control paths and deploy it on a full 5G testbed with GPU-accelerated core inference. The evaluation with a baseline comparison and a smart-glasses case study demonstrates that WiLLM enables stable and predictable LLM interactions on commodity mobile devices. We release our complete platform and dataset to spur future research on LLM service for mobile systems at https://openwillm.github.io.