Apple has demonstrated that large language models (LLMs) can accurately identify user activities by integrating textual audio and motion data without accessing raw audio. This multimodal approach opens new possibilities for health monitoring and smart fitness applications.
The company's latest research employs a "post-hoc multimodal sensor fusion" process, where two small modules independently process the original audio. The audio module converts WAV-format sounds into text labels or brief descriptions, while the IMU motion module generates action predictions based on accelerometer and gyroscope data.
These two sets of time-sequenced data are then combined into a single text prompt, which is fed into LLMs such as Google's Gemini-2.5-pro and Alibaba's Qwen-32B for inference and activity classification.
While traditional sensor data holds high value, cross-modal integration remains challenging. Apple's research team explains that applying LLMs at the final integration stage supports multimodal tasks with minimal additional computational cost. This method benefits future developments in activity recognition and health monitoring technologies, especially when sensor data is limited or requires cross-model fusion.
For benchmarking, Apple selected 12 daily activities from the Ego4D first-person dataset, including cooking, vacuuming, dishwashing, computer use, watching movies, playing basketball, soccer, interacting with pets, and reading. They evaluated LLM performance under different conditions.
First, they varied how much training context the model received, testing zero-shot and one-shot scenarios. One-shot means providing the LLM with representative samples of each activity before classification, helping it understand typical features; zero-shot involves no examples, asking the LLM to classify directly.
Second, they assessed the scope of model responses by comparing closed-set versus open-ended answer formats, measuring whether the LLM could reason beyond predefined lists.
Results showed that one-shot prompting enabled LLMs to better map textualized audio and IMU signals to specific activities, significantly improving multimodal activity recognition in both closed and open response settings. Accuracy was higher in closed-set scenarios.
Article edited by Jack Wu



