Apple discloses AI training practices, stressing transparency and legal compliance

Apple has released a technical report detailing its 2025 Apple Intelligence language models, providing insight into the architecture of its on-device and cloud-based systems, as well as its approach to data sourcing and multilingual optimization. The company underscores that all data used for training its AI models was obtained legally and in accordance with web crawling protocols.

Smart on-device architecture enhances efficiency

The on-device Apple Intelligence model features a 3-billion-parameter dual-block Transformer design. Block 1, responsible for the bulk of computation, comprises 62.5% of the model's Transformer layers. Block 2, accounting for the remaining 37.5%, shares keys and values with Block 1 instead of generating its own, thus reducing cache memory usage by 37.5%.

This efficient resource-sharing approach speeds up the output of the initial token and allows the model to perform effectively even on hardware-constrained devices.

For its Private Cloud Compute (PCC) service, Apple introduced a new architecture known as Parallel-Track Mixture-of-Experts (PT-MoE). Built on the traditional Transformer framework, this system divides the model into multiple parallel tracks. Each track is equipped with a Mixture-of-Experts (MoE) layer that dynamically selects appropriate sub-models ("experts") based on task requirements, improving both response speed and accuracy.

Apple unveils four key sources of training data

Apple disclosed four primary data sources: web-crawled data, licensed corpora, synthetic data, and public datasets. Notably, the share of non-English data has significantly expanded from 8% to 30% to enhance the naturalness and effectiveness of its multilingual Writing Tools.

Apple uses its own web crawler, Applebot, to collect web data. The company emphasizes full compliance with the Robots Exclusion Protocol (robots.txt), respecting site owners' decisions to block crawling. Additionally, Apple clearly states that personal user data and interaction logs are not used for training purposes.

Even when content surfaces in Siri or Spotlight search results, it is not included in model training unless explicitly permitted via robots.txt.

Licensed corpora and safety taxonomy

Licensed corpora, including long-form content like books, enhance the model's visual understanding and long-text processing capabilities. Although Apple did not identify specific publishing partners, media reports from AppleInsider and 9to5Mac suggest licensing talks have taken place with organizations such as Condé Nast, InterActiveCorp, and NBC News.

To address concerns about bias and harmful content, Apple has implemented a detailed safety taxonomy. This framework categorizes sensitive content into six major groups and 58 subcategories. The taxonomy is routinely reviewed and updated by internal teams and external experts to proactively filter inappropriate material.

Apple's release of the report comes as the tech industry faces increasing scrutiny over AI data practices. Several companies have come under legal fire for unauthorized use of copyrighted material. In one prominent case, AI search startup Perplexity was accused by Forbes in 2024 of improper web crawling.

Timely, strategic, and transparent disclosure

Observers see Apple's transparent disclosure as a strategic move that reinforces its commitment to privacy and compliance, helping it stand apart in a crowded and competitive AI landscape.

This move also aligns with new global regulatory frameworks. The European Union recently introduced a Code of Practice for General Purpose AI (GPAI) models under the AI Act. The code requires companies to disclose key aspects of model development, including data sources, training processes, energy consumption, and computational resources.

While Apple's report offers limited detail on energy use and compute power, it delivers thorough explanations of its data handling practices and model training process.

Article edited by Jack Wu