Apple under fire, allegedly stole millions of YouTube videos to train AI

The race for artificial intelligence continues to raise heated controversies related to copyright and the legitimacy of the data collected.

This time the one in the eye of the storm is Apple, hit by a proposed class action. Three well-known creators of content on the platform YouTube, namely Ted Entertainment, Matt Fisher and Golfholics, have formally sued the company accusing it of illegally downloading millions of videos.

Apple downloaded millions of YouTube videos to train AI? The accusations

apple tim cook
Credits: Yahoo News

The ultimate goal of these massive downloads would have been to train a generative video model, a technology described in detail inside a research paper published by the company’s own researchers at the end of 2024.

According to court filings, the plaintiffs argue that their multimedia content appears over 500 times within the data used, and are now aiming to legally represent a broad class of creators who find themselves in a similar situation of uncompensated exploitation.

At the center of the entire legal debate is a massive computer archive named Panda-70M. Interestingly, this database does not contain the video files directly. Instead, it works like an extremely detailed map that organizes third-party content via URLs, video identifiers, and precise timestamps, cataloging millions of files.

Despite the purely indexed nature of the data set, to extract and process the images the computing systems must necessarily download the material, bypassing YouTube’s anti-scraping protections.

It is precisely on this alleged circumvention of security measures that the main accusation brought by the creators rests. The dynamics at play, however, do not involve exclusively the iPhone maker.

Even tech giants such as Amazon and OpenAI are facing almost entirely identical lawsuits for the use of the same dataset, a confirmation of a business practice that is becoming increasingly widespread where online material is treated as pure free fuel for development, counting on the lack of reaction from the authors.

The training data crisis and the privacy problem

The core of the problem lies in an intrinsic critical issue of today’s AI industry: the severe shortage of legally authorized material. As companies accelerate development to create increasingly sophisticated models, the forced extraction of publicly available content continues to prevail over regular licensing acquisitions.

Publishers and large networks have already started blocking web crawlers directed at their portals, and now individual independent authors are joining this active resistance.

For the California company, however, this specific case takes on decidedly embarrassing contours. Already in 2024, another investigation had revealed the unauthorized use of YouTube subtitles for training open-source models.

In addition to these tensions, the management today faces a scenario of a technological arms race in the AI sector, with Apple Intelligence marked by software features continually delayed, an exodus of researchers to rival companies and growing discontent among shareholders.

Watching a brand that built its entire commercial identity and campaigns on the absolute inviolability of privacy and users’ rights being formally accused of bypassing the protections of third-party platforms creates a deep reputational backlash, capable of hitting the company much harder than any of its competitors.