News organisations opt out of Common Crawl archive to block AI training

11 articles · Updated · Bloomberg · Apr 30

CNN, NBC and USA Today are among 20 publishers seeking removal of content from dozens of websites in the nonprofit web repository, according to a letter sent on Wednesday.
The News/Media Alliance, representing newspapers and magazines, asked Common Crawl to honour publishers' requests and stop unauthorised use of their work, including for AI purposes.
The move targets a widely used archive that AI companies use to train chatbots, escalating publishers' efforts to control how their journalism is stored and reused.

Could stricter EU rules and publisher opt-outs force a global shift in how AI is trained, or will AI firms simply find new data sources?

With courts split and tech evolving, will news publishers ever truly control how AI models use their content—or is it a losing battle?

If AI-generated outputs are deemed 'products' under law, how might liability and copyright enforcement change for the entire tech industry?