Meta AI logo is seen in this illustration taken September 28, 2023. REUTERS/Dado Ruvic/Illustration/File photo
AI models are becoming more sophisticated due to the quality and cache of data they are trained on. However, training models on data, especially the protected one, may have its consequences. Google, Microsoft-backed OpenAI and Facebook parent Meta, at some point in the last year, have been criticised for ‘stealing’ data. Meta, for one, seems to have run into a lot of legal troubles for using copyrighted data to train Llama.
Citing a new filing in a case related to copyright infringement initially brought earlier this year, a report by news agency Reuters says that the company lawyers warned it about the legal perils of using thousands of pirated books to train its AI models, but Meta did it anyway.
The new filing also consolidates two lawsuits brought against the Facebook and Instagram owner by comedian Sarah Silverman, Pulitzer Prize winner Michael Chabon and other prominent authors. They allege that Meta used their works without permission to train its AI language model, Llama.
The complaint reportedly includes chat logs of a Meta-affiliated researcher discussing procurement of the dataset in a Discord server, suggesting that Meta was aware of the legality of using the books.
“At Facebook, there are a lot of people interested in working with (T)he (P)ile, including myself, but in its current form, we are unable to use it for legal reasons,” researcher Tim Dettmers said in one of the chats.
What this means of tech companies
As tech companies face a slew of lawsuits from content creators who accuse them of ripping off copyright-protected works to build generative AI models, they may be forced to compensate artists, authors and other creators for this.
Furthermore, the provisional rules on AI in Europe may force companies to disclose the data they use to train their models, potentially exposing them to more legal risk.
end of article