They’re accusing the AI giant of making use of a massive collection of copyrighted books illegally for the sake of training AI models.
The newly published unsealed papers spoke about how the AI startup even went on to delete datasets used for training of GPT-3 which featured a wide number of book collections. Moreover, other shocking facts worth a mention include how researchers that made datasets were removed from the organization and therefore no longer work there today.
The class action legal case delineated the datasets as Books 1 and Books 2 which were used for training the older versions of GPT like GPT-3. As revealed in the court filings recently, there were close to 100k published books that were unlawfully used for this purpose, knowing very well how they were supported by copyright terms and conditions.
Lawyers on this front have spoken about how the Authors Guild tried to attain more data from OpenAI regarding this. And while the firm did offer major resistance at the start, citing reasons such as confidentiality, the truth did eventually come out about how copies of the datasets were in face deleted as mentioned in the latest report by Business Insider.
The material used for training was of the highest standards and today stand as an integral part of the firm’s AI models who are revolutionizing the world as we speak. The company and many others made use of plenty of data found online such as books to better curate and refine its models.
But a lot of firms that created the material claim it’s not fair to unjustly use material that is under the ownership of others without any form of consent of compensation provided. Intelligence is being used and righty so, they should be paid for it. Now, the courts are fighting many such battles and it appears to be a long legal woe with no end in sight soon.
Meanwhile, other similar stories on this front relate to a white paper rolled out in 2020 where the AI giant called the datasets as books based on the internet and they ended up making just 16% of data used for training models like GPT-3.
The white paper highlighted how 67 billion tokens featuring data or close to 50 billion words were used. And that’s a lot of content when you come to think of it.
The letter that’s now unsealed from lawyers of OpenAI mentioned how such datasets in question that were used for training were discontinued during the latter part of 2021.
After that, they were deleted for not being in use. And then the letter that’s dubbed very confidential adds how no other kinds of data which were used for training of models were deleted. So the company did offer lawyers from The Authors Guild to access them and other kinds of datasets too.
The documents are now unsealed and they reveal how several researchers who gave rise to the databooks in question are now working in the firm so this is why they will not be revealing their identities either.
Right now, we can confirm how their identification was give for the sake of the investigation by the Authors Guild but no public revealing was or will be done right now. Moreover, the firm continues to petition inside court how the employees names and their datasets would be under seal and remain in that way.
But as one can expect, this information was not taken well by The Authors Guild who argued and opposed this. They felt the public had every right to know and now, the matter is an ongoing dispute.
For now, OpenAI remains very clear and bold on its stance. It says the models that power its GPT and API were not created through the use of such datasets. This was the statement rolled out publicly by the company on Tuesday. They were last said to be used in the year 2021 and since then, have been deleted as they were produced by employees who are no longer working with the AI giant.
Image: DIW-Aigen