In this blogpost, we will explore some concerns around AI learning on copyrighted material, and touch on some implications with copyleft material. There are many other ethical concerns about AI1. For a start, we focus on the underlying datasets.

Learning on Copyrighted material

Copyright originated in England in the 1700s. It was not a product of any inherent, natural law, but rather a human-made construct designed to regulate the reproduction and distribution of books. Over time, copyright laws expanded to cover other types of creative works, such as music, art, photography and source code. These laws were, in their beginning, born out of a desire to control and profit from the flourishing market of printed material, which saw an explosion in literacy rates and a corresponding growth in widely spreading ideas and texts across Europe.
The extension of copyright laws beyond their original purpose has had far-reaching consequences for society, from the increasing cost of academic journals2 and textbooks to the widespread use of technologies that restrict access to digital content. Despite criticism3 that the current application of copyright law is not suited to the digital age, it remains the legal framework that has been in place for over three centuries. While it is true that the rise of the internet and digital technologies have fundamentally altered the landscape of content creation4 and dissemination, the legal frameworks of copyright have struggled to keep pace with these changes. This couldn’t be more true in the context of large language models in AI that are trained on vast amounts of data, generated by humans who never agreed about such usage of the content they created. What are the potential consequences of using copyrighted material for AI training? What is the consequence of learning not only by a human but also by machines?
Gathered know-how based on learning was traditionally attached to a subject. Machines and the resulting models now mimic this concept and become a codex of deterministic output. If an AI system is learning on copyrighted material, the resulting model could be sued for infringement or face other legal consequences5. The metaphor «Standing on the shoulders of giants.»6 highlights cultural progress and the importance of building on existing knowledge rather than constantly starting from scratch. Does this only apply to humans and their learning?
All this raises serious ethical concerns about using copyrighted material for learning.
As such, it is essential that policymakers, industry stakeholders, and the broader public engage in ongoing discussions about the role of copyright, and work together to ensure that these technologies are serving the good of democracy and prosperity for all of us.

Learning on copyleft material

Another ethical concern with AI learning is using material in the training set containing copyleft works. Copyleft is a licensing strategy that requires any modified or derivative work to be released under the same licence, thereby promoting the free sharing and open development of derivative works. For instance, if an AI system is trained on source code with copy-left material, any software developed using that system could also be considered subject to the same licence terms. This could limit the commercial usage of the resulting models, as they would be required to be distributed under an open-source licence. Additionally, using copy-left material for AI training could lead to legal disputes and confusion about ownership and licensing of the resulting AI system7. The issue of copy-left, and copyright appears to be an unaddressed problem in the context of learning, which was previously not applicable to humans.

Access to information

AI also raises questions about fairness and access to information. If the training data is proprietary, it may not be accessible to review, leading to biases in the resulting AI systems8. Proprietary data, in general, can limit opportunities for innovation and competition. If a few large companies have exclusive access to valuable data sets, it may be challenging for new players to enter the market and develop better solutions. Furthermore, copyright owners may have the power to influence what data is used for AI training, which could limit the scope and potential of the resulting models. The situation is even more problematic than existing copyright enforcement. Since there are not only copyright holders and access barriers to works, but also obstacles to the outcome of interpreting these materials. This is due to access to these services being limited by paywalls, controlled by a capitalist system driven by corporations and their shareholders.

Conclusion

Given the potential ethical concerns around AI learning and its applications9, 10, 11, it is crucial to develop clear and comprehensive guidelines for using data in AI training, not only inside companies but also on a legal level. This includes guidelines for the use of copyrighted and copyleft material. Such guidelines help ensure that AI systems are trained ethically and responsibly, while also promoting fairness, access to information, and innovation.

AI is a new and constantly developing technology that will play a major role in shaping our future. The incorporation of copyrighted and copy-left material in AI learning raises serious ethical considerations. Such as fairness, information accessibility, and ownership. This emphasises the need for continued discussions and debates on the ethical implications of AI and the development of explicit guidelines for its application.


This blog post was co-authored and edited with the help of ChatGPT and Janina Kürsteiner. As a free software advocate at Liip, I'm aware of the ethical and licence implications this could have. Nevertheless, this text is licensed as creative commons (CC BY-SA 2.0), it's uncertain whether copyright truly belongs to the losers12.


Footnotes and additional reading suggestions

  1. Ethics in the Digital Domain, Robert S. Fortner, 2021
  2. University of Missouri · Journal Prices Increase More than True Inflation
  3. I think that conversations are the best, biggest thing that Free Software has to offer its user, 2015
  4. Giving What You Don't Have
  5. The Verge · Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement
  6. Isaac Newton letter to Robert Hooke, 1675
  7. Legal Playbook For Natural Language Processing Researchers, 2022
  8. Reuters · Amazon scraps secret AI recruiting tool that showed bias against women
  9. Tages-Anzeiger · Die Verwaltung darf jetzt künstliche Intelligenz nutzen – doch sie muss aufpassen
  10. Position der Digitalen Gesellschaft zur Regulierung von automatisierten Entscheidungssystemen
  11. Our Final Invention, James Barrat, 2013
  12. Wall and Piece, Banksy, 2006