A blog post explaining how GPT AI systems are problematic because they scrape content without recognising other people’s intellectual property and fail to understand the systemic bias embedded in historic datasets.

3 min readDec 7, 2022

Dalle2 image from prompt: a black and white photo of an AI system trying to write a complicated post about the ethics of IP scrapping and toxic bias in historic datasets

GPT (Generative Pretrained Transformer) AI systems, like many other AI systems, are trained on large amounts of existing text data. This data is used to help the AI system learn how to generate text that is similar to the input it was trained on. However, this reliance on existing text data can lead to a number of problems, including the scraping of content without recognizing other people’s intellectual property and the failure to understand the systemic bias embedded in historic datasets.

One of the major problems with GPT AI systems is that they often scrape content from various sources without recognizing the intellectual property of the original creators. This can lead to the theft of content and can be detrimental to the creators who have spent time and effort developing their ideas. In order to avoid this problem, it is important for GPT AI systems to be trained on data that has been properly licensed and to be used in accordance with the terms of that license.

Another problem with GPT AI systems is their failure to understand the systemic bias that is present in many historic datasets. These biases can come in many forms, such as gender, race, and socioeconomic status. As a result, GPT AI systems that are trained on biased datasets may generate text that reinforces these biases, leading to the perpetuation of harmful stereotypes. In order to avoid this problem, it is important for GPT AI systems to be trained on diverse and representative datasets, and for their outputs to be carefully monitored for biases.

Overall, while GPT AI systems have the potential to be incredibly powerful tools for generating text, it is important to be aware of their limitations and to take steps to address the problems they can cause. By properly licensing the data used to train GPT AI systems and by monitoring their outputs for biases, we can help ensure that these systems are used responsibly and ethically.

This led to a second prompt

A blog post explaining in 5 points the specific actions to take to prevent GPT AI systems scraping internet content without permission and regurgitating toxic bias embedded in datasets.

Use data filtering and data cleansing techniques to remove any toxic or biased content from the training dataset before using it to train a GPT AI system.
Implement strict rules and guidelines for the use of GPT AI systems, including a clear policy on what types of content they are allowed to access and scrape from the internet.
Regularly monitor and audit the output of GPT AI systems to ensure that they are not regurgitating any toxic or biased content.
Implement a system for users to report any instances of toxic or biased content generated by GPT AI systems, and promptly investigate and take appropriate action in response to such reports.
Educate users on the potential risks and limitations of GPT AI systems, and encourage them to use the systems responsibly and with care. This may include providing resources and guidance on how to identify and avoid toxic or biased content generated by GPT AI systems.

This was outputted by ChatGPT based on the two prompts mentioned in the title and in Bold.

A blog post explaining how GPT AI systems are problematic because they scrape content without recognising other people’s intellectual property and fail to understand the systemic bias embedded in historic datasets.

Written by Alastair Somerville