Actions

Report says photos of kids posted online, even with privacy settings, are being used to train AI

Human Rights Watch said even just a small portion of the popular dataset known as LAION-5B had nearly 200 links to personal photos of Australian children.

fizkes / Shutterstock

Mother and child taking a selfie.

By: Taylor O'Bier

Posted

Personal photos of children posted online are being used to train artificial intelligence tools without their parents' knowledge or consent, an analysis from Human Rights Watch found.

The human rights advocacy group said even photos that were posted to various platforms under strict privacy settings were being scraped from the internet as part of a larger dataset being used to teach popular AI programs.

Human Rights Watch researcher Hye Jung Han examined a small percentage of a dataset called LAION-5B, which is an openly accessible collection of 5.85 billion multilingual image-text pairs. The data is derived from internet archives accumulated by Common Crawl, a nonprofit organization based in San Francisco that publicly provides copies of data scraped from the internet for research and analysis.

LAION-5B is a popular dataset used by AI developers to train their models.

Han said she found 190 identifiable photos of Australian children, including some from the country’s Indigenous tribes, while analyzing less than 0.0001% of the LAION-5B dataset. Last month, she found 170 photos of Brazilian children in the data catalog.

LAION-5B does not contain the actual images, just links to where the images are stored and accompanying captions. But Han said some of the URLs in the dataset had children’s names and information that made it easy to trace their identities.

“One such photo features two boys, ages 3 and 4, grinning from ear to ear as they hold paintbrushes in front of a colorful mural. The accompanying caption reveals both children’s full names and ages, and the name of the preschool they attend in Perth, in Western Australia,” said Human Rights Watch in a press release regarding Han’s research.

Some photo links found in the LAION-5B dataset were from sources like personal blogs, posts by schools and family photographers hired to capture personal portraits. The report said there were some images that had been uploaded a decade before LAION-5B was even created.

Many of the photos did not appear to be possible to find through an online search or through the publicly accessible versions of the websites they came from, meaning the dataset bypasses privacy measures taken by those who posted them, the report said.

For example, one photo found in the analysis was of two boys making funny faces taken from a video of a school celebration following final exams that was posted to YouTube — even though the video’s privacy settings had been set to “unlisted,” meaning it does not show up on the creator’s page or in a YouTube search.

The report mentioned YouTube’s terms of service prohibit scraping or harvesting information that might identify a person, including an image of their face.

In addition to the concerns about the privacy of minors and their families, the knowledge absorbed from the LAION-5B dataset could be used by AI tools to manipulate or generate deepfake images using the children’s likenesses — something that has already been done.

“Current AI tools create lifelike outputs in seconds, are often free, and are easy to use, risking the proliferation of nonconsensual deepfakes that could recirculate online forever and inflict lasting harm,” stated Human Rights Watch.

After Han’s first report detailing the photos found in the dataset of children in Brazil, the German nonprofit organization that manages LAION-5B confirmed her findings and pledged to remove the image data. However, current AI models can’t unlearn the data they’re trained on, Human Rights Watch said.

These recent reports from Human Rights Watch aren’t the first to flag the vulnerabilities of LAION-5B as an AI-training tool. At the end of last year, a report published by Stanford University highlighted a concerning amount of child sexual abuse material in the dataset among thousands of other illegal image links.

LAION-5B was temporarily taken down to remove the harmful content from its data collection following Stanford’s report.

The organization behind LAION-5B has said the most effective protection against the misuse of personal photos of children is for parents or guardians to remove those photos from the internet themselves, according to Human Rights Watch.