I've been looking at how we could use Open Source software to develop Generative AI applications for education. Of course one of the issues is data for training the AI. And its interesting that reports say that the quality of training data is getting worse, probably because so much poor quality data is being produced by AI. So I was interested in an article, The Making of PD12M: Image Acquisition, published on the Spawning blog.
It reports that in the evolving landscape of AI data collection, the Spawning team has introduced Public Domain 12M (PD12M), a innovative 12.4 million image-text dataset that addresses critical challenges in AI training data acquisition. Unlike traditional methods of web scraping, PD12M focuses on ethically sourced images from reputable cultural institutions like Europeana, Wikimedia, and the Smithsonian.
The dataset tackles several persistent issues in AI training data: copyright concerns, image quality, and consent. By exclusively using images with Public Domain Marks or CC0 licenses, PD12M minimizes legal and ethical complications. The team carefully curated images from OpenGLAM institutions, ensuring high-quality, professionally photographed artworks with verified metadata.
Key innovations include a 14-day delay for Wikimedia uploads to allow community flagging, restrictive license selection, and a unique image hosting approach. Rather than placing download burden on original institutions, the images are hosted by AWS Open Data, representing approximately 30TB of high-quality image data.
For education professionals, this approach represents a model of responsible AI development: transparent, ethical, and focused on quality over quantity. It demonstrates how careful data curation can create more reliable and trustworthy AI training resources.