Public Domain Data

Jamillah Knowles / Better Images of AI / Data People / CC-BY 4.0

I've been looking at how we could use Open Source software to develop Generative AI applications for education. Of course one of the issues is data for training the AI. And its interesting that reports say that the quality of training data is getting worse, probably because so much poor quality data is being produced by AI. So I was interested in an article, The Making of PD12M: Image Acquisition, published on the Spawning blog.

It reports that in the evolving landscape of AI data collection, the Spawning team has introduced Public Domain 12M (PD12M), a innovative 12.4 million image-text dataset that addresses critical challenges in AI training data acquisition. Unlike traditional methods of web scraping, PD12M focuses on ethically sourced images from reputable cultural institutions like Europeana, Wikimedia, and the Smithsonian.

The dataset tackles several persistent issues in AI training data: copyright concerns, image quality, and consent. By exclusively using images with Public Domain Marks or CC0 licenses, PD12M minimizes legal and ethical complications. The team carefully curated images from OpenGLAM institutions, ensuring high-quality, professionally photographed artworks with verified metadata.

Key innovations include a 14-day delay for Wikimedia uploads to allow community flagging, restrictive license selection, and a unique image hosting approach. Rather than placing download burden on original institutions, the images are hosted by AWS Open Data, representing approximately 30TB of high-quality image data.

For education professionals, this approach represents a model of responsible AI development: transparent, ethical, and focused on quality over quantity. It demonstrates how careful data curation can create more reliable and trustworthy AI training resources.

Who owns your data?

Photo by Markus Spiske on Unsplash

Arguments over what data should be allowed to be used for training Large Language Models rumble on. Ironically it is LinkedIn which hosts hundreds of discussion is AI which is the latest villain.

The platform updated its policies to clarify data collection practices, but this led to user backlash and increased scrutiny over privacy violations. The lack of transparency regarding data usage and the automatic enrollment of users in AI training has resulted in a significant loss of trust. Users have expressed feeling blindsided by LinkedIn's practices.

In response to user concerns, LinkedIn has committed to updating its user agreements and improving data practices. However, skepticism remains among users regarding the effectiveness of these measures. LinkedIn has provided users with the option to opt out of AI training features through account settings. However, this does not eliminate previously collected data, leaving users uneasy about data handling.

However, it is worth noting that accounts from Europe are not affected at present. It seems that LinkedIn would be breaking European laws if they were to try to do the same within the European Union.

More generally, the UK Open Data Institute says "there is very little transparency about the data used in AI systems - a fact that is causing growing concern as these systems are increasingly deployed with real-world consequences. Key transparency information about data sources, copyright, and inclusion of personal information and more is rarely included by systems flagged within the Partnership for AI’s AI Incidents Database.

While transparency cannot be considered a ‘silver bullet’ for addressing the ethical challenges associated with AI systems, or building trust, it is a prerequisite for informed decision-making and other forms of intervention like regulation."

#AIinEd – Pontydysgu EU 2021-09-20 16:58:58

Photo by Markus Winkler on Unsplash

Empowering Learners for the Age of AI is a free, international online conference, organised by part of a national team from Australia representing leading researchers in the role of data, analytics and AI in learning, to empower both learners and teachers. They are seeking a public conversation to engage productively with societal infrastructure powered by data, analytics and AI. The questions for the conference include:

  • What’s actually happening with AI and how is it changing classrooms, teaching, and learning?
  • How can data, analytics and AI be used not to disempower or automate work, but to empower learners and professionals?
  • How must modern knowledge systems (such as schools, universities, corporate training and development, government agencies) change to prepare people for an AI society?
  • How to track and assess the qualities that equip people for this future?
  • What will the learning ecosystem look like by 2030 and what might humans and AI collaborate in solving complex problems?

They say:

The conference will be of interest to individuals with all levels of AI expertise, from beginner to advanced. World-leading researchers and experts will deliver keynotes addresses while discussion panels will explore implications in a range of sectors.

If your interests involve how data, analytics and AI will shape the future of learning, this open and free conference is for you!

The conference is on the 7 and 8 December, 2021 and you can register for free from the Empowering Learners AI website.

Data governance, management and infrastructure

Photo by Brooke Cagle on Unsplash

The big ed-tech news this week is the merger of Anthology, an educational management company, with Blackboard who produce learning technology. But as Stephen Downes said "It's funny, though - the more these companies grow and the wider their enterprise capabilities become, the less relevant they feel, to me at least, to educational technology and online learning."

And there is a revealing quote in an Inside Higher Ed article about the merger. They quote Bill Bauhaus, Blackboards chairman, CEO and president as saying the power of the combined company will flow from its ability to bring data from across the student life cycle to bear on student and institutional performance. "We're on the cusp of breaking down the data silos: that often exist between administrative and academic departments on campuses, Bauhaus said.

So is the new company really about educational technology or is it in reality a data company. And this raises many questions about who owns student data, data privacy and how institutions manage data. A new UK Open Data Institute (ODI) Fellow Report: Data governance for online learning by Janis Wong explores the data governance considerations when working with online learning data, looking at how educational institutions should rethink how they can better manage, protect and govern online learning data and personal data.

In a summary of the report, the ODI say:

The Covid-19 pandemic has increased the adoption of technology in education by higher education institutions in the UK. Although students are expected to return to in-person classes, online learning and the digitisation of the academic experience are here to stay. This includes the increased gathering, use and processing of digital data.

They go on to conclude:

Within online and hybrid learning, university management needs to consider how different forms of online learning data should be governed, from research data to teaching data to administration and the data processed by external platforms.

Online and hybrid learning needs to be inclusive and institutions have to address the benefits to, and concerns of, students and staff as the largest groups of stakeholders in delivering secure and safe academic experiences. This includes deciding what education technology platforms should be used to deliver, record and store online learning content, by comparing the merits of improving user experience against potential risks to vast data collection by third parties.

Online learning data governance needs to be considered holistically, with an understanding of how different stakeholders interact with each other’s data to create innovative, digital means of learning. When innovating for better online learning practices, institutions need to balance education innovation with the protection of student and staff personal data through data governance, management and infrastructure strategies.

The full report is available from the ODI web site.

Apprenticeship for Artificial Intelligence Data Specialists

block chain, data, records

geralt (CC0), Pixabay

This may be of interest to some readers. One of the issues with AI and data science is that it is leading to pressure for change in vocational education and training. In the UK there is a particular shortage of data specialists. And in response, last year a new apprenticeship standard was released for Artificial Intelligence Data Specialists. The 24 month "typical duration to gateway" (I think this means typical length of the apprenticeship) as developed by the following employers: British Broadcasting Corporation, Public Health England, Bank of England, Royal Mail Group, Unilever, TUI, Aviva, Shop Direct, Defence Science Technology Laboratory – MOD, Ericsson, First Response Finance LTD, GlaxoSmithKline, AstraZeneca, EasyJet, BT, Barclays, Machinable, Office of National Statistics, UBS.

The overview of the role  says it is to "Discover new artificial intelligence solutions that use data to improve and automate business processes."

The Institute of Apprenticeships and Technical Education web page goes on to say

The broad purpose of the occupation is to discover and devise new data-driven AI solutions to automate and optimise business processes and to support, augment and enhance human decision-making. AI Data Specialists carry out applied research in order to create innovative data-driven artificial intelligence (AI) solutions to business problems within the constraints of a specific business context. They work with datasets that are too large, too complex, too varied or too fast, that render traditional approaches and techniques unsuitable or unfeasible.

AI Data Specialists champion AI and its applications within their organisation and promote adoption of novel tools and technologies, informed by current data governance frameworks and ethical best practices.

They deliver better value products and processes to the business by advancing the use of data, machine learning and artificial intelligence; using novel research to increase the quality and value of data within the organisation and across the industry. They communicate, internally and externally, with technology leaders and third parties.