coding News

Definition of Open Source AI

27th August 202427th August 2024 Graham Attwell 196 Views #AIinEd, AI, AI Pioneers, AI@School, artificial intelligence, Wales Wide Web

Clarote & AI4Media / Better Images of AI / Power/Profit / CC-BY 4.0

There is growing interest in using and developing Open Source Software approaches to Generative AI for teaching and learning in education. And there are an explosion of models claiming to be Open Source (see, for example Hugging Face). But Gen AI is a new form of software and there has been difficulties on agreeing what a definition is. This week the Open Source Initiative has released a draft definition.

In the preamble they explain why it is important.

Open Source has demonstrated that massive benefits accrue to everyone when you remove the barriers to learning, using, sharing and improving software systems. These benefits are the result of using licenses that adhere to the Open Source Definition. The benefits can be summarized as autonomy, transparency, frictionless reuse, and collaborative improvement.

Everyone needs these benefits in AI. We need essential freedoms to enable users to build and deploy AI systems that are reliable and transparent.

The following text is taken from their website.

What is Open Source AI

When we refer to a “system,” we are speaking both broadly about a fully functional structure and its discrete structural elements. To be considered Open Source, the requirements are the same, whether applied to a system, a model, weights and parameters, or other structural elements.

An Open Source AI is an AI system made available under terms and in a way that grant the freedoms^[¹^] to:

Use the system for any purpose and without having to ask for permission.

Study how the system works and inspect its components.

Modify the system for any purpose, including to change its output.

Share the system for others to use with or without modifications, for any purpose.

These freedoms apply both to a fully functional system and to discrete elements of a system. A precondition to exercising these freedoms is to have access to the preferred form to make modifications to the system.

Preferred form to make modifications to machine-learning systems

The preferred form of making modifications to a machine-learning system is:

Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.

For example, if used, this would include the training methodologies and techniques, the training data sets used, information about the provenance of those data sets, their scope and characteristics, how the data was obtained and selected, the labeling procedures and data cleaning methodologies.

Code: The source code used to train and run the system, made available with OSI-approved licenses.

For example, if used, this would include code used for pre-processing data, code used for training, validation and testing, supporting libraries like tokenizers and hyperparameters search code, inference code, and model architecture.

Weights: The model weights and parameters, made available under OSI-approved terms^[²^].

For example, this might include checkpoints from key intermediate stages of training as well as the final optimizer state.

Open Source models and Open Source weights

For machine learning systems,

An AI model consists of the model architecture, model parameters (including weights) and inference code for running the model.

AI weights are the set of learned parameters that overlay the model architecture to produce an output from a given input.

The preferred form to make modifications to machine learning systems also applies to these individual components. “Open Source models” and “Open Source weights” must include the data information and code used to derive those parameters.

Of course this is only a draft and there will be disagreements. A particularly tricky issue is whether Large Language Models should be allowed to be trained from data scraped from the web without permission or attribution.