Skip to content

Why Digital Public Goods, including AI, Should Depend on Open Data

Policy

Acknowledging that some data should not be shared (for moral, ethical and/or privacy reasons) and some cannot be shared (for legal or other reasons), Creative Commons (CC) thinks there is value in incentivizing the creation, sharing, and use of open data to advance knowledge production. As open communities continue to imagine, design, and build digital public goods and public infrastructure services for education, science, and culture, these goods and services – whenever possible and appropriate – should produce, share, and/or build upon open data.

Open Data by Auregann is licensed under CC BY-SA 3.0.

Open Data and Digital Public Goods (DPGs)

CC is a member of the Digital Public Goods Alliance (DPGA) and CC’s legal tools have been recognized as digital public goods (DPGs). DPGs are “open-source software, open standards, open data, open AI systems, and open content collections that adhere to privacy and other applicable best practices, do no harm, and are of high relevance for attainment of the United Nations 2030 Sustainable Development Goals (SDGs).” If we want to solve the world’s greatest challenges, governments and other funders will need to invest in, develop, openly license, share, and use DPGs.

Open data is important to DPGs because data is a key driver of economic vitality with demonstrated potential to serve the public good. In the public sector, data informs policy making and public services delivery by helping to channel scarce resources to those most in need; providing the means to hold governments accountable and foster social innovation. In short, data has the potential to improve people’s lives. When data is closed or otherwise unavailable, the public does not accrue these benefits.

CC was recently part of a DPGA sub-committee working to preserve the integrity of open data as part of the DPG Standard. This important update to the DPG Standard was introduced to ensure only open datasets and content collections with open licenses are eligible for recognition as DPGs. This new requirement means open data sets and content collections must meet the following criteria to be recognised as a digital public good.

  1. Comprehensive Open Licensing:
    1. The entire data set/content collection must be under an acceptable open licence. Mixed-licensed collections will no longer be accepted.
  2. Accessible and Discoverable:
    1. All data sets and content collection DPGs must be openly licensed and easily accessible from a distinct, single location, such as a unique URL.
  3. Permitted Access Restrictions:
    1. Certain access restrictions – such as logins, registrations, API keys, and throttling – are permitted as long as they do not discriminate against users or restrict usage based on geography or any other factors.

The DPGA writes: “This new requirement is designed to increase trust and confidence in all DPGs by ensuring that users can fully engage with solutions without concerns over intellectual property infringement. Simplifying access and usage aligns with the DPGA’s goal of making DPGs truly open and accessible for widespread adoption… it helps foster an environment and ecosystem where innovation can thrive without legal uncertainties.”

AI and Open Data

As CC examines AI and its potential to be a public good that helps solve global challenges, we believe open data will play a similarly important role.

CC recognizes AI is a rapidly developing space, and we appreciate everyone’s diligent work to create definitions, recommendations, and guidance for and warnings about AI. After two years of community consultation, the Open Source Initiative released version 1.0 of the Open Source AI Definition (OSAID) on October 28, 2024. This definition is an important step in starting the conversation about what open means for AI systems. However, the OSAID’s data sharing requirements remain contentious, particularly around whether and how training data for AI models should be shared.

CC is of the opinion that just because it is difficult to build and release open datasets, that does not mean we should not encourage it. In cases where training data should not or cannot be shared, we encourage detailed summaries that explain the contents of the dataset and give instructions for reproducibility, but nonetheless that data should be defined as closed. When data can be made open and shared, it should be.

We agree with Liv Marte Nordhaug, CEO, Digital Public Goods Alliance who said in a recent post: “With regards to AI systems, there is a need to ensure that we don’t inadvertently undermine the open data movement and open data as a category of DPGs by advancing an approach to AI systems that is more permissive than for other categories of DPGs. Maintaining a high bar on training data could potentially result in fewer AI systems meeting the DPG Standard criteria. However, SDG relevance, platform independence, and do-no-harm by design are features that set DPGs apart from other open source solutions—and for those reasons, the inclusion of [AI] training data is needed.”

Next Steps

CC will continue to work with the DPGA, and other partners, as it develops a standard as to what qualifies an AI model to be a digital public good. In that arena we will advocate for open datasets, and consideration of a tiered approach, so that components of an AI model can be considered digital public goods, without the entire model needing to have every component openly shared. Updated recommendations and guidelines that recognize the value of fully open AI systems that use and share open datasets will be an important part of ensuring AI serves the public good.


¹Digital Public Goods Standard
²Data for Better Lives. World Bank (2021). CC BY 3.0 IGO
Posted 27 January 2025

Tags