The MLCommons initiative has unveiled Croissant, a metadata format designed to facilitate how machine learning (ML) practitioners interact with datasets.
The challenges in ML development are manifold, ranging from disparate data representations such as text, structured data, images, audio, and video, each with its unique arrangements and formats.
While existing metadata formats like schema.org and DCAT cater to general datasets, they fall short of meeting the specific needs of ML practitioners.
Croissant, a collaborative work within the MLCommons initiative, offers a standardised method to describe and organise ML-ready datasets.
Building upon the foundation of schema.org, Croissant introduces layers for ML-specific metadata, data resources, organisation, and default ML semantics.
Major ML platforms, including Kaggle, Hugging Face, and OpenML, along with frameworks like TensorFlow, PyTorch, and JAX, have announced their support for the Croissant format.
The 1.0 release of Croissant includes a comprehensive specification, example datasets, an open-source Python library for validation and generation of Croissant metadata, and a user-friendly visual editor for creating intuitive dataset descriptions.
In the realm of ML, where the majority of work revolves around data, the absence of a common format imposes a substantial data development burden.
Croissant aims to alleviate this burden by streamlining the ML development process, facilitating dataset discoverability, simplifying data cleaning and analysis, and enabling model training with minimal code.
Croissant datasets are already available on prominent platforms like Google Dataset Search, Hugging Face, Kaggle, and OpenML.
How well do you really know your competitors?
Access the most comprehensive Company Profiles on the market, powered by GlobalData. Save hours of research. Gain competitive edge.
Thank you!
Your download email will arrive shortly
Not ready to buy yet? Download a free sample
We are confident about the unique quality of our Company Profiles. However, we want you to make the most beneficial decision for your business, so we offer a free sample that you can download by submitting the below form
By GlobalDataIntegration with TensorFlow Datasets allows for data ingestion, while the Croissant editor UI enables users to inspect and modify metadata.
To publish a Croissant dataset, creators can use the editor UI to generate metadata automatically, publish it on their dataset webpage, or leverage supported repositories.