Chris Martin - Dolly and RedPajama

Databricks:

Today, we’re releasing Dolly 2.0, the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.

Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees.

[…]

databricks-dolly-15k contains 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction tuning large language models. Under the licensing terms for databricks-dolly-15k, anyone can use, modify, or extend this dataset for any purpose, including commercial applications.

To the best of our knowledge, this dataset is the first open source, human-generated instruction dataset specifically designed to make large language models exhibit the magical interactivity of ChatGPT.

The release of the “databricks-dolly-15k” instruction tuning dataset under a permissive license is a much bigger deal than the trained model itself.

Language models will no doubt continue to face questions regarding training data provenance. Any and all datasets that are open, high quality, and free of copyright and ethics concerns will only improve the perceived legitimacy of future models.

RedPajama, the open source 1.2 trillion token pre-training dataset, is a big deal for the same reason.

The RedPajama base dataset is a 1.2 trillion token fully-open dataset created by following the recipe described in the LLaMA paper.

[…]

We aim to create a fully open-source reproduction of LLaMA, which would be available for commercial applications, and provide a more transparent pipeline for research.

Without a doubt, someone will soon train an open source language model on RedPajama’s base data and then apply RLHF fine-tuning using databricks-dolly-15k. This would be the first instruction-tuned language model that is fully unencumbered by copyright concerns.