Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relicense the finetuned checkpoints under CC BY-SA #33

Open
koute opened this issue Apr 20, 2023 · 4 comments
Open

Relicense the finetuned checkpoints under CC BY-SA #33

koute opened this issue Apr 20, 2023 · 4 comments

Comments

@koute
Copy link

koute commented Apr 20, 2023

The license of the finetuned checkpoints currently makes no sense.

The base model was almost certainly trained on a ton of unlicensed all-rights-reserved data. In particular, the README says that it was trained on a dataset derived from the Pile, which includes ~100GB of commercial (some might say "pirated") ebooks (the Books3 dataset). And yet this model is licensed under CC BY-SA.

The finetuned model was trained on data which is under a less restrictive license (CC BY-NC, which is less restrictive than "all rights reserved") and yet suddenly the model has to follow the license of the data that was used for training?

This makes no sense. If training on unlicensed/all-rights-reserved data and releasing that model under an arbitrary license is OK then training it on less restrictive CC BY-NC data and releasing it under an arbitrary license is OK too. Alternatively, if the model has to follow the license of the data on which it was trained on then the base model has to be taken down as it was trained on all-rights-reserved data for which you had no license.

@waleed177
Copy link

waleed177 commented Apr 20, 2023

it also makes no sense for me, but i am no lawyer. but hey, i am happy with libre base model c: thank you stabilityAI

@mcmonkey4eva
Copy link

(I am not a lawyer, this is not legal advice, consult a real lawyer before making decisions, this is just my personal thought)

Scale matters a lot when considering dataset usage rights. The base model is trained on a massive scale mix of content such that it doesn't really directly retain much content from any one individual source (ie in theory the license doesn't matter much). The finetune is directly on top of a small section of entirely-license-restricted content (ie the model will directly retain information from the licensed content, thus the license must be matched appropriately).

As another way of thinking about it: Imagine an artist/author/whatever human creative. If that person looks at some copyright worked and copies from it directly, they're violating that copyright. However that same person has also been through a lifetime of looking at copyrighted works that have undoubtedly influenced their creative thought, but when they sit down and make something original (a work derivative of the mix of ideas in their head, many of which originate from copyright-restricted works), their new work is not subject to prior copyrights, it is considered their own work.

@twmmason twmmason reopened this Apr 25, 2023
@zoobab
Copy link

zoobab commented Apr 25, 2023

CC Non Commercial means it cannot be packaged in Debian, due to the non commercial restriction:

celery/celery#2890

Could you re-release it under a copyleft license if you want users that modify it to republish their changes?

And what is the dataset used the training?

@mcmonkey4eva
Copy link

mcmonkey4eva commented Apr 25, 2023

@zoobab View the readme @ https://github.com/Stability-AI/StableLM#models for dataset info. More detail will be published soon.

I don't think a ten gig+ model file is fit to be packaged natively into Debian anyway? The actual relevant source code to run LLMs is separately maintained and separately licensed. It's just the models that have license info in this repository, and it's only the Instruct-finetune that's non-commercial, which has to be licensed that way due to the dataset used for the Instruct finetuning.

Future revisions of the instruct-finetune might use a different dataset and thus have a different license.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants