Motivation
As a person interested in learning more about language models, I would prefer to use models whose training data was gathered ethically, with the consent of those who generated it.
For example: Whisper was "trained on 680,000 hours of multilingual and multitask supervised data collected from the web". Collected how? Did the speakers agree to this collection? Does Whisper claim that the legitimacy of its data collection stems from a clause buried in a clickthrough End User License Agreement that most users did not actually read? Was copyright infringed?
In contrast, Common Voice is a dataset containing data that people volunteered to contribute. I'd be interested in using a language model that trained on that dataset.
If Lumigator's evaluations could include information on how consensually and ethically the training data was gathered for a specific model, that would help me evaluate language models for suitability.
Alternatives
I have sought guidance elsewhere! So far I've found an absence of clear, canonical, and thoughtful guidance from trusted charities, governments, religious and spiritual leaders, et alia.
We have some guidance from ethicists and philosophers on whether to make particular kinds of tools (along the lines of the Declaration of Digital Autonomy; I am still looking for frameworks for deciding which tools to use (along the lines of the Franklin Street Statement on Freedom and Network Services). Software Freedom Conservancy has shared its thinking regarding GitHub Copilot](https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/); their focus is particularly on "AI-assisted authorship of software using copylefted training sets"; as such, they advise against using Copilot.
Contribution
I could beta test implementations of this feature, and further discuss the ethical considerations I would want surfaced in proposed user interfaces.
Have you searched for similar issues before submitting this one?
Motivation
As a person interested in learning more about language models, I would prefer to use models whose training data was gathered ethically, with the consent of those who generated it.
For example: Whisper was "trained on 680,000 hours of multilingual and multitask supervised data collected from the web". Collected how? Did the speakers agree to this collection? Does Whisper claim that the legitimacy of its data collection stems from a clause buried in a clickthrough End User License Agreement that most users did not actually read? Was copyright infringed?
In contrast, Common Voice is a dataset containing data that people volunteered to contribute. I'd be interested in using a language model that trained on that dataset.
If Lumigator's evaluations could include information on how consensually and ethically the training data was gathered for a specific model, that would help me evaluate language models for suitability.
Alternatives
I have sought guidance elsewhere! So far I've found an absence of clear, canonical, and thoughtful guidance from trusted charities, governments, religious and spiritual leaders, et alia.
We have some guidance from ethicists and philosophers on whether to make particular kinds of tools (along the lines of the Declaration of Digital Autonomy; I am still looking for frameworks for deciding which tools to use (along the lines of the Franklin Street Statement on Freedom and Network Services). Software Freedom Conservancy has shared its thinking regarding GitHub Copilot](https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/); their focus is particularly on "AI-assisted authorship of software using copylefted training sets"; as such, they advise against using Copilot.
Contribution
I could beta test implementations of this feature, and further discuss the ethical considerations I would want surfaced in proposed user interfaces.
Have you searched for similar issues before submitting this one?