[FEATURE]: include ethical considerations regarding models' training data

### Motivation

As a person interested in learning more about language models, I would prefer to use models whose training data was gathered ethically, with the consent of those who generated it. 

For example: Whisper was ["trained on 680,000 hours of multilingual and multitask supervised data collected from the web"](https://openai.com/blog/whisper/). Collected how? Did the speakers agree to this collection? Does Whisper claim that the legitimacy of its data collection stems from a clause buried in a clickthrough End User License Agreement that most users did not actually read? Was copyright infringed?

In contrast, [Common Voice](https://commonvoice.mozilla.org/) is a dataset containing data that people volunteered to contribute. I'd be interested in using a language model that trained on that dataset.

If Lumigator's evaluations could include information on how consensually and ethically the training data was gathered for a specific model, that would help me evaluate language models for suitability.

### Alternatives

I have sought guidance elsewhere! So far I've found an absence of clear, canonical, and thoughtful guidance from trusted charities, governments, religious and spiritual leaders, et alia.

We have some guidance from ethicists and philosophers on whether to *make* particular kinds of tools (along the lines of the [Declaration of Digital Autonomy](https://techautonomy.org/); I am still looking for frameworks for deciding which tools to *use* (along the lines of [the Franklin Street Statement on Freedom and Network Services](https://www.franklinstreetstatement.org/)). Software Freedom Conservancy has shared its thinking regarding GitHub Copilot](https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/); their focus is particularly on "AI-assisted authorship of software using copylefted training sets"; as such, they advise against using Copilot.

### Contribution

I could beta test implementations of this feature, and further discuss the ethical considerations I would want surfaced in proposed user interfaces.

### Have you searched for similar issues before submitting this one?

- [x] Yes, I have searched for similar issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: include ethical considerations regarding models' training data #1338

Motivation

Alternatives

Contribution

Have you searched for similar issues before submitting this one?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEATURE]: include ethical considerations regarding models' training data #1338

Description

Motivation

Alternatives

Contribution

Have you searched for similar issues before submitting this one?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions