Skip to content

[FEATURE]: include ethical considerations regarding models' training data #1338

@brainwane

Description

@brainwane

Motivation

As a person interested in learning more about language models, I would prefer to use models whose training data was gathered ethically, with the consent of those who generated it.

For example: Whisper was "trained on 680,000 hours of multilingual and multitask supervised data collected from the web". Collected how? Did the speakers agree to this collection? Does Whisper claim that the legitimacy of its data collection stems from a clause buried in a clickthrough End User License Agreement that most users did not actually read? Was copyright infringed?

In contrast, Common Voice is a dataset containing data that people volunteered to contribute. I'd be interested in using a language model that trained on that dataset.

If Lumigator's evaluations could include information on how consensually and ethically the training data was gathered for a specific model, that would help me evaluate language models for suitability.

Alternatives

I have sought guidance elsewhere! So far I've found an absence of clear, canonical, and thoughtful guidance from trusted charities, governments, religious and spiritual leaders, et alia.

We have some guidance from ethicists and philosophers on whether to make particular kinds of tools (along the lines of the Declaration of Digital Autonomy; I am still looking for frameworks for deciding which tools to use (along the lines of the Franklin Street Statement on Freedom and Network Services). Software Freedom Conservancy has shared its thinking regarding GitHub Copilot](https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/); their focus is particularly on "AI-assisted authorship of software using copylefted training sets"; as such, they advise against using Copilot.

Contribution

I could beta test implementations of this feature, and further discuss the ethical considerations I would want surfaced in proposed user interfaces.

Have you searched for similar issues before submitting this one?

  • Yes, I have searched for similar issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions