GitHub - Heterod0x/oto-open-full-duplex-spontaneous-conversation-datasets

OTO Open Full‑Duplex Spontaneous Conversation Datasets

This repository hosts open discussions and documentation for designing, collecting, evaluating, and (where possible) releasing large‑scale, full‑duplex spontaneous conversation datasets in English. The project is led by oto (our startup), with a long‑term goal of 10 million hours of natural, open‑domain small‑talk.

Overview

Purpose: Build shared understanding and community consensus around dataset requirements, collection design, evaluation, and release policies for full‑duplex conversations.
Focus: English, spontaneous and natural small‑talk audio conversations.
Where discussions happen: Primarily on Hugging Face Discussions (link TBA).

What is full‑duplex?

Full‑duplex means both parties can speak at the same time without waiting for the other to finish—allowing overlaps, backchannels, interruptions, and collaborative repairs, closer to real human talk than half‑duplex, turn‑by‑turn systems.

Goals and milestones

Short term: Define requirements, collection design, and ethics/privacy standards.
Mid term: Establish prototype-scale collection, evaluation procedures, and metadata specs.
Long term: Grow toward 10M hours and release datasets in staged, policy‑compliant ways.

Language roadmap

We will begin with English as the first step, then expand to multiple languages in phases. The expansion pace and language priority will be guided by community discussion, ethical considerations, and data availability. Our long‑term goal is a diverse, multilingual, and representative corpus while maintaining strong privacy and consent standards.

Scope

Spontaneous small‑talk (task‑free or loose‑goal)
Conversation phenomena: overlap, backchannels, repairs, silences
Audio plus metadata (recording conditions, anonymized speaker attributes, etc.)

Out of scope (for now)

Unrestricted distribution of raw, personally identifying data
Heavily scripted read‑aloud data as the only source

Ethics and privacy

We prioritize participant privacy and ethics. Collection and releases will follow applicable laws and platform policies, with clear consent, anonymization, and redistribution terms. Detailed policies will be codified through community discussion.

How to participate

Join the discussions on Hugging Face (proposals, questions, debates)
Use GitHub Issues for problem statements and requests
Send Pull Requests to improve documentation and integrate proposals

Links (will be updated)

Hugging Face Discussions: TBA
oto (service): TBA

Contribution guidelines

When opening a new Issue or Discussion, briefly state background, purpose, and expected impact.
Keep PRs minimal and link them to related discussions or issues.
English or Japanese contributions are both welcome.

About OTO (for Hugging Face Organization Card)

OTO is a project focused on end‑to‑end, full‑duplex conversational AI and large‑scale spontaneous dialogue datasets. We explore speech processing across overlapping talk, backchannels, interruptions, and collaborative repair—phenomena essential to human‑like interaction. Our work emphasizes:

privacy‑first data collection and release policies,
reproducible evaluation and metadata specs,
open community discussions to shape standards and best practices.

We build on modern deep learning tooling and follow rigorous data processing, feature design, and recipe‑driven workflows to enable comprehensive experimentation in full‑duplex conversation research. If you are interested in contributing datasets, evaluation ideas, or tooling, please join the discussions.

License

Licensing for discussions and design docs is being finalized. Dataset releases may carry source‑specific terms. Once decided, this README and LICENSE will be updated.

Maintainers

oto (startup) / Maintainers: TBA

FAQ

Q. When will data be released? A. Releases will be staged based on ethics review, anonymization, and rights clearance. Timing will be updated as discussions progress.

Q. Are non‑English languages included? A. Our initial focus is English. Multilingual expansion may be considered later through community consensus.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OTO Open Full‑Duplex Spontaneous Conversation Datasets

Overview

What is full‑duplex?

Goals and milestones

Language roadmap

Scope

Ethics and privacy

How to participate

Contribution guidelines

About OTO (for Hugging Face Organization Card)

License

Maintainers

FAQ

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

OTO Open Full‑Duplex Spontaneous Conversation Datasets

Overview

What is full‑duplex?

Goals and milestones

Language roadmap

Scope

Ethics and privacy

How to participate

Contribution guidelines

About OTO (for Hugging Face Organization Card)

License

Maintainers

FAQ

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages