This repository hosts open discussions and documentation for designing, collecting, evaluating, and (where possible) releasing large‑scale, full‑duplex spontaneous conversation datasets in English. The project is led by oto (our startup), with a long‑term goal of 10 million hours of natural, open‑domain small‑talk.
- Purpose: Build shared understanding and community consensus around dataset requirements, collection design, evaluation, and release policies for full‑duplex conversations.
- Focus: English, spontaneous and natural small‑talk audio conversations.
- Where discussions happen: Primarily on Hugging Face Discussions (link TBA).
Full‑duplex means both parties can speak at the same time without waiting for the other to finish—allowing overlaps, backchannels, interruptions, and collaborative repairs, closer to real human talk than half‑duplex, turn‑by‑turn systems.
- Short term: Define requirements, collection design, and ethics/privacy standards.
- Mid term: Establish prototype-scale collection, evaluation procedures, and metadata specs.
- Long term: Grow toward 10M hours and release datasets in staged, policy‑compliant ways.
We will begin with English as the first step, then expand to multiple languages in phases. The expansion pace and language priority will be guided by community discussion, ethical considerations, and data availability. Our long‑term goal is a diverse, multilingual, and representative corpus while maintaining strong privacy and consent standards.
- Spontaneous small‑talk (task‑free or loose‑goal)
- Conversation phenomena: overlap, backchannels, repairs, silences
- Audio plus metadata (recording conditions, anonymized speaker attributes, etc.)
Out of scope (for now)
- Unrestricted distribution of raw, personally identifying data
- Heavily scripted read‑aloud data as the only source
We prioritize participant privacy and ethics. Collection and releases will follow applicable laws and platform policies, with clear consent, anonymization, and redistribution terms. Detailed policies will be codified through community discussion.
- Join the discussions on Hugging Face (proposals, questions, debates)
- Use GitHub Issues for problem statements and requests
- Send Pull Requests to improve documentation and integrate proposals
Links (will be updated)
- Hugging Face Discussions: TBA
- oto (service): TBA
- When opening a new Issue or Discussion, briefly state background, purpose, and expected impact.
- Keep PRs minimal and link them to related discussions or issues.
- English or Japanese contributions are both welcome.
OTO is a project focused on end‑to‑end, full‑duplex conversational AI and large‑scale spontaneous dialogue datasets. We explore speech processing across overlapping talk, backchannels, interruptions, and collaborative repair—phenomena essential to human‑like interaction. Our work emphasizes:
- privacy‑first data collection and release policies,
- reproducible evaluation and metadata specs,
- open community discussions to shape standards and best practices.
We build on modern deep learning tooling and follow rigorous data processing, feature design, and recipe‑driven workflows to enable comprehensive experimentation in full‑duplex conversation research. If you are interested in contributing datasets, evaluation ideas, or tooling, please join the discussions.
Licensing for discussions and design docs is being finalized. Dataset releases may carry source‑specific terms. Once decided, this README and LICENSE will be updated.
- oto (startup) / Maintainers: TBA
Q. When will data be released? A. Releases will be staged based on ethics review, anonymization, and rights clearance. Timing will be updated as discussions progress.
Q. Are non‑English languages included? A. Our initial focus is English. Multilingual expansion may be considered later through community consensus.