Skip to content

New Source Connector: 🤗 "Hugging Face Datasets" (optionally via DuckDB 🦆 ) #30

Open
@aaronsteers

Description

@aaronsteers

Overview

This blog post came out 2 weeks ago, announcing a new feature where DuckDB can now extract from hugging face datasets using the hf:// URI prefix.

We think this would make an awesome connector for users in our community.

https://duckdb.org/2024/05/29/access-150k-plus-datasets-from-hugging-face-with-duckdb.html

Technical spec

You would write a new source connector which can connect to Hugging Face source datasets and emit records from them, allowing Airbyte users to send these to any Airbyte destination.

Notes:

  • It is not strictly required to use the DuckDB source implementation - although this is desirable for us since it could be leveraged for similar use cases in the future.
  • We do not yet have a source connector for DuckDB, although we do have a Destination and a PyAirbyte Cache and SQLProcessor.
  • While we normally only assign one hackathon task at a time, we would reserve this particular issue for someone who wanted to build this on top of DuckDB and also pick up the related item:

Definition of Done

  • You would build a new "Hugging Face Datasets" source in Python (reusing code if helpful).
  • The source should accept configuration inputs that specify specifically which Hugging Face dataset(s) to stream.
  • If primary keys exist, they should be registered in the catalog.
  • If incremental keys exist, they should be described as well in the catalog.
  • You should use the CDK as much as possible.
  • The connector should pass integration tests and acceptance tests.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Not Started

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions