Skip to content

New Source Connector: 🤗 "Hugging Face Datasets" (optionally via DuckDB 🦆 ) #30

Open
@aaronsteers

Description

@aaronsteers

Overview

This blog post came out 2 weeks ago, announcing a new feature where DuckDB can now extract from hugging face datasets using the hf:// URI prefix.

We think this would make an awesome connector for users in our community.

https://duckdb.org/2024/05/29/access-150k-plus-datasets-from-hugging-face-with-duckdb.html

Technical spec

You would write a new source connector which can connect to Hugging Face source datasets and emit records from them, allowing Airbyte users to send these to any Airbyte destination.

Notes:

  • It is not strictly required to use the DuckDB source implementation - although this is desirable for us since it could be leveraged for similar use cases in the future.
  • We do not yet have a source connector for DuckDB, although we do have a Destination and a PyAirbyte Cache and SQLProcessor.
  • While we normally only assign one hackathon task at a time, we would reserve this particular issue for someone who wanted to build this on top of DuckDB and also pick up the related item:

Definition of Done

  • You would build a new "Hugging Face Datasets" source in Python (reusing code if helpful).
  • The source should accept configuration inputs that specify specifically which Hugging Face dataset(s) to stream.
  • If primary keys exist, they should be registered in the catalog.
  • If incremental keys exist, they should be described as well in the catalog.
  • You should use the CDK as much as possible.
  • The connector should pass integration tests and acceptance tests.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

  • Status

    Not Started

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions