Open
Description
Overview
This blog post came out 2 weeks ago, announcing a new feature where DuckDB can now extract from hugging face datasets using the hf://
URI prefix.
We think this would make an awesome connector for users in our community.
https://duckdb.org/2024/05/29/access-150k-plus-datasets-from-hugging-face-with-duckdb.html
Technical spec
You would write a new source connector which can connect to Hugging Face source datasets and emit records from them, allowing Airbyte users to send these to any Airbyte destination.
Notes:
- It is not strictly required to use the DuckDB source implementation - although this is desirable for us since it could be leveraged for similar use cases in the future.
- We do not yet have a source connector for DuckDB, although we do have a Destination and a PyAirbyte
Cache
andSQLProcessor
. - While we normally only assign one hackathon task at a time, we would reserve this particular issue for someone who wanted to build this on top of DuckDB and also pick up the related item:
Definition of Done
- You would build a new "Hugging Face Datasets" source in Python (reusing code if helpful).
- The source should accept configuration inputs that specify specifically which Hugging Face dataset(s) to stream.
- If primary keys exist, they should be registered in the catalog.
- If incremental keys exist, they should be described as well in the catalog.
- You should use the CDK as much as possible.
- The connector should pass integration tests and acceptance tests.
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Not Started