-
Notifications
You must be signed in to change notification settings - Fork 2
Description
In discussion with @ParamThakkar123, we realized that distributing data sources from HuggingFace is quite important! Here is an issue describing how we should build this out:
Issue Description
Difficulty: Intermediate
Time: 12 - 15 hours
Description:
This issue aims to extend HealthSampleData.jl with automatic dataset fetching and management capabilities using HuggingFaceHub.jl and DataDeps.jl. Currently, users must manually download datasets (e.g., synthea_1M_3YR.duckdb) from external sources.
With this enhancement, users will be able to run:
using HealthSampleData
path = HealthSampleData.load("synthea_1M_3YR") # Or something like thisand have the dataset automatically downloaded, cached, and reproducibly managed using Hugging Face and DataDeps.
Requirements
-
Add dependencies
- Add
HuggingFaceHub.jlandDataDeps.jltoProject.toml. - Ensure both packages are available and compatible with at least Julia 1.10.
- Add
-
Create dataset registration helpers for HuggingFaceHub.jl
-
Implement a function
_huggingface_dataset_register(name::String, repo::String, filename::String). -
Use
HF.info(HF.Dataset, repo)to locate dataset metadata andHF.file_download()to retrieve files. -
Register the dataset using
DataDeps.jl(you'll need to consult the documentation here):register(DataDep( name, """ JuliaHealth synthetic dataset (1M patients, 3 years of data). Source: https://huggingface.co/JuliaHealthOrg/JuliaHealthDatasets """, "https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/resolve/main/synthea_1M_3YR.duckdb"; post_fetch_method = somethingsomething ))
-
-
Register JuliaHealthDatasets as DataDeps
-
Documentation
-
Update
README.mdwith:- Installation instructions for HuggingFaceHub.jl and DataDeps.jl.
- Examples of dataset loading and caching.
- Instructions for setting Hugging Face tokens.
-
Expected Outcomes
The implemented functionality should:
- Automatically download datasets from Hugging Face Hub using HuggingFaceHub.jl.
- Cache and manage datasets locally using DataDeps.jl.
- Provide a reproducible and Julia-native dataset management workflow.
Example Implementation
using HuggingFaceHub, DataDeps
function _huggingface_dataset_register(name::String, repo::String, filename::String)
dataset = HF.info(HF.Dataset, repo)
HF.file_download(dataset, filename)
end
#=
Register DataDep later using information
=#
register(DataDep(
name,
"Dataset from Hugging Face repository $(repo).",
"https://huggingface.co/datasets/$(repo)/resolve/main/$(filename)";
post_fetch_method = identity
))Example user workflow:
julia> using HealthSampleData
julia> path = HealthSampleData.load("synthea_1M_3YR")
Downloading dataset from Hugging Face...
100% complete!
@info "Dataset available at /home/datadeps/synthea_1M_3YR.duckdb"You can then open the dataset as:
using DuckDB
con = DBInterface.connect(DuckDB.DB, path)Future Extensions
- Data versioning using Hugging Face
revisiontags. - Command-line interface (
healthdata list,healthdata download) for dataset management.