Skip to content

[FEATURE] Download Data Sources from HuggingFace #12

@TheCedarPrince

Description

@TheCedarPrince

In discussion with @ParamThakkar123, we realized that distributing data sources from HuggingFace is quite important! Here is an issue describing how we should build this out:

Issue Description

Difficulty: Intermediate
Time: 12 - 15 hours

Description:
This issue aims to extend HealthSampleData.jl with automatic dataset fetching and management capabilities using HuggingFaceHub.jl and DataDeps.jl. Currently, users must manually download datasets (e.g., synthea_1M_3YR.duckdb) from external sources.
With this enhancement, users will be able to run:

using HealthSampleData
path = HealthSampleData.load("synthea_1M_3YR") # Or something like this

and have the dataset automatically downloaded, cached, and reproducibly managed using Hugging Face and DataDeps.


Requirements

  • Add dependencies

    • Add HuggingFaceHub.jl and DataDeps.jl to Project.toml.
    • Ensure both packages are available and compatible with at least Julia 1.10.
  • Create dataset registration helpers for HuggingFaceHub.jl

    • Implement a function _huggingface_dataset_register(name::String, repo::String, filename::String).

    • Use HF.info(HF.Dataset, repo) to locate dataset metadata and HF.file_download() to retrieve files.

    • Register the dataset using DataDeps.jl (you'll need to consult the documentation here):

      register(DataDep(
          name,
          """
          JuliaHealth synthetic dataset (1M patients, 3 years of data).
          Source: https://huggingface.co/JuliaHealthOrg/JuliaHealthDatasets
          """,
          "https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/resolve/main/synthea_1M_3YR.duckdb";
          post_fetch_method = somethingsomething
      ))
  • Register JuliaHealthDatasets as DataDeps

  • Documentation

    • Update README.md with:

      • Installation instructions for HuggingFaceHub.jl and DataDeps.jl.
      • Examples of dataset loading and caching.
      • Instructions for setting Hugging Face tokens.

Expected Outcomes

The implemented functionality should:

  1. Automatically download datasets from Hugging Face Hub using HuggingFaceHub.jl.
  2. Cache and manage datasets locally using DataDeps.jl.
  3. Provide a reproducible and Julia-native dataset management workflow.

Example Implementation

using HuggingFaceHub, DataDeps

function _huggingface_dataset_register(name::String, repo::String, filename::String)
    dataset = HF.info(HF.Dataset, repo)
    HF.file_download(dataset, filename)

end

#= 

Register DataDep later using information

=# 

    register(DataDep(
        name,
        "Dataset from Hugging Face repository $(repo).",
        "https://huggingface.co/datasets/$(repo)/resolve/main/$(filename)";
        post_fetch_method = identity
    ))

Example user workflow:

julia> using HealthSampleData
julia> path = HealthSampleData.load("synthea_1M_3YR")
Downloading dataset from Hugging Face...
100% complete!
@info "Dataset available at /home/datadeps/synthea_1M_3YR.duckdb"

You can then open the dataset as:

using DuckDB
con = DBInterface.connect(DuckDB.DB, path)

Future Extensions

  • Data versioning using Hugging Face revision tags.
  • Command-line interface (healthdata list, healthdata download) for dataset management.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions