Skip to content

docs(examples): create demos on large data volumes #126

Open
@deepyaman

Description

Objective

Create two real-world reference projects that showcase Ibis and IbisML at scale.

Outcomes

  • Documented end-to-end ML projects, including:

    • data ingestion
    • data exploration (using Ibis; stretch: produce visualizations using existing Ibis integrations)
    • data processing (including feature engineering using Ibis)
    • train-test split (manually using Ibis)
    • last-mile feature preprocessing (using IbisML)
    • handoff to model (approach TBD)
    • modeling (one using Dask-XGBoost on GPU, another using PyTorch)
    • stretch: real-time inference

    Ideally, these can be written up as (series of) blog posts in the future.
    They can also be submitted to conferences.
    It could be useful to track approximate time needed for each stage of the project (e.g. to confirm whether most time really is spent on feature engineering).

  • Lessons learned on model handoff that can inform future work (if any necessary) in that area for IbisML

  • Also expect feedback across the rest of the pipeline, but this is where we have the most uncertainty

Projects

  • Lichess live win probability using distributed XGBoost
    • Full dataset size: >12TB
  • TBD using PyTorch
  • (Backup option) NYC taxi dataset
  • (Backup option) Bureau of Transportation Statistics full airline dataset

Metadata

Labels

documentationImprovements or additions to documentation

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions