docs(examples): create demos on large data volumes #126
Open
Description
Objective
Create two real-world reference projects that showcase Ibis and IbisML at scale.
Outcomes
-
Documented end-to-end ML projects, including:
- data ingestion
- data exploration (using Ibis; stretch: produce visualizations using existing Ibis integrations)
- data processing (including feature engineering using Ibis)
- train-test split (manually using Ibis)
- last-mile feature preprocessing (using IbisML)
- handoff to model (approach TBD)
- modeling (one using Dask-XGBoost on GPU, another using PyTorch)
- stretch: real-time inference
Ideally, these can be written up as (series of) blog posts in the future.
They can also be submitted to conferences.
It could be useful to track approximate time needed for each stage of the project (e.g. to confirm whether most time really is spent on feature engineering). -
Lessons learned on model handoff that can inform future work (if any necessary) in that area for IbisML
-
Also expect feedback across the rest of the pipeline, but this is where we have the most uncertainty
Projects
- Lichess live win probability using distributed XGBoost
- Full dataset size: >12TB
- TBD using PyTorch
- (Backup option) NYC taxi dataset
- (Backup option) Bureau of Transportation Statistics full airline dataset
Metadata
Type
Projects
Status
backlog