mcunhash / csv-to-parquet-to-delta Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Convert huge CSVs to Parquet/Delta to bypass Databricks upload limits and speed up ingestion.

0 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md

Repository files navigation

csv-to-parquet-to-delta

Convert huge CSVs to Parquet parts to bypass Databricks upload limits and speed up ingestion.

Motivation

Databricks workspace upload limit (~200MB) vs CSV sources >7GB.
This repo contains only the code to split a large CSV into Parquet parts (no data committed).

Key points

Streaming read with pyarrow (low memory).
Split by target compressed size (~100 MB per file by default).
Defaults: delimiter=';' and encoding='latin1', compression='snappy'.
Output files named: parte_001.parquet, parte_002.parquet, …

Requirements

Python 3.13.3
pip install -r requirements.txt
pandas
pyarrow

How to run (local)

Edit the parameters at the top of the script:

csv_path: full path to your CSV
output_dir: folder for Parquet parts
max_file_size_mb: target size per file (compressed), e.g., 100

Execute:

python notebooks/01_csv_to_parquet.py

Outputs

Files will be written to the folder defined in output_dir, e.g.:
parte_001.parquet, parte_002.parquet, …

Notes

No data is versioned in this repo (code only).
The script estimates an in-memory buffer using a compression ratio of ~0.20; final sizes may vary slightly.

About

Convert huge CSVs to Parquet/Delta to bypass Databricks upload limits and speed up ingestion.

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%