Automatically infers and recommends Spark schemas for CSV and JSON datasets with column-level confidence metrics. Ideal for data engineers and analysts who want fast, reliable schema generation.
Lightweight PySpark utility that automatically infers and recommends an optimal Spark schema for CSV or JSON datasets based on a configurable sample size. It analyzes data types and field patterns, then outputs a recommended schema along with column-level correlation and confidence metrics to help you validate inference accuracy.
Key Features
- Adaptive schema inference for CSV and JSON files
- Configurable sampling for large datasets
- Outputs recommended StructType for direct Spark use
- Provides correlation scores and confidence indicators per column
- Easily integrates into existing ETL or data-validation pipelines
Use Case Ideal for data engineers and analysts who need a fast, automated way to generate reliable Spark schemas without manually inspecting data samples.
** Developed with assistance from AI (GPT-5) and thoroughly tested by a human. **