Skip to content

damianshub/PySpark_Automations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PySpark Smart Schema Automation

Automatically infers and recommends Spark schemas for CSV and JSON datasets with column-level confidence metrics. Ideal for data engineers and analysts who want fast, reliable schema generation.


Lightweight PySpark utility that automatically infers and recommends an optimal Spark schema for CSV or JSON datasets based on a configurable sample size. It analyzes data types and field patterns, then outputs a recommended schema along with column-level correlation and confidence metrics to help you validate inference accuracy.

Key Features

  • Adaptive schema inference for CSV and JSON files
  • Configurable sampling for large datasets
  • Outputs recommended StructType for direct Spark use
  • Provides correlation scores and confidence indicators per column
  • Easily integrates into existing ETL or data-validation pipelines

Use Case Ideal for data engineers and analysts who need a fast, automated way to generate reliable Spark schemas without manually inspecting data samples.

** Developed with assistance from AI (GPT-5) and thoroughly tested by a human. **

About

Automatically infers and recommends Spark schemas for CSV and JSON datasets with column-level confidence metrics. Ideal for data engineers and analysts who want fast, reliable schema generation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages