Skip to content

A Python MapReduce Framework for parsing and validating large XML datasets.

Notifications You must be signed in to change notification settings

chris-relaxing/mapreduce-framework

Repository files navigation

mapreduce-framework

This is a Python MapReduce Framwork that works on Hadoop (HDFS) and validates a large "big data" XML dataset. It is written for a specific product schema, but the underlying framework can be adpated to any XML schema. This framework works best with the paramiko-scp MapReduce automation script that I wrote:
https://github.com/chris-relaxing/paramiko-scp

Here is an overview of how the framework works:
alt text

And some notes about the mapper and reducer:
alt text
alt text

About

A Python MapReduce Framework for parsing and validating large XML datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages