Putting CSV headers in alphabetical order because we don't know what else to do with them seems like a recipe for disaster. It makes the data hard to read and analyze, e.g., the most important columns in my dataset are ... the second earliest, and the latest.
The CSV data can also include multiple types of tables, such as a summary and detailed data sets, that make it hard or impossible to interpret. As much as a regular CSV has someone else's judgment imposed upon the user, these have ... no sense at all.
Here's an ordering idea, by kind of spidering through. This assumes each header is interpreted accurately, without a difference in white space.
Secondary tables may have fields with the same names, but sorting order will be determined by the most popular table they appear in.
Where we have a list of dicts or list of lists where each represents an ostensible complete CSV header:
Build a tracking dictionary.
Traverse the entire list-ofs. For each header entity, see if the first entry exists in the tracker.
If not, in the tracking dictionary create a dictionary keyed to the value of the first entry name, but also with "tally" as an integer 0, "children" as a dictionary.
For that first entry, increment the "tally".
For each subsequent header item in that row, begin appending as children the same way we built the first level except "handled" is not needed.
We should end of up with a functional tree.
Once we have a populated tree we need to fill it in.
Use recursion to go through each branch, and each branch's branch. Recursion function should include the name of the master branch/family tree at the root level.
Build a new list called rootheaders.
For each branch or root in the
For each branch in a list sorted by tally in descending order -- most popular first -- were going to go through and process each child, adding them to the list. Except the recursion function will first call first child's first child and first child's first child's first child before calling the second child,, where there is one. Recursion function will check to see if the child names have already been added to the rootheaders. If not, they do.
Saving because airplane Wifi is sketch.
Putting CSV headers in alphabetical order because we don't know what else to do with them seems like a recipe for disaster. It makes the data hard to read and analyze, e.g., the most important columns in my dataset are ... the second earliest, and the latest.
The CSV data can also include multiple types of tables, such as a summary and detailed data sets, that make it hard or impossible to interpret. As much as a regular CSV has someone else's judgment imposed upon the user, these have ... no sense at all.
Here's an ordering idea, by kind of spidering through. This assumes each header is interpreted accurately, without a difference in white space.
Secondary tables may have fields with the same names, but sorting order will be determined by the most popular table they appear in.
Where we have a list of dicts or list of lists where each represents an ostensible complete CSV header:
Build a tracking dictionary.
Traverse the entire list-ofs. For each header entity, see if the first entry exists in the tracker.
If not, in the tracking dictionary create a dictionary keyed to the value of the first entry name, but also with "tally" as an integer 0, "children" as a dictionary.
For that first entry, increment the "tally".
For each subsequent header item in that row, begin appending as children the same way we built the first level except "handled" is not needed.
We should end of up with a functional tree.
Once we have a populated tree we need to fill it in.
Use recursion to go through each branch, and each branch's branch. Recursion function should include the name of the master branch/family tree at the root level.
Build a new list called rootheaders.
For each branch or root in the
For each branch in a list sorted by tally in descending order -- most popular first -- were going to go through and process each child, adding them to the list. Except the recursion function will first call first child's first child and first child's first child's first child before calling the second child,, where there is one. Recursion function will check to see if the child names have already been added to the rootheaders. If not, they do.
Saving because airplane Wifi is sketch.