-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Issue
When using Metadata Wizard to create metadata from a large CSV file (> 1Gb, or in the case of fort-pymdwizard > 1000000 rows), or GeoTIFF the program crashes or stalls cause the files is too large. Some of our data sets have millions of rows. Similar with large rasters (GeoTIFF) Therefore, making a reader that is more efficient with big data could be a solution.
Usually we just want the min/max values of each column in a CSV so probably don't need to read the entire file in at once.
On other issue is that some of our data has many columns (> 25) and writing each description in the GUI is not the greatest. If a user could input a CSV that has the attributes for the descriptor of an attribute block (like below) that could be read in to populate those files that would be great.
<attr>
<attrlabl>MagneticTotalField(nT)</attrlabl>
<attrdef>Magnetic total field value in nanoteslas</attrdef>
<attrdefs>Producer defined</attrdefs>
<attrdomv>
<edom>
<edomv>NaN</edomv>
<edomvd>No data value is NaN (Not a Number)</edomvd>
<edomvds>Producer defined</edomvds>
</edom>
</attrdomv>
<attrdomv>
<rdom>
<rdommin>20000.0</rdommin>
<rdommax>71679.0</rdommax>
<attrunit>nanotesla</attrunit>
</rdom>
</attrdomv>
</attr>Attribute description CSV headers
attrlabl, attrdef, attrdefs, edom, edomv, edomevd, edomvds, ...
Suggestion
We are suggesting a feature that would include:
- Logic to identify large files (> 100000 rows)
- If the file is large read in chunks and only store min/max values for each column
- Adding an option to input a CSV of attribute descriptions for programmatic population of the fields.
- Chunk read rasters, maybe use
rasterio
We are happy to help by creating a PR, or if this is something tenable by the team we can be early testers.