Skip to content

Crash when reading large CSV or Rasters #188

@kujaku11

Description

@kujaku11

Issue

When using Metadata Wizard to create metadata from a large CSV file (> 1Gb, or in the case of fort-pymdwizard > 1000000 rows), or GeoTIFF the program crashes or stalls cause the files is too large. Some of our data sets have millions of rows. Similar with large rasters (GeoTIFF) Therefore, making a reader that is more efficient with big data could be a solution.

Usually we just want the min/max values of each column in a CSV so probably don't need to read the entire file in at once.

On other issue is that some of our data has many columns (> 25) and writing each description in the GUI is not the greatest. If a user could input a CSV that has the attributes for the descriptor of an attribute block (like below) that could be read in to populate those files that would be great.

      <attr>
        <attrlabl>MagneticTotalField(nT)</attrlabl>
        <attrdef>Magnetic total field value in nanoteslas</attrdef>
        <attrdefs>Producer defined</attrdefs>
        <attrdomv>
          <edom>
            <edomv>NaN</edomv>
            <edomvd>No data value is NaN (Not a Number)</edomvd>
            <edomvds>Producer defined</edomvds>
          </edom>
        </attrdomv>
        <attrdomv>
          <rdom>
            <rdommin>20000.0</rdommin>
            <rdommax>71679.0</rdommax>
            <attrunit>nanotesla</attrunit>
          </rdom>
        </attrdomv>
      </attr>

Attribute description CSV headers

attrlabl, attrdef, attrdefs, edom, edomv, edomevd, edomvds, ...

Suggestion

We are suggesting a feature that would include:

  • Logic to identify large files (> 100000 rows)
    • If the file is large read in chunks and only store min/max values for each column
  • Adding an option to input a CSV of attribute descriptions for programmatic population of the fields.
  • Chunk read rasters, maybe use rasterio

We are happy to help by creating a PR, or if this is something tenable by the team we can be early testers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions