Skip to content

Problem parsing a simple CSV file #133

Open
@olekscode

Description

@olekscode

(from the email by Peter Odehnal)

I will be manipulating several large (8,000+ rows) CSV files to analyse and clean the data. First, I'm getting familiar with DataFrame using small CSV files -- for performance, development and testing purposes.

I am getting some unexpected behaviour. After spending many hours on this, I'm hoping you could give me some suggestions on how to resolve the problem.

| df workDir VvDir inFileRef outFileRef  cellVal cellValNew |  
workDir := FileSystem disk workingDirectory.
((VvDir := workDir / 'Vv') isDirectory) 
    ifFalse: [ self halt. ].
inFileRef := (workDir / 'Vv' / 'test.csv') asFileReference.

df := DataFrame readFromCsv: inFileRef withSeparator: (Character tab).
Transcript open; clear.
"Evaluate aBlock on the column with columnName and replace column with the result."
df column: #pw transform: [ :x | 
    x keys do:[ :eachkey | "| cellVal cellValNew |"
        cellVal := x at: eachkey.
        Transcript cr;
          show: ' key: ', (eachkey asString);
          show: ' cellVal: ', (cellVal class asString), 
            ' ', (cellVal asString), ' | '.
          cellValNew := cellVal asString.
        Transcript show: ' to: ', (cellValNew class asString);
          show: ' ', (cellValNew).
        " ((eachkey) > 8) ifTrue:[ self halt ]." "--- HALT_2 ---"
        x at: eachkey put: (cellValNew ).
        ]
    ].
" self halt."   "--- HALT_3 ---"
outFileRef := (workDir / 'Vv' / 'testOut.csv') asFileReference.
df writeToCsv: outFileRef.

My small tab-delimited TEST.CSV text file contains a header row plus 9 data rows:

id pw Name phone balance
1a 111 1a Company 111-1111 0.00
2b 222 2b Company 222-2222 50.22
3c 333 3c Company 333-3333 33.33
4d 444 4d Company 444-4444 0.00
5e 555 5e Company 555-5555 500.00
6f 666 6f Company 666-6666 600
7g 777 7g Company 777-7777 7.00
8h 888 8h Company 888-8888 8.88
9i 999 9i Company 999-9999 9.99

Transcript output is as follows:

 key: 1 cellVal: SmallInteger 111 |  to: ByteString 111
 key: 2 cellVal: SmallInteger 222 |  to: ByteString 222
 key: 3 cellVal: SmallInteger 333 |  to: ByteString 333
 key: 4 cellVal: SmallInteger 444 |  to: ByteString 444
 key: 5 cellVal: SmallInteger 555 |  to: ByteString 555
 key: 6 cellVal: SmallInteger 666 |  to: ByteString 666
 key: 7 cellVal: SmallInteger 777 |  to: ByteString 777
 key: 8 cellVal: SmallInteger 888 |  to: ByteString 888
 key: 9 cellVal: SmallInteger 999 |  to: ByteString 999

At the line with "--- SELF HALT_2 ---" all of the #pw fields (column data) are converted to a String.
But, by the line with "--- SELF HALT_3 ---" all of the data for the #pw column have reverted to Integer.

I'm hoping you are able to provide some insights, suggestions or a solution, as I've spent many hours on this problem.

On writing data out to the testOut.csv other fields get converted to data that looks like time values. After resolving the problem described above, I'm assuming that I can do similar #transform: [ aBlock ] to ensure I can convert all my data to String objects.

I will be manipulating all data as String objects, sorting, identifying duplicate key values, finding and fixing invalid field data... I hope DataFrame can be a foundation for this project.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions