Problem parsing a simple CSV file

*(from the email by Peter Odehnal)*

I will be manipulating several large (8,000+ rows) CSV files to analyse and clean the data. First, I'm getting familiar with DataFrame using small CSV files -- for performance, development and testing purposes. 

I am getting some unexpected behaviour. After spending many hours on this, I'm hoping you could give me some suggestions on how to resolve the problem.

```Smalltalk
| df workDir VvDir inFileRef outFileRef  cellVal cellValNew |  
workDir := FileSystem disk workingDirectory.
((VvDir := workDir / 'Vv') isDirectory) 
    ifFalse: [ self halt. ].
inFileRef := (workDir / 'Vv' / 'test.csv') asFileReference.

df := DataFrame readFromCsv: inFileRef withSeparator: (Character tab).
Transcript open; clear.
"Evaluate aBlock on the column with columnName and replace column with the result."
df column: #pw transform: [ :x | 
    x keys do:[ :eachkey | "| cellVal cellValNew |"
        cellVal := x at: eachkey.
        Transcript cr;
          show: ' key: ', (eachkey asString);
          show: ' cellVal: ', (cellVal class asString), 
            ' ', (cellVal asString), ' | '.
          cellValNew := cellVal asString.
        Transcript show: ' to: ', (cellValNew class asString);
          show: ' ', (cellValNew).
        " ((eachkey) > 8) ifTrue:[ self halt ]." "--- HALT_2 ---"
        x at: eachkey put: (cellValNew ).
        ]
    ].
" self halt."   "--- HALT_3 ---"
outFileRef := (workDir / 'Vv' / 'testOut.csv') asFileReference.
df writeToCsv: outFileRef.
```

My small tab-delimited `TEST.CSV` text file contains a header row plus 9 data rows:

```
id pw Name phone balance
1a 111 1a Company 111-1111 0.00
2b 222 2b Company 222-2222 50.22
3c 333 3c Company 333-3333 33.33
4d 444 4d Company 444-4444 0.00
5e 555 5e Company 555-5555 500.00
6f 666 6f Company 666-6666 600
7g 777 7g Company 777-7777 7.00
8h 888 8h Company 888-8888 8.88
9i 999 9i Company 999-9999 9.99
```

Transcript output is as follows:

```
 key: 1 cellVal: SmallInteger 111 |  to: ByteString 111
 key: 2 cellVal: SmallInteger 222 |  to: ByteString 222
 key: 3 cellVal: SmallInteger 333 |  to: ByteString 333
 key: 4 cellVal: SmallInteger 444 |  to: ByteString 444
 key: 5 cellVal: SmallInteger 555 |  to: ByteString 555
 key: 6 cellVal: SmallInteger 666 |  to: ByteString 666
 key: 7 cellVal: SmallInteger 777 |  to: ByteString 777
 key: 8 cellVal: SmallInteger 888 |  to: ByteString 888
 key: 9 cellVal: SmallInteger 999 |  to: ByteString 999
```

At the line with `"--- SELF HALT_2 ---"` all of the `#pw` fields (column data) are converted to a String.
But, by the line with `"--- SELF HALT_3 ---"` all of the data for the `#pw` column have reverted to Integer.

I'm hoping you are able to provide some insights, suggestions or a solution, as I've spent many hours on this problem.

On writing data out to the `testOut.csv` other fields get converted to data that looks like time values.  After resolving the problem described above, I'm assuming that I can do similar `#transform: [ aBlock ]` to ensure I can convert all my data to String objects. 

I will be manipulating all data as String objects, sorting, identifying duplicate key values, finding and fixing invalid field data...  I hope DataFrame can be a foundation for this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem parsing a simple CSV file #133

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problem parsing a simple CSV file #133

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions