-
Notifications
You must be signed in to change notification settings - Fork 6
Expand file tree
/
Copy pathfix_files_that_load_with_errors.Rmd
More file actions
79 lines (54 loc) · 2.7 KB
/
Copy pathfix_files_that_load_with_errors.Rmd
File metadata and controls
79 lines (54 loc) · 2.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# Fix files that load with errors
Some records in the raw data feed files have errors when loaded into postgres. Some records have too many fields. Problem records should be deleted then reloaded into postgres.
These files have problematic records:
./data/yellow_tripdata_2010-02.csv #1127 errors
./data/yellow_tripdata_2010-03.csv #5684 errors
## Hand inspection in bash
There are too many problem records to be handled efficiently in bash. Nevertheless, here are some commands for doing inspections by hand.
```{bash, eval=FALSE}
# Problem files
filename=./data/yellow_tripdata_2010-02.csv
filename=./data/yellow_tripdata_2010-03.csv
# Identify problem records in bash:
sed -n '2957,2959p;2960q' $filename
# Remove problem records in bash:
sed -i '2958d' $filename
```
## Remove problem records with R
Use `readr` to automatically remove problem records and write clean records to file.
```{r, eval=FALSE}
require(readr)
dir.create('./data/fix')
f02 <- tempfile()
download.file('https://storage.googleapis.com/tlc-trip-data/2010/yellow_tripdata_2010-02.csv', f02)
d02 <- read_csv(f02)
length(unique(problems(d02)$row)) # 1127 errors
write_csv(d02[-1, ], './data/fix/yellow_tripdata_2010-02.csv', na = "")
f03 <- tempfile()
download.file('https://storage.googleapis.com/tlc-trip-data/2010/yellow_tripdata_2010-03.csv', f03)
d03 <- read_csv(f03)
length(unique(problems(d03)$row)) # 5684 errors
write_csv(d03[-1, ], './data/fix/yellow_tripdata_2010-03.csv', na = "")
```
## Reload files into postgres
Attempt to reload clean files into the postgres db.
```{bash, eval=FALSE}
# Set database host name
dbname=mydb
# Set schema
schema="(vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code_id,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,total_amount)"
# Load yellow_tripdata_2010-02.csv
filename=./data/fix/yellow_tripdata_2010-02.csv
echo "`date`: beginning load for ${filename}"
sed $'s/\r$//' $filename | sed '/^$/d' | psql nyc-taxi-data -c "COPY yellow_tripdata_staging ${schema} FROM stdin CSV HEADER;" -h $dbname
echo "`date`: finished raw load for ${filename}"
psql nyc-taxi-data -f populate_yellow_trips.sql -h $dbname
echo "`date`: loaded trips for ${filename}"
# Load yellow_tripdata_2010-03.csv
filename=./data/fix/yellow_tripdata_2010-03.csv
echo "`date`: beginning load for ${filename}"
sed $'s/\r$//' $filename | sed '/^$/d' | psql nyc-taxi-data -c "COPY yellow_tripdata_staging ${schema} FROM stdin CSV HEADER;" -h $dbname
echo "`date`: finished raw load for ${filename}"
psql nyc-taxi-data -f populate_yellow_trips.sql -h $dbname
echo "`date`: loaded trips for ${filename}"
```