Skip to content

Commit 8a9adf4

Browse files
authored
Fix baseball databank schemas, retrosheet data, add latest tags to build (#53)
* Fix schema issues * Fix PK/charlen schema issues * workin * style * tab * add build time window * bump * separate build from up * fix parallelism * just build for now
1 parent d85f1ed commit 8a9adf4

File tree

6 files changed

+178
-110
lines changed

6 files changed

+178
-110
lines changed

.circleci/config.yml

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -35,12 +35,7 @@ jobs:
3535
environment:
3636
BUILD_ENV: test
3737
command: |
38-
docker-compose up -d
39-
sleep 60s
40-
total_exits=$(docker-compose ps -q | wc -l)
41-
clean_exits=$(docker-compose ps -q | xargs docker inspect -f '{{ .State.ExitCode }}' | grep -c 0)
42-
echo "Total containers: ${total_exits} Clean exits: ${clean_exits}"
43-
((${clean_exits}==${total_exits})) || (echo "At least one container had a non-zero exit" && exit 1);
38+
docker-compose build
4439
4540
4641
# Orchestrate or schedule a set of jobs, see https://circleci.com/docs/2.0/workflows/

.env

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
CHADWICK_VERSION=v0.7.2
22
BASEBALLDATABANK_VERSION=bb19ecb78e6be5764da32497ec165eb0aaab66a9
3-
RETROSHEET_VERSION=0cd5f717c1ea5979a6eca43dff6ac543e155111c
3+
RETROSHEET_VERSION=75fe03a53e2add11441d1f012401e1aef299cf03
44

55
EXTRACT_DIR=extract
66
REPO=doublewick/boxball
7-
VERSION=2020.0.0
7+
VERSION=2020.0.1

docker-compose.yml

Lines changed: 158 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -1,111 +1,179 @@
11
version: '3.7'
2+
x-extract:
3+
&extract
4+
build:
5+
context: extract
6+
args:
7+
- CHADWICK_VERSION
8+
- RETROSHEET_VERSION
9+
- BASEBALLDATABANK_VERSION
10+
- BUILD_ENV
11+
image: ${REPO}:extract-${VERSION}
12+
x-ddl:
13+
&ddl
14+
build:
15+
context: transform
16+
dockerfile: ddl.Dockerfile
17+
args:
18+
- VERSION
19+
image: ${REPO}:ddl-${VERSION}
20+
x-parquet:
21+
&parquet
22+
build:
23+
context: transform
24+
dockerfile: parquet.Dockerfile
25+
args:
26+
- VERSION
27+
image: ${REPO}:parquet-${VERSION}
28+
depends_on:
29+
- extract
30+
x-csv:
31+
&csv
32+
build:
33+
context: transform
34+
dockerfile: csv.Dockerfile
35+
args:
36+
- VERSION
37+
image: ${REPO}:csv-${VERSION}
38+
depends_on:
39+
- extract
40+
41+
x-clickhouse:
42+
&clickhouse
43+
build:
44+
context: load/clickhouse
45+
args:
46+
- VERSION
47+
image: ${REPO}:clickhouse-${VERSION}
48+
volumes:
49+
- ~/boxball/clickhouse:/var/lib/clickhouse
50+
depends_on:
51+
- parquet
52+
- ddl
53+
54+
x-drill:
55+
&drill
56+
build:
57+
context: load/drill
58+
args:
59+
- VERSION
60+
image: ${REPO}:drill-${VERSION}
61+
volumes:
62+
- ~/boxball/drill:/data
63+
depends_on:
64+
- parquet
65+
- ddl
66+
67+
x-postgres:
68+
&postgres
69+
build:
70+
context: load/postgres
71+
args:
72+
- VERSION
73+
image: ${REPO}:postgres-${VERSION}
74+
volumes:
75+
- ~/boxball/postgres:/var/lib/postgresql/data
76+
depends_on:
77+
- csv
78+
- ddl
79+
80+
x-postgres-cstore-fdw:
81+
&postgres-cstore-fdw
82+
build:
83+
context: load/postgres_cstore_fdw
84+
args:
85+
- VERSION
86+
image: ${REPO}:postgres-cstore-fdw-${VERSION}
87+
volumes:
88+
- ~/boxball/postgres-cstore-fdw:/var/lib/postgresql/data
89+
depends_on:
90+
- csv
91+
- ddl
92+
93+
x-mysql:
94+
&mysql
95+
build:
96+
context: load/mysql
97+
args:
98+
- VERSION
99+
image: ${REPO}:mysql-${VERSION}
100+
volumes:
101+
- ~/boxball/mysql:/var/lib/mysql
102+
depends_on:
103+
- csv
104+
- ddl
105+
106+
x-sqlite:
107+
&sqlite
108+
build:
109+
context: load/sqlite
110+
args:
111+
- VERSION
112+
image: ${REPO}:sqlite-${VERSION}
113+
volumes:
114+
- ~/boxball/sqlite:/db
115+
depends_on:
116+
- csv
117+
- ddl
118+
119+
2120
services:
3121
extract:
4-
build:
5-
context: extract
6-
args:
7-
- CHADWICK_VERSION
8-
- RETROSHEET_VERSION
9-
- BASEBALLDATABANK_VERSION
10-
- BUILD_ENV
11-
image: ${REPO}:extract-${VERSION}
122+
<< : *extract
123+
extract-latest:
124+
<< : *extract
125+
image: ${REPO}:extract-latest
12126

13127
ddl:
14-
build:
15-
context: transform
16-
dockerfile: ddl.Dockerfile
17-
args:
18-
- VERSION
19-
image: ${REPO}:ddl-${VERSION}
128+
<< : *ddl
129+
ddl-latest:
130+
<< : *ddl
131+
image: ${REPO}:ddl-latest
20132

21133
parquet:
22-
build:
23-
context: transform
24-
dockerfile: parquet.Dockerfile
25-
args:
26-
- VERSION
27-
image: ${REPO}:parquet-${VERSION}
28-
depends_on:
29-
- extract
134+
<< : *parquet
135+
parquet-latest:
136+
<< : *parquet
137+
image: ${REPO}:parquet-latest
30138

31139
csv:
32-
build:
33-
context: transform
34-
dockerfile: csv.Dockerfile
35-
args:
36-
- VERSION
37-
image: ${REPO}:csv-${VERSION}
38-
depends_on:
39-
- extract
140+
<< : *csv
141+
csv-latest:
142+
<< : *csv
143+
image: ${REPO}:csv-latest
40144

41145
clickhouse:
42-
build:
43-
context: load/clickhouse
44-
args:
45-
- VERSION
46-
image: ${REPO}:clickhouse-${VERSION}
47-
volumes:
48-
- ~/boxball/clickhouse:/var/lib/clickhouse
49-
depends_on:
50-
- parquet
51-
- ddl
146+
<< : *clickhouse
147+
clickhouse-latest:
148+
<< : *clickhouse
149+
image: ${REPO}:clickhouse-latest
52150

53151
drill:
54-
build:
55-
context: load/drill
56-
args:
57-
- VERSION
58-
image: ${REPO}:drill-${VERSION}
59-
volumes:
60-
- ~/boxball/drill:/data
61-
depends_on:
62-
- parquet
63-
- ddl
152+
<< : *drill
153+
drill-latest:
154+
<< : *drill
155+
image: ${REPO}:drill-latest
64156

65157
postgres:
66-
build:
67-
context: load/postgres
68-
args:
69-
- VERSION
70-
image: ${REPO}:postgres-${VERSION}
71-
volumes:
72-
- ~/boxball/postgres:/var/lib/postgresql/data
73-
depends_on:
74-
- csv
75-
- ddl
158+
<< : *postgres
159+
postgres-latest:
160+
<< : *postgres
161+
image: ${REPO}:postgres-latest
76162

77163
postgres-cstore-fdw:
78-
build:
79-
context: load/postgres_cstore_fdw
80-
args:
81-
- VERSION
82-
image: ${REPO}:postgres-cstore-fdw-${VERSION}
83-
volumes:
84-
- ~/boxball/postgres-cstore-fdw:/var/lib/postgresql/data
85-
depends_on:
86-
- csv
87-
- ddl
164+
<< : *postgres-cstore-fdw
165+
postgres-cstore-fdw-latest:
166+
<< : *postgres-cstore-fdw
167+
image: ${REPO}:postgres-cstore-fdw-latest
88168

89169
mysql:
90-
build:
91-
context: load/mysql
92-
args:
93-
- VERSION
94-
image: ${REPO}:mysql-${VERSION}
95-
volumes:
96-
- ~/boxball/mysql:/var/lib/mysql
97-
depends_on:
98-
- csv
99-
- ddl
170+
<< : *mysql
171+
mysql-latest:
172+
<< : *mysql
173+
image: ${REPO}:mysql-latest
100174

101175
sqlite:
102-
build:
103-
context: load/sqlite
104-
args:
105-
- VERSION
106-
image: ${REPO}:sqlite-${VERSION}
107-
volumes:
108-
- ~/boxball/sqlite:/db
109-
depends_on:
110-
- csv
111-
- ddl
176+
<< : *sqlite
177+
sqlite-latest:
178+
<< : *sqlite
179+
image: ${REPO}:sqlite-latest

extract/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ ENV PYTHONPATH="/"
2222
# `prod` gets the full datasets, while `test` provides fixtures with small sample data for each file
2323
FROM build-common as get-retrosheet-prod
2424
ARG RETROSHEET_VERSION
25-
RUN wget https://github.com/chadwickbureau/retrosheet/archive/${RETROSHEET_VERSION}.zip -O retrosheet.zip
25+
RUN wget https://github.com/droher/retrosheet/archive/${RETROSHEET_VERSION}.zip -O retrosheet.zip
2626

2727
FROM build-common as get-retrosheet-test
2828
COPY fixtures/raw/retrosheet.zip .

transform/src/schemas/baseballdatabank.py

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,17 @@
88
class AllstarFull(Base):
99
__tablename__ = 'allstar_full'
1010

11-
player_id = Column(String(9), primary_key=True, nullable=False)
12-
year_id = Column(SmallInteger, primary_key=True, nullable=False)
13-
game_num = Column(SmallInteger, primary_key=True, nullable=False)
11+
player_id = Column(String(9), nullable=False)
12+
# Should be non-nullable, see https://github.com/chadwickbureau/baseballdatabank/issues/105
13+
year_id = Column(SmallInteger)
14+
game_num = Column(SmallInteger)
1415
game_id = Column(String(12))
1516
team_id = Column(String(3))
1617
lg_id = Column(String(2))
1718
gp = Column(SmallInteger)
1819
starting_pos = Column(SmallInteger)
20+
# Note -- Billy Herman's 1934 record prevents us from using the true PK, player-year-gamenum
21+
dummy_id = Column(Integer, autoincrement=True, primary_key=True)
1922

2023

2124
class Appearance(Base):
@@ -47,12 +50,14 @@ class Appearance(Base):
4750
class AwardsManager(Base):
4851
__tablename__ = 'awards_managers'
4952

50-
player_id = Column(String(10), primary_key=True, nullable=False)
51-
award_id = Column(String(75), primary_key=True, nullable=False)
52-
year_id = Column(SmallInteger, primary_key=True, nullable=False)
53-
lg_id = Column(String(2), primary_key=True, nullable=False)
53+
player_id = Column(String(10), nullable=False)
54+
award_id = Column(String(75), nullable=False)
55+
year_id = Column(SmallInteger, nullable=False)
56+
lg_id = Column(String(2))
5457
tie = Column(String(1))
5558
notes = Column(String(100))
59+
# PK should be player/award/year/lg, see https://github.com/chadwickbureau/baseballdatabank/issues/105
60+
dummy_id = Column(Integer, autoincrement=True, primary_key=True)
5661

5762

5863
class AwardsPlayer(Base):
@@ -298,7 +303,7 @@ class Parks(Base):
298303

299304
park_id = Column(String(5), primary_key=True, nullable=False)
300305
park_name = Column(String(40))
301-
park_alias = Column(String(45))
306+
park_alias = Column(String(55))
302307
city = Column(String(25))
303308
state = Column(String(16))
304309
country = Column(String(2))

transform/src/schemas/retrosheet.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -854,7 +854,7 @@ class Park(Base):
854854

855855
park_id = Column(CHAR(5), primary_key=True, doc="Park ID")
856856
name = Column(String(41), doc="Park name")
857-
aka = Column(String(40), doc="Common park alias")
857+
aka = Column(String(55), doc="Common park alias")
858858
city = Column(String(17), doc="City")
859859
state = Column(String(9), doc="State")
860860
# TODO: Handle this MySQL edge case so these can be dates again
@@ -898,7 +898,7 @@ class Schedule(Base):
898898
home_team_league = Column(CHAR(2), doc="Home team league ID")
899899
home_team_game_number = Column(Integer, primary_key=True, doc="Home team game number")
900900
day_night = Column(CHAR(1), doc="D - day, N - night")
901-
postponement_indicator = Column(String(30), doc="""
901+
postponement_indicator = Column(String(120), doc="""
902902
This field will contain one or more phrases related to the game if it was
903903
not played as scheduled. If there is more than one phrase, they are separated
904904
by a semi-colon (";"). There are three possible outcomes for games not played
@@ -907,7 +907,7 @@ class Schedule(Base):
907907
-- The game was played on the original date but at another site
908908
-- The game was not played
909909
""")
910-
makeup_dates = Column(String(20), doc="""
910+
makeup_dates = Column(String(120), doc="""
911911
This field will contain a makeup date if the postponed game was played at
912912
another time or place. If an attempt was known to have been made on a date but
913913
postponed again, that date will be listed. In that case, there will be a second

0 commit comments

Comments
 (0)