Skip to content

Commit 6405d35

Browse files
authored
Merge pull request #99 from bxparks/develop
merge 1.6.1 into master
2 parents d8fb050 + a016ce4 commit 6405d35

File tree

13 files changed

+252
-14
lines changed

13 files changed

+252
-14
lines changed

CHANGELOG.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,25 @@
11
# Changelog
22

33
* Unreleased
4+
* 1.6.1 (2024-01-12)
5+
* **Bug Fix**: Prevent amnesia that causes multiple type mismatches warnings
6+
* If a data set contains multiple records with a column which do not
7+
match each other, then the old code would *remove* the corresponding
8+
internal `schema_entry` for that column, and print a warning message.
9+
* This means that subsequent records would recreate the `schema_entry`,
10+
and a subsequent mismatch would print another warning message.
11+
* This also meant that if there was a second record after the most
12+
recent mismatch, the script would output a schema entry for the
13+
mismatching column, corresponding to the type of the last record which
14+
was not marked as a mismatch.
15+
* The fix is to use a tombstone entry for the offending column, instead
16+
of deleting the `schema_entry` completely. Only a single warning
17+
message is printed, and the column is ignored for all subsequent
18+
records in the input data set.
19+
* See
20+
[Issue#98](https://github.com/bxparks/bigquery-schema-generator/issues/98]
21+
which identified this problem which seems to have existed from the
22+
very beginning.
423
* 1.6.0 (2023-04-01)
524
* Allow `null` fields to convert to `REPEATED` because `bq load` seems
625
to interpret null fields to be equivalent to an empty array `[]`.

README.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ $ generate-schema < file.data.json > file.schema.json
1515
$ generate-schema --input_format csv < file.data.csv > file.schema.json
1616
```
1717

18-
**Version**: 1.6.0 (2023-04-01)
18+
**Version**: 1.6.1 (2024-01-12)
1919

2020
**Changelog**: [CHANGELOG.md](CHANGELOG.md)
2121

@@ -267,6 +267,18 @@ csv` flag. The support is not as robust as JSON file. For example, CSV format
267267
supports only the comma-separator, and does not support the pipe (`|`) or tab
268268
(`\t`) character.
269269

270+
**Side Note**: The `input_format` parameter now supports (v1.6.0) the
271+
`csvdictreader` option which allows using the
272+
[csv.DictReader](https://docs.python.org/3/library/csv.html) class that can be
273+
customized to handle different delimiters such as tabs. But this requires
274+
creating a custom Python script using `bigquery_schema_generator` as a library.
275+
See [SchemaGenerator.deduce_schema() from
276+
csv.DictReader](#SchemaGeneratorDeduceSchemaFromCsvDictReader) section below. It
277+
is probably possible to enable this functionality through the command line
278+
script, but it was not obvious how to expose the various options of
279+
`csv.DictReader` through the command line flags. I didn't spend any time on this
280+
problem because this is not a feature that I use personally.)
281+
270282
Unlike `bq load`, the `generate_schema.py` script reads every record in the
271283
input data file to deduce the table's schema. It prints the JSON formatted
272284
schema file on the STDOUT.

bigquery_schema_generator/generate_schema.py

Lines changed: 41 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -136,13 +136,15 @@ def deduce_schema(self, input_data, *, schema_map=None):
136136
key: schema_entry
137137
}
138138
139-
The 'key' is the name of the table column.
139+
The 'key' is the canonical column name, which is set to be the
140+
lower-cased version of the sanitized key because BigQuery is
141+
case-insensitive to its column name.
140142
141143
schema_entry := {
142-
'status': 'hard | soft',
144+
'status': 'hard | soft | ignore',
143145
'filled': True | False,
144146
'info': {
145-
'name': key,
147+
'name': column_name,
146148
'type': 'STRING | TIMESTAMP | DATE | TIME
147149
| FLOAT | INTEGER | BOOLEAN | RECORD'
148150
'mode': 'NULLABLE | REQUIRED | REPEATED',
@@ -160,6 +162,13 @@ def deduce_schema(self, input_data, *, schema_map=None):
160162
'hard'. The status can transition from 'soft' to 'hard' but not the
161163
reverse.
162164
165+
The status of 'ignore' identifies a column where the type of one record
166+
conflicts with the type of another record. The column will be ignored
167+
in the final JSON schema. (Early versions of this script *removed* the
168+
offending column entry completely upon the first mismatch. But that
169+
caused subsequent records to recreate the schema entry, which would be
170+
incorrect.)
171+
163172
The 'filled' entry indicates whether all input data records contained
164173
the given field. If the --infer_mode flag is given, the 'filled' entry
165174
is used to convert a NULLABLE schema entry to a REQUIRED schema entry or
@@ -277,8 +286,7 @@ def merge_schema_entry(
277286
278287
Returns the merged schema_entry. This method assumes that both
279288
'old_schema_entry' and 'new_schema_entry' can be modified in place and
280-
returned as the new schema_entry. Returns None if the field should
281-
be removed from the schema due to internal consistency errors.
289+
returned as the new schema_entry.
282290
283291
'base_path' is the string representing the current path within the
284292
nested record that leads to this specific entry. This is used during
@@ -302,13 +310,19 @@ def merge_schema_entry(
302310
old_status = old_schema_entry['status']
303311
new_status = new_schema_entry['status']
304312

305-
# new 'soft' does not clobber old 'hard'
313+
# If the field was previously determined to be inconsistent, hence set
314+
# to 'ignore', do nothing and return immediately.
315+
if old_status == 'ignore':
316+
return old_schema_entry
317+
318+
# new 'soft' retains the old 'hard'
306319
if old_status == 'hard' and new_status == 'soft':
307320
mode = self.merge_mode(old_schema_entry,
308321
new_schema_entry,
309322
base_path)
310323
if mode is None:
311-
return None
324+
old_schema_entry['status'] = 'ignore'
325+
return old_schema_entry
312326
old_schema_entry['info']['mode'] = mode
313327
return old_schema_entry
314328

@@ -318,7 +332,8 @@ def merge_schema_entry(
318332
new_schema_entry,
319333
base_path)
320334
if mode is None:
321-
return None
335+
old_schema_entry['status'] = 'ignore'
336+
return old_schema_entry
322337
new_schema_entry['info']['mode'] = mode
323338
return new_schema_entry
324339

@@ -389,7 +404,8 @@ def merge_schema_entry(
389404
new_schema_entry,
390405
base_path)
391406
if new_mode is None:
392-
return None
407+
old_schema_entry['status'] = 'ignore'
408+
return old_schema_entry
393409
new_schema_entry['info']['mode'] = new_mode
394410

395411
# For all other types...
@@ -402,7 +418,8 @@ def merge_schema_entry(
402418
f'old=({old_status},{full_old_name},{old_mode},{old_type});'
403419
f' new=({new_status},{full_new_name},{new_mode},{new_type})'
404420
)
405-
return None
421+
old_schema_entry['status'] = 'ignore'
422+
return old_schema_entry
406423

407424
new_info['type'] = candidate_type
408425
return new_schema_entry
@@ -414,6 +431,11 @@ def merge_mode(self, old_schema_entry, new_schema_entry, base_path):
414431
flag), because REQUIRED is created only in the flatten_schema()
415432
method. Therefore, a NULLABLE->REQUIRED transition cannot occur.
416433
434+
Returns the merged mode.
435+
436+
Returning None means that the modes of the old_schema and new_schema are
437+
not compatible.
438+
417439
We have the following sub cases for the REQUIRED -> NULLABLE
418440
transition:
419441
@@ -425,8 +447,6 @@ def merge_mode(self, old_schema_entry, new_schema_entry, base_path):
425447
REQUIRED -> NULLABLE transition.
426448
b) If --infer_mode is not given, then we log an error and ignore
427449
this field from the schema.
428-
429-
Returning a 'None' causes the field to be dropped from the schema.
430450
"""
431451
old_info = old_schema_entry['info']
432452
new_info = new_schema_entry['info']
@@ -778,6 +798,10 @@ def convert_type(atype, btype):
778798
* [Q]FLOAT + [Q]INTEGER => FLOAT (except QFLOAT + QINTEGER)
779799
* (DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) +
780800
(DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) => STRING
801+
802+
The "Q" refers to the quoted (i.e. string) versions of the various types,
803+
which are needed to emulate the type inference inside quoted strings
804+
performed by BigQuery.
781805
"""
782806
# type + type => type
783807
if atype == btype:
@@ -884,6 +908,11 @@ def flatten_schema_map(
884908
filled = meta['filled']
885909
info = meta['info']
886910

911+
# An 'ignore' status means different records had different types for
912+
# this field, so should be ignored.
913+
if status == 'ignore':
914+
continue
915+
887916
# Schema entries with a status of 'soft' are caused by 'null' or
888917
# empty fields. Don't print those out if the 'keep_nulls' flag is
889918
# False.
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = '1.6.0'
1+
__version__ = '1.6.1'

tests/testdata.txt

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,26 @@ SCHEMA
167167
[]
168168
END
169169

170+
# If there are multiple records with a column with a mismatched type, only the
171+
# first one mismatch should trigger a warning message. All subsequent records
172+
# should have that column ignored, even if the mismatch occurs multiple times.
173+
# See Issue 98.
174+
#
175+
# Before that fix, the following would generate a warning each time there is a
176+
# transition between matching and mismatching type for the problematic column.
177+
# So for the following 4 records, the previous version would print 2 warnings,
178+
# one for line #2, and one for line #4.
179+
DATA
180+
{ "ts": "2017-05-22T17:10:00-07:00" }
181+
{ "ts": 1.0 }
182+
{ "ts": 2.0 }
183+
{ "ts": "2017-05-22T17:10:00-07:00" }
184+
ERRORS
185+
2: Ignoring field with mismatched type: old=(hard,ts,NULLABLE,TIMESTAMP); new=(hard,ts,NULLABLE,FLOAT)
186+
SCHEMA
187+
[]
188+
END
189+
170190
# DATE cannot change into a non-String type.
171191
DATA
172192
{ "d": "2017-01-01" }
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)