bxparks
diff --git a/‎CHANGELOG.md‎
Lines changed: 19 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 13 additions & 1 deletion b/‎README.md‎
Lines changed: 13 additions & 1 deletion
diff --git a/‎bigquery_schema_generator/generate_schema.py‎
Lines changed: 41 additions & 12 deletions b/‎bigquery_schema_generator/generate_schema.py‎
Lines changed: 41 additions & 12 deletions
diff --git a/‎bigquery_schema_generator/version.py‎
Lines changed: 1 addition & 1 deletion b/‎bigquery_schema_generator/version.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎tests/testdata.txt‎
Lines changed: 20 additions & 0 deletions b/‎tests/testdata.txt‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎tests/testdata/nulls.json‎ renamed to ‎tests/testdata/issue90-1.json‎ b/‎tests/testdata/nulls.json‎ renamed to ‎tests/testdata/issue90-1.json‎
diff --git a/‎tests/testdata/nulls2.json‎ renamed to ‎tests/testdata/issue90-2.json‎ b/‎tests/testdata/nulls2.json‎ renamed to ‎tests/testdata/issue90-2.json‎
diff --git a/‎tests/testdata/nulls3.json‎ renamed to ‎tests/testdata/issue90-3.json‎ b/‎tests/testdata/nulls3.json‎ renamed to ‎tests/testdata/issue90-3.json‎
diff --git a/‎tests/testdata/nulls4.json‎ renamed to ‎tests/testdata/issue90-4.json‎ b/‎tests/testdata/nulls4.json‎ renamed to ‎tests/testdata/issue90-4.json‎
diff --git a/‎tests/testdata/nulls5.json‎ renamed to ‎tests/testdata/issue90-5.json‎ b/‎tests/testdata/nulls5.json‎ renamed to ‎tests/testdata/issue90-5.json‎
@@ -1,6 +1,25 @@
 # Changelog
 
 * Unreleased
+* 1.6.1 (2024-01-12)
+    * **Bug Fix**: Prevent amnesia that causes multiple type mismatches warnings
+        * If a data set contains multiple records with a column which do not
+          match each other, then the old code would *remove* the corresponding
+          internal `schema_entry` for that column, and print a warning message.
+        * This means that subsequent records would recreate the `schema_entry`,
+          and a subsequent mismatch would print another warning message.
+        * This also meant that if there was a second record after the most
+          recent mismatch, the script would output a schema entry for the
+          mismatching column, corresponding to the type of the last record which
+          was not marked as a mismatch.
+        * The fix is to use a tombstone entry for the offending column, instead
+          of deleting the `schema_entry` completely. Only a single warning
+          message is printed, and the column is ignored for all subsequent
+          records in the input data set.
+        * See
+          [Issue#98](https://github.com/bxparks/bigquery-schema-generator/issues/98]
+          which identified this problem which seems to have existed from the
+          very beginning.
 * 1.6.0 (2023-04-01)
     * Allow `null` fields to convert to `REPEATED` because `bq load` seems
       to interpret null fields to be equivalent to an empty array `[]`.
 
@@ -15,7 +15,7 @@ $ generate-schema < file.data.json > file.schema.json
 $ generate-schema --input_format csv < file.data.csv > file.schema.json
 ```
 
-**Version**: 1.6.0 (2023-04-01)
+**Version**: 1.6.1 (2024-01-12)
 
 **Changelog**: [CHANGELOG.md](CHANGELOG.md)
 
@@ -267,6 +267,18 @@ csv` flag. The support is not as robust as JSON file. For example, CSV format
 supports only the comma-separator, and does not support the pipe (`|`) or tab
 (`\t`) character.
 
+**Side Note**: The `input_format` parameter now supports (v1.6.0) the
+`csvdictreader` option which allows using the
+[csv.DictReader](https://docs.python.org/3/library/csv.html) class that can be
+customized to handle different delimiters such as tabs. But this requires
+creating a custom Python script using `bigquery_schema_generator` as a library.
+See [SchemaGenerator.deduce_schema() from
+csv.DictReader](#SchemaGeneratorDeduceSchemaFromCsvDictReader) section below. It
+is probably possible to enable this functionality through the command line
+script, but it was not obvious how to expose the various options of
+`csv.DictReader` through the command line flags. I didn't spend any time on this
+problem because this is not a feature that I use personally.)
+
 Unlike `bq load`, the `generate_schema.py` script reads every record in the
 input data file to deduce the table's schema. It prints the JSON formatted
 schema file on the STDOUT.
 
@@ -136,13 +136,15 @@ def deduce_schema(self, input_data, *, schema_map=None):
             key: schema_entry
           }
 
-        The 'key' is the name of the table column.
+        The 'key' is the canonical column name, which is set to be the
+        lower-cased version of the sanitized key because BigQuery is
+        case-insensitive to its column name.
 
           schema_entry := {
-            'status': 'hard | soft',
+            'status': 'hard | soft | ignore',
             'filled': True | False,
             'info': {
-              'name': key,
+              'name': column_name,
               'type': 'STRING | TIMESTAMP | DATE | TIME
                       | FLOAT | INTEGER | BOOLEAN | RECORD'
               'mode': 'NULLABLE | REQUIRED | REPEATED',
@@ -160,6 +162,13 @@ def deduce_schema(self, input_data, *, schema_map=None):
         'hard'. The status can transition from 'soft' to 'hard' but not the
         reverse.
 
+        The status of 'ignore' identifies a column where the type of one record
+        conflicts with the type of another record. The column will be ignored
+        in the final JSON schema. (Early versions of this script *removed* the
+        offending column entry completely upon the first mismatch. But that
+        caused subsequent records to recreate the schema entry, which would be
+        incorrect.)
+
         The 'filled' entry indicates whether all input data records contained
         the given field. If the --infer_mode flag is given, the 'filled' entry
         is used to convert a NULLABLE schema entry to a REQUIRED schema entry or
@@ -277,8 +286,7 @@ def merge_schema_entry(
 
         Returns the merged schema_entry. This method assumes that both
         'old_schema_entry' and 'new_schema_entry' can be modified in place and
-        returned as the new schema_entry. Returns None if the field should
-        be removed from the schema due to internal consistency errors.
+        returned as the new schema_entry.
 
         'base_path' is the string representing the current path within the
         nested record that leads to this specific entry. This is used during
@@ -302,13 +310,19 @@ def merge_schema_entry(
         old_status = old_schema_entry['status']
         new_status = new_schema_entry['status']
 
-        # new 'soft' does not clobber old 'hard'
+        # If the field was previously determined to be inconsistent, hence set
+        # to 'ignore', do nothing and return immediately.
+        if old_status == 'ignore':
+            return old_schema_entry
+
+        # new 'soft' retains the old 'hard'
         if old_status == 'hard' and new_status == 'soft':
             mode = self.merge_mode(old_schema_entry,
                                    new_schema_entry,
                                    base_path)
             if mode is None:
-                return None
+                old_schema_entry['status'] = 'ignore'
+                return old_schema_entry
             old_schema_entry['info']['mode'] = mode
             return old_schema_entry
 
@@ -318,7 +332,8 @@ def merge_schema_entry(
                                    new_schema_entry,
                                    base_path)
             if mode is None:
-                return None
+                old_schema_entry['status'] = 'ignore'
+                return old_schema_entry
             new_schema_entry['info']['mode'] = mode
             return new_schema_entry
 
@@ -389,7 +404,8 @@ def merge_schema_entry(
                                    new_schema_entry,
                                    base_path)
         if new_mode is None:
-            return None
+            old_schema_entry['status'] = 'ignore'
+            return old_schema_entry
         new_schema_entry['info']['mode'] = new_mode
 
         # For all other types...
@@ -402,7 +418,8 @@ def merge_schema_entry(
                     f'old=({old_status},{full_old_name},{old_mode},{old_type});'
                     f' new=({new_status},{full_new_name},{new_mode},{new_type})'
                 )
-                return None
+                old_schema_entry['status'] = 'ignore'
+                return old_schema_entry
 
             new_info['type'] = candidate_type
         return new_schema_entry
@@ -414,6 +431,11 @@ def merge_mode(self, old_schema_entry, new_schema_entry, base_path):
         flag), because REQUIRED is created only in the flatten_schema()
         method. Therefore, a NULLABLE->REQUIRED transition cannot occur.
 
+        Returns the merged mode.
+
+        Returning None means that the modes of the old_schema and new_schema are
+        not compatible.
+
         We have the following sub cases for the REQUIRED -> NULLABLE
         transition:
 
@@ -425,8 +447,6 @@ def merge_mode(self, old_schema_entry, new_schema_entry, base_path):
                 REQUIRED -> NULLABLE transition.
             b) If --infer_mode is not given, then we log an error and ignore
                 this field from the schema.
-
-        Returning a 'None' causes the field to be dropped from the schema.
         """
         old_info = old_schema_entry['info']
         new_info = new_schema_entry['info']
@@ -778,6 +798,10 @@ def convert_type(atype, btype):
     * [Q]FLOAT + [Q]INTEGER => FLOAT (except QFLOAT + QINTEGER)
     * (DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) +
         (DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) => STRING
+
+    The "Q" refers to the quoted (i.e. string) versions of the various types,
+    which are needed to emulate the type inference inside quoted strings
+    performed by BigQuery.
     """
     # type + type => type
     if atype == btype:
@@ -884,6 +908,11 @@ def flatten_schema_map(
         filled = meta['filled']
         info = meta['info']
 
+        # An 'ignore' status means different records had different types for
+        # this field, so should be ignored.
+        if status == 'ignore':
+            continue
+
         # Schema entries with a status of 'soft' are caused by 'null' or
         # empty fields. Don't print those out if the 'keep_nulls' flag is
         # False.
 
@@ -1 +1 @@
-__version__ = '1.6.0'
+__version__ = '1.6.1'
@@ -167,6 +167,26 @@ SCHEMA
 []
 END
 
+# If there are multiple records with a column with a mismatched type, only the
+# first one mismatch should trigger a warning message. All subsequent records
+# should have that column ignored, even if the mismatch occurs multiple times.
+# See Issue 98.
+#
+# Before that fix, the following would generate a warning each time there is a
+# transition between matching and mismatching type for the problematic column.
+# So for the following 4 records, the previous version would print 2 warnings,
+# one for line #2, and one for line #4.
+DATA
+{ "ts": "2017-05-22T17:10:00-07:00" }
+{ "ts": 1.0 }
+{ "ts": 2.0 }
+{ "ts": "2017-05-22T17:10:00-07:00" }
+ERRORS
+2: Ignoring field with mismatched type: old=(hard,ts,NULLABLE,TIMESTAMP); new=(hard,ts,NULLABLE,FLOAT)
+SCHEMA
+[]
+END
+
 # DATE cannot change into a non-String type.
 DATA
 { "d": "2017-01-01" }
Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-__version__ = '1.6.0'`
	`1`	`+__version__ = '1.6.1'`