Merge pull request #76 from bxparks/develop

bxparks · web-flow · commit 2830dd0b4368 · 2021-11-14T08:30:42.000-08:00
merge v1.5 into master
diff --git a/.github/workflows/pythonpackage.yml b/.github/workflows/pythonpackage.yml
@@ -16,13 +16,13 @@ jobs:
     strategy: 
       matrix:
         # 3.5 does not support f-strings
-        python-version: [3.6, 3.7, 3.8]
+        python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]
 
     steps:
     - uses: actions/checkout@v2
 
     - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v1
+      uses: actions/setup-python@v2
       with:
         python-version: ${{ matrix.python-version }}
 
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,9 +1,14 @@
 # Changelog
 
 * Unreleased
+* 1.5 (2021-11-14)
+    * Make the column order in the BQ schema file match the order of appearance
+      in the JSON data file using the `--preserve_input_sort_order` flag.
+      Thanks to kdeggelman@ in
+      [PR#75](https://github.com/bxparks/bigquery-schema-generator/pull/75).
 * 1.4.1 (2021-08-23)
     * Add documentation for the `input_format='dict'` option.
-    * Add additional inpout format 'json' and 'dict' test cases.
+    * Add additional input format 'json' and 'dict' test cases.
     * Maintenance release, no functional change in core code.
 * 1.4 (2020-12-09)
     * Add 'dict' as a third `input_format` when `SchemaGenerator` is used as a
@@ -13,7 +18,7 @@
     * Expand the pattern matchers for quoted integers and quoted floating point
       numbers to be more compatible with the patterns recognized by `bq load
       --autodetect`.
-    * Add Table of Contents to READMD.md. Add usage info for the
+    * Add Table of Contents to README.md. Add usage info for the
       `schema_map=existing_schema_map` and the `input_format='dict'` parameters
       in the `SchemaGenerator()` constructor.
 * 1.3 (2020-12-05)
@@ -92,8 +97,8 @@
 * 0.1.3 (2018-01-23)
     * Attempt #2 to fix exception during pip3 install.
 * 0.1.2 (2018-01-23)
-    * Attemp to fix exception during pip3 install. Didn't work. Pulled.
+    * Attempt to fix exception during pip3 install. Didn't work. Pulled.
 * 0.1.1 (2018-01-03)
     * Install `generate-schema` script in `/usr/local/bin`
 * 0.1 (2018-01-02)
-    * Iniitial release to PyPI.
+    * Initial release to PyPI.
diff --git a/README.md b/README.md
@@ -14,7 +14,7 @@ $ generate-schema < file.data.json > file.schema.json
 $ generate-schema --input_format csv < file.data.csv > file.schema.json
 ```
 
-**Version**: 1.4.1 (2021-08-23)
+**Version**: 1.5 (2021-11-14)
 
 **Changelog**: [CHANGELOG.md](CHANGELOG.md)
 
@@ -38,6 +38,8 @@ $ generate-schema --input_format csv < file.data.csv > file.schema.json
         * [Sanitize Names (`--sanitize_names`)](#SanitizedNames)
         * [Ignore Invalid Lines (`--ignore_invalid_lines`)](#IgnoreInvalidLines)
         * [Existing Schema Path (`--existing_schema_path`)](#ExistingSchemaPath)
+        * [Preserve Input Sort Order
+          (`--preserve_input_sort_order`)](#PreserveInputSortOrder)
     * [Using as a Library](#UsingAsLibrary)
         * [`SchemaGenerator.run()`](#SchemaGeneratorRun)
         * [`SchemaGenerator.deduce_schema()`](#SchemaGeneratorDeduceSchema)
@@ -289,6 +291,7 @@ usage: generate-schema [-h] [--input_format INPUT_FORMAT] [--keep_nulls]
                        [--debugging_map] [--sanitize_names]
                        [--ignore_invalid_lines]
                        [--existing_schema_path EXISTING_SCHEMA_PATH]
+                       [--preserve_input_sort_order]
 
 Generate BigQuery schema from JSON or CSV file.
 
@@ -313,6 +316,11 @@ optional arguments:
                         File that contains the existing BigQuery schema for a
                         table. This can be fetched with: `bq show --schema
                         <project_id>:<dataset>:<table_name>
+  --preserve_input_sort_order
+                        Preserve the original ordering of columns from input
+                        instead of sorting alphabetically. This only impacts
+                        `input_format` of json or dict
+
 ```
 
 <a name="InputFormat"></a>
@@ -547,6 +555,89 @@ See discussion in
 [PR #57](https://github.com/bxparks/bigquery-schema-generator/pull/57) for
 more details.
 
+<a name="PreserveInputSortOrder"></a>
+#### Preserve Input Sort Order (`--preserve_input_sort_order`)
+
+By default, the order of columns in the BQ schema file is sorted
+lexicographically, which matched the original behavior of `bq load
+--autodetect`. If the `--preserve_input_sort_order` flag is given, the columns
+in the resulting schema file is not sorted, but preserves the order of
+appearance in the input JSON data. For example, the following JSON data with
+the `--preserve_input_sort_order` flag will produce:
+
+```bash
+$ generate-schema --preserve_input_sort_order
+{ "s": "string", "i": 3, "x": 3.2, "b": true }
+^D
+[
+  {
+    "mode": "NULLABLE",
+    "name": "s",
+    "type": "STRING"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "i",
+    "type": "INTEGER"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "x",
+    "type": "FLOAT"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "b",
+    "type": "BOOLEAN"
+  }
+]
+```
+
+It is possible that each JSON record line contains only a partial subset of the
+total possible columns in the data set. The order of the columns in the BQ
+schema will then be the order that each column was first *seen* by the
+script:
+
+```bash
+$ generate-schema --preserve_input_sort_order
+{ "s": "string", "i": 3 }
+{ "x": 3.2, "s": "string", "i": 3 }
+{ "b": true, "x": 3.2, "s": "string", "i": 3 }
+^D
+[
+  {
+    "mode": "NULLABLE",
+    "name": "s",
+    "type": "STRING"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "i",
+    "type": "INTEGER"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "x",
+    "type": "FLOAT"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "b",
+    "type": "BOOLEAN"
+  }
+]
+```
+
+**Note**: In Python 3.6 (the earliest version of Python supported by this
+project), the order of keys in a `dict` was the insertion-order, but this
+ordering was an implementation detail, and not guaranteed. In Python 3.7, that
+ordering was made permanent. So the `--preserve_input_sort_order` flag
+**should** work in Python 3.6 but is not guaranteed.
+
+See discussion in
+[PR #75](https://github.com/bxparks/bigquery-schema-generator/pull/75) for
+more details.
+
 <a name="UsingAsLibrary"></a>
 ### Using As a Library
 
@@ -572,6 +663,7 @@ generator = SchemaGenerator(
     debugging_map=debugging_map,
     sanitize_names=sanitize_names,
     ignore_invalid_lines=ignore_invalid_lines,
+    preserve_input_sort_order=preserve_input_sort_order,
 )
 generator.run(input_file=input_file, output_file=output_file)
 ```
@@ -903,14 +995,14 @@ Apache License 2.0
 <a name="Feedback"></a>
 ## Feedback and Support
 
-If you have any questions, comments and other support questions about how to
-use this library, use the
-[GitHub Discussions](https://github.com/bxparks/bigquery-schema-generator/discussions)
-for this project. If you have bug reports or feature requests, file a ticket in
-[GitHub Issues](https://github.com/bxparks/bigquery-schema-generator/issues).
-I'd love to hear about how this software and its documentation can be improved.
-I can't promise that I will incorporate everything, but I will give your ideas
-serious consideration.
+If you have any questions, comments, or feature requests for this library,
+please use the [GitHub
+Discussions](https://github.com/bxparks/bigquery-schema-generator/discussions)
+for this project. If you have bug reports, please file a ticket in [GitHub
+Issues](https://github.com/bxparks/bigquery-schema-generator/issues). Feature
+requests should go into Discussions first because they often have alternative
+solutions which are useful to remain visible, instead of disappearing from the
+default view of the Issue tracker after the ticket is closed.
 
 Please refrain from emailing me directly unless the content is sensitive. The
 problem with email is that I cannot reference the email conversation when other
@@ -936,4 +1028,6 @@ people ask similar questions later.
   by Austin Brogle (abroglesc@) and Bozo Dragojevic (bozzzzo@).
 * Allow `SchemaGenerator.deduce_schema()` to accept a list of native Python
   `dict` objects, by Zigfrid Zvezdin (ZiggerZZ@).
-
+* Make the column order in the BQ schema file match the order of appearance in
+  the JSON data file using the `--preserve_input_sort_order` flag. By Kevin
+  Deggelman (kdeggelman@).
diff --git a/bigquery_schema_generator/generate_schema.py b/bigquery_schema_generator/generate_schema.py
@@ -86,6 +86,7 @@ def __init__(
         debugging_map=False,
         sanitize_names=False,
         ignore_invalid_lines=False,
+        preserve_input_sort_order=False,
     ):
         self.input_format = input_format
         self.infer_mode = infer_mode
@@ -113,7 +114,10 @@ def __init__(
         # If CSV, preserve the original ordering because 'bq load` matches the
         # CSV column with the respective schema entry using the position of the
         # column in the schema.
-        self.sorted_schema = (input_format in {'json', 'dict'})
+        self.sorted_schema = (
+            (input_format in {'json', 'dict'})
+            and not preserve_input_sort_order
+        )
 
         self.line_number = 0
         self.error_logs = []
@@ -1042,6 +1046,13 @@ def main():
         ' This can be fetched with:'
         ' `bq show --schema <project_id>:<dataset>:<table_name>',
         default=None)
+    parser.add_argument(
+        '--preserve_input_sort_order',
+        help='Preserve the original ordering of columns from input instead of'
+        ' sorting alphabetically.'
+        ' This only impacts `input_format` of json or dict',
+        action='store_true'
+    )
     args = parser.parse_args()
 
     # Configure logging.
@@ -1056,6 +1067,7 @@ def main():
         debugging_map=args.debugging_map,
         sanitize_names=args.sanitize_names,
         ignore_invalid_lines=args.ignore_invalid_lines,
+        preserve_input_sort_order=args.preserve_input_sort_order
     )
     existing_schema_map = read_existing_schema_from_file(
         args.existing_schema_path)
diff --git a/bigquery_schema_generator/version.py b/bigquery_schema_generator/version.py
@@ -1 +1 @@
-__version__ = '1.4.1'
+__version__ = '1.5'
diff --git a/tests/test_generate_schema.py b/tests/test_generate_schema.py
@@ -608,6 +608,7 @@ def verify_data_chunk_as_csv_json_dict(self, *, chunk, as_dict):
         quoted_values_are_strings = ('quoted_values_are_strings' in data_flags)
         sanitize_names = ('sanitize_names' in data_flags)
         ignore_invalid_lines = ('ignore_invalid_lines' in data_flags)
+        preserve_input_sort_order = ('preserve_input_sort_order' in data_flags)
         records = chunk['records']
         expected_errors = chunk['errors']
         expected_error_map = chunk['error_map']
@@ -638,7 +639,8 @@ def verify_data_chunk_as_csv_json_dict(self, *, chunk, as_dict):
             keep_nulls=keep_nulls,
             quoted_values_are_strings=quoted_values_are_strings,
             sanitize_names=sanitize_names,
-            ignore_invalid_lines=ignore_invalid_lines)
+            ignore_invalid_lines=ignore_invalid_lines,
+            preserve_input_sort_order=preserve_input_sort_order)
         existing_schema_map = None
         if existing_schema:
             existing_schema_map = bq_schema_to_map(json.loads(existing_schema))
diff --git a/tests/testdata.txt b/tests/testdata.txt
@@ -2158,3 +2158,97 @@ SCHEMA
   }
 ]
 END
+
+# Test --preserve_input_sort_order flag. Without the flag, the
+# keys are in sorted order, for compatibility with 'bq load --autodetect`,
+# at least what 'bq load' used to do.
+# See https://github.com/bxparks/bigquery-schema-generator/pull/75
+DATA
+{ "s": "string", "i": 3, "x": 3.2, "b": true }
+SCHEMA
+[
+  {
+    "mode": "NULLABLE",
+    "name": "b",
+    "type": "BOOLEAN"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "i",
+    "type": "INTEGER"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "s",
+    "type": "STRING"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "x",
+    "type": "FLOAT"
+  }
+]
+END
+
+# Test --preserve_input_sort_order flag. With the flag, the column keys should
+# be in the order they appear in the JSON data.
+# See https://github.com/bxparks/bigquery-schema-generator/pull/75
+DATA preserve_input_sort_order
+{ "s": "string", "i": 3, "x": 3.2, "b": true }
+SCHEMA
+[
+  {
+    "mode": "NULLABLE",
+    "name": "s",
+    "type": "STRING"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "i",
+    "type": "INTEGER"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "x",
+    "type": "FLOAT"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "b",
+    "type": "BOOLEAN"
+  }
+]
+END
+
+# Test --preserve_input_sort_order flag. Each JSON data record can contain a
+# partial list of keys. So the order of columns in the schema will be the order
+# in which they are first *seen* by the bigquery_schema_generator.
+# See https://github.com/bxparks/bigquery-schema-generator/pull/75
+DATA preserve_input_sort_order
+{ "s": "string", "i": 3 }
+{ "x": 3.2, "s": "string", "i": 3 }
+{ "b": true, "x": 3.2, "s": "string", "i": 3 }
+SCHEMA
+[
+  {
+    "mode": "NULLABLE",
+    "name": "s",
+    "type": "STRING"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "i",
+    "type": "INTEGER"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "x",
+    "type": "FLOAT"
+  },
+  {
+    "mode": "NULLABLE",
+    "name": "b",
+    "type": "BOOLEAN"
+  }
+]
+END

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-__version__ = '1.4.1'`
	`1`	`+__version__ = '1.5'`