Skip to content

Commit 2830dd0

Browse files
authored
Merge pull request #76 from bxparks/develop
merge v1.5 into master
2 parents da3609f + b3cee1a commit 2830dd0

File tree

7 files changed

+226
-19
lines changed

7 files changed

+226
-19
lines changed

.github/workflows/pythonpackage.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,13 @@ jobs:
1616
strategy:
1717
matrix:
1818
# 3.5 does not support f-strings
19-
python-version: [3.6, 3.7, 3.8]
19+
python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]
2020

2121
steps:
2222
- uses: actions/checkout@v2
2323

2424
- name: Set up Python ${{ matrix.python-version }}
25-
uses: actions/setup-python@v1
25+
uses: actions/setup-python@v2
2626
with:
2727
python-version: ${{ matrix.python-version }}
2828

CHANGELOG.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,14 @@
11
# Changelog
22

33
* Unreleased
4+
* 1.5 (2021-11-14)
5+
* Make the column order in the BQ schema file match the order of appearance
6+
in the JSON data file using the `--preserve_input_sort_order` flag.
7+
Thanks to kdeggelman@ in
8+
[PR#75](https://github.com/bxparks/bigquery-schema-generator/pull/75).
49
* 1.4.1 (2021-08-23)
510
* Add documentation for the `input_format='dict'` option.
6-
* Add additional inpout format 'json' and 'dict' test cases.
11+
* Add additional input format 'json' and 'dict' test cases.
712
* Maintenance release, no functional change in core code.
813
* 1.4 (2020-12-09)
914
* Add 'dict' as a third `input_format` when `SchemaGenerator` is used as a
@@ -13,7 +18,7 @@
1318
* Expand the pattern matchers for quoted integers and quoted floating point
1419
numbers to be more compatible with the patterns recognized by `bq load
1520
--autodetect`.
16-
* Add Table of Contents to READMD.md. Add usage info for the
21+
* Add Table of Contents to README.md. Add usage info for the
1722
`schema_map=existing_schema_map` and the `input_format='dict'` parameters
1823
in the `SchemaGenerator()` constructor.
1924
* 1.3 (2020-12-05)
@@ -92,8 +97,8 @@
9297
* 0.1.3 (2018-01-23)
9398
* Attempt #2 to fix exception during pip3 install.
9499
* 0.1.2 (2018-01-23)
95-
* Attemp to fix exception during pip3 install. Didn't work. Pulled.
100+
* Attempt to fix exception during pip3 install. Didn't work. Pulled.
96101
* 0.1.1 (2018-01-03)
97102
* Install `generate-schema` script in `/usr/local/bin`
98103
* 0.1 (2018-01-02)
99-
* Iniitial release to PyPI.
104+
* Initial release to PyPI.

README.md

Lines changed: 104 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ $ generate-schema < file.data.json > file.schema.json
1414
$ generate-schema --input_format csv < file.data.csv > file.schema.json
1515
```
1616

17-
**Version**: 1.4.1 (2021-08-23)
17+
**Version**: 1.5 (2021-11-14)
1818

1919
**Changelog**: [CHANGELOG.md](CHANGELOG.md)
2020

@@ -38,6 +38,8 @@ $ generate-schema --input_format csv < file.data.csv > file.schema.json
3838
* [Sanitize Names (`--sanitize_names`)](#SanitizedNames)
3939
* [Ignore Invalid Lines (`--ignore_invalid_lines`)](#IgnoreInvalidLines)
4040
* [Existing Schema Path (`--existing_schema_path`)](#ExistingSchemaPath)
41+
* [Preserve Input Sort Order
42+
(`--preserve_input_sort_order`)](#PreserveInputSortOrder)
4143
* [Using as a Library](#UsingAsLibrary)
4244
* [`SchemaGenerator.run()`](#SchemaGeneratorRun)
4345
* [`SchemaGenerator.deduce_schema()`](#SchemaGeneratorDeduceSchema)
@@ -289,6 +291,7 @@ usage: generate-schema [-h] [--input_format INPUT_FORMAT] [--keep_nulls]
289291
[--debugging_map] [--sanitize_names]
290292
[--ignore_invalid_lines]
291293
[--existing_schema_path EXISTING_SCHEMA_PATH]
294+
[--preserve_input_sort_order]
292295

293296
Generate BigQuery schema from JSON or CSV file.
294297

@@ -313,6 +316,11 @@ optional arguments:
313316
File that contains the existing BigQuery schema for a
314317
table. This can be fetched with: `bq show --schema
315318
<project_id>:<dataset>:<table_name>
319+
--preserve_input_sort_order
320+
Preserve the original ordering of columns from input
321+
instead of sorting alphabetically. This only impacts
322+
`input_format` of json or dict
323+
316324
```
317325

318326
<a name="InputFormat"></a>
@@ -547,6 +555,89 @@ See discussion in
547555
[PR #57](https://github.com/bxparks/bigquery-schema-generator/pull/57) for
548556
more details.
549557
558+
<a name="PreserveInputSortOrder"></a>
559+
#### Preserve Input Sort Order (`--preserve_input_sort_order`)
560+
561+
By default, the order of columns in the BQ schema file is sorted
562+
lexicographically, which matched the original behavior of `bq load
563+
--autodetect`. If the `--preserve_input_sort_order` flag is given, the columns
564+
in the resulting schema file is not sorted, but preserves the order of
565+
appearance in the input JSON data. For example, the following JSON data with
566+
the `--preserve_input_sort_order` flag will produce:
567+
568+
```bash
569+
$ generate-schema --preserve_input_sort_order
570+
{ "s": "string", "i": 3, "x": 3.2, "b": true }
571+
^D
572+
[
573+
{
574+
"mode": "NULLABLE",
575+
"name": "s",
576+
"type": "STRING"
577+
},
578+
{
579+
"mode": "NULLABLE",
580+
"name": "i",
581+
"type": "INTEGER"
582+
},
583+
{
584+
"mode": "NULLABLE",
585+
"name": "x",
586+
"type": "FLOAT"
587+
},
588+
{
589+
"mode": "NULLABLE",
590+
"name": "b",
591+
"type": "BOOLEAN"
592+
}
593+
]
594+
```
595+
596+
It is possible that each JSON record line contains only a partial subset of the
597+
total possible columns in the data set. The order of the columns in the BQ
598+
schema will then be the order that each column was first *seen* by the
599+
script:
600+
601+
```bash
602+
$ generate-schema --preserve_input_sort_order
603+
{ "s": "string", "i": 3 }
604+
{ "x": 3.2, "s": "string", "i": 3 }
605+
{ "b": true, "x": 3.2, "s": "string", "i": 3 }
606+
^D
607+
[
608+
{
609+
"mode": "NULLABLE",
610+
"name": "s",
611+
"type": "STRING"
612+
},
613+
{
614+
"mode": "NULLABLE",
615+
"name": "i",
616+
"type": "INTEGER"
617+
},
618+
{
619+
"mode": "NULLABLE",
620+
"name": "x",
621+
"type": "FLOAT"
622+
},
623+
{
624+
"mode": "NULLABLE",
625+
"name": "b",
626+
"type": "BOOLEAN"
627+
}
628+
]
629+
```
630+
631+
**Note**: In Python 3.6 (the earliest version of Python supported by this
632+
project), the order of keys in a `dict` was the insertion-order, but this
633+
ordering was an implementation detail, and not guaranteed. In Python 3.7, that
634+
ordering was made permanent. So the `--preserve_input_sort_order` flag
635+
**should** work in Python 3.6 but is not guaranteed.
636+
637+
See discussion in
638+
[PR #75](https://github.com/bxparks/bigquery-schema-generator/pull/75) for
639+
more details.
640+
550641
<a name="UsingAsLibrary"></a>
551642
### Using As a Library
552643
@@ -572,6 +663,7 @@ generator = SchemaGenerator(
572663
debugging_map=debugging_map,
573664
sanitize_names=sanitize_names,
574665
ignore_invalid_lines=ignore_invalid_lines,
666+
preserve_input_sort_order=preserve_input_sort_order,
575667
)
576668
generator.run(input_file=input_file, output_file=output_file)
577669
```
@@ -903,14 +995,14 @@ Apache License 2.0
903995
<a name="Feedback"></a>
904996
## Feedback and Support
905997
906-
If you have any questions, comments and other support questions about how to
907-
use this library, use the
908-
[GitHub Discussions](https://github.com/bxparks/bigquery-schema-generator/discussions)
909-
for this project. If you have bug reports or feature requests, file a ticket in
910-
[GitHub Issues](https://github.com/bxparks/bigquery-schema-generator/issues).
911-
I'd love to hear about how this software and its documentation can be improved.
912-
I can't promise that I will incorporate everything, but I will give your ideas
913-
serious consideration.
998+
If you have any questions, comments, or feature requests for this library,
999+
please use the [GitHub
1000+
Discussions](https://github.com/bxparks/bigquery-schema-generator/discussions)
1001+
for this project. If you have bug reports, please file a ticket in [GitHub
1002+
Issues](https://github.com/bxparks/bigquery-schema-generator/issues). Feature
1003+
requests should go into Discussions first because they often have alternative
1004+
solutions which are useful to remain visible, instead of disappearing from the
1005+
default view of the Issue tracker after the ticket is closed.
9141006
9151007
Please refrain from emailing me directly unless the content is sensitive. The
9161008
problem with email is that I cannot reference the email conversation when other
@@ -936,4 +1028,6 @@ people ask similar questions later.
9361028
by Austin Brogle (abroglesc@) and Bozo Dragojevic (bozzzzo@).
9371029
* Allow `SchemaGenerator.deduce_schema()` to accept a list of native Python
9381030
`dict` objects, by Zigfrid Zvezdin (ZiggerZZ@).
939-
1031+
* Make the column order in the BQ schema file match the order of appearance in
1032+
the JSON data file using the `--preserve_input_sort_order` flag. By Kevin
1033+
Deggelman (kdeggelman@).

bigquery_schema_generator/generate_schema.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ def __init__(
8686
debugging_map=False,
8787
sanitize_names=False,
8888
ignore_invalid_lines=False,
89+
preserve_input_sort_order=False,
8990
):
9091
self.input_format = input_format
9192
self.infer_mode = infer_mode
@@ -113,7 +114,10 @@ def __init__(
113114
# If CSV, preserve the original ordering because 'bq load` matches the
114115
# CSV column with the respective schema entry using the position of the
115116
# column in the schema.
116-
self.sorted_schema = (input_format in {'json', 'dict'})
117+
self.sorted_schema = (
118+
(input_format in {'json', 'dict'})
119+
and not preserve_input_sort_order
120+
)
117121

118122
self.line_number = 0
119123
self.error_logs = []
@@ -1042,6 +1046,13 @@ def main():
10421046
' This can be fetched with:'
10431047
' `bq show --schema <project_id>:<dataset>:<table_name>',
10441048
default=None)
1049+
parser.add_argument(
1050+
'--preserve_input_sort_order',
1051+
help='Preserve the original ordering of columns from input instead of'
1052+
' sorting alphabetically.'
1053+
' This only impacts `input_format` of json or dict',
1054+
action='store_true'
1055+
)
10451056
args = parser.parse_args()
10461057

10471058
# Configure logging.
@@ -1056,6 +1067,7 @@ def main():
10561067
debugging_map=args.debugging_map,
10571068
sanitize_names=args.sanitize_names,
10581069
ignore_invalid_lines=args.ignore_invalid_lines,
1070+
preserve_input_sort_order=args.preserve_input_sort_order
10591071
)
10601072
existing_schema_map = read_existing_schema_from_file(
10611073
args.existing_schema_path)
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = '1.4.1'
1+
__version__ = '1.5'

tests/test_generate_schema.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -608,6 +608,7 @@ def verify_data_chunk_as_csv_json_dict(self, *, chunk, as_dict):
608608
quoted_values_are_strings = ('quoted_values_are_strings' in data_flags)
609609
sanitize_names = ('sanitize_names' in data_flags)
610610
ignore_invalid_lines = ('ignore_invalid_lines' in data_flags)
611+
preserve_input_sort_order = ('preserve_input_sort_order' in data_flags)
611612
records = chunk['records']
612613
expected_errors = chunk['errors']
613614
expected_error_map = chunk['error_map']
@@ -638,7 +639,8 @@ def verify_data_chunk_as_csv_json_dict(self, *, chunk, as_dict):
638639
keep_nulls=keep_nulls,
639640
quoted_values_are_strings=quoted_values_are_strings,
640641
sanitize_names=sanitize_names,
641-
ignore_invalid_lines=ignore_invalid_lines)
642+
ignore_invalid_lines=ignore_invalid_lines,
643+
preserve_input_sort_order=preserve_input_sort_order)
642644
existing_schema_map = None
643645
if existing_schema:
644646
existing_schema_map = bq_schema_to_map(json.loads(existing_schema))

tests/testdata.txt

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2158,3 +2158,97 @@ SCHEMA
21582158
}
21592159
]
21602160
END
2161+
2162+
# Test --preserve_input_sort_order flag. Without the flag, the
2163+
# keys are in sorted order, for compatibility with 'bq load --autodetect`,
2164+
# at least what 'bq load' used to do.
2165+
# See https://github.com/bxparks/bigquery-schema-generator/pull/75
2166+
DATA
2167+
{ "s": "string", "i": 3, "x": 3.2, "b": true }
2168+
SCHEMA
2169+
[
2170+
{
2171+
"mode": "NULLABLE",
2172+
"name": "b",
2173+
"type": "BOOLEAN"
2174+
},
2175+
{
2176+
"mode": "NULLABLE",
2177+
"name": "i",
2178+
"type": "INTEGER"
2179+
},
2180+
{
2181+
"mode": "NULLABLE",
2182+
"name": "s",
2183+
"type": "STRING"
2184+
},
2185+
{
2186+
"mode": "NULLABLE",
2187+
"name": "x",
2188+
"type": "FLOAT"
2189+
}
2190+
]
2191+
END
2192+
2193+
# Test --preserve_input_sort_order flag. With the flag, the column keys should
2194+
# be in the order they appear in the JSON data.
2195+
# See https://github.com/bxparks/bigquery-schema-generator/pull/75
2196+
DATA preserve_input_sort_order
2197+
{ "s": "string", "i": 3, "x": 3.2, "b": true }
2198+
SCHEMA
2199+
[
2200+
{
2201+
"mode": "NULLABLE",
2202+
"name": "s",
2203+
"type": "STRING"
2204+
},
2205+
{
2206+
"mode": "NULLABLE",
2207+
"name": "i",
2208+
"type": "INTEGER"
2209+
},
2210+
{
2211+
"mode": "NULLABLE",
2212+
"name": "x",
2213+
"type": "FLOAT"
2214+
},
2215+
{
2216+
"mode": "NULLABLE",
2217+
"name": "b",
2218+
"type": "BOOLEAN"
2219+
}
2220+
]
2221+
END
2222+
2223+
# Test --preserve_input_sort_order flag. Each JSON data record can contain a
2224+
# partial list of keys. So the order of columns in the schema will be the order
2225+
# in which they are first *seen* by the bigquery_schema_generator.
2226+
# See https://github.com/bxparks/bigquery-schema-generator/pull/75
2227+
DATA preserve_input_sort_order
2228+
{ "s": "string", "i": 3 }
2229+
{ "x": 3.2, "s": "string", "i": 3 }
2230+
{ "b": true, "x": 3.2, "s": "string", "i": 3 }
2231+
SCHEMA
2232+
[
2233+
{
2234+
"mode": "NULLABLE",
2235+
"name": "s",
2236+
"type": "STRING"
2237+
},
2238+
{
2239+
"mode": "NULLABLE",
2240+
"name": "i",
2241+
"type": "INTEGER"
2242+
},
2243+
{
2244+
"mode": "NULLABLE",
2245+
"name": "x",
2246+
"type": "FLOAT"
2247+
},
2248+
{
2249+
"mode": "NULLABLE",
2250+
"name": "b",
2251+
"type": "BOOLEAN"
2252+
}
2253+
]
2254+
END

0 commit comments

Comments
 (0)