44
55This script generates the BigQuery schema from the newline-delimited data
66records on the STDIN. The records can be in JSON format or CSV format. The
7- BigQuery data importer (` bq load ` ) uses only the first 100 lines when the schema
8- auto-detection feature is enabled. In contrast, this script uses all data
9- records to generate the schema.
7+ BigQuery data importer (` bq load ` ) uses only the
8+ [ first 500 records] ( https://cloud.google.com/bigquery/docs/schema-detect )
9+ when the schema auto-detection feature is enabled. In contrast, this script uses
10+ all data records to generate the schema.
1011
1112Usage:
1213```
1314$ generate-schema < file.data.json > file.schema.json
1415$ generate-schema --input_format csv < file.data.csv > file.schema.json
1516```
1617
17- ** Version** : 1.5 (2021-11-14 )
18+ ** Version** : 1.5.1 (2022-12-04 )
1819
1920** Changelog** : [ CHANGELOG.md] ( CHANGELOG.md )
2021
@@ -24,6 +25,8 @@ $ generate-schema --input_format csv < file.data.csv > file.schema.json
2425* [ Installation] ( #Installation )
2526 * [ Ubuntu Linux] ( #UbuntuLinux )
2627 * [ MacOS] ( #MacOS )
28+ * [ MacOS 11 (Big Sur)] ( #MacOS11 )
29+ * [ MacOS 10.14 (Mojave)] ( #MacOS1014 )
2730* [ Usage] ( #Usage )
2831 * [ Command Line] ( #CommandLine )
2932 * [ Schema Output] ( #SchemaOutput )
@@ -42,10 +45,11 @@ $ generate-schema --input_format csv < file.data.csv > file.schema.json
4245 (` --preserve_input_sort_order ` )] ( #PreserveInputSortOrder )
4346 * [ Using as a Library] ( #UsingAsLibrary )
4447 * [ ` SchemaGenerator.run() ` ] ( #SchemaGeneratorRun )
45- * [ ` SchemaGenerator.deduce_schema() ` ] ( #SchemaGeneratorDeduceSchema )
48+ * [ ` SchemaGenerator.deduce_schema() ` with File] ( #SchemaGeneratorDeduceSchemaFromFile )
49+ * [ ` SchemaGenerator.deduce_schema() ` with Dict] ( #SchemaGeneratorDeduceSchemaFromDict )
4650* [ Schema Types] ( #SchemaTypes )
4751 * [ Supported Types] ( #SupportedTypes )
48- * [ Type Inferrence ] ( #TypeInferrence )
52+ * [ Type Inference ] ( #TypeInference )
4953* [ Examples] ( #Examples )
5054* [ Benchmarks] ( #Benchmarks )
5155* [ System Requirements] ( #SystemRequirements )
@@ -66,7 +70,7 @@ schema can be defined manually or the schema can be
6670[ auto-detected] ( https://cloud.google.com/bigquery/docs/schema-detect#auto-detect ) .
6771
6872When the auto-detect feature is used, the BigQuery data importer examines only
69- the [ first 100 records] ( https://cloud.google.com/bigquery/docs/schema-detect )
73+ the [ first 500 records] ( https://cloud.google.com/bigquery/docs/schema-detect )
7074of the input data. In many cases, this is sufficient
7175because the data records were dumped from another database and the exact schema
7276of the source table was known. However, for data extracted from a service
@@ -127,7 +131,7 @@ depending on how your Python environment is configured. See below for
127131some notes for Ubuntu Linux and MacOS.
128132
129133<a name =" UbuntuLinux " ></a >
130- ### Ubuntu Linux (18.04, 20.04)
134+ ### Ubuntu Linux (18.04, 20.04, 22.04 )
131135
132136After running ` pip3 install bigquery_schema_generator ` , the ` generate-schema `
133137script may be installed in one the following locations:
@@ -138,27 +142,59 @@ script may be installed in one the following locations:
138142* ` $HOME/.virtualenvs/{your_virtual_env}/bin/generate-schema `
139143
140144<a name =" MacOS " ></a >
141- ### MacOS (10.14 Mojave)
145+ ### MacOS
142146
143- I don't use my Mac for software development these days, and I won't upgrade to
144- Catalina (10.15) or later, but here are some notes if they help.
147+ I don't have any Macs which are able to run the latest macOS, and I don't use
148+ them much for software development these days, but here are some notes if they
149+ help.
145150
146- If you installed Python from
147- [ Python Releases for Mac OS X] ( https://www.python.org/downloads/mac-osx/ ) ,
148- then ` /usr/local/bin/pip3 ` is a symlink to
149- ` /Library/Frameworks/Python.framework/Versions/3.6/bin/pip3 ` . So
150- ` generate-schema ` is installed at
151+ <a name =" MacOS11 " ></a >
152+ #### MacOS 11 (Big Sur)
153+
154+ I believe Big Sur comes preinstalled with Python 3.8. If you install
155+ ` bigquery_schema_generator ` using:
156+
157+ ```
158+ $ pip3 install --user bigquery_schema_generator
159+ ```
160+
161+ then the ` generate-schema ` wrapper script will be installed at:
162+
163+ ```
164+ /User/{your-login}/Library/Python/3.8/bin/generate-schema
165+ ```
166+
167+ <a name =" MacOS1014 " ></a >
168+ #### MacOS 10.14 (Mojave)
169+
170+ This MacOS version comes with Python 2.7 only. To install Python 3, you can
171+ install using:
172+
173+ 1)) Downloading the [ macos installer directly from
174+ Python.org] ( https://www.python.org/downloads/macos/ ) .
175+
176+ The python3 binary will be located at ` /usr/local/bin/python3 ` , and the
177+ ` /usr/local/bin/pip3 ` is a symlink to
178+ ` /Library/Frameworks/Python.framework/Versions/3.6/bin/pip3 ` .
179+
180+ So running
181+
182+ ```
183+ $ pip3 install --user bigquery_schema_generator
184+ ```
185+
186+ will install ` generate-schema ` at
151187` /Library/Frameworks/Python.framework/Versions/3.6/bin/generate-schema ` .
152188
153189The Python installer updates ` $HOME/.bash_profile ` to add
154190` /Library/Frameworks/Python.framework/Versions/3.6/bin ` to the ` $PATH `
155191environment variable. So you should be able to run the ` generate-schema `
156192command without typing in the full path.
157193
158- You can install Python3 using
159- [ Homebrew ] ( https://docs.brew.sh/Homebrew-and-Python ) . In this environment, the
160- ` generate-schema ` script will probably be installed in ` /usr/local/bin ` but I'm
161- not completely certain.
194+ 2)) Using [ Homebrew ] ( https://docs.brew.sh/Homebrew-and-Python ) .
195+
196+ In this environment, the ` generate-schema ` script will probably be installed in
197+ ` /usr/local/bin ` but I'm not completely certain.
162198
163199<a name =" Usage " ></a >
164200## Usage
@@ -665,42 +701,56 @@ generator = SchemaGenerator(
665701 ignore_invalid_lines=ignore_invalid_lines,
666702 preserve_input_sort_order=preserve_input_sort_order,
667703)
668- generator.run(input_file=input_file, output_file=output_file)
704+
705+ FILENAME = "..."
706+
707+ with open(FILENAME) as input_file:
708+ generator.run(input_file=input_file, output_file=output_file)
669709```
670710
671711The `input_format` is one of `json`, `csv`, and `dict` as described in the
672712[Input Format](#InputFormat) section above. The `input_file` must match the
673713format given by this parameter.
674714
675- See the `TestSchemaGeneratorDeduce.test_run_with_input_and_output()` test
676- case in [examples/test_generate_schema.py](examples/test_generate_schema.py) for
677- an example of an `input_file` of type `json`.
715+ See [generatorrun.py](examples/generatorrun.py) for an example.
678716
679- <a name="SchemaGeneratorDeduceSchema "></a>
680- #### `SchemaGenerator.deduce_schema()`
717+ <a name="SchemaGeneratorDeduceSchemaFromFile "></a>
718+ #### `SchemaGenerator.deduce_schema()` from File
681719
682720If you need to process the generated schema programmatically, use the
683721`deduce_schema()` method and process the resulting `schema_map` and `error_log`
684722data structures like this:
685723
686724```python
725+ import json
726+ import logging
727+ import sys
687728from bigquery_schema_generator.generate_schema import SchemaGenerator
688- ...
729+
730+ FILENAME = "jsonfile.json"
731+
689732generator = SchemaGenerator(
690- ...(same as above)...
733+ input_format=' json' ,
734+ quoted_values_are_strings=True,
691735)
692736
737+ with open(FILENAME) as file:
738+ schema_map, errors = generator.deduce_schema(file)
739+
693740schema_map, error_logs = generator.deduce_schema(input_data=input_data)
694741
695- # Print errors if desired.
696742for error in error_logs:
697743 logging.info("Problem on line %s: %s", error[' line_number' ], error[' msg' ])
698744
699745schema = generator.flatten_schema(schema_map)
700- json.dump(schema, output_file, indent=2)
746+ json.dump(schema, sys.stdout, indent=2)
747+ print()
701748```
702749
703- The `deduce_schema()` now supports starting from an existing `schema_map`
750+ See [csvreader.py](examples/csvreader.py) and
751+ [jsoneader.py](examples/jsoneader.py) for 2 examples.
752+
753+ The `deduce_schema()` also supports starting from an existing `schema_map`
704754instead of starting from scratch. This is the internal version of the
705755`--existing_schema_path` functionality.
706756
@@ -714,9 +764,36 @@ schema_map2, error_logs = generator.deduce_schema(
714764The `input_data` must match the `input_format` given in the constructor. The
715765format is described in the [Input Format](#InputFormat) section above.
716766
717- See the `TestSchemaGeneratorDeduce.test_deduce_schema_with_dict_input()` test
718- case in [examples/test_generate_schema.py](examples/test_generate_schema.py) for
719- an example of an `input_data` of type `dict`.
767+ <a name="SchemaGeneratorDeduceSchemaFromDict"></a>
768+ #### `SchemaGenerator.deduce_schema()` from Dict
769+
770+ If the JSON data set has already been read into memory into a Python `dict`
771+ object, the `SchemaGenerator` can process that too like this:
772+
773+ ```Python
774+ import json
775+ import logging
776+ import sys
777+ from bigquery_schema_generator.generate_schema import SchemaGenerator
778+
779+ generator = SchemaGenerator(input_format=' dict' )
780+ input_data = [
781+ {
782+ ' s' : ' string' ,
783+ ' b' : True,
784+ },
785+ {
786+ ' d' : ' 2021-08-18' ,
787+ ' x' : 3.1
788+ },
789+ ]
790+ schema_map, error_logs = generator.deduce_schema(input_data)
791+ schema = generator.flatten_schema(schema_map)
792+ json.dump(schema, sys.stdout, indent=2)
793+ print()
794+ ```
795+
796+ See [dictreader.py](examples/dictreader.py) for an example.
720797
721798<a name="SchemaTypes"></a>
722799## Schema Types
@@ -773,8 +850,8 @@ The following types are _not_ supported at all:
773850* `BYTES`
774851* `DATETIME` (unable to distinguish from `TIMESTAMP`)
775852
776- <a name="TypeInferrence "></a>
777- ### Type Inferrence Rules
853+ <a name="TypeInference "></a>
854+ ### Type Inference Rules
778855
779856The `generate-schema` script attempts to emulate the various type conversion and
780857compatibility rules implemented by **bq load**:
@@ -977,16 +1054,24 @@ now requires Python 3.6 or higher, I think mostly due to the use of f-strings.
9771054
9781055I have tested it on:
9791056
1057+ * Ubuntu 22.04, Python 3.10.6
9801058* Ubuntu 20.04, Python 3.8.5
9811059* Ubuntu 18.04, Python 3.7.7
9821060* Ubuntu 18.04, Python 3.6.7
9831061* Ubuntu 17.10, Python 3.6.3
984- * MacOS 10.14.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)
985- * MacOS 10.13.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)
1062+ * MacOS 11.7.1 (Big Sur), Python 3.8.9
1063+ * MacOS 10.14.2 (Mojave), Python 3.6.4
1064+ * MacOS 10.13.2 (High Sierra), Python 3.6.4
9861065
9871066The GitHub Actions continuous integration pipeline validates on Python 3.6, 3.7
9881067and 3.8.
9891068
1069+ The unit tests are invoked with ` $ make tests` target, and depends only on the
1070+ built-in Python ` unittest` package.
1071+
1072+ The coding style check is invoked using ` $ make flake8` and depends on the
1073+ ` flake8` package. It can be installed using ` $ pip3 install --user flake8` .
1074+
9901075< a name=" License" ></a>
9911076# # License
9921077
0 commit comments