@@ -15,7 +15,7 @@ $ generate-schema < file.data.json > file.schema.json
1515$ generate-schema --input_format csv < file.data.csv > file.schema.json
1616```
1717
18- ** Version** : 1.5.1 (2022-12-04 )
18+ ** Version** : 1.5.2 (2023-04-01 )
1919
2020** Changelog** : [ CHANGELOG.md] ( CHANGELOG.md )
2121
@@ -25,6 +25,7 @@ $ generate-schema --input_format csv < file.data.csv > file.schema.json
2525* [ Installation] ( #Installation )
2626 * [ Ubuntu Linux] ( #UbuntuLinux )
2727 * [ MacOS] ( #MacOS )
28+ * [ MacOS 12 (Monterey)] ( #MacOS12 )
2829 * [ MacOS 11 (Big Sur)] ( #MacOS11 )
2930 * [ MacOS 10.14 (Mojave)] ( #MacOS1014 )
3031* [ Usage] ( #Usage )
@@ -45,8 +46,12 @@ $ generate-schema --input_format csv < file.data.csv > file.schema.json
4546 (` --preserve_input_sort_order ` )] ( #PreserveInputSortOrder )
4647 * [ Using as a Library] ( #UsingAsLibrary )
4748 * [ ` SchemaGenerator.run() ` ] ( #SchemaGeneratorRun )
48- * [ ` SchemaGenerator.deduce_schema() ` with File] ( #SchemaGeneratorDeduceSchemaFromFile )
49- * [ ` SchemaGenerator.deduce_schema() ` with Dict] ( #SchemaGeneratorDeduceSchemaFromDict )
49+ * [ ` SchemaGenerator.deduce_schema() ` from
50+ File] ( #SchemaGeneratorDeduceSchemaFromFile )
51+ * [ ` SchemaGenerator.deduce_schema() ` from
52+ Dict] ( #SchemaGeneratorDeduceSchemaFromDict )
53+ * [ ` SchemaGenerator.deduce_schema() ` from
54+ DictReader] ( #SchemaGeneratorDeduceSchemaFromCsvDictReader )
5055* [ Schema Types] ( #SchemaTypes )
5156 * [ Supported Types] ( #SupportedTypes )
5257 * [ Type Inference] ( #TypeInference )
@@ -75,7 +80,7 @@ of the input data. In many cases, this is sufficient
7580because the data records were dumped from another database and the exact schema
7681of the source table was known. However, for data extracted from a service
7782(e.g. using a REST API) the record fields could have been organically added
78- at later dates. In this case, the first 100 records do not contain fields which
83+ at later dates. In this case, the first 500 records do not contain fields which
7984are present in later records. The ** bq load** auto-detection fails and the data
8085fails to load.
8186
@@ -145,24 +150,77 @@ script may be installed in one the following locations:
145150### MacOS
146151
147152I don't have any Macs which are able to run the latest macOS, and I don't use
148- them much for software development these days, but here are some notes if they
149- help.
153+ them much for software development these days, but here are some notes on older
154+ versions of macOS in case they help.
150155
151- <a name =" MacOS11 " ></a >
152- #### MacOS 11 (Big Sur)
156+ <a name =" MacOS12 " ></a >
157+ #### MacOS 12 (Monterey)
158+
159+ Python 2 or 3 is not installed by default on Monterey. If you try to run
160+ ` python3 ` on the command line, a dialog box asks you to install the
161+ [ Xcode] ( https://developer.apple.com/support/xcode/ ) development package. It
162+ apparently takes over an hour at 10 MB/s.
163+
164+ You can instead install Python 3 using
165+ [ Homebrew] ( https://docs.brew.sh/Homebrew-and-Python ) , by installing ` brew ` , and
166+ typing ` $ brew install python ` . Currently, it downloads Python 3.10 in about 1-2
167+ minutes and installs the ` python3 ` and ` pip3 ` binaries into
168+ ` /usr/local/bin/python3 ` and ` /usr/local/bin/pip3 ` . Using ` brew ` seems to be
169+ easiest option, so let's assume that Python 3 was installed through that.
153170
154- I believe Big Sur comes preinstalled with Python 3.8. If you install
155- ` bigquery_schema_generator ` using:
171+ If you run:
172+ ```
173+ $ pip3 install bigquery_schema_generator
174+ ```
175+ the package will be installed at ` /usr/local/lib/python3.10/site-packages/ ` , and
176+ the ` generate-schema ` script will be installed at
177+ ` /usr/local/bin/generate-schema ` .
156178
179+ If you use the ` --user ` flag:
157180```
158181$ pip3 install --user bigquery_schema_generator
159182```
183+ the package will be installed at
184+ ` $HOME/Library/Python/3.10/lib/python/site-packages/ ` , and the ` generate-schema `
185+ script will be installed at ` $HOME/Library/Python/3.10/bin/generate-schema ` .
186+
187+ You may need to add the ` $HOME/Library/Python/3.10/bin ` directory to your
188+ ` $PATH ` variable in your ` $HOME/.bashrc ` file.
189+
190+ <a name =" MacOS11 " ></a >
191+ #### MacOS 11 (Big Sur)
192+
193+ Python 2.7.16 is installed by default on Big Sur as ` /usr/bin/python ` . If you
194+ try to run ` python3 ` on the command line, a dialog box asks you to install
195+ the [ Xcode] ( https://developer.apple.com/support/xcode/ ) development package will
196+ be installed, which I think installs Python 3.8 as ` /usr/bin/python3 ` (I can't
197+ remember, it was installed a long time ago.)
198+
199+ You can instead install Python 3 using
200+ [ Homebrew] ( https://docs.brew.sh/Homebrew-and-Python ) , by installing ` brew ` , and
201+ typing ` $ brew install python ` . Currently, it downloads Python 3.10 in about 1-2
202+ minutes and installs the ` python3 ` and ` pip3 ` binaries into
203+ ` /usr/local/bin/python3 ` and ` /usr/local/bin/pip3 ` . Using ` brew ` seems to be
204+ easiest option, so let's assume that Python 3 was installed through that.
160205
161- then the ` generate-schema ` wrapper script will be installed at:
206+ If you run:
207+ ```
208+ $ pip3 install bigquery_schema_generator
209+ ```
210+ the package will be installed at ` /usr/local/lib/python3.10/site-packages/ ` , and
211+ the ` generate-schema ` script will be installed at
212+ ` /usr/local/bin/generate-schema ` .
162213
214+ If you use the ` --user ` flag:
163215```
164- /User/{your-login}/Library/Python/3.8/bin/generate-schema
216+ $ pip3 install --user bigquery_schema_generator
165217```
218+ the package will be installed at
219+ ` $HOME/Library/Python/3.10/lib/python/site-packages/ ` , and the ` generate-schema `
220+ script will be installed at ` $HOME/Library/Python/3.10/bin/generate-schema ` .
221+
222+ You may need to add the ` $HOME/Library/Python/3.10/bin ` directory to your
223+ ` $PATH ` variable in your ` $HOME/.bashrc ` file.
166224
167225<a name =" MacOS1014 " ></a >
168226#### MacOS 10.14 (Mojave)
@@ -717,9 +775,12 @@ See [generatorrun.py](examples/generatorrun.py) for an example.
717775<a name="SchemaGeneratorDeduceSchemaFromFile"></a>
718776#### `SchemaGenerator.deduce_schema()` from File
719777
720- If you need to process the generated schema programmatically, use the
721- `deduce_schema()` method and process the resulting `schema_map` and `error_log`
722- data structures like this:
778+ If you need to process the generated schema programmatically, create an instance
779+ of `SchemaGenerator` using the appropriate `input_format` option, use the
780+ `deduce_schema()` method to read in the file, then postprocess the resulting
781+ `schema_map` and `error_log` data structures.
782+
783+ The following reads in a JSON file (see [jsoneader.py](examples/jsoneader.py)):
723784
724785```python
725786import json
@@ -737,26 +798,39 @@ generator = SchemaGenerator(
737798with open(FILENAME) as file:
738799 schema_map, errors = generator.deduce_schema(file)
739800
740- schema_map, error_logs = generator.deduce_schema(input_data=input_data)
741-
742- for error in error_logs:
801+ for error in errors:
743802 logging.info("Problem on line %s: %s", error[' line_number' ], error[' msg' ])
744803
745804schema = generator.flatten_schema(schema_map)
746805json.dump(schema, sys.stdout, indent=2)
747806print()
748807```
749808
750- See [csvreader.py](examples/csvreader.py) and
751- [jsoneader.py](examples/jsoneader.py) for 2 examples.
809+ The following reads a CSV file (see [csvreader.py](examples/csvreader.py)):
810+
811+ ```python
812+ ...(same as above)...
813+
814+ generator = SchemaGenerator(
815+ input_format=' csv' ,
816+ infer_mode=True,
817+ quoted_values_are_strings=True,
818+ sanitize_names=True,
819+ )
820+
821+ with open(FILENAME) as file:
822+ schema_map, errors = generator.deduce_schema(file)
823+
824+ ...(same as above)...
825+ ```
752826
753827The `deduce_schema()` also supports starting from an existing `schema_map`
754828instead of starting from scratch. This is the internal version of the
755829`--existing_schema_path` functionality.
756830
757831```python
758- schema_map1, error_logs = generator.deduce_schema(input_data=data1)
759- schema_map2, error_logs = generator.deduce_schema(
832+ schema_map1, errors = generator.deduce_schema(input_data=data1)
833+ schema_map2, errors = generator.deduce_schema(
760834 input_data=data1, schema_map=schema_map1
761835)
762836```
@@ -765,10 +839,13 @@ The `input_data` must match the `input_format` given in the constructor. The
765839format is described in the [Input Format](#InputFormat) section above.
766840
767841<a name="SchemaGeneratorDeduceSchemaFromDict"></a>
768- #### `SchemaGenerator.deduce_schema()` from Dict
842+ #### `SchemaGenerator.deduce_schema()` from Iterable of Dict
843+
844+ If the JSON data set has already been read into memory into an array or iterable
845+ of Python `dict` objects, the `SchemaGenerator` can process that too using the
846+ `input_format=' dict' ` option. Here is an example from
847+ [dictreader.py](examples/dictreader.py):
769848
770- If the JSON data set has already been read into memory into a Python `dict`
771- object, the `SchemaGenerator` can process that too like this:
772849
773850```Python
774851import json
@@ -787,13 +864,55 @@ input_data = [
787864 ' x' : 3.1
788865 },
789866]
790- schema_map, error_logs = generator.deduce_schema(input_data)
867+ schema_map, errors = generator.deduce_schema(input_data)
791868schema = generator.flatten_schema(schema_map)
792869json.dump(schema, sys.stdout, indent=2)
793870print()
794871```
795872
796- See [dictreader.py](examples/dictreader.py) for an example.
873+ **Note**: The `input_format=' dict' ` option supports any `input_data` object
874+ which acts like an iterable of `dict`. The data does not have to be loaded into
875+ memory.
876+
877+ <a name="SchemaGeneratorDeduceSchemaFromCsvDictReader"></a>
878+ #### `SchemaGenerator.deduce_schema()` from csv.DictReader
879+
880+ The `input_format=' csvdictreader' ` option is similar to `input_format=' dict' `
881+ but sort of acts like `input_format=' csv' `. It supports any object that behaves
882+ like an iterable of `dict`, but it is intended to be used with the
883+ [csv.DictReader](https://docs.python.org/3/library/csv.html) object.
884+
885+ The difference between `' dict' ` and `' csvdictreader' ` is the assumption made
886+ about the shape of the data. The `' csvdictreader' ` option assumes that the data
887+ is tabular like a CSV file, with every row usually containing an entry for every
888+ column. The `' dict' ` option does not make that assumption, and the data can be
889+ more hierarchical with some rows containing partial sets of columns.
890+
891+ This semantic difference means that `' csvdictreader' ` supports options which
892+ apply to `' csv' ` files. In particular, the `infer_mode=True` option can be used
893+ to determine if the `mode` field can be `REQUIRED` instead of `NULLABLE` if the
894+ script finds that all columns are defined in every row.
895+
896+ Here is an example from [tsvreader.py](examples/tsvreader.py) which reads a
897+ tab-separate file (TSV):
898+
899+ ```python
900+ import csv
901+ import json
902+ import sys
903+ from bigquery_schema_generator.generate_schema import SchemaGenerator
904+
905+ FILENAME = "tsvfile.tsv"
906+
907+ generator = SchemaGenerator(input_format=' dict' )
908+ with open(FILENAME) as file:
909+ reader = csv.DictReader(file, delimiter=' \t ' )
910+ schema_map, errors = generator.deduce_schema(reader)
911+
912+ schema = generator.flatten_schema(schema_map)
913+ json.dump(schema, sys.stdout, indent=2)
914+ print()
915+ ```
797916
798917<a name="SchemaTypes"></a>
799918## Schema Types
@@ -1059,12 +1178,14 @@ I have tested it on:
10591178* Ubuntu 18.04, Python 3.7.7
10601179* Ubuntu 18.04, Python 3.6.7
10611180* Ubuntu 17.10, Python 3.6.3
1062- * MacOS 11.7.1 (Big Sur), Python 3.8.9
1181+ * MacOS 12.6.2 (Monterey), Python 3.10.9
1182+ * MacOS 11.7.2 (Big Sur), Python 3.10.9
1183+ * MacOS 11.7.2 (Big Sur), Python 3.8.9
10631184* MacOS 10.14.2 (Mojave), Python 3.6.4
10641185* MacOS 10.13.2 (High Sierra), Python 3.6.4
10651186
1066- The GitHub Actions continuous integration pipeline validates on Python 3.6, 3.7
1067- and 3.8 .
1187+ The GitHub Actions continuous integration pipeline validates on Python 3.7,
1188+ 3.8, 3.9, and 3.10 .
10681189
10691190The unit tests are invoked with ` $ make tests` target, and depends only on the
10701191built-in Python ` unittest` package.
0 commit comments