@@ -12,10 +12,41 @@ $ generate-schema < file.data.json > file.schema.json
1212$ generate-schema --input_format csv < file.data.csv > file.schema.json
1313```
1414
15- Version: 1.3 (2020-12-05)
16-
17- Changelog: [ CHANGELOG.md] ( CHANGELOG.md )
18-
15+ ** Version** : 1.4 (2020-12-09)
16+
17+ ** Changelog** : [ CHANGELOG.md] ( CHANGELOG.md )
18+
19+ ## Table of Contents
20+
21+ * [ Background] ( #Background )
22+ * [ Installation] ( #Installation )
23+ * [ Ubuntu Linux] ( #UbuntuLinux )
24+ * [ MacOS] ( #MacOS )
25+ * [ Usage] ( #Usage )
26+ * [ Command Line] ( #CommandLine )
27+ * [ Schema Output] ( #SchemaOutput )
28+ * [ Command Line Flag Options] ( #FlagOptions )
29+ * [ Help (` --help ` )] ( #Help )
30+ * [ Input Format (` --input_format ` )] ( #InputFormat )
31+ * [ Keep Nulls (` --keep_nulls ` )] ( #KeepNulls )
32+ * [ Quoted Values Are Strings(` --quoted_values_are_strings ` )] ( #QuotedValuesAreStrings )
33+ * [ Infer Mode (` --infer_mode ` )] ( #InferMode )
34+ * [ Debugging Interval (` --debugging_interval ` )] ( #DebuggingInterval )
35+ * [ Debugging Map (` --debugging_map ` )] ( #DebuggingMap )
36+ * [ Sanitize Names (` --sanitize_names ` )] ( #SanitizedNames )
37+ * [ Ignore Invalid Lines (` --ignore_invalid_lines ` )] ( #IgnoreInvalidLines )
38+ * [ Existing Schema Path (` --existing_schema_path ` )] ( #ExistingSchemaPath )
39+ * [ Using as a Library] ( #UsingAsLibrary )
40+ * [ Schema Types] ( #SchemaTypes )
41+ * [ Supported Types] ( #SupportedTypes )
42+ * [ Type Inferrence] ( #TypeInferrence )
43+ * [ Examples] ( #Examples )
44+ * [ Benchmarks] ( #Benchmarks )
45+ * [ System Requirements] ( #SystemRequirements )
46+ * [ Authors] ( #Authors )
47+ * [ License] ( #License )
48+
49+ <a name =" Background " ></a >
1950## Background
2051
2152Data can be imported into [ BigQuery] ( https://cloud.google.com/bigquery/ ) using
@@ -44,6 +75,7 @@ in JSON format on the STDOUT. This schema file can be fed back into the **bq
4475load** tool to create a table that is more compatible with the data fields in
4576the input dataset.
4677
78+ <a name =" Installation " ></a >
4779## Installation
4880
4981** Prerequisite** : You need have Python 3.6 or higher.
@@ -87,6 +119,7 @@ The shell script `generate-schema` will be installed somewhere in your system,
87119depending on how your Python environment is configured. See below for
88120some notes for Ubuntu Linux and MacOS.
89121
122+ <a name =" UbuntuLinux " ></a >
90123### Ubuntu Linux (18.04, 20.04)
91124
92125After running ` pip3 install bigquery_schema_generator ` , the ` generate-schema `
@@ -97,6 +130,7 @@ script may be installed in one the following locations:
97130* ` $HOME/.local/bin/generate-schema `
98131* ` $HOME/.virtualenvs/{your_virtual_env}/bin/generate-schema `
99132
133+ <a name =" MacOS " ></a >
100134### MacOS (10.14 Mojave)
101135
102136I don't use my Mac for software development these days, and I won't upgrade to
@@ -119,8 +153,12 @@ You can install Python3 using
119153` generate-schema ` script will probably be installed in ` /usr/local/bin ` but I'm
120154not completely certain.
121155
156+ <a name =" Usage " ></a >
122157## Usage
123158
159+ <a name =" CommandLine " ></a >
160+ ### Command Line
161+
124162The ` generate_schema.py ` script accepts a newline-delimited JSON or
125163CSV data file on the STDIN. JSON input format has been tested extensively.
126164CSV input format was added more recently (in v0.4) using the `--input_format
@@ -161,6 +199,7 @@ then you can invoke the Python script directly:
161199$ ./generate_schema.py < file.data.json > file.schema.json
162200```
163201
202+ <a name =" SchemaOutput " ></a >
164203### Using the Schema Output
165204
166205The resulting schema file can be given to the ** bq load** command using the
@@ -226,11 +265,13 @@ $ bq show --schema mydataset.mytable | python3 -m json.tool
226265file. An alternative is the [ jq command] ( https://stedolan.github.io/jq/ ) .)
227266The resulting schema file should be identical to ` file.schema.json ` .
228267
229- ### Flag Options
268+ <a name =" FlagOptions " ></a >
269+ ### Command Line Flag Options
230270
231271The ` generate_schema.py ` script supports a handful of command line flags
232272as shown by the ` --help ` flag below.
233273
274+ <a name =" Help " ></a >
234275#### Help (` --help ` )
235276
236277Print the built-in help strings:
@@ -268,6 +309,7 @@ optional arguments:
268309 < project_id> :< dataset> :< table_name>
269310` ` `
270311
312+ < a name=" InputFormat" ></a>
271313# ### Input Format (`--input_format`)
272314
273315Specifies the format of the input file, either ` json` (default) or ` csv` .
@@ -280,6 +322,7 @@ order, even if the column contains an empty value for every record.
280322See [Issue # 26](https://github.com/bxparks/bigquery-schema-generator/issues/26)
281323for implementation details.
282324
325+ < a name= " KeepNulls" >< /a>
283326# ### Keep Nulls (`--keep_nulls`)
284327
285328Normally when the input data file contains a field which has a null, empty
@@ -327,6 +370,7 @@ INFO:root:Processed 1 lines
327370]
328371` ` `
329372
373+ < a name=" QuotedValuesAreStrings" ></a>
330374# ### Quoted Values Are Strings (`--quoted_values_are_strings`)
331375
332376By default, quoted values are inspected to determine if they can be interpreted
@@ -360,6 +404,7 @@ $ generate-schema --quoted_values_are_strings
360404]
361405` ` `
362406
407+ < a name=" InferMode" ></a>
363408# ### Infer Mode (`--infer_mode`)
364409
365410Set the schema ` mode` of a field to ` REQUIRED` instead of the default
@@ -379,6 +424,7 @@ either input_format, CSV or JSON.
379424See [Issue # 28](https://github.com/bxparks/bigquery-schema-generator/issues/28)
380425for implementation details.
381426
427+ < a name= " DebuggingInterval" >< /a>
382428# ### Debugging Interval (`--debugging_interval`)
383429
384430By default, the ` generate_schema.py` script prints a short progress message
@@ -389,6 +435,7 @@ every 1000 lines of input data. This interval can be changed using the
389435$ generate-schema --debugging_interval 50 < file.data.json > file.schema.json
390436` ` `
391437
438+ < a name= " DebuggingMap" >< /a>
392439# ### Debugging Map (`--debugging_map`)
393440
394441Instead of printing out the BigQuery schema, the ` --debugging_map` prints out
@@ -400,6 +447,7 @@ flag is intended to be used for debugging.
400447$ generate-schema --debugging_map < file.data.json > file.schema.json
401448` ` `
402449
450+ < a name= " SanitizedNames" >< /a>
403451# ### Sanitize Names (`--sanitize_names`)
404452
405453BigQuery column names are [restricted to certain characters and
@@ -426,6 +474,7 @@ through the data files to cleanup the column names anyway. See
426474[Issue #14](https://github.com/bxparks/bigquery-schema-generator/issues/14) and
427475[Issue #33](https://github.com/bxparks/bigquery-schema-generator/issues/33).
428476
477+ <a name="IgnoreInvalidLines"></a>
429478#### Ignore Invalid Lines (`--ignore_invalid_lines`)
430479
431480By default, if an error is encountered on a particular line, processing stops
@@ -446,6 +495,7 @@ deduction logic will handle any missing or extra columns gracefully.
446495Fixes
447496[Issue #49](https://github.com/bxparks/bigquery-schema-generator/issues/49).
448497
498+ <a name="ExistingSchemaPath"></a>
449499#### Existing Schema Path (`--existing_schema_path`)
450500
451501There are cases where we would like to start from an existing BigQuery table
@@ -478,8 +528,72 @@ See discussion in
478528[PR #57](https://github.com/bxparks/bigquery-schema-generator/pull/57) for
479529more details.
480530
531+ <a name="UsingAsLibrary"></a>
532+ ### Using As a Library
533+
534+ The `bigquery_schema_generator` module can be used as a library by an external
535+ Python client code by creating an instance of `SchemaGenerator` and calling the
536+ `run(input, output)` method:
537+
538+ ```python
539+ from bigquery_schema_generator.generate_schema import SchemaGenerator
540+
541+ generator = SchemaGenerator(
542+ input_format=input_format,
543+ infer_mode=infer_mode,
544+ keep_nulls=keep_nulls,
545+ quoted_values_are_strings=quoted_values_are_strings,
546+ debugging_interval=debugging_interval,
547+ debugging_map=debugging_map,
548+ sanitize_names=sanitize_names,
549+ ignore_invalid_lines=ignore_invalid_lines,
550+ )
551+ generator.run(input_file=input_file, output_file=output_file)
552+ ```
553+
554+ If you need to process the generated schema programmatically, use the
555+ `deduce_schema()` method and process the resulting `schema_map` and `error_log`
556+ data structures like this:
557+
558+ ```python
559+ from bigquery_schema_generator.generate_schema import SchemaGenerator
560+ ...
561+ generator = SchemaGenerator(
562+ ...(same as above)...
563+ )
564+
565+ schema_map, error_logs = generator.deduce_schema(input_data=input_data)
566+
567+ # Print errors if desired.
568+ for error in error_logs:
569+ logging.info("Problem on line %s: %s", error[' line_number' ], error[' msg' ])
570+
571+ schema = generator.flatten_schema(schema_map)
572+ json.dump(schema, output_file, indent=2)
573+ ```
574+
575+ The `deduce_schema()` now supports starting from an existing `schema_map`
576+ instead of starting from scratch. This is the internal version of the
577+ `--existing_schema_path` functionality.
578+
579+ ```python
580+ schema_map1, error_logs = generator.deduce_schema(input_data=data1)
581+ schema_map2, error_logs = generator.deduce_schema(
582+ input_data=data1, schema_map=schema_map1
583+ )
584+ ```
585+
586+ When using the `SchemaGenerator` object directly, the `input_format` parameter
587+ supports `dict` as a third input format in addition to the `json` and `csv`
588+ formats. The `dict` input format tells `SchemaGenerator.deduce_schema()` to
589+ accept a list of Python dict objects as the `input_data`. This is useful if the
590+ input data (usually JSON) has already been read into memory and parsed from
591+ newline-delimited JSON into native Python dict objects.
592+
593+ <a name="SchemaTypes"></a>
481594## Schema Types
482595
596+ <a name="SupportedTypes"></a>
483597### Supported Types
484598
485599The `bq show --schema` command produces a JSON schema file that uses the
@@ -531,6 +645,7 @@ The following types are _not_ supported at all:
531645* `BYTES`
532646* `DATETIME` (unable to distinguish from `TIMESTAMP`)
533647
648+ <a name="TypeInferrence"></a>
534649### Type Inferrence Rules
535650
536651The `generate-schema` script attempts to emulate the various type conversion and
@@ -572,6 +687,7 @@ compatibility rules implemented by **bq load**:
572687 * integers less than ` -2^63` (-9223372036854775808)
573688 * (See [Issue # 18](https://github.com/bxparks/bigquery-schema-generator/issues/18) for more details)
574689
690+ < a name= " Examples" >< /a>
575691# # Examples
576692
577693Here is an example of a single JSON data record on the STDIN (the ` ^D` below
@@ -705,41 +821,7 @@ INFO:root:Processed 4 lines
705821]
706822` ` `
707823
708- # # Using As a Library
709-
710- The ` bigquery_schema_generator` module can be used as a library by an external
711- Python client code by creating an instance of ` SchemaGenerator` and calling the
712- ` run(input, output)` method:
713-
714- ` ` ` python
715- from bigquery_schema_generator.generate_schema import SchemaGenerator
716-
717- generator = SchemaGenerator(
718- input_format=input_format,
719- infer_mode=infer_mode,
720- keep_nulls=keep_nulls,
721- quoted_values_are_strings=quoted_values_are_strings,
722- debugging_interval=debugging_interval,
723- debugging_map=debugging_map)
724- generator.run(input_file, output_file)
725- ` ` `
726-
727- If you need to process the generated schema programmatically, use the
728- ` deduce_schema()` method and process the resulting ` schema_map` and ` error_log`
729- data structures like this:
730-
731- ` ` ` python
732- from bigquery_schema_generator.generate_schema import SchemaGenerator
733- ...
734- schema_map, error_logs = generator.deduce_schema(input_file)
735-
736- for error in error_logs:
737- logging.info(" Problem on line %s: %s" , error[' line' ], error[' msg' ])
738-
739- schema = generator.flatten_schema(schema_map)
740- json.dump(schema, output_file, indent=2)
741- ` ` `
742-
824+ < a name=" Benchmarks" ></a>
743825# # Benchmarks
744826
745827I wrote the ` bigquery_schema_generator/anonymize.py` script to create an
@@ -759,6 +841,7 @@ $ bigquery_schema_generator/generate_schema.py < anon1.data.json \
759841took 67s on a Dell Precision M4700 laptop with an Intel Core i7-3840QM CPU @
7608422.80GHz, 32GB of RAM, Ubuntu Linux 18.04, Python 3.6.7.
761843
844+ < a name=" SystemRequirements" ></a>
762845# # System Requirements
763846
764847This project was initially developed on Ubuntu 17.04 using Python 3.5.3, but it
@@ -776,6 +859,12 @@ I have tested it on:
776859The GitHub Actions continuous integration pipeline validates on Python 3.6, 3.7
777860and 3.8.
778861
862+ < a name=" License" ></a>
863+ # # License
864+
865+ Apache License 2.0
866+
867+ < a name=" Authors" ></a>
779868# # Authors
780869
781870* Created by Brian T. Park ([email protected] ).@@ -793,8 +882,6 @@ and 3.8.
793882 (abroglesc@).
794883* Allow an existing schema file to be specified using ` --existing_schema_path` ,
795884 by Austin Brogle (abroglesc@) and Bozo Dragojevic (bozzzzo@).
885+ * Allow ` SchemaGenerator.deduce_schema()` to accept a list of native Python
886+ ` dict` objects, by Zigfrid Zvezdin (ZiggerZZ@).
796887
797-
798- # # License
799-
800- Apache License 2.0
0 commit comments