@@ -12,7 +12,7 @@ $ generate-schema < file.data.json > file.schema.json
1212$ generate-schema --input_format csv < file.data.csv > file.schema.json
1313```
1414
15- Version: 1.2 (2020-10-27 )
15+ Version: 1.3 (2020-12-05 )
1616
1717Changelog: [ CHANGELOG.md] ( CHANGELOG.md )
1818
@@ -235,13 +235,14 @@ as shown by the `--help` flag below.
235235
236236Print the built-in help strings:
237237
238- ```
238+ ``` bash
239239$ generate-schema --help
240- usage: generate_schema.py [-h] [--input_format INPUT_FORMAT] [--keep_nulls]
241- [--quoted_values_are_strings] [--infer_mode]
242- [--debugging_interval DEBUGGING_INTERVAL]
243- [--debugging_map] [--sanitize_names]
244- [--ignore_invalid_lines]
240+ usage: generate-schema [-h] [--input_format INPUT_FORMAT] [--keep_nulls]
241+ [--quoted_values_are_strings] [--infer_mode]
242+ [--debugging_interval DEBUGGING_INTERVAL]
243+ [--debugging_map] [--sanitize_names]
244+ [--ignore_invalid_lines]
245+ [--existing_schema_path EXISTING_SCHEMA_PATH]
245246
246247Generate BigQuery schema from JSON or CSV file.
247248
@@ -261,6 +262,10 @@ optional arguments:
261262 standard
262263 --ignore_invalid_lines
263264 Ignore lines that cannot be parsed instead of stopping
265+ --existing_schema_path EXISTING_SCHEMA_PATH
266+ File that contains the existing BigQuery schema for a
267+ table. This can be fetched with: ` bq show --schema
268+ < project_id> :< dataset> :< table_name>
264269` ` `
265270
266271# ### Input Format (`--input_format`)
@@ -282,7 +287,7 @@ array or empty record as its value, the field is suppressed in the schema file.
282287This flag enables this field to be included in the schema file.
283288
284289In other words, using a data file containing just nulls and empty values:
285- ```
290+ ` ` ` bash
286291$ generate_schema
287292{ " s" : null, " a" : [], " m" : {} }
288293^D
@@ -291,7 +296,7 @@ INFO:root:Processed 1 lines
291296` ` `
292297
293298With the ` keep_nulls` flag, we get:
294- ```
299+ ` ` ` bash
295300$ generate-schema --keep_nulls
296301{ " s" : null, " a" : [], " m" : {} }
297302^D
@@ -331,7 +336,7 @@ consistent with the algorithm used by `bq load`. However, for the `BOOLEAN`,
331336normal strings instead. This flag disables type inference for ` BOOLEAN` ,
332337` INTEGER` and ` FLOAT` types inside quoted strings.
333338
334- ```
339+ ` ` ` bash
335340$ generate-schema
336341{ " name" : " 1" }
337342^D
@@ -365,6 +370,12 @@ feature for JSON files, but too difficult to implement in practice because
365370fields are often completely missing from a given JSON record (instead of
366371explicitly being defined to be ` null` ).
367372
373+ In addition to the above, this option, when used in conjunction with
374+ ` --existing_schema_map` , will allow fields to be relaxed from REQUIRED to
375+ NULLABLE if they were REQUIRED in the existing schema and NULL rows are found in
376+ the new data we are inferring a schema from. In this case it can be used with
377+ either input_format, CSV or JSON.
378+
368379See [Issue # 28](https://github.com/bxparks/bigquery-schema-generator/issues/28)
369380for implementation details.
370381
@@ -374,7 +385,7 @@ By default, the `generate_schema.py` script prints a short progress message
374385every 1000 lines of input data. This interval can be changed using the
375386` --debugging_interval` flag.
376387
377- ```
388+ ` ` ` bash
378389$ generate-schema --debugging_interval 50 < file.data.json > file.schema.json
379390` ` `
380391
@@ -385,7 +396,7 @@ the bookkeeping metadata map which is used internally to keep track of the
385396various fields and their types that were inferred using the data file. This
386397flag is intended to be used for debugging.
387398
388- ```
399+ ` ` ` bash
389400$ generate-schema --debugging_map < file.data.json > file.schema.json
390401` ` `
391402
@@ -411,9 +422,9 @@ generate the schema file. The transformations are:
411422My recollection is that the ` bq load` command does * not* normalize the JSON key
412423names. Instead it prints an error message. So the ` --sanitize_names` flag is
413424useful mostly for CSV files. For JSON files, you' ll have to do a second pass
414- through the data files to cleanup the column names anyway. See [ Issue
415- #14 ] ( https://github.com/bxparks/bigquery-schema-generator/issues/14 ) and [ Issue
416- #33 ] ( https://github.com/bxparks/bigquery-schema-generator/issues/33 ) .
425+ through the data files to cleanup the column names anyway. See
426+ [Issue #14](https://github.com/bxparks/bigquery-schema-generator/issues/14) and
427+ [Issue #33](https://github.com/bxparks/bigquery-schema-generator/issues/33).
417428
418429#### Ignore Invalid Lines (`--ignore_invalid_lines`)
419430
@@ -432,14 +443,46 @@ does throw an exception on a given line, we would not be able to catch it and
432443continue processing. Fortunately, CSV files are fairly robust, and the schema
433444deduction logic will handle any missing or extra columns gracefully.
434445
435- Fixes [ Issue
436- #49 ] ( https://github.com/bxparks/bigquery-schema-generator/issues/49 ) .
446+ Fixes
447+ [Issue #49](https://github.com/bxparks/bigquery-schema-generator/issues/49).
448+
449+ #### Existing Schema Path (`--existing_schema_path`)
450+
451+ There are cases where we would like to start from an existing BigQuery table
452+ schema rather than starting from scratch with a new batch of data we would like
453+ to load. In this case we can specify the path to a local file on disk that is
454+ our existing bigquery table schema. This can be generated via the following `bq
455+ show --schema` command:
456+ ```bash
457+ bq show --schema <PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME> > existing_table_schema.json
458+ ```
459+
460+ We can then run generate-schema with the additional option
461+ ```bash
462+ --existing_schema_path existing_table_schema.json
463+ ```
464+
465+ There is some subtle interaction between the `--existing_schema_path` and fields
466+ which are marked with a `mode` of `REQUIRED` in the existing schema. If the new
467+ data contains a `null` value (either in a CSV or JSON data file), it is not
468+ clear if the schema should be changed to `mode=NULLABLE` or whether the new data
469+ should be ignored and the schema should remain `mode=REQUIRED`. The choice is
470+ determined by overloading the `--infer_mode` flag:
471+
472+ * If `--infer_mode` is given, the new schema will be allowed to revert back to
473+ `NULLABLE`.
474+ * If `--infer_mode` is not given, the offending new record will be ignored
475+ and the new schema will remain `REQUIRED`.
476+
477+ See discussion in
478+ [PR #57](https://github.com/bxparks/bigquery-schema-generator/pull/57) for
479+ more details.
437480
438481## Schema Types
439482
440483### Supported Types
441484
442- The ** bq show --schema** command produces a JSON schema file that uses the
485+ The ` bq show --schema` command produces a JSON schema file that uses the
443486older [Legacy SQL date types](https://cloud.google.com/bigquery/data-types).
444487For compatibility, **generate-schema** script will also generate a schema file
445488using the legacy data types.
@@ -534,7 +577,7 @@ compatibility rules implemented by **bq load**:
534577Here is an example of a single JSON data record on the STDIN (the ` ^D` below
535578means typing Control-D, which indicates " end of file" under Linux and MacOS):
536579
537- ```
580+ ` ` ` bash
538581$ generate-schema
539582{ " s" : " string" , " b" : true, " i" : 1, " x" : 3.1, " t" : " 2017-05-22T17:10:00-07:00" }
540583^D
@@ -569,7 +612,7 @@ INFO:root:Processed 1 lines
569612` ` `
570613
571614In most cases, the data file will be stored in a file:
572- ```
615+ ` ` ` bash
573616$ cat > file.data.json
574617{ " a" : [1, 2] }
575618{ " i" : 3 }
@@ -596,7 +639,7 @@ $ cat file.schema.json
596639Here is the schema generated from a CSV input file. The first line is the header
597640containing the names of the columns, and the schema lists the columns in the
598641same order as the header:
599- ```
642+ ` ` ` bash
600643$ generate-schema --input_format csv
601644e,b,c,d,a
6026451,x,true,,2.0
@@ -634,7 +677,7 @@ INFO:root:Processed 3 lines
634677` ` `
635678
636679Here is an example of the schema generated with the ` --infer_mode` flag:
637- ```
680+ ` ` ` bash
638681$ generate-schema --input_format csv --infer_mode
639682name,surname,age
640683John
@@ -701,15 +744,15 @@ json.dump(schema, output_file, indent=2)
701744
702745I wrote the ` bigquery_schema_generator/anonymize.py` script to create an
703746anonymized data file ` tests/testdata/anon1.data.json.gz` :
704- ```
747+ ` ` ` bash
705748$ ./bigquery_schema_generator/anonymize.py < original.data.json \
706749 > anon1.data.json
707750$ gzip anon1.data.json
708751` ` `
709752This data file is 290MB (5.6MB compressed) with 103080 data records.
710753
711754Generating the schema using
712- ```
755+ ` ` ` bash
713756$ bigquery_schema_generator/generate_schema.py < anon1.data.json \
714757 > anon1.schema.json
715758` ` `
@@ -748,6 +791,8 @@ and 3.8.
748791* Bug fix in ` --sanitize_names` by Riccardo M. Cefala (riccardomc@).
749792* Print full path of nested JSON elements in error messages, by Austin Brogle
750793 (abroglesc@).
794+ * Allow an existing schema file to be specified using ` --existing_schema_path` ,
795+ by Austin Brogle (abroglesc@) and Bozo Dragojevic (bozzzzo@).
751796
752797
753798# # License
0 commit comments