Skip to content

Commit acaa74b

Browse files
authored
Merge pull request #65 from bxparks/develop
merge v1.4 into master
2 parents d5c3cd3 + cf1c1ad commit acaa74b

File tree

7 files changed

+365
-77
lines changed

7 files changed

+365
-77
lines changed

CHANGELOG.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,17 @@
11
# Changelog
22

33
* Unreleased
4+
* 1.4 (2020-12-09)
5+
* Add 'dict' as a third `input_format` when `SchemaGenerator` is used as a
6+
library. This can be useful when the data has already been transformed
7+
into a list of native Python `dict` objects (see #58, thanks to
8+
ZiggerZZ@).
9+
* Expand the pattern matchers for quoted integers and quoted floating point
10+
numbers to be more compatible with the patterns recognized by `bq load
11+
--autodetect`.
12+
* Add Table of Contents to READMD.md. Add usage info for the
13+
`schema_map=existing_schema_map` and the `input_format='dict'` parameters
14+
in the `SchemaGenerator()` constructor.
415
* 1.3 (2020-12-05)
516
* Allow an existing schema file to be specified using
617
`--existing_schema_path` flag, so that new data can be merged into it.

DEVELOPER.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -19,24 +19,25 @@ There are a lot of instructions on the web that uses
1919
those are deprecated. The tool that seems to work for me is
2020
[Twine](https://github.com/pypa/twine).
2121

22-
[PyPI](https://pypi.python.org/pypi) does not support Markdown, so
23-
we use `pypandoc` and `pandoc` to convert Markdown to RST.
24-
`pypandoc` is a thin Python wrapper around `pandoc`.
22+
[PyPI](https://pypi.python.org/pypi) now supports Markdown so we no longer need
23+
to download `pypandoc` (Python package) and `pandoc` (apt package) to convert
24+
Markdown to RST.
2525

2626
Install the following packages:
2727
```
28-
$ sudo apt install pandoc
29-
$ sudo -H pip3 install setuptools wheel twine pypandoc
28+
$ sudo -H pip3 install setuptools wheel twine
3029
```
3130

3231
### Steps
3332

3433
1. Edit `setup.py` and increment the `version`.
3534
1. Push all changes to `develop` branch.
36-
1. Merge `develop` into `master` branch, and checkout the `master` branch.
35+
1. Create a GitHub pull request (PR) from `develop` into `master` branch.
36+
1. Merge the PR into `master`.
37+
1. Create a new Release in GitHub with the new tag label.
3738
1. Create the dist using `python3 setup.py sdist`.
38-
1. Upload to PyPI using `twine upload dist/*`.
39-
(Need to enter my PyPI login creddentials).
39+
1. Upload to PyPI using `twine upload
40+
dist/bigquery-schema-generator-{version}.tar.gz`.
41+
* Enter my PyPI login creddentials.
4042
* If `dist/` becomes too cluttered, we can remove the entire `dist/`
4143
directory and run `python3 setup.py sdist` again.
42-
1. Tag the `master` branch with the release on GitHub.

README.md

Lines changed: 131 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,41 @@ $ generate-schema < file.data.json > file.schema.json
1212
$ generate-schema --input_format csv < file.data.csv > file.schema.json
1313
```
1414

15-
Version: 1.3 (2020-12-05)
16-
17-
Changelog: [CHANGELOG.md](CHANGELOG.md)
18-
15+
**Version**: 1.4 (2020-12-09)
16+
17+
**Changelog**: [CHANGELOG.md](CHANGELOG.md)
18+
19+
## Table of Contents
20+
21+
* [Background](#Background)
22+
* [Installation](#Installation)
23+
* [Ubuntu Linux](#UbuntuLinux)
24+
* [MacOS](#MacOS)
25+
* [Usage](#Usage)
26+
* [Command Line](#CommandLine)
27+
* [Schema Output](#SchemaOutput)
28+
* [Command Line Flag Options](#FlagOptions)
29+
* [Help (`--help`)](#Help)
30+
* [Input Format (`--input_format`)](#InputFormat)
31+
* [Keep Nulls (`--keep_nulls`)](#KeepNulls)
32+
* [Quoted Values Are Strings(`--quoted_values_are_strings`)](#QuotedValuesAreStrings)
33+
* [Infer Mode (`--infer_mode`)](#InferMode)
34+
* [Debugging Interval (`--debugging_interval`)](#DebuggingInterval)
35+
* [Debugging Map (`--debugging_map`)](#DebuggingMap)
36+
* [Sanitize Names (`--sanitize_names`)](#SanitizedNames)
37+
* [Ignore Invalid Lines (`--ignore_invalid_lines`)](#IgnoreInvalidLines)
38+
* [Existing Schema Path (`--existing_schema_path`)](#ExistingSchemaPath)
39+
* [Using as a Library](#UsingAsLibrary)
40+
* [Schema Types](#SchemaTypes)
41+
* [Supported Types](#SupportedTypes)
42+
* [Type Inferrence](#TypeInferrence)
43+
* [Examples](#Examples)
44+
* [Benchmarks](#Benchmarks)
45+
* [System Requirements](#SystemRequirements)
46+
* [Authors](#Authors)
47+
* [License](#License)
48+
49+
<a name="Background"></a>
1950
## Background
2051

2152
Data can be imported into [BigQuery](https://cloud.google.com/bigquery/) using
@@ -44,6 +75,7 @@ in JSON format on the STDOUT. This schema file can be fed back into the **bq
4475
load** tool to create a table that is more compatible with the data fields in
4576
the input dataset.
4677

78+
<a name="Installation"></a>
4779
## Installation
4880

4981
**Prerequisite**: You need have Python 3.6 or higher.
@@ -87,6 +119,7 @@ The shell script `generate-schema` will be installed somewhere in your system,
87119
depending on how your Python environment is configured. See below for
88120
some notes for Ubuntu Linux and MacOS.
89121

122+
<a name="UbuntuLinux"></a>
90123
### Ubuntu Linux (18.04, 20.04)
91124

92125
After running `pip3 install bigquery_schema_generator`, the `generate-schema`
@@ -97,6 +130,7 @@ script may be installed in one the following locations:
97130
* `$HOME/.local/bin/generate-schema`
98131
* `$HOME/.virtualenvs/{your_virtual_env}/bin/generate-schema`
99132

133+
<a name="MacOS"></a>
100134
### MacOS (10.14 Mojave)
101135

102136
I don't use my Mac for software development these days, and I won't upgrade to
@@ -119,8 +153,12 @@ You can install Python3 using
119153
`generate-schema` script will probably be installed in `/usr/local/bin` but I'm
120154
not completely certain.
121155

156+
<a name="Usage"></a>
122157
## Usage
123158

159+
<a name="CommandLine"></a>
160+
### Command Line
161+
124162
The `generate_schema.py` script accepts a newline-delimited JSON or
125163
CSV data file on the STDIN. JSON input format has been tested extensively.
126164
CSV input format was added more recently (in v0.4) using the `--input_format
@@ -161,6 +199,7 @@ then you can invoke the Python script directly:
161199
$ ./generate_schema.py < file.data.json > file.schema.json
162200
```
163201

202+
<a name="SchemaOutput"></a>
164203
### Using the Schema Output
165204

166205
The resulting schema file can be given to the **bq load** command using the
@@ -226,11 +265,13 @@ $ bq show --schema mydataset.mytable | python3 -m json.tool
226265
file. An alternative is the [jq command](https://stedolan.github.io/jq/).)
227266
The resulting schema file should be identical to `file.schema.json`.
228267

229-
### Flag Options
268+
<a name="FlagOptions"></a>
269+
### Command Line Flag Options
230270

231271
The `generate_schema.py` script supports a handful of command line flags
232272
as shown by the `--help` flag below.
233273

274+
<a name="Help"></a>
234275
#### Help (`--help`)
235276

236277
Print the built-in help strings:
@@ -268,6 +309,7 @@ optional arguments:
268309
<project_id>:<dataset>:<table_name>
269310
```
270311

312+
<a name="InputFormat"></a>
271313
#### Input Format (`--input_format`)
272314

273315
Specifies the format of the input file, either `json` (default) or `csv`.
@@ -280,6 +322,7 @@ order, even if the column contains an empty value for every record.
280322
See [Issue #26](https://github.com/bxparks/bigquery-schema-generator/issues/26)
281323
for implementation details.
282324

325+
<a name="KeepNulls"></a>
283326
#### Keep Nulls (`--keep_nulls`)
284327

285328
Normally when the input data file contains a field which has a null, empty
@@ -327,6 +370,7 @@ INFO:root:Processed 1 lines
327370
]
328371
```
329372
373+
<a name="QuotedValuesAreStrings"></a>
330374
#### Quoted Values Are Strings (`--quoted_values_are_strings`)
331375
332376
By default, quoted values are inspected to determine if they can be interpreted
@@ -360,6 +404,7 @@ $ generate-schema --quoted_values_are_strings
360404
]
361405
```
362406
407+
<a name="InferMode"></a>
363408
#### Infer Mode (`--infer_mode`)
364409
365410
Set the schema `mode` of a field to `REQUIRED` instead of the default
@@ -379,6 +424,7 @@ either input_format, CSV or JSON.
379424
See [Issue #28](https://github.com/bxparks/bigquery-schema-generator/issues/28)
380425
for implementation details.
381426
427+
<a name="DebuggingInterval"></a>
382428
#### Debugging Interval (`--debugging_interval`)
383429
384430
By default, the `generate_schema.py` script prints a short progress message
@@ -389,6 +435,7 @@ every 1000 lines of input data. This interval can be changed using the
389435
$ generate-schema --debugging_interval 50 < file.data.json > file.schema.json
390436
```
391437
438+
<a name="DebuggingMap"></a>
392439
#### Debugging Map (`--debugging_map`)
393440
394441
Instead of printing out the BigQuery schema, the `--debugging_map` prints out
@@ -400,6 +447,7 @@ flag is intended to be used for debugging.
400447
$ generate-schema --debugging_map < file.data.json > file.schema.json
401448
```
402449
450+
<a name="SanitizedNames"></a>
403451
#### Sanitize Names (`--sanitize_names`)
404452
405453
BigQuery column names are [restricted to certain characters and
@@ -426,6 +474,7 @@ through the data files to cleanup the column names anyway. See
426474
[Issue #14](https://github.com/bxparks/bigquery-schema-generator/issues/14) and
427475
[Issue #33](https://github.com/bxparks/bigquery-schema-generator/issues/33).
428476
477+
<a name="IgnoreInvalidLines"></a>
429478
#### Ignore Invalid Lines (`--ignore_invalid_lines`)
430479
431480
By default, if an error is encountered on a particular line, processing stops
@@ -446,6 +495,7 @@ deduction logic will handle any missing or extra columns gracefully.
446495
Fixes
447496
[Issue #49](https://github.com/bxparks/bigquery-schema-generator/issues/49).
448497
498+
<a name="ExistingSchemaPath"></a>
449499
#### Existing Schema Path (`--existing_schema_path`)
450500
451501
There are cases where we would like to start from an existing BigQuery table
@@ -478,8 +528,72 @@ See discussion in
478528
[PR #57](https://github.com/bxparks/bigquery-schema-generator/pull/57) for
479529
more details.
480530
531+
<a name="UsingAsLibrary"></a>
532+
### Using As a Library
533+
534+
The `bigquery_schema_generator` module can be used as a library by an external
535+
Python client code by creating an instance of `SchemaGenerator` and calling the
536+
`run(input, output)` method:
537+
538+
```python
539+
from bigquery_schema_generator.generate_schema import SchemaGenerator
540+
541+
generator = SchemaGenerator(
542+
input_format=input_format,
543+
infer_mode=infer_mode,
544+
keep_nulls=keep_nulls,
545+
quoted_values_are_strings=quoted_values_are_strings,
546+
debugging_interval=debugging_interval,
547+
debugging_map=debugging_map,
548+
sanitize_names=sanitize_names,
549+
ignore_invalid_lines=ignore_invalid_lines,
550+
)
551+
generator.run(input_file=input_file, output_file=output_file)
552+
```
553+
554+
If you need to process the generated schema programmatically, use the
555+
`deduce_schema()` method and process the resulting `schema_map` and `error_log`
556+
data structures like this:
557+
558+
```python
559+
from bigquery_schema_generator.generate_schema import SchemaGenerator
560+
...
561+
generator = SchemaGenerator(
562+
...(same as above)...
563+
)
564+
565+
schema_map, error_logs = generator.deduce_schema(input_data=input_data)
566+
567+
# Print errors if desired.
568+
for error in error_logs:
569+
logging.info("Problem on line %s: %s", error['line_number'], error['msg'])
570+
571+
schema = generator.flatten_schema(schema_map)
572+
json.dump(schema, output_file, indent=2)
573+
```
574+
575+
The `deduce_schema()` now supports starting from an existing `schema_map`
576+
instead of starting from scratch. This is the internal version of the
577+
`--existing_schema_path` functionality.
578+
579+
```python
580+
schema_map1, error_logs = generator.deduce_schema(input_data=data1)
581+
schema_map2, error_logs = generator.deduce_schema(
582+
input_data=data1, schema_map=schema_map1
583+
)
584+
```
585+
586+
When using the `SchemaGenerator` object directly, the `input_format` parameter
587+
supports `dict` as a third input format in addition to the `json` and `csv`
588+
formats. The `dict` input format tells `SchemaGenerator.deduce_schema()` to
589+
accept a list of Python dict objects as the `input_data`. This is useful if the
590+
input data (usually JSON) has already been read into memory and parsed from
591+
newline-delimited JSON into native Python dict objects.
592+
593+
<a name="SchemaTypes"></a>
481594
## Schema Types
482595
596+
<a name="SupportedTypes"></a>
483597
### Supported Types
484598
485599
The `bq show --schema` command produces a JSON schema file that uses the
@@ -531,6 +645,7 @@ The following types are _not_ supported at all:
531645
* `BYTES`
532646
* `DATETIME` (unable to distinguish from `TIMESTAMP`)
533647
648+
<a name="TypeInferrence"></a>
534649
### Type Inferrence Rules
535650
536651
The `generate-schema` script attempts to emulate the various type conversion and
@@ -572,6 +687,7 @@ compatibility rules implemented by **bq load**:
572687
* integers less than `-2^63` (-9223372036854775808)
573688
* (See [Issue #18](https://github.com/bxparks/bigquery-schema-generator/issues/18) for more details)
574689
690+
<a name="Examples"></a>
575691
## Examples
576692
577693
Here is an example of a single JSON data record on the STDIN (the `^D` below
@@ -705,41 +821,7 @@ INFO:root:Processed 4 lines
705821
]
706822
```
707823
708-
## Using As a Library
709-
710-
The `bigquery_schema_generator` module can be used as a library by an external
711-
Python client code by creating an instance of `SchemaGenerator` and calling the
712-
`run(input, output)` method:
713-
714-
```python
715-
from bigquery_schema_generator.generate_schema import SchemaGenerator
716-
717-
generator = SchemaGenerator(
718-
input_format=input_format,
719-
infer_mode=infer_mode,
720-
keep_nulls=keep_nulls,
721-
quoted_values_are_strings=quoted_values_are_strings,
722-
debugging_interval=debugging_interval,
723-
debugging_map=debugging_map)
724-
generator.run(input_file, output_file)
725-
```
726-
727-
If you need to process the generated schema programmatically, use the
728-
`deduce_schema()` method and process the resulting `schema_map` and `error_log`
729-
data structures like this:
730-
731-
```python
732-
from bigquery_schema_generator.generate_schema import SchemaGenerator
733-
...
734-
schema_map, error_logs = generator.deduce_schema(input_file)
735-
736-
for error in error_logs:
737-
logging.info("Problem on line %s: %s", error['line'], error['msg'])
738-
739-
schema = generator.flatten_schema(schema_map)
740-
json.dump(schema, output_file, indent=2)
741-
```
742-
824+
<a name="Benchmarks"></a>
743825
## Benchmarks
744826
745827
I wrote the `bigquery_schema_generator/anonymize.py` script to create an
@@ -759,6 +841,7 @@ $ bigquery_schema_generator/generate_schema.py < anon1.data.json \
759841
took 67s on a Dell Precision M4700 laptop with an Intel Core i7-3840QM CPU @
760842
2.80GHz, 32GB of RAM, Ubuntu Linux 18.04, Python 3.6.7.
761843
844+
<a name="SystemRequirements"></a>
762845
## System Requirements
763846
764847
This project was initially developed on Ubuntu 17.04 using Python 3.5.3, but it
@@ -776,6 +859,12 @@ I have tested it on:
776859
The GitHub Actions continuous integration pipeline validates on Python 3.6, 3.7
777860
and 3.8.
778861
862+
<a name="License"></a>
863+
## License
864+
865+
Apache License 2.0
866+
867+
<a name="Authors"></a>
779868
## Authors
780869
781870
* Created by Brian T. Park ([email protected]).
@@ -793,8 +882,6 @@ and 3.8.
793882
(abroglesc@).
794883
* Allow an existing schema file to be specified using `--existing_schema_path`,
795884
by Austin Brogle (abroglesc@) and Bozo Dragojevic (bozzzzo@).
885+
* Allow `SchemaGenerator.deduce_schema()` to accept a list of native Python
886+
`dict` objects, by Zigfrid Zvezdin (ZiggerZZ@).
796887
797-
798-
## License
799-
800-
Apache License 2.0

0 commit comments

Comments
 (0)