Skip to content

Commit 2d983fa

Browse files
authored
Merge pull request #88 from bxparks/develop
merge 1.5.1 into master
2 parents 2830dd0 + a564447 commit 2d983fa

File tree

11 files changed

+251
-41
lines changed

11 files changed

+251
-41
lines changed

.github/workflows/pythonpackage.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ jobs:
1919
python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]
2020

2121
steps:
22-
- uses: actions/checkout@v2
22+
- uses: actions/checkout@v3
2323

2424
- name: Set up Python ${{ matrix.python-version }}
2525
uses: actions/setup-python@v2

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
# Changelog
22

33
* Unreleased
4+
* 1.5.1 (2022-12-04)
5+
* Add `examples/*.py` to demonstrate how to use `SchemaGenerator` as a
6+
library.
7+
* Update README.md to state that `bq load --autodetect` uses the first
8+
500 records. Previously, it scanned only the 100 records.
9+
* This is a maintenance release with no new features or bug fixes.
410
* 1.5 (2021-11-14)
511
* Make the column order in the BQ schema file match the order of appearance
612
in the JSON data file using the `--preserve_input_sort_order` flag.

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ tests:
66
python3 -m unittest
77

88
flake8:
9-
flake8 bigquery_schema_generator tests \
9+
flake8 bigquery_schema_generator tests examples \
1010
--count \
1111
--ignore W503 \
1212
--show-source \

README.md

Lines changed: 123 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,18 @@
44

55
This script generates the BigQuery schema from the newline-delimited data
66
records on the STDIN. The records can be in JSON format or CSV format. The
7-
BigQuery data importer (`bq load`) uses only the first 100 lines when the schema
8-
auto-detection feature is enabled. In contrast, this script uses all data
9-
records to generate the schema.
7+
BigQuery data importer (`bq load`) uses only the
8+
[first 500 records](https://cloud.google.com/bigquery/docs/schema-detect)
9+
when the schema auto-detection feature is enabled. In contrast, this script uses
10+
all data records to generate the schema.
1011

1112
Usage:
1213
```
1314
$ generate-schema < file.data.json > file.schema.json
1415
$ generate-schema --input_format csv < file.data.csv > file.schema.json
1516
```
1617

17-
**Version**: 1.5 (2021-11-14)
18+
**Version**: 1.5.1 (2022-12-04)
1819

1920
**Changelog**: [CHANGELOG.md](CHANGELOG.md)
2021

@@ -24,6 +25,8 @@ $ generate-schema --input_format csv < file.data.csv > file.schema.json
2425
* [Installation](#Installation)
2526
* [Ubuntu Linux](#UbuntuLinux)
2627
* [MacOS](#MacOS)
28+
* [MacOS 11 (Big Sur)](#MacOS11)
29+
* [MacOS 10.14 (Mojave)](#MacOS1014)
2730
* [Usage](#Usage)
2831
* [Command Line](#CommandLine)
2932
* [Schema Output](#SchemaOutput)
@@ -42,10 +45,11 @@ $ generate-schema --input_format csv < file.data.csv > file.schema.json
4245
(`--preserve_input_sort_order`)](#PreserveInputSortOrder)
4346
* [Using as a Library](#UsingAsLibrary)
4447
* [`SchemaGenerator.run()`](#SchemaGeneratorRun)
45-
* [`SchemaGenerator.deduce_schema()`](#SchemaGeneratorDeduceSchema)
48+
* [`SchemaGenerator.deduce_schema()` with File](#SchemaGeneratorDeduceSchemaFromFile)
49+
* [`SchemaGenerator.deduce_schema()` with Dict](#SchemaGeneratorDeduceSchemaFromDict)
4650
* [Schema Types](#SchemaTypes)
4751
* [Supported Types](#SupportedTypes)
48-
* [Type Inferrence](#TypeInferrence)
52+
* [Type Inference](#TypeInference)
4953
* [Examples](#Examples)
5054
* [Benchmarks](#Benchmarks)
5155
* [System Requirements](#SystemRequirements)
@@ -66,7 +70,7 @@ schema can be defined manually or the schema can be
6670
[auto-detected](https://cloud.google.com/bigquery/docs/schema-detect#auto-detect).
6771

6872
When the auto-detect feature is used, the BigQuery data importer examines only
69-
the [first 100 records](https://cloud.google.com/bigquery/docs/schema-detect)
73+
the [first 500 records](https://cloud.google.com/bigquery/docs/schema-detect)
7074
of the input data. In many cases, this is sufficient
7175
because the data records were dumped from another database and the exact schema
7276
of the source table was known. However, for data extracted from a service
@@ -127,7 +131,7 @@ depending on how your Python environment is configured. See below for
127131
some notes for Ubuntu Linux and MacOS.
128132

129133
<a name="UbuntuLinux"></a>
130-
### Ubuntu Linux (18.04, 20.04)
134+
### Ubuntu Linux (18.04, 20.04, 22.04)
131135

132136
After running `pip3 install bigquery_schema_generator`, the `generate-schema`
133137
script may be installed in one the following locations:
@@ -138,27 +142,59 @@ script may be installed in one the following locations:
138142
* `$HOME/.virtualenvs/{your_virtual_env}/bin/generate-schema`
139143

140144
<a name="MacOS"></a>
141-
### MacOS (10.14 Mojave)
145+
### MacOS
142146

143-
I don't use my Mac for software development these days, and I won't upgrade to
144-
Catalina (10.15) or later, but here are some notes if they help.
147+
I don't have any Macs which are able to run the latest macOS, and I don't use
148+
them much for software development these days, but here are some notes if they
149+
help.
145150

146-
If you installed Python from
147-
[Python Releases for Mac OS X](https://www.python.org/downloads/mac-osx/),
148-
then `/usr/local/bin/pip3` is a symlink to
149-
`/Library/Frameworks/Python.framework/Versions/3.6/bin/pip3`. So
150-
`generate-schema` is installed at
151+
<a name="MacOS11"></a>
152+
#### MacOS 11 (Big Sur)
153+
154+
I believe Big Sur comes preinstalled with Python 3.8. If you install
155+
`bigquery_schema_generator` using:
156+
157+
```
158+
$ pip3 install --user bigquery_schema_generator
159+
```
160+
161+
then the `generate-schema` wrapper script will be installed at:
162+
163+
```
164+
/User/{your-login}/Library/Python/3.8/bin/generate-schema
165+
```
166+
167+
<a name="MacOS1014"></a>
168+
#### MacOS 10.14 (Mojave)
169+
170+
This MacOS version comes with Python 2.7 only. To install Python 3, you can
171+
install using:
172+
173+
1)) Downloading the [macos installer directly from
174+
Python.org](https://www.python.org/downloads/macos/).
175+
176+
The python3 binary will be located at `/usr/local/bin/python3`, and the
177+
`/usr/local/bin/pip3` is a symlink to
178+
`/Library/Frameworks/Python.framework/Versions/3.6/bin/pip3`.
179+
180+
So running
181+
182+
```
183+
$ pip3 install --user bigquery_schema_generator
184+
```
185+
186+
will install `generate-schema` at
151187
`/Library/Frameworks/Python.framework/Versions/3.6/bin/generate-schema`.
152188

153189
The Python installer updates `$HOME/.bash_profile` to add
154190
`/Library/Frameworks/Python.framework/Versions/3.6/bin` to the `$PATH`
155191
environment variable. So you should be able to run the `generate-schema`
156192
command without typing in the full path.
157193

158-
You can install Python3 using
159-
[Homebrew](https://docs.brew.sh/Homebrew-and-Python). In this environment, the
160-
`generate-schema` script will probably be installed in `/usr/local/bin` but I'm
161-
not completely certain.
194+
2)) Using [Homebrew](https://docs.brew.sh/Homebrew-and-Python).
195+
196+
In this environment, the `generate-schema` script will probably be installed in
197+
`/usr/local/bin` but I'm not completely certain.
162198

163199
<a name="Usage"></a>
164200
## Usage
@@ -665,42 +701,56 @@ generator = SchemaGenerator(
665701
ignore_invalid_lines=ignore_invalid_lines,
666702
preserve_input_sort_order=preserve_input_sort_order,
667703
)
668-
generator.run(input_file=input_file, output_file=output_file)
704+
705+
FILENAME = "..."
706+
707+
with open(FILENAME) as input_file:
708+
generator.run(input_file=input_file, output_file=output_file)
669709
```
670710
671711
The `input_format` is one of `json`, `csv`, and `dict` as described in the
672712
[Input Format](#InputFormat) section above. The `input_file` must match the
673713
format given by this parameter.
674714
675-
See the `TestSchemaGeneratorDeduce.test_run_with_input_and_output()` test
676-
case in [examples/test_generate_schema.py](examples/test_generate_schema.py) for
677-
an example of an `input_file` of type `json`.
715+
See [generatorrun.py](examples/generatorrun.py) for an example.
678716
679-
<a name="SchemaGeneratorDeduceSchema"></a>
680-
#### `SchemaGenerator.deduce_schema()`
717+
<a name="SchemaGeneratorDeduceSchemaFromFile"></a>
718+
#### `SchemaGenerator.deduce_schema()` from File
681719
682720
If you need to process the generated schema programmatically, use the
683721
`deduce_schema()` method and process the resulting `schema_map` and `error_log`
684722
data structures like this:
685723
686724
```python
725+
import json
726+
import logging
727+
import sys
687728
from bigquery_schema_generator.generate_schema import SchemaGenerator
688-
...
729+
730+
FILENAME = "jsonfile.json"
731+
689732
generator = SchemaGenerator(
690-
...(same as above)...
733+
input_format='json',
734+
quoted_values_are_strings=True,
691735
)
692736
737+
with open(FILENAME) as file:
738+
schema_map, errors = generator.deduce_schema(file)
739+
693740
schema_map, error_logs = generator.deduce_schema(input_data=input_data)
694741
695-
# Print errors if desired.
696742
for error in error_logs:
697743
logging.info("Problem on line %s: %s", error['line_number'], error['msg'])
698744
699745
schema = generator.flatten_schema(schema_map)
700-
json.dump(schema, output_file, indent=2)
746+
json.dump(schema, sys.stdout, indent=2)
747+
print()
701748
```
702749
703-
The `deduce_schema()` now supports starting from an existing `schema_map`
750+
See [csvreader.py](examples/csvreader.py) and
751+
[jsoneader.py](examples/jsoneader.py) for 2 examples.
752+
753+
The `deduce_schema()` also supports starting from an existing `schema_map`
704754
instead of starting from scratch. This is the internal version of the
705755
`--existing_schema_path` functionality.
706756
@@ -714,9 +764,36 @@ schema_map2, error_logs = generator.deduce_schema(
714764
The `input_data` must match the `input_format` given in the constructor. The
715765
format is described in the [Input Format](#InputFormat) section above.
716766
717-
See the `TestSchemaGeneratorDeduce.test_deduce_schema_with_dict_input()` test
718-
case in [examples/test_generate_schema.py](examples/test_generate_schema.py) for
719-
an example of an `input_data` of type `dict`.
767+
<a name="SchemaGeneratorDeduceSchemaFromDict"></a>
768+
#### `SchemaGenerator.deduce_schema()` from Dict
769+
770+
If the JSON data set has already been read into memory into a Python `dict`
771+
object, the `SchemaGenerator` can process that too like this:
772+
773+
```Python
774+
import json
775+
import logging
776+
import sys
777+
from bigquery_schema_generator.generate_schema import SchemaGenerator
778+
779+
generator = SchemaGenerator(input_format='dict')
780+
input_data = [
781+
{
782+
's': 'string',
783+
'b': True,
784+
},
785+
{
786+
'd': '2021-08-18',
787+
'x': 3.1
788+
},
789+
]
790+
schema_map, error_logs = generator.deduce_schema(input_data)
791+
schema = generator.flatten_schema(schema_map)
792+
json.dump(schema, sys.stdout, indent=2)
793+
print()
794+
```
795+
796+
See [dictreader.py](examples/dictreader.py) for an example.
720797
721798
<a name="SchemaTypes"></a>
722799
## Schema Types
@@ -773,8 +850,8 @@ The following types are _not_ supported at all:
773850
* `BYTES`
774851
* `DATETIME` (unable to distinguish from `TIMESTAMP`)
775852
776-
<a name="TypeInferrence"></a>
777-
### Type Inferrence Rules
853+
<a name="TypeInference"></a>
854+
### Type Inference Rules
778855
779856
The `generate-schema` script attempts to emulate the various type conversion and
780857
compatibility rules implemented by **bq load**:
@@ -977,16 +1054,24 @@ now requires Python 3.6 or higher, I think mostly due to the use of f-strings.
9771054
9781055
I have tested it on:
9791056
1057+
* Ubuntu 22.04, Python 3.10.6
9801058
* Ubuntu 20.04, Python 3.8.5
9811059
* Ubuntu 18.04, Python 3.7.7
9821060
* Ubuntu 18.04, Python 3.6.7
9831061
* Ubuntu 17.10, Python 3.6.3
984-
* MacOS 10.14.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)
985-
* MacOS 10.13.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)
1062+
* MacOS 11.7.1 (Big Sur), Python 3.8.9
1063+
* MacOS 10.14.2 (Mojave), Python 3.6.4
1064+
* MacOS 10.13.2 (High Sierra), Python 3.6.4
9861065
9871066
The GitHub Actions continuous integration pipeline validates on Python 3.6, 3.7
9881067
and 3.8.
9891068
1069+
The unit tests are invoked with `$ make tests` target, and depends only on the
1070+
built-in Python `unittest` package.
1071+
1072+
The coding style check is invoked using `$ make flake8` and depends on the
1073+
`flake8` package. It can be installed using `$ pip3 install --user flake8`.
1074+
9901075
<a name="License"></a>
9911076
## License
9921077
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = '1.5'
1+
__version__ = '1.5.1'

examples/csvfile.csv

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
name,surname,age
2+
John,Smith,23
3+
Michael,Johnson,27
4+
Maria,Smith,30
5+
Joanna,Anders,21

examples/csvreader.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/usr/bin/env python3
2+
#
3+
# Example of using SchemaGenerator as a library instead of a command line
4+
# script. Read the CSV file named 'csvfile.csv' in the current directory, deduce
5+
# its schema, and print it out on the stdout.
6+
#
7+
# This is the equivalent of:
8+
# $ generate-schema
9+
# --input_format=csv
10+
# --infer_mode
11+
# --quoted_values_are_strings
12+
# --sanitize_names
13+
# < csvfile.csv
14+
15+
import json
16+
import logging
17+
import sys
18+
from bigquery_schema_generator.generate_schema import SchemaGenerator
19+
20+
FILENAME = "csvfile.csv"
21+
22+
generator = SchemaGenerator(
23+
input_format='csv',
24+
infer_mode=True,
25+
quoted_values_are_strings=True,
26+
sanitize_names=True,
27+
)
28+
29+
with open(FILENAME) as file:
30+
schema_map, errors = generator.deduce_schema(file)
31+
32+
for error in errors:
33+
logging.info("Problem on line %s: %s", error['line_number'], error['msg'])
34+
35+
schema = generator.flatten_schema(schema_map)
36+
json.dump(schema, sys.stdout, indent=2)
37+
print()

examples/dictreader.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
#!/usr/bin/env python3
2+
#
3+
# Example of using SchemaGenerator programmatically instead of a command line
4+
# script. This example consumes a JSON data set that has *already* been read
5+
# into memory as a Python array of dict.
6+
7+
import json
8+
import sys
9+
from bigquery_schema_generator.generate_schema import SchemaGenerator
10+
11+
generator = SchemaGenerator(input_format='dict')
12+
input_data = [
13+
{
14+
's': 'string',
15+
'b': True,
16+
},
17+
{
18+
'd': '2021-08-18',
19+
'x': 3.1
20+
},
21+
]
22+
schema_map, error_logs = generator.deduce_schema(input_data)
23+
schema = generator.flatten_schema(schema_map)
24+
json.dump(schema, sys.stdout, indent=2)
25+
print()

0 commit comments

Comments
 (0)