Skip to content

Commit 82ce03d

Browse files
committed
Add 'generate-schema' script, installed by 'pip'. Update README.md with different ways to invoke script. Update version to 0.1.1.
1 parent f5f8696 commit 82ce03d

File tree

3 files changed

+47
-9
lines changed

3 files changed

+47
-9
lines changed

README.md

Lines changed: 44 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -39,45 +39,81 @@ in JSON format on the STDOUT. This schema file can be fed back into the **bq
3939
load** tool to create a table that is more compatible with the data fields in
4040
the input dataset.
4141

42+
## Installation
43+
44+
Install from [PyPI](https://pypi.python.org/pypi) repository using:
45+
```
46+
$ pip3 install bigquery_schema_generator
47+
```
48+
4249
## Usage
4350

4451
The `generate_schema.py` script accepts a newline-delimited JSON data file on
4552
the STDIN. (CSV is not supported currently.) It scans every record in the
4653
input data file to deduce the table's schema. It prints the JSON formatted
47-
schema file on the STDOUT:
54+
schema file on the STDOUT. There are at least 3 ways to run this script:
55+
56+
If you installed using `pip3`, then it should have installed a small helper
57+
script named `generate-schema` in your local `./bin` directory of your current
58+
environment (depending on whether you are using a virtual environment).
59+
4860
```
49-
$ generate_schema.py < file.data.json > file.schema.json
61+
$ generate-schema < file.data.json > file.schema.json
5062
```
5163

52-
The schema file can be used in the **bq** command using:
64+
You can invoke the module directly using:
65+
```
66+
$ python3 -m bigquery_schema_generator.generate_schema < file.data.json > file.schema.json
67+
```
68+
69+
If you retrieved this code from its [GitHub
70+
repository](https://github.com/bxparks/bigquery-schema-generator), then you can invoke
71+
the Python script directly:
72+
```
73+
$ ./generate_schema.py < file.data.json > file.schema.json
74+
```
75+
76+
The resulting schema file can be used in the **bq load** command using the
77+
`--schema` flag:
5378
```
5479
$ bq load --schema file.schema.json mydataset.mytable file.data.json
5580
```
5681

5782
where `mydataset.mytable` is the target table in BigQuery.
5883

59-
A useful flag for **bq load** is `--ignore_unknown_values`, which causes `bq load`
84+
A useful flag for **bq load** is `--ignore_unknown_values`, which causes **bq load**
6085
to ignore fields in the input data which are not defined in the schema. When
6186
`generate_schema.py` detects an inconsistency in the definition of a particular
6287
field in the input data, it removes the field from the schema definition.
6388
Without the `--ignore_unknown_values`, the **bq load** fails when the
6489
inconsistent data record is read.
6590

6691
After the BigQuery table is loaded, the schema can be retrieved using:
92+
6793
```
6894
$ bq show --schema mydataset.mytable | python -m json.tool
6995
```
96+
7097
(The `python -m json.tool` command will pretty-print the JSON formatted schema
7198
file.) This schema file should be identical to `file.schema.json`.
7299

73100
### Options
74101

75102
The `generate_schema.py` script supports a handful of command line flags:
76103

104+
* `--help` Prints the usage with the list of supported flags.
77105
* `--keep_nulls` Print the schema for null values, empty arrays or empty records.
78106
* `--debugging_interval lines` Number of lines between heartbeat debugging messages. Default 1000.
79107
* `--debugging_map` Print the metadata schema map for debugging purposes
80108

109+
#### Help
110+
111+
Print the built-in help strings:
112+
113+
```
114+
$ ./generate_schema.py --help
115+
```
116+
81117
#### Null Values
82118

83119
Normally when the input data file contains a field which has a null, empty
@@ -122,7 +158,7 @@ With the ``keep_nulls``, the resulting schema file will be:
122158
Example:
123159

124160
```
125-
$ generate_schema.py --keep_nulls < file.data.json > file.schema.json
161+
$ ./generate_schema.py --keep_nulls < file.data.json > file.schema.json
126162
```
127163

128164
#### Debugging Interval
@@ -132,7 +168,7 @@ every 1000 lines of input data. This interval can be changed using the
132168
`--debugging_interval` flag.
133169

134170
```
135-
$ generate_schema.py --debugging_interval 1000 < file.data.json > file.schema.json
171+
$ ./generate_schema.py --debugging_interval 1000 < file.data.json > file.schema.json
136172
```
137173

138174
#### Debugging Map
@@ -143,7 +179,7 @@ various fields and theirs types that was inferred using the data file. This
143179
flag is intended to be used for debugging.
144180

145181
```
146-
$ generate_schema.py --debugging_map < file.data.json > file.schema.json
182+
$ ./generate_schema.py --debugging_map < file.data.json > file.schema.json
147183
```
148184

149185
## Examples
@@ -212,7 +248,7 @@ $ cat file.schema.json
212248
## System Requirements
213249

214250
This project was developed on Ubuntu 17.04 using Python 3.5. It is likely
215-
compatible with other python environments but I have not yet verified those.
251+
compatible with other Python environments but I have not yet verified those.
216252

217253
## Author
218254

scripts/generate-schema

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
python3 -m bigquery_schema_generator.generate_schema

setup.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,13 @@
99
long_description = f.read()
1010

1111
setup(name='bigquery-schema-generator',
12-
version='0.1',
12+
version='0.1.1',
1313
description='BigQuery schema generator',
1414
long_description=long_description,
1515
url='https://github.com/bxparks/bigquery-schema-generator',
1616
author='Brian T. Park',
1717
author_email='[email protected]',
1818
license='Apache 2.0',
1919
packages=['bigquery_schema_generator'],
20+
scripts=['scripts/generate-schema'],
2021
python_requires='~=3.5')

0 commit comments

Comments
 (0)