Skip to content

Commit f5f8696

Browse files
authored
Merge pull request #2 from bxparks/develop
Initial release of version 0.1 to PyPI.
2 parents 409bebd + 10a4aec commit f5f8696

File tree

9 files changed

+95
-41
lines changed

9 files changed

+95
-41
lines changed

.gitignore

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,22 @@
22
__pycache__/
33
*.py[cod]
44
*$py.class
5+
6+
# Distribution / packaging
7+
.Python
8+
build/
9+
develop-eggs/
10+
dist/
11+
downloads/
12+
eggs/
13+
.eggs/
14+
lib/
15+
lib64/
16+
parts/
17+
sdist/
18+
var/
19+
wheels/
20+
*.egg-info/
21+
.installed.cfg
22+
*.egg
23+
MANIFEST

README.md

Lines changed: 18 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,10 @@ When the auto-detect feature is used, the BigQuery data importer examines only
2727
the first 100 records of the input data. In many cases, this is sufficient
2828
because the data records were dumped from another database and the exact schema
2929
of the source table was known. However, for data extracted from a service
30-
(e.g. using a REST API) the record fields were organically at later dates. In
31-
this case, the first 100 records do not contain fields which are present in
32-
later records. The **bq load** auto-detection fails and the data fails to
33-
load.
30+
(e.g. using a REST API) the record fields could have been organically added
31+
at later dates. In this case, the first 100 records do not contain fields which
32+
are present in later records. The **bq load** auto-detection fails and the data
33+
fails to load.
3434

3535
The **bq load** tool does not support the ability to process the entire dataset
3636
to determine a more accurate schema. This script fills in that gap. It
@@ -119,19 +119,33 @@ With the ``keep_nulls``, the resulting schema file will be:
119119
]
120120
```
121121

122+
Example:
123+
124+
```
125+
$ generate_schema.py --keep_nulls < file.data.json > file.schema.json
126+
```
127+
122128
#### Debugging Interval
123129

124130
By default, the `generate_schema.py` script prints a short progress message
125131
every 1000 lines of input data. This interval can be changed using the
126132
`--debugging_interval` flag.
127133

134+
```
135+
$ generate_schema.py --debugging_interval 1000 < file.data.json > file.schema.json
136+
```
137+
128138
#### Debugging Map
129139

130140
Instead of printing out the BigQuery schema, the `--debugging_map` prints out
131141
the bookkeeping metadata map which is used internally to keep track of the
132142
various fields and theirs types that was inferred using the data file. This
133143
flag is intended to be used for debugging.
134144

145+
```
146+
$ generate_schema.py --debugging_map < file.data.json > file.schema.json
147+
```
148+
135149
## Examples
136150

137151
Here is an example of a single JSON data record on the STDIN:
@@ -195,36 +209,6 @@ $ cat file.schema.json
195209
]
196210
```
197211

198-
## Unit Tests
199-
200-
Instead of embeddeding the input data records and the expected schema file into
201-
the `test_generate_schema.py` file, we placed them into the `testdata.txt`
202-
file. This has two advantages:
203-
204-
* we can more easily update the input and output data records, and
205-
* the `testdata.txt` data could be reused for versions written in other languages
206-
207-
The output of `test_generate_schema.py` should look something like this:
208-
```
209-
----------------------------------------------------------------------
210-
Ran 4 tests in 0.002s
211-
212-
OK
213-
Test chunk 1: First record: { "s": null, "a": [], "m": {} }
214-
Test chunk 2: First record: { "s": null, "a": [], "m": {} }
215-
Test chunk 3: First record: { "s": "string", "b": true, "i": 1, "x": 3.1, "t": "2017-05-22T17:10:00-07:00" }
216-
Test chunk 4: First record: { "a": [1, 2], "r": { "r0": "r0", "r1": "r1" } }
217-
Test chunk 5: First record: { "s": "string", "x": 3.2, "i": 3, "b": true, "a": [ "a", 1] }
218-
Test chunk 6: First record: { "a": [1, 2] }
219-
Test chunk 7: First record: { "r" : { "a": [1, 2] } }
220-
Test chunk 8: First record: { "i": 1 }
221-
Test chunk 9: First record: { "i": null }
222-
Test chunk 10: First record: { "i": 3 }
223-
Test chunk 11: First record: { "i": [1, 2] }
224-
Test chunk 12: First record: { "r" : { "i": 3 } }
225-
Test chunk 13: First record: { "r" : [{ "i": 4 }] }
226-
```
227-
228212
## System Requirements
229213

230214
This project was developed on Ubuntu 17.04 using Python 3.5. It is likely

bigquery_schema_generator/__init__.py

Whitespace-only changes.

generator/generate_schema.py renamed to bigquery_schema_generator/generate_schema.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#!/usr/bin/python3
1+
#!/usr/bin/env python3
22
#
33
# Copyright 2017 Brian T. Park
44
#
@@ -18,7 +18,7 @@
1818
Unlike the BigQuery importer which uses only the first 100 records, this script
1919
uses all available records in the data file.
2020
21-
Usage: generator_schema.py [-h] [flags ...] < file.data.json > file.schema.json
21+
Usage: generate_schema.py [-h] [flags ...] < file.data.json > file.schema.json
2222
2323
* file.data.json is a newline-delimited JSON data file, one JSON object per line.
2424
* file.schema.json is the schema definition of the table.

setup.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
from setuptools import setup
2+
3+
# Convert README.md to README.rst because PyPI does not support Markdown.
4+
try:
5+
import pypandoc
6+
long_description = pypandoc.convert('README.md', 'rst')
7+
except OSError:
8+
with open('README.md', encoding="utf-8") as f:
9+
long_description = f.read()
10+
11+
setup(name='bigquery-schema-generator',
12+
version='0.1',
13+
description='BigQuery schema generator',
14+
long_description=long_description,
15+
url='https://github.com/bxparks/bigquery-schema-generator',
16+
author='Brian T. Park',
17+
author_email='[email protected]',
18+
license='Apache 2.0',
19+
packages=['bigquery_schema_generator'],
20+
python_requires='~=3.5')

tests/README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Tests
2+
3+
Instead of embedding the input data records and the expected schema into
4+
the `test_generate_schema.py` file, we placed them into the `testdata.txt`
5+
file which is parsed by the unit test program. This has two advantages:
6+
7+
* we can more easily update the input and output data records, and
8+
* the `testdata.txt` data can be reused for versions written in other languages
9+
10+
The output of `test_generate_schema.py` should look something like this:
11+
```
12+
----------------------------------------------------------------------
13+
Ran 4 tests in 0.002s
14+
15+
OK
16+
Test chunk 1: First record: { "s": null, "a": [], "m": {} }
17+
Test chunk 2: First record: { "s": null, "a": [], "m": {} }
18+
Test chunk 3: First record: { "s": "string", "b": true, "i": 1, "x": 3.1, "t": "2017-05-22T17:10:00-07:00" }
19+
Test chunk 4: First record: { "a": [1, 2], "r": { "r0": "r0", "r1": "r1" } }
20+
Test chunk 5: First record: { "s": "string", "x": 3.2, "i": 3, "b": true, "a": [ "a", 1] }
21+
Test chunk 6: First record: { "a": [1, 2] }
22+
Test chunk 7: First record: { "r" : { "a": [1, 2] } }
23+
Test chunk 8: First record: { "i": 1 }
24+
Test chunk 9: First record: { "i": null }
25+
Test chunk 10: First record: { "i": 3 }
26+
Test chunk 11: First record: { "i": [1, 2] }
27+
Test chunk 12: First record: { "r" : { "i": 3 } }
28+
Test chunk 13: First record: { "r" : [{ "i": 4 }] }
29+
```
30+
31+

generator/data_reader.py renamed to tests/data_reader.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#!/usr/bin/python3
1+
#!/usr/bin/env python3
22
#
33
# Copyright 2017 Brian T. Park
44
#
@@ -14,7 +14,7 @@
1414
# See the License for the specific language governing permissions and
1515
# limitations under the License.
1616
"""
17-
Parses the 'testdata.txt' date file used by the 'generate_schema_test.py'
17+
Parses the 'testdata.txt' date file used by the 'test_generate_schema.py'
1818
program.
1919
2020
Usage:

generator/test_generate_schema.py renamed to tests/test_generate_schema.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#!/usr/bin/python3
1+
#!/usr/bin/env python3
22
#
33
# Copyright 2017 Brian T. Park
44
#
@@ -18,9 +18,9 @@
1818
import os
1919
import json
2020
from collections import OrderedDict
21+
from bigquery_schema_generator.generate_schema import SchemaGenerator
22+
from bigquery_schema_generator.generate_schema import sort_schema
2123
from data_reader import DataReader
22-
from generate_schema import SchemaGenerator
23-
from generate_schema import sort_schema
2424

2525

2626
class TestSchemaGenerator(unittest.TestCase):
File renamed without changes.

0 commit comments

Comments
 (0)