Skip to content

Commit 15b68d1

Browse files
authored
Merge pull request #45 from bxparks/develop
merge v1.0 into master
2 parents e37cec6 + 3bf559d commit 15b68d1

File tree

11 files changed

+233
-110
lines changed

11 files changed

+233
-110
lines changed
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
2+
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
3+
4+
name: BigQuery Schema Generator CI
5+
6+
on:
7+
push:
8+
branches: [ develop ]
9+
pull_request:
10+
branches: [ develop ]
11+
12+
jobs:
13+
build:
14+
15+
runs-on: ubuntu-latest
16+
strategy:
17+
matrix:
18+
# 3.5 does not support f-strings
19+
python-version: [3.6, 3.7, 3.8]
20+
21+
steps:
22+
- uses: actions/checkout@v2
23+
- name: Set up Python ${{ matrix.python-version }}
24+
uses: actions/setup-python@v1
25+
with:
26+
python-version: ${{ matrix.python-version }}
27+
- name: Install dependencies
28+
run: |
29+
python -m pip install --upgrade pip
30+
# pip install -r requirements.txt
31+
- name: Lint with flake8
32+
run: |
33+
pip install flake8
34+
# Stop the build for most python errors.
35+
# W503 and W504 are both enabled by default and contradictory, so we
36+
# have to suppress one of them.
37+
# E501 complains that 80 > 79 columns, but 80 is the default line wrap
38+
# in vim.
39+
flake8 . --count --ignore E501,W503 --show-source --statistics
40+
41+
# Exit-zero treats all errors as warnings. Vim editor defaults to 80.
42+
# The complexity warning is not useful... in fact the whole thing is
43+
# not useful, so turn it off.
44+
# flake8 . --count --exit-zero --max-complexity=10 --max-line-length=80
45+
# --statistics
46+
- name: Test with unittest
47+
run: |
48+
python -m unittest

CHANGELOG.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,16 @@
11
# Changelog
22

33
* Unreleased
4+
* 1.0 (2020-04-04)
5+
* Fix `--sanitize_names` for recursive RECORD fields (Thanks riccardomc@,
6+
see #43).
7+
* Clean up how unit tests are run, trying my best to figure out
8+
Python's convolution package importing mechanism.
9+
* Add GitHub Actions continuous integration pipelines with flake8 checks and
10+
automated unit testing.
411
* 0.5.1 (2019-06-17)
512
* Add `--sanitize_names` to convert invalid characters in column names and
6-
to shorten them if too long. (See #33; thanks @jonwarghed).
13+
to shorten them if too long. (See #33; thanks jonwarghed@).
714
* 0.5 (2019-06-06)
815
* Add input and output parameters to run() to allow the client code using
916
`SchemaGenerator` to redirect the input and output files. (See #30).

Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
.PHONY: tests
2+
3+
tests:
4+
python3 -m unittest

README.md

Lines changed: 31 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ $ generate-schema < file.data.json > file.schema.json
1212
$ generate-schema --input_format csv < file.data.csv > file.schema.json
1313
```
1414

15-
Version: 0.5.1 (2019-06-19)
15+
Version: 1.0 (2020-04-04)
1616

1717
## Background
1818

@@ -44,18 +44,33 @@ the input dataset.
4444

4545
## Installation
4646

47-
Install from [PyPI](https://pypi.python.org/pypi) repository using `pip3`.
48-
If you want to install the package for your entire system globally, use
47+
Install from [PyPI](https://pypi.python.org/pypi) repository using `pip3`. There
48+
are too many ways to install packages in Python. The following are in order
49+
highest to lowest recommendation:
50+
51+
1) If you are using a virtual environment (such as
52+
[venv](https://docs.python.org/3/library/venv.html)), then use:
4953
```
50-
$ sudo -H pip3 install bigquery_schema_generator
54+
$ pip3 install bigquery_schema_generator
5155
```
52-
If you are using a virtual environment (such as
53-
[venv](https://docs.python.org/3/library/venv.html)), then you don't need
54-
the `sudo` coommand, and you can just type:
56+
57+
2) If you aren't using a virtual environment you can install into
58+
your local Python directory:
59+
5560
```
56-
$ pip3 install bigquery_schema_generator
61+
$ pip3 install --user bigquery_schema_generator
5762
```
5863

64+
3) If you want to install the package for your entire system globally, use
65+
```
66+
$ sudo -H pip3 install bigquery_schema_generator
67+
```
68+
but realize that you will be running code from PyPI as `root` so this has
69+
security implications.
70+
71+
Sometimes, your Python environment gets into a complete mess and the `pip3`
72+
command won't work. Try typing `python3 -m pip` instead.
73+
5974
A successful install should print out something like the following (the version
6075
number may be different):
6176
```
@@ -644,16 +659,20 @@ took 67s on a Dell Precision M4700 laptop with an Intel Core i7-3840QM CPU @
644659

645660
## System Requirements
646661

647-
This project was initially developed on Ubuntu 17.04 using Python 3.5.3. I have
648-
tested it on:
662+
This project was initially developed on Ubuntu 17.04 using Python 3.5.3, but it
663+
now requires Python 3.6 or higher, I think mostly due to the use of f-strings.
664+
665+
I have tested it on:
649666

667+
* Ubuntu 18.04, Python 3.7.7
650668
* Ubuntu 18.04, Python 3.6.7
651669
* Ubuntu 17.10, Python 3.6.3
652-
* Ubuntu 17.04, Python 3.5.3
653-
* Ubuntu 16.04, Python 3.5.2
654670
* MacOS 10.14.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)
655671
* MacOS 10.13.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)
656672

673+
The GitHub Actions continuous integration pipeline validates on Python 3.6, 3.7
674+
and 3.8.
675+
657676
## Changelog
658677

659678
See [CHANGELOG.md](CHANGELOG.md).

bigquery_schema_generator/generate_schema.py

Lines changed: 74 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,9 @@ class SchemaGenerator:
7373
# Detect floats inside quotes.
7474
FLOAT_MATCHER = re.compile(r'^[-]?\d+\.\d+$')
7575

76+
# Valid field name characters of BigQuery
77+
FIELD_NAME_MATCHER = re.compile(r'[^a-zA-Z0-9_]')
78+
7679
def __init__(self,
7780
input_format='json',
7881
infer_mode=False,
@@ -114,8 +117,8 @@ def __init__(self,
114117

115118
# This option generally wants to be turned on as any inferred schema
116119
# will not be accepted by `bq load` when it contains illegal characters.
117-
# Characters such as #, / or -. Neither will it be accepted if the column name
118-
# in the schema is larger than 128 characters.
120+
# Characters such as #, / or -. Neither will it be accepted if the
121+
# column name in the schema is larger than 128 characters.
119122
self.sanitize_names = sanitize_names
120123

121124
def log_error(self, msg):
@@ -323,7 +326,6 @@ def get_schema_entry(self, key, value):
323326
if not value_mode or not value_type:
324327
return None
325328

326-
# yapf: disable
327329
if value_type == 'RECORD':
328330
# recursively figure out the RECORD
329331
fields = OrderedDict()
@@ -332,39 +334,48 @@ def get_schema_entry(self, key, value):
332334
else:
333335
for val in value:
334336
self.deduce_schema_for_line(val, fields)
335-
schema_entry = OrderedDict([('status', 'hard'),
336-
('filled', True),
337-
('info', OrderedDict([
338-
('fields', fields),
339-
('mode', value_mode),
340-
('name', key),
341-
('type', value_type),
342-
]))])
337+
# yapf: disable
338+
schema_entry = OrderedDict([
339+
('status', 'hard'),
340+
('filled', True),
341+
('info', OrderedDict([
342+
('fields', fields),
343+
('mode', value_mode),
344+
('name', key),
345+
('type', value_type),
346+
])),
347+
])
343348
elif value_type == '__null__':
344-
schema_entry = OrderedDict([('status', 'soft'),
345-
('filled', False),
346-
('info', OrderedDict([
347-
('mode', 'NULLABLE'),
348-
('name', key),
349-
('type', 'STRING'),
350-
]))])
349+
schema_entry = OrderedDict([
350+
('status', 'soft'),
351+
('filled', False),
352+
('info', OrderedDict([
353+
('mode', 'NULLABLE'),
354+
('name', key),
355+
('type', 'STRING'),
356+
])),
357+
])
351358
elif value_type == '__empty_array__':
352-
schema_entry = OrderedDict([('status', 'soft'),
353-
('filled', False),
354-
('info', OrderedDict([
355-
('mode', 'REPEATED'),
356-
('name', key),
357-
('type', 'STRING'),
358-
]))])
359+
schema_entry = OrderedDict([
360+
('status', 'soft'),
361+
('filled', False),
362+
('info', OrderedDict([
363+
('mode', 'REPEATED'),
364+
('name', key),
365+
('type', 'STRING'),
366+
])),
367+
])
359368
elif value_type == '__empty_record__':
360-
schema_entry = OrderedDict([('status', 'soft'),
361-
('filled', False),
362-
('info', OrderedDict([
363-
('fields', OrderedDict()),
364-
('mode', value_mode),
365-
('name', key),
366-
('type', 'RECORD'),
367-
]))])
369+
schema_entry = OrderedDict([
370+
('status', 'soft'),
371+
('filled', False),
372+
('info', OrderedDict([
373+
('fields', OrderedDict()),
374+
('mode', value_mode),
375+
('name', key),
376+
('type', 'RECORD'),
377+
])),
378+
])
368379
else:
369380
# Empty fields are returned as empty strings, and must be treated as
370381
# a (soft String) to allow clobbering by subsquent non-empty fields.
@@ -374,13 +385,15 @@ def get_schema_entry(self, key, value):
374385
else:
375386
status = 'hard'
376387
filled = True
377-
schema_entry = OrderedDict([('status', status),
378-
('filled', filled),
379-
('info', OrderedDict([
380-
('mode', value_mode),
381-
('name', key),
382-
('type', value_type),
383-
]))])
388+
schema_entry = OrderedDict([
389+
('status', status),
390+
('filled', filled),
391+
('info', OrderedDict([
392+
('mode', value_mode),
393+
('name', key),
394+
('type', value_type),
395+
])),
396+
])
384397
# yapf: enable
385398
return schema_entry
386399

@@ -435,8 +448,8 @@ def infer_value_type(self, value):
435448
# Implement the same type inference algorithm as 'bq load' for
436449
# quoted values that look like ints, floats or bools.
437450
if self.INTEGER_MATCHER.match(value):
438-
if int(value) < self.INTEGER_MIN_VALUE or \
439-
self.INTEGER_MAX_VALUE < int(value):
451+
if (int(value) < self.INTEGER_MIN_VALUE
452+
or self.INTEGER_MAX_VALUE < int(value)):
440453
return 'QFLOAT' # quoted float
441454
else:
442455
return 'QINTEGER' # quoted integer
@@ -618,11 +631,13 @@ def is_string_type(thetype):
618631
]
619632

620633

621-
def flatten_schema_map(schema_map,
622-
keep_nulls=False,
623-
sorted_schema=True,
624-
infer_mode=False,
625-
sanitize_names=False):
634+
def flatten_schema_map(
635+
schema_map,
636+
keep_nulls=False,
637+
sorted_schema=True,
638+
infer_mode=False,
639+
sanitize_names=False,
640+
):
626641
"""Converts the 'schema_map' into a more flatten version which is
627642
compatible with BigQuery schema.
628643
@@ -647,7 +662,8 @@ def flatten_schema_map(schema_map,
647662
else schema_map.items()
648663
for name, meta in map_items:
649664
# Skip over fields which have been explicitly removed
650-
if not meta: continue
665+
if not meta:
666+
continue
651667

652668
status = meta['status']
653669
filled = meta['filled']
@@ -679,16 +695,24 @@ def flatten_schema_map(schema_map,
679695
else:
680696
# Recursively flatten the sub-fields of a RECORD entry.
681697
new_value = flatten_schema_map(
682-
value, keep_nulls, sorted_schema, sanitize_names)
698+
schema_map=value,
699+
keep_nulls=keep_nulls,
700+
sorted_schema=sorted_schema,
701+
infer_mode=infer_mode,
702+
sanitize_names=sanitize_names,
703+
)
683704
elif key == 'type' and value in ['QINTEGER', 'QFLOAT', 'QBOOLEAN']:
705+
# Convert QINTEGER -> INTEGER, similarly for QFLAT and QBOOLEAN.
684706
new_value = value[1:]
685707
elif key == 'mode':
686708
if infer_mode and value == 'NULLABLE' and filled:
687709
new_value = 'REQUIRED'
688710
else:
689711
new_value = value
690712
elif key == 'name' and sanitize_names:
691-
new_value = re.sub('[^a-zA-Z0-9_]', '_', value)[0:127]
713+
new_value = SchemaGenerator.FIELD_NAME_MATCHER.sub(
714+
'_', value,
715+
)[0:127]
692716
else:
693717
new_value = value
694718
new_info[key] = new_value

setup.py

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,28 +4,29 @@
44
try:
55
import pypandoc
66
long_description = pypandoc.convert('README.md', 'rst', format='md')
7-
except:
7+
except: # noqa: E722
88
# If unable to convert, try inserting the raw README.md file.
99
try:
1010
with open('README.md', encoding="utf-8") as f:
1111
long_description = f.read()
12-
except:
12+
except: # noqa: E722
1313
# If all else fails, use some reasonable string.
1414
long_description = 'BigQuery schema generator.'
1515

16-
setup(name='bigquery-schema-generator',
17-
version='0.5.1',
18-
description='BigQuery schema generator from JSON or CSV data',
19-
long_description=long_description,
20-
url='https://github.com/bxparks/bigquery-schema-generator',
21-
author='Brian T. Park',
22-
author_email='[email protected]',
23-
license='Apache 2.0',
24-
packages=['bigquery_schema_generator'],
25-
python_requires='~=3.5',
26-
entry_points={
27-
'console_scripts': [
16+
setup(
17+
name='bigquery-schema-generator',
18+
version='1.0',
19+
description='BigQuery schema generator from JSON or CSV data',
20+
long_description=long_description,
21+
url='https://github.com/bxparks/bigquery-schema-generator',
22+
author='Brian T. Park',
23+
author_email='[email protected]',
24+
license='Apache 2.0',
25+
packages=['bigquery_schema_generator'],
26+
python_requires='~=3.6',
27+
entry_points={
28+
'console_scripts': [
2829
'generate-schema = bigquery_schema_generator.generate_schema:main'
2930
]
30-
}
31+
},
3132
)

0 commit comments

Comments
 (0)