Skip to content

Commit da3609f

Browse files
authored
Merge pull request #72 from bxparks/develop
merge v1.4.1 into master
2 parents acaa74b + 3b33efa commit da3609f

File tree

7 files changed

+314
-22
lines changed

7 files changed

+314
-22
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
# Changelog
22

33
* Unreleased
4+
* 1.4.1 (2021-08-23)
5+
* Add documentation for the `input_format='dict'` option.
6+
* Add additional inpout format 'json' and 'dict' test cases.
7+
* Maintenance release, no functional change in core code.
48
* 1.4 (2020-12-09)
59
* Add 'dict' as a third `input_format` when `SchemaGenerator` is used as a
610
library. This can be useful when the data has already been transformed

DEVELOPER.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,14 +30,17 @@ $ sudo -H pip3 install setuptools wheel twine
3030

3131
### Steps
3232

33-
1. Edit `setup.py` and increment the `version`.
33+
1. Increment the version numbers in:
34+
* `version.py`
35+
* `README.md`
36+
* `CHANGELOG.md`
3437
1. Push all changes to `develop` branch.
3538
1. Create a GitHub pull request (PR) from `develop` into `master` branch.
3639
1. Merge the PR into `master`.
3740
1. Create a new Release in GitHub with the new tag label.
3841
1. Create the dist using `python3 setup.py sdist`.
3942
1. Upload to PyPI using `twine upload
4043
dist/bigquery-schema-generator-{version}.tar.gz`.
41-
* Enter my PyPI login creddentials.
44+
* Enter my PyPI login credentials.
4245
* If `dist/` becomes too cluttered, we can remove the entire `dist/`
4346
directory and run `python3 setup.py sdist` again.

README.md

Lines changed: 62 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# BigQuery Schema Generator
22

3+
[![BigQuery Schema Generator CI](https://github.com/bxparks/bigquery-schema-generator/actions/workflows/pythonpackage.yml/badge.svg)](https://github.com/bxparks/bigquery-schema-generator/actions/workflows/pythonpackage.yml)
4+
35
This script generates the BigQuery schema from the newline-delimited data
46
records on the STDIN. The records can be in JSON format or CSV format. The
57
BigQuery data importer (`bq load`) uses only the first 100 lines when the schema
@@ -12,7 +14,7 @@ $ generate-schema < file.data.json > file.schema.json
1214
$ generate-schema --input_format csv < file.data.csv > file.schema.json
1315
```
1416

15-
**Version**: 1.4 (2020-12-09)
17+
**Version**: 1.4.1 (2021-08-23)
1618

1719
**Changelog**: [CHANGELOG.md](CHANGELOG.md)
1820

@@ -37,14 +39,17 @@ $ generate-schema --input_format csv < file.data.csv > file.schema.json
3739
* [Ignore Invalid Lines (`--ignore_invalid_lines`)](#IgnoreInvalidLines)
3840
* [Existing Schema Path (`--existing_schema_path`)](#ExistingSchemaPath)
3941
* [Using as a Library](#UsingAsLibrary)
42+
* [`SchemaGenerator.run()`](#SchemaGeneratorRun)
43+
* [`SchemaGenerator.deduce_schema()`](#SchemaGeneratorDeduceSchema)
4044
* [Schema Types](#SchemaTypes)
4145
* [Supported Types](#SupportedTypes)
4246
* [Type Inferrence](#TypeInferrence)
4347
* [Examples](#Examples)
4448
* [Benchmarks](#Benchmarks)
4549
* [System Requirements](#SystemRequirements)
46-
* [Authors](#Authors)
4750
* [License](#License)
51+
* [Feedback and Support](#Feedback)
52+
* [Authors](#Authors)
4853

4954
<a name="Background"></a>
5055
## Background
@@ -290,7 +295,8 @@ Generate BigQuery schema from JSON or CSV file.
290295
optional arguments:
291296
-h, --help show this help message and exit
292297
--input_format INPUT_FORMAT
293-
Specify an alternative input format ('csv', 'json')
298+
Specify an alternative input format ('csv', 'json',
299+
'dict')
294300
--keep_nulls Print the schema for null values, empty arrays or
295301
empty records
296302
--quoted_values_are_strings
@@ -312,7 +318,20 @@ optional arguments:
312318
<a name="InputFormat"></a>
313319
#### Input Format (`--input_format`)
314320

315-
Specifies the format of the input file, either `json` (default) or `csv`.
321+
Specifies the format of the input file as a string. It must be one of `json`
322+
(default), `csv`, or `dict`:
323+
324+
* `json`
325+
* a "file-like" object containing newline-delimited JSON
326+
* `csv`
327+
* a "file-like" object containing newline-delimited CSV
328+
* `dict`
329+
* a `list` of Python `dict` objects corresponding to list of
330+
newline-delimited JSON, in other words `List[Dict[str, Any]]`
331+
* applies only if `SchemaGenerator` is used as a library through the
332+
`run()` or `deduce_schema()` method
333+
* useful if the input data (usually JSON) has already been read into memory
334+
and parsed from newline-delimited JSON into native Python dict objects.
316335
317336
If `csv` file is specified, the `--keep_nulls` flag is automatically activated.
318337
This is required because CSV columns are defined positionally, so the schema
@@ -531,6 +550,12 @@ more details.
531550
<a name="UsingAsLibrary"></a>
532551
### Using As a Library
533552
553+
The `SchemaGenerator` class can be used programmatically as a library from a
554+
larger Python application.
555+
556+
<a name="SchemaGeneratorRun"></a>
557+
#### `SchemaGenerator.run()`
558+
534559
The `bigquery_schema_generator` module can be used as a library by an external
535560
Python client code by creating an instance of `SchemaGenerator` and calling the
536561
`run(input, output)` method:
@@ -551,6 +576,17 @@ generator = SchemaGenerator(
551576
generator.run(input_file=input_file, output_file=output_file)
552577
```
553578
579+
The `input_format` is one of `json`, `csv`, and `dict` as described in the
580+
[Input Format](#InputFormat) section above. The `input_file` must match the
581+
format given by this parameter.
582+
583+
See the `TestSchemaGeneratorDeduce.test_run_with_input_and_output()` test
584+
case in [examples/test_generate_schema.py](examples/test_generate_schema.py) for
585+
an example of an `input_file` of type `json`.
586+
587+
<a name="SchemaGeneratorDeduceSchema"></a>
588+
#### `SchemaGenerator.deduce_schema()`
589+
554590
If you need to process the generated schema programmatically, use the
555591
`deduce_schema()` method and process the resulting `schema_map` and `error_log`
556592
data structures like this:
@@ -583,12 +619,12 @@ schema_map2, error_logs = generator.deduce_schema(
583619
)
584620
```
585621
586-
When using the `SchemaGenerator` object directly, the `input_format` parameter
587-
supports `dict` as a third input format in addition to the `json` and `csv`
588-
formats. The `dict` input format tells `SchemaGenerator.deduce_schema()` to
589-
accept a list of Python dict objects as the `input_data`. This is useful if the
590-
input data (usually JSON) has already been read into memory and parsed from
591-
newline-delimited JSON into native Python dict objects.
622+
The `input_data` must match the `input_format` given in the constructor. The
623+
format is described in the [Input Format](#InputFormat) section above.
624+
625+
See the `TestSchemaGeneratorDeduce.test_deduce_schema_with_dict_input()` test
626+
case in [examples/test_generate_schema.py](examples/test_generate_schema.py) for
627+
an example of an `input_data` of type `dict`.
592628
593629
<a name="SchemaTypes"></a>
594630
## Schema Types
@@ -864,6 +900,22 @@ and 3.8.
864900
865901
Apache License 2.0
866902
903+
<a name="Feedback"></a>
904+
## Feedback and Support
905+
906+
If you have any questions, comments and other support questions about how to
907+
use this library, use the
908+
[GitHub Discussions](https://github.com/bxparks/bigquery-schema-generator/discussions)
909+
for this project. If you have bug reports or feature requests, file a ticket in
910+
[GitHub Issues](https://github.com/bxparks/bigquery-schema-generator/issues).
911+
I'd love to hear about how this software and its documentation can be improved.
912+
I can't promise that I will incorporate everything, but I will give your ideas
913+
serious consideration.
914+
915+
Please refrain from emailing me directly unless the content is sensitive. The
916+
problem with email is that I cannot reference the email conversation when other
917+
people ask similar questions later.
918+
867919
<a name="Authors"></a>
868920
## Authors
869921

bigquery_schema_generator/generate_schema.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1004,7 +1004,7 @@ def main():
10041004
description='Generate BigQuery schema from JSON or CSV file.')
10051005
parser.add_argument(
10061006
'--input_format',
1007-
help="Specify an alternative input format ('csv', 'json')",
1007+
help="Specify an alternative input format ('csv', 'json', 'dict')",
10081008
default='json')
10091009
parser.add_argument(
10101010
'--keep_nulls',
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = '1.4'
1+
__version__ = '1.4.1'

tests/test_generate_schema.py

Lines changed: 51 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
from .data_reader import DataReader
3030

3131

32-
class TestSchemaGenerator(unittest.TestCase):
32+
class TestSchemaGeneratorHelpers(unittest.TestCase):
3333
def test_timestamp_matcher_valid(self):
3434
self.assertTrue(
3535
SchemaGenerator.TIMESTAMP_MATCHER.match('2017-05-22T12:33:01'))
@@ -479,6 +479,17 @@ def test_is_string_type(self):
479479
self.assertTrue(is_string_type('DATE'))
480480
self.assertTrue(is_string_type('TIME'))
481481

482+
def test_json_full_path(self):
483+
self.assertEqual('port', json_full_path(None, 'port'))
484+
self.assertEqual('port', json_full_path("", 'port'))
485+
486+
# 'base_path' should never be '0', but if is do something reasonable.
487+
self.assertEqual('0.port', json_full_path(0, 'port'))
488+
489+
self.assertEqual('server.port', json_full_path('server', 'port'))
490+
491+
492+
class TestSchemaGeneratorDeduce(unittest.TestCase):
482493
def test_run_with_input_and_output(self):
483494
generator = SchemaGenerator()
484495
input = StringIO('{ "name": "1" }')
@@ -507,14 +518,46 @@ def test_run_with_invalid_input_throws_exception(self):
507518
with self.assertRaises(Exception):
508519
generator.run(input, output)
509520

510-
def test_json_full_path(self):
511-
self.assertEqual('port', json_full_path(None, 'port'))
512-
self.assertEqual('port', json_full_path("", 'port'))
513-
514-
# 'base_path' should never be '0', but if is do something reasonable.
515-
self.assertEqual('0.port', json_full_path(0, 'port'))
521+
def test_deduce_schema_with_dict_input(self):
522+
generator = SchemaGenerator(input_format='dict')
523+
input_data = [
524+
{
525+
's': 'string',
526+
'b': True,
527+
},
528+
{
529+
'd': '2021-08-18',
530+
'x': 3.1
531+
},
532+
]
533+
schema_map, error_logs = generator.deduce_schema(input_data)
534+
schema = generator.flatten_schema(schema_map)
516535

517-
self.assertEqual('server.port', json_full_path('server', 'port'))
536+
self.assertEqual(
537+
schema,
538+
[
539+
OrderedDict([
540+
('mode', 'NULLABLE'),
541+
('name', 'b'),
542+
('type', 'BOOLEAN'),
543+
]),
544+
OrderedDict([
545+
('mode', 'NULLABLE'),
546+
('name', 'd'),
547+
('type', 'DATE'),
548+
]),
549+
OrderedDict([
550+
('mode', 'NULLABLE'),
551+
('name', 's'),
552+
('type', 'STRING'),
553+
]),
554+
OrderedDict([
555+
('mode', 'NULLABLE'),
556+
('name', 'x'),
557+
('type', 'FLOAT'),
558+
]),
559+
],
560+
)
518561

519562

520563
class TestDataChunksFromFile(unittest.TestCase):

0 commit comments

Comments
 (0)