Merge pull request #72 from bxparks/develop

bxparks · web-flow · commit da3609f5c526 · 2021-08-23T09:50:25.000-07:00
merge v1.4.1 into master
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,10 @@
 # Changelog
 
 * Unreleased
+* 1.4.1 (2021-08-23)
+    * Add documentation for the `input_format='dict'` option.
+    * Add additional inpout format 'json' and 'dict' test cases.
+    * Maintenance release, no functional change in core code.
 * 1.4 (2020-12-09)
     * Add 'dict' as a third `input_format` when `SchemaGenerator` is used as a
       library. This can be useful when the data has already been transformed
diff --git a/DEVELOPER.md b/DEVELOPER.md
@@ -30,14 +30,17 @@ $ sudo -H pip3 install setuptools wheel twine
 
 ### Steps
 
-1. Edit `setup.py` and increment the `version`.
+1. Increment the version numbers in:
+    * `version.py`
+    * `README.md`
+    * `CHANGELOG.md`
 1. Push all changes to `develop` branch.
 1. Create a GitHub pull request (PR) from `develop` into `master` branch.
 1. Merge the PR into `master`.
 1. Create a new Release in GitHub with the new tag label.
 1. Create the dist using `python3 setup.py sdist`.
 1. Upload to PyPI using `twine upload
    dist/bigquery-schema-generator-{version}.tar.gz`.
-    * Enter my PyPI login creddentials.
+    * Enter my PyPI login credentials.
     * If `dist/` becomes too cluttered, we can remove the entire `dist/`
       directory and run `python3 setup.py sdist` again.
diff --git a/README.md b/README.md
@@ -1,5 +1,7 @@
 # BigQuery Schema Generator
 
+[![BigQuery Schema Generator CI](https://github.com/bxparks/bigquery-schema-generator/actions/workflows/pythonpackage.yml/badge.svg)](https://github.com/bxparks/bigquery-schema-generator/actions/workflows/pythonpackage.yml)
+
 This script generates the BigQuery schema from the newline-delimited data
 records on the STDIN. The records can be in JSON format or CSV format. The
 BigQuery data importer (`bq load`) uses only the first 100 lines when the schema
@@ -12,7 +14,7 @@ $ generate-schema < file.data.json > file.schema.json
 $ generate-schema --input_format csv < file.data.csv > file.schema.json
 ```
 
-**Version**: 1.4 (2020-12-09)
+**Version**: 1.4.1 (2021-08-23)
 
 **Changelog**: [CHANGELOG.md](CHANGELOG.md)
 
@@ -37,14 +39,17 @@ $ generate-schema --input_format csv < file.data.csv > file.schema.json
         * [Ignore Invalid Lines (`--ignore_invalid_lines`)](#IgnoreInvalidLines)
         * [Existing Schema Path (`--existing_schema_path`)](#ExistingSchemaPath)
     * [Using as a Library](#UsingAsLibrary)
+        * [`SchemaGenerator.run()`](#SchemaGeneratorRun)
+        * [`SchemaGenerator.deduce_schema()`](#SchemaGeneratorDeduceSchema)
 * [Schema Types](#SchemaTypes)
     * [Supported Types](#SupportedTypes)
     * [Type Inferrence](#TypeInferrence)
 * [Examples](#Examples)
 * [Benchmarks](#Benchmarks)
 * [System Requirements](#SystemRequirements)
-* [Authors](#Authors)
 * [License](#License)
+* [Feedback and Support](#Feedback)
+* [Authors](#Authors)
 
 <a name="Background"></a>
 ## Background
@@ -290,7 +295,8 @@ Generate BigQuery schema from JSON or CSV file.
 optional arguments:
   -h, --help            show this help message and exit
   --input_format INPUT_FORMAT
-                        Specify an alternative input format ('csv', 'json')
+                        Specify an alternative input format ('csv', 'json',
+                        'dict')
   --keep_nulls          Print the schema for null values, empty arrays or
                         empty records
   --quoted_values_are_strings
@@ -312,7 +318,20 @@ optional arguments:
 <a name="InputFormat"></a>
 #### Input Format (`--input_format`)
 
-Specifies the format of the input file, either `json` (default) or `csv`.
+Specifies the format of the input file as a string. It must be one of `json`
+(default), `csv`, or `dict`:
+
+* `json`
+    * a "file-like" object containing newline-delimited JSON
+* `csv`
+    * a "file-like" object containing newline-delimited CSV
+* `dict`
+    * a `list` of Python `dict` objects corresponding to list of
+      newline-delimited JSON, in other words `List[Dict[str, Any]]`
+    * applies only if `SchemaGenerator` is used as a library through the
+      `run()` or `deduce_schema()` method
+    * useful if the input data (usually JSON) has already been read into memory
+      and parsed from newline-delimited JSON into native Python dict objects.
 
 If `csv` file is specified, the `--keep_nulls` flag is automatically activated.
 This is required because CSV columns are defined positionally, so the schema
@@ -531,6 +550,12 @@ more details.
 <a name="UsingAsLibrary"></a>
 ### Using As a Library
 
+The `SchemaGenerator` class can be used programmatically as a library from a
+larger Python application.
+
+<a name="SchemaGeneratorRun"></a>
+#### `SchemaGenerator.run()`
+
 The `bigquery_schema_generator` module can be used as a library by an external
 Python client code by creating an instance of `SchemaGenerator` and calling the
 `run(input, output)` method:
@@ -551,6 +576,17 @@ generator = SchemaGenerator(
 generator.run(input_file=input_file, output_file=output_file)
 ```
 
+The `input_format` is one of `json`, `csv`, and `dict` as described in the
+[Input Format](#InputFormat) section above. The `input_file` must match the
+format given by this parameter.
+
+See the `TestSchemaGeneratorDeduce.test_run_with_input_and_output()` test
+case in [examples/test_generate_schema.py](examples/test_generate_schema.py) for
+an example of an `input_file` of type `json`.
+
+<a name="SchemaGeneratorDeduceSchema"></a>
+#### `SchemaGenerator.deduce_schema()`
+
 If you need to process the generated schema programmatically, use the
 `deduce_schema()` method and process the resulting `schema_map` and `error_log`
 data structures like this:
@@ -583,12 +619,12 @@ schema_map2, error_logs = generator.deduce_schema(
 )
 ```
 
-When using the `SchemaGenerator` object directly, the `input_format` parameter
-supports `dict` as a third input format in addition to the `json` and `csv`
-formats. The `dict` input format tells `SchemaGenerator.deduce_schema()` to
-accept a list of Python dict objects as the `input_data`. This is useful if the
-input data (usually JSON) has already been read into memory and parsed from
-newline-delimited JSON into native Python dict objects.
+The `input_data` must match the `input_format` given in the constructor. The
+format is described in the [Input Format](#InputFormat) section above.
+
+See the `TestSchemaGeneratorDeduce.test_deduce_schema_with_dict_input()` test
+case in [examples/test_generate_schema.py](examples/test_generate_schema.py) for
+an example of an `input_data` of type `dict`.
 
 <a name="SchemaTypes"></a>
 ## Schema Types
@@ -864,6 +900,22 @@ and 3.8.
 
 Apache License 2.0
 
+<a name="Feedback"></a>
+## Feedback and Support
+
+If you have any questions, comments and other support questions about how to
+use this library, use the
+[GitHub Discussions](https://github.com/bxparks/bigquery-schema-generator/discussions)
+for this project. If you have bug reports or feature requests, file a ticket in
+[GitHub Issues](https://github.com/bxparks/bigquery-schema-generator/issues).
+I'd love to hear about how this software and its documentation can be improved.
+I can't promise that I will incorporate everything, but I will give your ideas
+serious consideration.
+
+Please refrain from emailing me directly unless the content is sensitive. The
+problem with email is that I cannot reference the email conversation when other
+people ask similar questions later.
+
 <a name="Authors"></a>
 ## Authors
 
diff --git a/bigquery_schema_generator/generate_schema.py b/bigquery_schema_generator/generate_schema.py
@@ -1004,7 +1004,7 @@ def main():
         description='Generate BigQuery schema from JSON or CSV file.')
     parser.add_argument(
         '--input_format',
-        help="Specify an alternative input format ('csv', 'json')",
+        help="Specify an alternative input format ('csv', 'json', 'dict')",
         default='json')
     parser.add_argument(
         '--keep_nulls',
diff --git a/bigquery_schema_generator/version.py b/bigquery_schema_generator/version.py
@@ -1 +1 @@
-__version__ = '1.4'
+__version__ = '1.4.1'
diff --git a/tests/test_generate_schema.py b/tests/test_generate_schema.py
@@ -29,7 +29,7 @@
 from .data_reader import DataReader
 
 
-class TestSchemaGenerator(unittest.TestCase):
+class TestSchemaGeneratorHelpers(unittest.TestCase):
     def test_timestamp_matcher_valid(self):
         self.assertTrue(
             SchemaGenerator.TIMESTAMP_MATCHER.match('2017-05-22T12:33:01'))
@@ -479,6 +479,17 @@ def test_is_string_type(self):
         self.assertTrue(is_string_type('DATE'))
         self.assertTrue(is_string_type('TIME'))
 
+    def test_json_full_path(self):
+        self.assertEqual('port', json_full_path(None, 'port'))
+        self.assertEqual('port', json_full_path("", 'port'))
+
+        # 'base_path' should never be '0', but if is do something reasonable.
+        self.assertEqual('0.port', json_full_path(0, 'port'))
+
+        self.assertEqual('server.port', json_full_path('server', 'port'))
+
+
+class TestSchemaGeneratorDeduce(unittest.TestCase):
     def test_run_with_input_and_output(self):
         generator = SchemaGenerator()
         input = StringIO('{ "name": "1" }')
@@ -507,14 +518,46 @@ def test_run_with_invalid_input_throws_exception(self):
         with self.assertRaises(Exception):
             generator.run(input, output)
 
-    def test_json_full_path(self):
-        self.assertEqual('port', json_full_path(None, 'port'))
-        self.assertEqual('port', json_full_path("", 'port'))
-
-        # 'base_path' should never be '0', but if is do something reasonable.
-        self.assertEqual('0.port', json_full_path(0, 'port'))
+    def test_deduce_schema_with_dict_input(self):
+        generator = SchemaGenerator(input_format='dict')
+        input_data = [
+            {
+                's': 'string',
+                'b': True,
+            },
+            {
+                'd': '2021-08-18',
+                'x': 3.1
+            },
+        ]
+        schema_map, error_logs = generator.deduce_schema(input_data)
+        schema = generator.flatten_schema(schema_map)
 
-        self.assertEqual('server.port', json_full_path('server', 'port'))
+        self.assertEqual(
+            schema,
+            [
+                OrderedDict([
+                    ('mode', 'NULLABLE'),
+                    ('name', 'b'),
+                    ('type', 'BOOLEAN'),
+                ]),
+                OrderedDict([
+                    ('mode', 'NULLABLE'),
+                    ('name', 'd'),
+                    ('type', 'DATE'),
+                ]),
+                OrderedDict([
+                    ('mode', 'NULLABLE'),
+                    ('name', 's'),
+                    ('type', 'STRING'),
+                ]),
+                OrderedDict([
+                    ('mode', 'NULLABLE'),
+                    ('name', 'x'),
+                    ('type', 'FLOAT'),
+                ]),
+            ],
+        )
 
 
 class TestDataChunksFromFile(unittest.TestCase):
diff --git a/tests/testdata.txt b/tests/testdata.txt

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-__version__ = '1.4'`
	`1`	`+__version__ = '1.4.1'`