Skip to content

post_search.iterate() iterator does not terminate cleanly #500

Open
@chmreid

Description

@chmreid

Am seeing an issue when iterating over DSS search results with post_search.iterate() - I believe this is a corner case that occurs when the number of results returned is exactly the same as the page size, and the error happens when the iterator tries to return the second page.

Here is the setup: I start by creating a DSS client, and I write an ElasticSearch query that returns exactly 10 results (the page size of the returned results, when metadata is included). Here is the code to do that:

import hca.dss, json
client = hca.dss.DSSClient()

method = "*10x*"
organ = "liver"

query = {
    "query": {
        "bool": {
            "must": [
                {
                    "wildcard": {
                        "files.library_preparation_protocol_json.library_construction_method.text": {
                            "value": method
                        }
                    }
                },
                {
                    "match": {
                        "files.specimen_from_organism_json.organ.text": organ
                    }
                }
            ]
        }
    }
}

executing query with post_search

Now if we execute this query with a call to post_search(), we can see that there are exactly 10 results returned:

search_results = client.post_search(
    es_query=query, replica='aws', output_format='raw')
print("post_search() found %d results"%(search_results['total_hits']))
print("post_search() returned %d results"%(len(search_results['results'])))

which results in

post_search() found 10 results
post_search() returned 10 results

executing query with post_search.iterate

Now if we want to iterate over all results returned by the query, we should use post_saerch.iterate() instead of post_search(). Swapping out the call:

results_generator = client.post_search.iterate(es_query=query, replica='aws', output_format='raw')

for bundle in results_generator:
    print(f"Now processing bundle {bundle['bundle_fqid']}")

which results in the following exception:

Now processing bundle fd7a46db-1e90-4bfd-8e70-a77baa01faa5.2019-09-23T173116.106310Z
Now processing bundle fbda9910-5076-47a6-83d6-cfff39d17606.2019-09-26T051748.268160Z
Now processing bundle fb2ae8b7-06b0-4881-ad9f-1f37255b91b6.2019-09-23T173116.107225Z
Now processing bundle c65efd23-bbc4-459a-ac60-d3cde705193d.2019-09-23T173116.107641Z
Now processing bundle c59a8de8-d4f3-424b-b716-06b7152b980a.2019-09-23T173116.106782Z
Now processing bundle be9f2d04-77ee-4f59-a0f7-f0b58034cf8c.2019-09-23T173116.105576Z
Now processing bundle 82164816-64d4-4975-a248-b66c4fdad6f8.2019-09-26T054646.254919Z
Now processing bundle 56cce395-634e-4c53-976c-931727d22dfa.2019-09-26T074801.713933Z
Now processing bundle 3a7af639-ac18-49a7-aef9-2eb4b1ecf598.2019-09-26T072342.935554Z
Now processing bundle 2f62f508-6503-4c2e-a714-8298f55bdaa2.2019-09-26T064659.900169Z

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-5afaae8a16d1> in <module>
      1 results_generator = client.post_search.iterate(es_query=query, replica='aws', output_format='raw')
      2 
----> 3 for bundle in results_generator:
      4     print(f"Now processing bundle {bundle['bundle_fqid']}")

~/codes/data-consumer-vignettes/vp/lib/python3.6/site-packages/hca/util/__init__.py in iterate(self, **kwargs)
    235                     yield file
    236             else:
--> 237                 for collection in page.json().get('collections'):
    238                     yield collection
    239 

TypeError: 'NoneType' object is not iterable

If I modify the query to search for a different organ type, the number of results returned is different - not a multiple of 10 - and so this bug does not occur. This bug only occurs when the number of results returned is exactly equal to the size of each page. The error occurs because it does not handle the case of the second page being completely empty (which only happens when number of results is an exact multiple of the page size).

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions