The in operator in ResultSetCollection.append causes problems with numpy arrays#1049
The in operator in ResultSetCollection.append causes problems with numpy arrays#1049kesmit13 wants to merge 9 commits intoploomber:masterfrom
in operator in ResultSetCollection.append causes problems with numpy arrays#1049Conversation
in operator in ResultSetCollection causes problems with numpy arraysin operator in ResultSetCollection causes problems with numpy arrays
in operator in ResultSetCollection causes problems with numpy arraysin operator in ResultSetCollection causes problems with numpy arrays
in operator in ResultSetCollection causes problems with numpy arraysin operator in ResultSetCollection.append causes problems with numpy arrays
src/sql/connection/connection.py
Outdated
| for idx in reversed( | ||
| [ | ||
| i | ||
| for i, item in enumerate(self._result_sets) | ||
| if all(starmap(_eq, zip(_results(result), _results(item)))) | ||
| ] | ||
| ): | ||
| self._result_sets.pop(idx) |
There was a problem hiding this comment.
can you add some comments to explain what this is doing?
There was a problem hiding this comment.
I'll explain here and let you comment on it first. What this is doing is replacing if result in self._result_sets: self._result_sets.remove(result) which simply removes all items equal to result from the result set list. I'm not actually clear on why this is done, but the in operator requires a bool to be returned by the == operation. That doesn't happen with np.array, pd.Series, etc.
What the new code does is iterates over result and item and compares each item individually using the _eq function which returns a bool for both atomic values and array comparisons. If all of the _eq operations return true, the index of that item is added to the resulting list. The final list of indexes is reversed so that they can be popped from self._result_sets without messing up the indexes of the remaining items.
There was a problem hiding this comment.
Just to clarify, this isn't a completely backwards-compatible change. It does require that the _result_sets variable contains only iterable rows of data. I believe this is the only use of this object in the code, except in the unit tests where scalar values are used to test, which is why I also made changes to the unit tests.
There was a problem hiding this comment.
Another option which would make this change backwards compatible would be to wrap the original version of the code in a try / except block and only switch to the new version if the exception from the existence of numpy arrays occurs.
src/tests/test_connection.py
Outdated
| def test_result_set_collection_append_numpy(): | ||
| try: | ||
| import numpy as np | ||
|
|
||
| a1 = (np.array([1, 2]),) | ||
| a2 = (np.array([3, 4]),) | ||
|
|
||
| collection = ResultSetCollection() | ||
| collection.append(a1) | ||
| collection.append(a2) | ||
|
|
||
| assert len(collection._result_sets) == 2 | ||
| assert collection._result_sets[0] is a1 | ||
| assert collection._result_sets[1] is a2 | ||
|
|
||
| except ImportError: | ||
| pass |
There was a problem hiding this comment.
let's add numpy as a dev dependency and get rid of the ImportError:
Line 33 in 7e02910
| def test_result_set_collection_append(): | ||
| collection = ResultSetCollection() | ||
| collection.append(1) | ||
| collection.append(2) | ||
| collection.append((1,)) | ||
| collection.append((2,)) | ||
|
|
There was a problem hiding this comment.
let's keep the original test and add a new one
There was a problem hiding this comment.
The reason this was changed was because the code in append needs the values to be iterable objects. This is more consistent with what the result objects that get appended from the connection are. I'm not sure there is a case in real-world code where the value will be atomic like the original test.
src/tests/test_connection.py
Outdated
| def test_result_set_collection_iterate(): | ||
| collection = ResultSetCollection() | ||
| collection.append(1) | ||
| collection.append(2) | ||
| collection.append((1,)) | ||
| collection.append((2,)) | ||
|
|
||
| assert list(collection) == [1, 2] | ||
| assert list(collection) == [(1,), (2,)] | ||
|
|
||
|
|
||
| def test_result_set_collection_is_last(): | ||
| collection = ResultSetCollection() | ||
| first, second = object(), object() | ||
| first, second = (object(),), (object(),) | ||
| collection.append(first) | ||
|
|
||
| assert len(collection) == 1 |
|
I'm closing this because it's taking too long, I don't fully understand the changes, and because some of my comments are >2 weeks old and have not been addressed. I don't have the resources to keep checking on this |
Describe your changes
When using the
inoperator to test for equal results that contain numpy arrays, you will get the following error:This is due to the fact that
np.array == np.arrayreturns anp.arraynot a bool. The SingleStoreDB database uses numpy arrays for vector values.Issue number
None
Checklist before requesting a review
pkgmt format📚 Documentation preview 📚: https://jupysql--1049.org.readthedocs.build/en/1049/