Description
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
See below.
Issue Description
Reproducible example
import pandas as pd
data = {
"Column2": [10, 20, 30],
"Column3": ["A", "B", "C"],
"Column4": ["Lala", "YesYes", "NoNo"],
}
df1 = pd.DataFrame(data)
import polars as pl
data = {
"Column1": ["Text1", "Text2", "Text3"],
"Column2": [10, 20, 30],
"Column3": ["A", "B", "C"]
}
df2 = pl.DataFrame(data)
result = df1.join(df2, on=["Column2", "Column3"], how="inner")
Log output
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_11612\367032622.py in ?()
----> 1 result = df1.join(df2, on=["Column2", "Column3"], how="inner")
c:\Users\name\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\frame.py in ?(self, other, on, how, lsuffix, rsuffix, sort, validate)
10766 validate=validate,
10767 )
10768 else:
10769 if on is not None:
> 10770 raise ValueError(
10771 "Joining multiple DataFrames only supported for joining on index"
10772 )
10773
ValueError: Joining multiple DataFrames only supported for joining on index
Expected Behavior
Expected Result
Error message is not correct.
It should say that joining pandas dataframe with polars dataframe is not supported.
This is how Polars formulates the error when joining the other way around:
TypeError: expected
other join table to be a DataFrame, not 'pandas.core.frame.DataFrame'
Installed Versions
INSTALLED VERSIONS
commit : 0691c5c
python : 3.12.9
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Dutch_Netherlands.1252
pandas : 2.2.3
numpy : 2.2.5
pytz : 2025.2
dateutil : 2.9.0.post0
pip : 25.1.1
Cython : None
sphinx : None
IPython : 9.2.0
adbc-driver-postgresql: None
...
zstandard : None
tzdata : 2025.2
qtpy : None
pyqt5 : None
Activity
AshishRaj97 commentedon May 13, 2025
I have confirmed the bug on pandas version 2.2.3.
The error message when attempting to join a pandas DataFrame with a Polars DataFrame is misleading. I intend to work on a fix to provide a more appropriate error message that clearly indicates the incompatibility between pandas and Polars for such join operations.
I will submit a pull request with the proposed changes shortly.
rhshadrach commentedon May 13, 2025
I'm somewhat negative here. The API docs for
DataFrame.join
sayother
can beand I think it is reasonable to expect readers to know we mean "pandas DataFrame" whenever our docs say "DataFrame".
Similar situations have been discussed, and I believe the conclusion was that when we think it's likely a user could make an error that we can support improving the error message. In my opinion, this crosses the line and should not be supported. To support something like this across the pandas API would be a lot of code, a lot of runtime checks, all to support what I think is an unreasonable case.
cc @pandas-dev/pandas-core
Dr-Irv commentedon May 13, 2025
I think doing an instance check on the type we expect, with an appropriate error message, is worthwhile. I think we can fix these as they come up. This isn't about passing a polars DataFrame versus pandas DataFrame. It's about that we aren't checking the type of the argument at runtime. For example, here is something that fails where an attempt is made to join a DataFrame with a list of ints, but the error message isn't saying "you didn't pass a DataFrame, Series, or list of such":
rhshadrach commentedon May 13, 2025
Thanks @Dr-Irv. I think the benefits to the user are clear. But I do not see those benefits as being anywhere near the cost. We will be spending time on triaging issues, reviewing PRs, running tests, and maintaining more code. These checks also come with a runtime penalty. It's likely not all that significant, but it's also not zero. And all of this for making sure the user is using our API the way it's documented, which I think one can argue is the user's responsibility.
Dr-Irv commentedon May 13, 2025
We're inconsistent in pandas as to whether we do these runtime checks. I think checking if the passed parameters are the proper types is reasonable. I think we should handle these via a whack-a-mole approach - fix them as they are reported. So we fix
join()
here and not worry about other places. For something likejoin()
, the added check costs nothing in comparison to the overall join operation.rhshadrach commentedon May 13, 2025
I do not think doing runtime checks are unreasonable, I think they are not worth the cost. But I do not wish to argue this further, I suspect it won't get much in the way of attention.
I've removed the Discussion Needed label. Contributions here are welcome.
iabhi4 commentedon May 18, 2025
Opened a PR based on the conversation to address this specific case. It adds a clear
TypeError
when a non-pandas object is passed toDataFrame.join()
withon=
. Happy to make any further adjustments if needed.