-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG (string dtype): comparison of string column to mixed object column fails #60228 (fixed) #60392
base: main
Are you sure you want to change the base?
Changes from all commits
4bc49ed
0def761
c4da919
900f3b1
8db4edc
d4ea527
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,6 +40,7 @@ | |
is_numeric_v_string_like, | ||
is_object_dtype, | ||
is_scalar, | ||
is_string_dtype, | ||
) | ||
from pandas.core.dtypes.generic import ( | ||
ABCExtensionArray, | ||
|
@@ -53,7 +54,10 @@ | |
|
||
from pandas.core import roperator | ||
from pandas.core.computation import expressions | ||
from pandas.core.construction import ensure_wrapped_if_datetimelike | ||
from pandas.core.construction import ( | ||
array as pd_array, | ||
ensure_wrapped_if_datetimelike, | ||
) | ||
from pandas.core.ops import missing | ||
from pandas.core.ops.dispatch import should_extension_dispatch | ||
from pandas.core.ops.invalid import invalid_comparison | ||
|
@@ -321,6 +325,16 @@ def comparison_op(left: ArrayLike, right: Any, op) -> ArrayLike: | |
"Lengths must match to compare", lvalues.shape, rvalues.shape | ||
) | ||
|
||
if (is_string_dtype(lvalues) and is_object_dtype(rvalues)) or ( | ||
is_object_dtype(lvalues) and is_string_dtype(rvalues) | ||
): | ||
if lvalues.dtype.name == "string" and rvalues.dtype == object: | ||
lvalues = lvalues.astype("string") | ||
rvalues = pd_array(rvalues, dtype="string") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We might need to do the casting the other way around. Instead of casting the object to string and then compare both as strings, I think we have to cast the string to object and compare both as object dtype. The reason for this is that casting to string might actually convert values to strings, and then we are no longer doing the comparison for the original values. >>> ser_string = pd.Series(["1", "b"])
>>> ser_mixed = pd.Series([1, "b"])
>>> ser_string == ser_mixed
0 False
1 True
dtype: bool
>>> ser_string == ser_mixed.astype("string")
0 True
1 True
dtype: bool So if we would do that casting under the hood, the result would change in this case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And we should add this case to the tests! |
||
elif rvalues.dtype.name == "string" and lvalues.dtype == object: | ||
rvalues = rvalues.astype("string") | ||
lvalues = pd_array(lvalues, dtype="string") | ||
|
||
if should_extension_dispatch(lvalues, rvalues) or ( | ||
(isinstance(rvalues, (Timedelta, BaseOffset, Timestamp)) or right is NaT) | ||
and lvalues.dtype != object | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -138,3 +138,17 @@ def test_compare_datetime64_and_string(): | |
tm.assert_series_equal(result_eq1, expected_eq) | ||
tm.assert_series_equal(result_eq2, expected_eq) | ||
tm.assert_series_equal(result_neq, expected_neq) | ||
|
||
|
||
def test_comparison_string_mixed_object(): | ||
# Issue https://github.com/pandas-dev/pandas/issues/60228 | ||
pd.options.future.infer_string = True | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You don't need to add this for CI, because we have a separate CI build that enables this option for the full test suite. Now, this can still be useful to test locally, but the way you can do this is with setting an environment variable (on linux I can do |
||
|
||
ser_string = pd.Series(["a", "b"], dtype="string") | ||
ser_mixed = pd.Series([1, "b"]) | ||
|
||
result = ser_string == ser_mixed | ||
expected = pd.Series([False, True], dtype="boolean") | ||
tm.assert_series_equal(result, expected) | ||
|
||
pd.options.future.infer_string = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking for string dtype for an array can be expensive in case the array is object dtype (at that point it will scan all values to check if they are strings). So we might want to try avoid that at this level.
I think we could handle the issue specifically for the ArrowExtensionArray itself (see the code I referenced in #60228 (comment))