-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-44501: [C#] Use an index lookup for O(1)
field index access
#44633
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
O(1)
field index access
|
ce4dda4
to
c7a0c58
Compare
@CurtHagenlocher it looks like my tests are failing in CI but I can't replicate this locally. |
c7a0c58
to
21a86cd
Compare
Oh, I see the issue. The CI does a merge into The PR above has changed the lookup behaviour when the field is not matched. |
e6d024d
to
52b374a
Compare
@CurtHagenlocher and @georgevanburgh I've reverted to the original behaviour (throwing an |
52b374a
to
d099cdf
Compare
I'm a bit tied up with work but will look at this by Friday. |
Thank you very much! |
|
||
for (int i = 0; i < _fieldsList.Count; i++) | ||
for (var i = 0; i < _fieldsList.Count; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't make unnecessary stylistic changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will undo this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change! (and sorry for the delay) I left a few suggestions.
} | ||
|
||
return -1; | ||
throw new InvalidOperationException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change. Is there a specific reason for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually fixing a regression that was added in #44576
The function would previously throw InvalidOperationException
due to .First(...)
. The description of the PR is erroneous in asserting that -1
is returned in the case of no match as the exception would first be thrown by the argument of the .IndexOf(...)
function.
I noticed this because I wrote my unit tests before that PR was merged. Are you happy with keeping the return -1
regression?
|
||
_fieldsIndexLookup = _fieldsList | ||
.Select((x, idx) => (Name: x.Name, Index: idx)) | ||
.ToLookup(x => x.Name, x => x.Index, StringComparer.CurrentCulture); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for new code, we should use Ordinal
. I don't think CurrentCulture
really makes sense here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I change to Ordinal
then this would be a breaking change as this is used in GetFieldIndex(string)
. Are you okay with changing the behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, I can then only use the lookup if the user has provided Ordinal
as the comparator but this would defeat the purpose of this PR which is to make lookup of Columns in RecordBatch O(1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or I could change this in a separate PR that implements your suggestions here #44650 (comment) ?
@@ -31,6 +31,7 @@ public partial class Schema : IRecordType | |||
private readonly List<Field> _fieldsList; | |||
|
|||
public ILookup<string, Field> FieldsLookup { get; } | |||
private readonly ILookup<string, int> _fieldsIndexLookup; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also expose this in a way that lets users get all the indexes for a particular field name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'm happy to add this, though it feels like it's adding further to the scope of the PR. I'm happy to implement any API suggestions that you may have; what would be your preferred way?
Thank you very much for your review @CurtHagenlocher ! I added some replies to your comments. Let me know your thoughts and will proceed with changes. |
Hi @CurtHagenlocher , do you have some time to look into this? |
Fixes #44501
Rationale for this change
This should speed up the reasonably common operation of looking up a column via name in
Schema
orRecordBatch
.I would argue that the fact that this operation was previously O(N) (now O(1)) is unintuitive and could easily lead to accidental performance woes.
Note that I would like to also replace the existing Lookups with signature
string -> Field
but unfortunately those useStringComparer.Default
instead ofStringComparer.CurrentCulture
(which I'm using to make the changedpublic
function backwards compatible). Not sure if this is intentional.What changes are included in this PR?
This PR simply memoizes the index lookup of the fields. It should not break any existing behaviour.
It also fixes a recent regression about the behaviour of the changed function in the case of using a custom comparator and have no matches in the schema.
Are these changes tested?
All code paths changed are covered by new unit tests.
Are there any user-facing changes?
Yes, the lookup function throws an exception when there are no matches fuels rather than returning -1. This was a regression introduces O(days) ago.
Column(string)
method inRecordBatch
is linear to the number of columns #44501