-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster utf8 validation #6668
base: main
Are you sure you want to change the base?
Faster utf8 validation #6668
Conversation
Ok(_) => Ok(()), | ||
Err(e) => Err(general_err!("encountered non UTF-8 data: {}", e)), | ||
Err(_) => { | ||
let e = simdutf8::compat::from_utf8(val).unwrap_err(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call compat from_utf8
again to get the same error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the role of simdutf8::basic::from_utf8 and re-run with simdutf8::compat -- does this deserve a code comment?
(at least the .unwrap_err()
safety deserves one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same in offset_buffer.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree deserves some comments explaining why we rerun it in case of error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a positive sentiment about using simdutf8
for faster validation, I can do so.
@@ -69,6 +69,7 @@ paste = { version = "1.0" } | |||
half = { version = "2.1", default-features = false, features = ["num-traits"] } | |||
sysinfo = { version = "0.32.0", optional = true, default-features = false, features = ["system"] } | |||
crc32fast = { version = "1.4.2", optional = true, default-features = false } | |||
simdutf8 = { version = "0.1.5"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be optional as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How mature is the library and its dependencies?
My random spike led me to https://github.com/rusticstuff/simdutf8/blob/main/src/implementation/aarch64/neon.rs#L16 and https://docs.rs/flexpect/latest/flexpect/ lacks documentation.
Should we help simdutf8 to bring it to arrow's maturity level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it is just some macro helper for clippy split off as crate / dependency. Doesn't seem too bad.
I'm not sure that 5% really justifies an additional dependency, especially one that uses so much unsafe... |
Hm yeah wondering about that. I think that 5% speed up for Parquet might be quite valuable though, given that it often translates in close to 5% faster query execution for queries where Parquet scan is a bottleneck (quite some DF benchmarks actually involving string data). |
FWIW some other projects are using |
I am not sure exactly the usecase here, but what about simply disabling utf8 validation for known good data? |
The "use case" of this PR is just that utf8 validation takes time, this PR improves the performance. I think having a option to disable it makes sense, but would be good to minimize the cost of validation as well. |
Which issue does this PR close?
Closes #6667
Rationale for this change
Improves performance for about 4-5% (on M1 Pro) on strings (plain encoding):
What changes are included in this PR?
Are there any user-facing changes?