Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster utf8 validation #6668

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Faster utf8 validation #6668

wants to merge 2 commits into from

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Oct 31, 2024

Which issue does this PR close?

Closes #6667

Rationale for this change

Improves performance for about 4-5% (on M1 Pro) on strings (plain encoding):

arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs
                        time:   [740.81 µs 746.51 µs 752.11 µs]
                        change: [-5.8127% -5.2637% -4.6414%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringArray/plain encoded, optional, no NULLs
                        time:   [743.62 µs 748.70 µs 754.14 µs]
                        change: [-4.2825% -3.6551% -3.0212%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringArray/plain encoded, optional, half NULLs
                        time:   [633.43 µs 638.47 µs 643.71 µs]
                        change: [-5.1930% -4.5414% -3.8189%] (p = 0.00 < 0.05)

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 31, 2024
Ok(_) => Ok(()),
Err(e) => Err(general_err!("encountered non UTF-8 data: {}", e)),
Err(_) => {
let e = simdutf8::compat::from_utf8(val).unwrap_err();
Copy link
Contributor Author

@Dandandan Dandandan Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call compat from_utf8 again to get the same error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the role of simdutf8::basic::from_utf8 and re-run with simdutf8::compat -- does this deserve a code comment?

(at least the .unwrap_err() safety deserves one)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same in offset_buffer.rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree deserves some comments explaining why we rerun it in case of error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a positive sentiment about using simdutf8 for faster validation, I can do so.

@@ -69,6 +69,7 @@ paste = { version = "1.0" }
half = { version = "2.1", default-features = false, features = ["num-traits"] }
sysinfo = { version = "0.32.0", optional = true, default-features = false, features = ["system"] }
crc32fast = { version = "1.4.2", optional = true, default-features = false }
simdutf8 = { version = "0.1.5"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be optional as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How mature is the library and its dependencies?
My random spike led me to https://github.com/rusticstuff/simdutf8/blob/main/src/implementation/aarch64/neon.rs#L16 and https://docs.rs/flexpect/latest/flexpect/ lacks documentation.
Should we help simdutf8 to bring it to arrow's maturity level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it is just some macro helper for clippy split off as crate / dependency. Doesn't seem too bad.

@tustvold
Copy link
Contributor

I'm not sure that 5% really justifies an additional dependency, especially one that uses so much unsafe...

@Dandandan
Copy link
Contributor Author

I'm not sure that 5% really justifies an additional dependency, especially one that uses so much unsafe...

Hm yeah wondering about that.

I think that 5% speed up for Parquet might be quite valuable though, given that it often translates in close to 5% faster query execution for queries where Parquet scan is a bottleneck (quite some DF benchmarks actually involving string data).

@Dandandan
Copy link
Contributor Author

Dandandan commented Nov 1, 2024

FWIW some other projects are using simdutf8 as well, like polars https://github.com/pola-rs/polars/blob/main/Cargo.toml#L77 and simd-json

@alamb
Copy link
Contributor

alamb commented Nov 7, 2024

I am not sure exactly the usecase here, but what about simply disabling utf8 validation for known good data?

@Dandandan
Copy link
Contributor Author

Dandandan commented Nov 8, 2024

I am not sure exactly the usecase here, but what about simply disabling utf8 validation for known good data?

The "use case" of this PR is just that utf8 validation takes time, this PR improves the performance.

I think having a option to disable it makes sense, but would be good to minimize the cost of validation as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up Parquet utf8 validation
4 participants