Faster utf8 validation #6668

Dandandan · 2024-10-31T19:22:52Z

Which issue does this PR close?

Rationale for this change

Improves performance for about 4-5% (on M1 Pro) on strings (plain encoding):

arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs
                        time:   [740.81 µs 746.51 µs 752.11 µs]
                        change: [-5.8127% -5.2637% -4.6414%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringArray/plain encoded, optional, no NULLs
                        time:   [743.62 µs 748.70 µs 754.14 µs]
                        change: [-4.2825% -3.6551% -3.0212%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringArray/plain encoded, optional, half NULLs
                        time:   [633.43 µs 638.47 µs 643.71 µs]
                        change: [-5.1930% -4.5414% -3.8189%] (p = 0.00 < 0.05)

What changes are included in this PR?

Are there any user-facing changes?

Dandandan · 2024-10-31T19:34:46Z

parquet/src/arrow/array_reader/byte_view_array.rs

        Ok(_) => Ok(()),
-        Err(e) => Err(general_err!("encountered non UTF-8 data: {}", e)),
+        Err(_) => {
+            let e = simdutf8::compat::from_utf8(val).unwrap_err();


We call compat from_utf8 again to get the same error.

the role of simdutf8::basic::from_utf8 and re-run with simdutf8::compat -- does this deserve a code comment?

(at least the .unwrap_err() safety deserves one)

same in offset_buffer.rs

Yeah I agree deserves some comments explaining why we rerun it in case of error.

If there is a positive sentiment about using simdutf8 for faster validation, I can do so.

Dandandan · 2024-10-31T19:35:15Z

parquet/Cargo.toml

@@ -69,6 +69,7 @@ paste = { version = "1.0" }
 half = { version = "2.1", default-features = false, features = ["num-traits"] }
 sysinfo = { version = "0.32.0", optional = true, default-features = false, features = ["system"] }
 crc32fast = { version = "1.4.2", optional = true, default-features = false }
+simdutf8 = { version = "0.1.5"}


It could be optional as well.

How mature is the library and its dependencies?
My random spike led me to https://github.com/rusticstuff/simdutf8/blob/main/src/implementation/aarch64/neon.rs#L16 and https://docs.rs/flexpect/latest/flexpect/ lacks documentation.
Should we help simdutf8 to bring it to arrow's maturity level?

It seems it is just some macro helper for clippy split off as crate / dependency. Doesn't seem too bad.

tustvold · 2024-10-31T20:17:55Z

I'm not sure that 5% really justifies an additional dependency, especially one that uses so much unsafe...

Dandandan · 2024-10-31T20:31:11Z

I'm not sure that 5% really justifies an additional dependency, especially one that uses so much unsafe...

Hm yeah wondering about that.

I think that 5% speed up for Parquet might be quite valuable though, given that it often translates in close to 5% faster query execution for queries where Parquet scan is a bottleneck (quite some DF benchmarks actually involving string data).

Dandandan · 2024-11-01T12:12:01Z

FWIW some other projects are using simdutf8 as well, like polars https://github.com/pola-rs/polars/blob/main/Cargo.toml#L77 and simd-json

alamb · 2024-11-07T22:57:06Z

I am not sure exactly the usecase here, but what about simply disabling utf8 validation for known good data?

Proposal: Add unsafe option to disable UTF8 validation on parquet read #6701

Dandandan · 2024-11-08T05:23:30Z

I am not sure exactly the usecase here, but what about simply disabling utf8 validation for known good data?

Proposal: Add unsafe option to disable UTF8 validation on parquet read #6701

The "use case" of this PR is just that utf8 validation takes time, this PR improves the performance.

I think having a option to disable it makes sense, but would be good to minimize the cost of validation as well.

Faster utf8 validation

eeb57e3

github-actions bot added the parquet Changes to the parquet crate label Oct 31, 2024

Move dependency

adbd07a

Dandandan commented Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster utf8 validation #6668

Faster utf8 validation #6668

Dandandan commented Oct 31, 2024 •

edited

Loading

Dandandan Oct 31, 2024 •

edited

Loading

findepi Nov 2, 2024

findepi Nov 2, 2024

Dandandan Nov 2, 2024

Dandandan Nov 2, 2024

Dandandan Oct 31, 2024

findepi Nov 2, 2024

Dandandan Nov 2, 2024

tustvold commented Oct 31, 2024

Dandandan commented Oct 31, 2024

Dandandan commented Nov 1, 2024 •

edited

Loading

alamb commented Nov 7, 2024

Dandandan commented Nov 8, 2024 •

edited

Loading

Faster utf8 validation #6668

Are you sure you want to change the base?

Faster utf8 validation #6668

Conversation

Dandandan commented Oct 31, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Dandandan Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

findepi Nov 2, 2024

Choose a reason for hiding this comment

findepi Nov 2, 2024

Choose a reason for hiding this comment

Dandandan Nov 2, 2024

Choose a reason for hiding this comment

Dandandan Nov 2, 2024

Choose a reason for hiding this comment

Dandandan Oct 31, 2024

Choose a reason for hiding this comment

findepi Nov 2, 2024

Choose a reason for hiding this comment

Dandandan Nov 2, 2024

Choose a reason for hiding this comment

tustvold commented Oct 31, 2024

Dandandan commented Oct 31, 2024

Dandandan commented Nov 1, 2024 • edited Loading

alamb commented Nov 7, 2024

Dandandan commented Nov 8, 2024 • edited Loading

Dandandan commented Oct 31, 2024 •

edited

Loading

Dandandan Oct 31, 2024 •

edited

Loading

Dandandan commented Nov 1, 2024 •

edited

Loading

Dandandan commented Nov 8, 2024 •

edited

Loading