-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checkpoint breaks writes on 0.22.0 #3030
Comments
To me this error also happens when doing overwrites from polars (simple table overwrites, appends works fine), without checkpoint involved, couldn't isolate the issue, but just saying it could not be 100% related with checkpointing |
Very unclear why this now fails Trace:
|
@echai58 thanks for the bug report and the reproduction case! I'll spin up a branch with this test case |
Signed-off-by: R. Tyler Croy <[email protected]>
I'm narrowing this down, it looks like this behavior is manifesting in the creation of the remove action in the rust core. I still have not cobbled together a Rust-based unit test that exhibits this behavior though 🤔 |
The checkpoint is being re-read as having an empty DeletionVector, so I'm getting closer to understanding the issue but I'm not yet sure if it's due to the delta-kernel-rs upgrade or some other behavior change 🤔
|
@rtyler thanks for investigating - does it help narrow it down that the error is only raised when overwriting with |
was just reading through this thread as an interested kernel person :) I'll let you guys do some more investigation but please let me know if I can help from the kernel side at all! @rtyler |
Looking at this output:
The empty string is not a valid
It looks to me like some code somewhere is translating "missing" to "default value" instead of NULL:
... which is the default value for an Do we know what is physically written in the parquet and/or json files? That would tell us whether it's a read path or write path issue. |
Or, even more likely, these were arrow reads that ignored a null mask and picked up whatever (zero-initialized bytes) happened to be in the corresponding data column. Note: "read" here could be the parquet writer consuming arrow data that should go to disk. Or it could be the actual read path consuming data that was later read back from the parquet. |
Environment
Delta-rs version: 0.22.0
Binding: python
Bug
What happened: I am trying to upgrade to the latest release of delta-rs and it seems to introduce a breaking bug in checkpoints.
What you expected to happen: Checkpoints continue to work.
How to reproduce it: This introduces a breaking bug in both pyarrow and rust writer engines. In pyarrow, it does not overwrite successfully (two rows in output), and in rust, it panics.
At this point, the delta table looks correct, e.g.:
But, on the next write:
, when using
rust
writer engine, we get the following exception:and on pyarrow, it manifests itself with an incorrect overwrite:
with two rows showing up.
Side note: I think this sort of breaking bug ought to be caught by the test suite... it's a breaking bug in core usage of deltalake.
The text was updated successfully, but these errors were encountered: