Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoint breaks writes on 0.22.0 #3030

Open
echai58 opened this issue Nov 25, 2024 · 9 comments
Open

checkpoint breaks writes on 0.22.0 #3030

echai58 opened this issue Nov 25, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@echai58
Copy link

echai58 commented Nov 25, 2024

Environment

Delta-rs version: 0.22.0

Binding: python


Bug

What happened: I am trying to upgrade to the latest release of delta-rs and it seems to introduce a breaking bug in checkpoints.

What you expected to happen: Checkpoints continue to work.

How to reproduce it: This introduces a breaking bug in both pyarrow and rust writer engines. In pyarrow, it does not overwrite successfully (two rows in output), and in rust, it panics.

from deltalake import DeltaTable, write_deltalake
import pandas as pd 

write_deltalake(
    "test",
    pd.DataFrame(
        {
            "a": ["a"],
            "b": [3],
        }
    ),
)
DeltaTable("test").create_checkpoint()

At this point, the delta table looks correct, e.g.:
image

But, on the next write:

write_deltalake(
    "test",
        pd.DataFrame(
            {
                "a": ["a"],
                "b": [100],
            }
        ),
    mode="overwrite",
)

, when using rust writer engine, we get the following exception:

PanicException: called `Result::unwrap()` on an `Err` value: DeletionVector("Unknown storage format: ''.")

and on pyarrow, it manifests itself with an incorrect overwrite:
image
with two rows showing up.

Side note: I think this sort of breaking bug ought to be caught by the test suite... it's a breaking bug in core usage of deltalake.

@echai58 echai58 added the bug Something isn't working label Nov 25, 2024
@echai58 echai58 changed the title delta.checkpointInterval breaks writes on 0.22.0 delta checkpoint breaks writes on 0.22.0 Nov 25, 2024
@echai58 echai58 changed the title delta checkpoint breaks writes on 0.22.0 checkpoint breaks writes on 0.22.0 Nov 25, 2024
@TinoSM
Copy link

TinoSM commented Nov 25, 2024

To me this error also happens when doing overwrites from polars (simple table overwrites, appends works fine), without checkpoint involved, couldn't isolate the issue, but just saying it could not be 100% related with checkpointing

@ion-elgreco
Copy link
Collaborator

Very unclear why this now fails

Trace:

thread '<unnamed>' panicked at crates/core/src/kernel/snapshot/log_data.rs:98:55:
called `Result::unwrap()` on an `Err` value: DeletionVector("Unknown storage format: ''.")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@rtyler
Copy link
Member

rtyler commented Nov 25, 2024

@echai58 thanks for the bug report and the reproduction case! I'll spin up a branch with this test case

@rtyler
Copy link
Member

rtyler commented Nov 25, 2024

I'm narrowing this down, it looks like this behavior is manifesting in the creation of the remove action in the rust core. I still have not cobbled together a Rust-based unit test that exhibits this behavior though 🤔

@rtyler
Copy link
Member

rtyler commented Nov 25, 2024

The checkpoint is being re-read as having an empty DeletionVector, so I'm getting closer to understanding the issue but I'm not yet sure if it's due to the delta-kernel-rs upgrade or some other behavior change 🤔

Some(DeletionVector { storage_type: StringArray
[
  "",
], path_or_inline_dv: StringArray
[
  "",
], size_in_bytes: PrimitiveArray<Int32>
[
  0,
], cardinality: PrimitiveArray<Int64>
[
  0,
], offset: Some(PrimitiveArray<Int32>
[
  null,
]) })

@echai58
Copy link
Author

echai58 commented Nov 25, 2024

@rtyler thanks for investigating - does it help narrow it down that the error is only raised when overwriting with engine=rust, vs engine=pyarrow just seemingly ignoring the existing data, and appending the new data even when mode=overwrite?

@zachschuermann
Copy link

was just reading through this thread as an interested kernel person :) I'll let you guys do some more investigation but please let me know if I can help from the kernel side at all! @rtyler

@scovich
Copy link

scovich commented Nov 25, 2024

Looking at this output:

Some(DeletionVector { storage_type: StringArray
[
  "",
], path_or_inline_dv: StringArray

The empty string is not a valid storage_type -- according to the Delta spec:

Legal options are: ['u', 'i', 'p'].

It looks to me like some code somewhere is translating "missing" to "default value" instead of NULL: 0, "", etc. Interestingly, the one nullable DV descriptor field does come through as NULL:

], offset: Some(PrimitiveArray<Int32>
[
  null,
]) })

... which is the default value for an Option. This makes me suspect some code somewhere is not gracefully handling the case of a non-nullable struct field inside a nullable struct.

Do we know what is physically written in the parquet and/or json files? That would tell us whether it's a read path or write path issue.

@scovich
Copy link

scovich commented Nov 25, 2024

Or, even more likely, these were arrow reads that ignored a null mask and picked up whatever (zero-initialized bytes) happened to be in the corresponding data column.

Note: "read" here could be the parquet writer consuming arrow data that should go to disk. Or it could be the actual read path consuming data that was later read back from the parquet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants