[Question] Appending to existing parquet files in Azure Blob Storage #471
-
I'm working on a project and want to use your library to read&write parquet files to Azure Blob Storage for storing time series data. Parquet seems to be a very good fit for the kinds of workloads my service will need to execute. I want to append large chunks of rows (as a new row group) to existing parquet files stored in Azure Blob Storage. I can't do this directly because the blob writer creates a stream that is not seekable, as was pointed out in this issue. Can anyone recommend other workarounds other than downloading the parquet file into a memory stream and persisting the result? It's been a few years since that issue raised. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I don't think that's possible, because block blobs do not support random access, which is required by append. In general, you should treat parquet files as immutable. Most big data engines impelment "append" operation as creating a new file and uploading to blob storage. This is in fact cheaper for the writer and more performant for the reader to manage, as entire large file does not need to be downloaded every time. In addition to that, most parquet libraries do not support append mode (see https://issues.apache.org/jira/browse/PARQUET-1022), and parquet.net implements it as a bonus ;) |
Beta Was this translation helpful? Give feedback.
I don't think that's possible, because block blobs do not support random access, which is required by append. In general, you should treat parquet files as immutable. Most big data engines impelment "append" operation as creating a new file and uploading to blob storage. This is in fact cheaper for the writer and more performant for the reader to manage, as entire large file does not need to be downloaded every time.
In addition to that, most parquet libraries do not support append mode (see https://issues.apache.org/jira/browse/PARQUET-1022), and parquet.net implements it as a bonus ;)