Replies: 1 comment 2 replies
-
Simply put, because of the internal binary layout of a parquet file - column must be written in the order they are declared in schema. Generally writing shouldn't be slow, as it's the fastest parquet writer on the market, across all implementations, according to performance tests ;) Also generally parquet files are optimised for reading - you can paralellise reads by column chunk as much as you want, but writing is supposed to be slow due to how data is packed and compressed both logically and physically. Are you sure it's the library is slow or the way you are preparing data before it's written? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I use succesfully Parquet.net to write parquet file from postgresql table or sql query but i found that the WriteColumnAsync quite "slow"
Currently my working serial code is like this :
My idea was to use a Parallel Writer like this :
Unfortunatly this does not work leading to an error from the WriteColumnAsync :
System.ArgumentException: cannot write this column, expected 'niv_agregat', passed: 'type_ca' (Parameter 'column')
at Parquet.ParquetRowGroupWriter.WriteColumnAsync(DataColumn column, Dictionary`2 customMetadata, CancellationToken cancellationToken)
In the source code comments it is several time mention that "Writes next data column to parquet stream. Note that columns must be written in the order they are declared in the file schema._" but i'm wondering why such a constraint ?
Beta Was this translation helpful? Give feedback.
All reactions