Parallel ColumnWriteAsync #505

rferraton · 2024-04-30T12:35:57Z

rferraton
Apr 30, 2024

I use succesfully Parquet.net to write parquet file from postgresql table or sql query but i found that the WriteColumnAsync quite "slow"

Currently my working serial code is like this :

private static async Task WriteRowGroupAsync(ParquetWriter parquetWriter, object[][] arrayfields, int fieldCount, IList<DataField> fields)
{
	using (ParquetRowGroupWriter rgw = parquetWriter.CreateRowGroup())
	{
		for (int i = 0; i < fieldCount; i++)
		{
			Parquet.Data.DataColumn dataCol = new Parquet.Data.DataColumn(fields[i], ConvertToProperArrayType(arrayfields[i], fields[i].ClrType));
			
			await rgw.WriteColumnAsync(dataCol);					
		}
	}
}

My idea was to use a Parallel Writer like this :

private static async Task ParallelWriteRowGroupAsync(ParquetWriter parquetWriter, object[][] arrayfields, int fieldCount, IList<DataField> fields)
{
	using (ParquetRowGroupWriter rgw = parquetWriter.CreateRowGroup())
	{
		// parallel write
		var ParallelOptions = new ParallelOptions { MaxDegreeOfParallelism = 4 };
		var parallelGroups = Enumerable.Range(0, fieldCount).GroupBy(r => r/ParallelOptions.MaxDegreeOfParallelism);
		var parallelTasks = parallelGroups.Select(groups =>
		{
			return Task.Run(async () =>
			{
				foreach (var i in groups)
				{
					Parquet.Data.DataColumn dataCol = new Parquet.Data.DataColumn(fields[i], ConvertToProperArrayType(arrayfields[i], fields[i].ClrType));

					await rgw.WriteColumnAsync(dataCol);
				}
			});
		});

		await Task.WhenAll(parallelTasks).ConfigureAwait(true);

	}
}

Unfortunatly this does not work leading to an error from the WriteColumnAsync :
System.ArgumentException: cannot write this column, expected 'niv_agregat', passed: 'type_ca' (Parameter 'column')
at Parquet.ParquetRowGroupWriter.WriteColumnAsync(DataColumn column, Dictionary`2 customMetadata, CancellationToken cancellationToken)

In the source code comments it is several time mention that "Writes next data column to parquet stream. Note that columns must be written in the order they are declared in the file schema._" but i'm wondering why such a constraint ?

aloneguid · 2024-04-30T14:54:00Z

aloneguid
Apr 30, 2024
Maintainer

Simply put, because of the internal binary layout of a parquet file - column must be written in the order they are declared in schema. Generally writing shouldn't be slow, as it's the fastest parquet writer on the market, across all implementations, according to performance tests ;) Also generally parquet files are optimised for reading - you can paralellise reads by column chunk as much as you want, but writing is supposed to be slow due to how data is packed and compressed both logically and physically.

Are you sure it's the library is slow or the way you are preparing data before it's written?

2 replies

rferraton Apr 30, 2024
Author

I think yes :

rferraton Apr 30, 2024
Author

I used a object[][] to serialize rows to columns. Not very efficient, i know but difficult to do another way when the source data are not known :

the code that call the WriteRowGroupAsync :

var parquetschema = new Parquet.Schema.ParquetSchema(fields);
var FullFileName = Path.Combine(filePath.ToString(), fileName.ToString());

using (Stream parquetFileStream = File.OpenWrite(FullFileName))
using (ParquetWriter parquetWriter = await ParquetWriter.CreateAsync(parquetschema, parquetFileStream))
{
				parquetWriter.CompressionMethod = CompressionMethod.Zstd;
				parquetWriter.CompressionLevel = System.IO.Compression.CompressionLevel.Optimal;

				using (DbDataReader sdr = sqr.ExecuteReader())
				{
					ccols = sdr.FieldCount;
					object[][] datacols = new object[ccols][];

					for (int i = 0; i < ccols; i++)
					{
						// Initialize each list with the datatype of the corresponding field
						datacols[i] = new object[RGSIZE];
					}

					
					
					crows = 0; // to count the total number of rows
					int arrayIndex = 0; // To track the current position in the column arrays
					int rowgroupcount = 0;

					while (sdr.Read())
					{
						if (arrayIndex == RGSIZE)
						{
							// Handle the situation -- write out the current data, reset index, clear or reinitialize the arrays
						
							await WriteRowGroupAsync(parquetWriter, datacols, sdr.FieldCount, parquetschema.DataFields);
						
							
							for (int i = 0; i < ccols; i++)
							{
								datacols[i] = new object[RGSIZE]; // Re-initialize each column array
							}
							arrayIndex = 0; // Reset index
							rowgroupcount++;
							
						}
						for (int i = 0; i < ccols; i++)
						{
							datacols[i][arrayIndex] = sdr.GetValue(i);
						}
						arrayIndex++;
						crows++;
					}

					// Write any remaining data
					if (arrayIndex > 0)
					{ // There are leftover rows to write
						object[][] finalDatacols = new object[ccols][];
						for (int i = 0; i < ccols; i++)
						{
							finalDatacols[i] = new object[arrayIndex]; // Allocate exactly as many rows as needed
							System.Array.Copy(datacols[i], finalDatacols[i], arrayIndex);
						}
						await WriteRowGroupAsync(parquetWriter, finalDatacols, sdr.FieldCount, parquetschema.DataFields);
					}
				}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel ColumnWriteAsync #505

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Parallel ColumnWriteAsync #505

rferraton Apr 30, 2024

Replies: 1 comment · 2 replies

aloneguid Apr 30, 2024 Maintainer

rferraton Apr 30, 2024 Author

rferraton Apr 30, 2024 Author

rferraton
Apr 30, 2024

Replies: 1 comment 2 replies

aloneguid
Apr 30, 2024
Maintainer

rferraton Apr 30, 2024
Author

rferraton Apr 30, 2024
Author