Proposed ArrowRecordBatchAdapter to replace ParquetReader with `read_from_memory_tables=True` #297

ptomecek · 2024-06-25T19:50:11Z

Is your feature request related to a problem? Please describe.
The ability to stream a sequence of in-memory arrow tables into csp is very powerful, but currently a bit hidden within the ParquetReader implementation (by setting read_from_memory_tables=True), making it hard for users to find. It can also be a bit tricky to use properly as many of the arguments to the parquet reader are ignored in this mode.

Describe the solution you'd like
A dedicated pull adapter that consumes a serquence of arrow record batches (or tables) as efficiently as possible. i.e. ArrowRecordBatchAdapter.
An initial implementation could just delegate to the ParquetReader for implementation.

Also, the solution should ideally be zero-copy on the underlying arrow tables (I am not positive this is the case at the moment).

Describe alternatives you've considered
Continuing to use the parquet reader is confusing and non-intuitive, especially for cases where the arrow tables come from other sources.

Additional context
Other features to consider

Being able to convert arrow lists as numpy arrays (like the current parquet reader)
Being able to convert arrow structs as csp structs
An option such at that each timestamp, it ticks an arrow record batches containing all the records with that timestamp (in cases where we want to minimize any copy overhead and just operate on arrow types directly). This is essentially a "group-by" on the timestamp column.

Note that since there are a number of other tools that can efficiently and flexibly produce sequences of arrow tables from different sources (polars, duckdb, arrow-odbc, ray), having a generic ArrowRecordBatchAdapter will allow an even greater number of historical data connections with very little additional effort.

The text was updated successfully, but these errors were encountered:

ptomecek · 2024-09-05T16:25:12Z

The need has also arisen for this on the output side, i.e. an output adapter that yields in-memory arrow tables (i.e. according to some trigger function).

ptomecek added the adapter: parquet Issues and PRs related to our Apache Parquet/Arrow adapter label Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed ArrowRecordBatchAdapter to replace ParquetReader with `read_from_memory_tables=True` #297

Proposed ArrowRecordBatchAdapter to replace ParquetReader with `read_from_memory_tables=True` #297

ptomecek commented Jun 25, 2024

ptomecek commented Sep 5, 2024

Proposed ArrowRecordBatchAdapter to replace ParquetReader with read_from_memory_tables=True #297

Proposed ArrowRecordBatchAdapter to replace ParquetReader with read_from_memory_tables=True #297

Comments

ptomecek commented Jun 25, 2024

ptomecek commented Sep 5, 2024

Proposed ArrowRecordBatchAdapter to replace ParquetReader with `read_from_memory_tables=True` #297

Proposed ArrowRecordBatchAdapter to replace ParquetReader with `read_from_memory_tables=True` #297