Proposed ArrowRecordBatchAdapter to replace ParquetReader with read_from_memory_tables=True
#297
Labels
adapter: parquet
Issues and PRs related to our Apache Parquet/Arrow adapter
Is your feature request related to a problem? Please describe.
The ability to stream a sequence of in-memory arrow tables into csp is very powerful, but currently a bit hidden within the ParquetReader implementation (by setting
read_from_memory_tables=True
), making it hard for users to find. It can also be a bit tricky to use properly as many of the arguments to the parquet reader are ignored in this mode.Describe the solution you'd like
A dedicated pull adapter that consumes a serquence of arrow record batches (or tables) as efficiently as possible. i.e. ArrowRecordBatchAdapter.
An initial implementation could just delegate to the ParquetReader for implementation.
Also, the solution should ideally be zero-copy on the underlying arrow tables (I am not positive this is the case at the moment).
Describe alternatives you've considered
Continuing to use the parquet reader is confusing and non-intuitive, especially for cases where the arrow tables come from other sources.
Additional context
Other features to consider
Note that since there are a number of other tools that can efficiently and flexibly produce sequences of arrow tables from different sources (polars, duckdb, arrow-odbc, ray), having a generic ArrowRecordBatchAdapter will allow an even greater number of historical data connections with very little additional effort.
The text was updated successfully, but these errors were encountered: