[Question] Ergonomically adding columns to an existing pq file #459

daz10000 · 2023-10-29T21:49:02Z

daz10000
Oct 29, 2023

Issue description

I had a (largely) good experience using parquet-dotnet for a project recently but my final code looks (to me) horrendous, and I figure I must be missing something. I looked at the documentation and couldn't find a path so any pointers appreciated. I worked extensively with pandas, and am not a great fan. I do most of my work in F#, and type safe code is preferred. What they found a little bit frustrating was the need to enumerate all of the columns when updating an existing file.

Conceptually what I am doing is

load an existing pq file (54 columns of different types), 10M rows.. etc
do calculations based on a small subset of those columns to generate a few additional columns
write out a new pq file with the original columns plus the new columns.

I started with a table reader because it seemed the simplest way to load data in an agnostic fashion

let loadInferenceSpace (pqPath:string) =
    task {
        let! table = Parquet.ParquetReader.ReadTableFromFileAsync pqPath
        printfn $"Read {table.Count} rows with {table.Schema.Fields.Count} columns"
        return table
    }

I could not for the life of me work out how to programmatically define an output schema just using data from the input schema. What I would like to do is something like

// 
// dream psuedocode.. doesn't not work
//
let existingFields = myInputPQ.Fields
let newFields = [ new DataField<string>("foo") ; new DataField<float>("bar") ]
let finalFields = existingFields @ newFields // concatenate old and new
let outputSchema = new ParquetSchema( finalFields) // define new schema using list or array of fields

let catCount = myInputPQ["cats"].atType(int) // grab an existing column, as a particular data type, e.g. array of ints
let dogCount = myInputPQ["dogs"].asType(int) // grab n existing column ...
let petCount =  Array.zip catCount dogCount |> Array.map (+)  // add cats and dogs to create a new pets column of ints


// write out cols programmatically
use fs = System.IO.File.OpenWrite pqPathOut
use! writer = ParquetWriter.CreateAsync(outSchema, fs)
let groupWriter = writer.CreateRowGroup()

for i in 0..myQ.Fields.Length-1 do
  groupWriter.WriteColumnAsync(DataColumn(outputSchema.DataFields[i], myInputPQ.Column[i]))

// write out additional column
groupWriter.Write(outputSchema["pets"], petCount)

What I ended up having to do which was painful, was to redefine all columns explicitly for the output schema. This was somewhat painful, As there are many columns in the input file. I could almost be resigned to that, having to define types explicitly for a parquet file as part of the type safety. In this case though, all of the type information comes from the existing file so the library is just checking my homework and it could easily help me define the new schema. Again apologies if there is an easy way to do this but I couldn't find an API for defining a schema from an array of table columns. What I conceptually want to do is clone an existing table and just add columns.

On the output side, again I couldn't easily use a loop to generate all of the existing columns I ended up writing them out one by one because I had to get it to work but that was another 50 lines of explicit column writing with indexes that are very fragile. If I move columns around I have to adjust the indices.

The final point of friction was just pulling out a column of data. I know the row oriented interface in Tables is meant for row processing but given the underlying column based structure, it feels like there should still be an API for getting a column of data, preferably by name, as a specific type. I couldn't work out how to do that and I might just be missing things so let me know if there is an easy way.

I ended up defining my output schema like this from scratch

new ParquetSchema(  new DataField<int64>("foo"),
                                    new DataField<bool>("bar"),
                                    new DataField<double>("cow"),
                                    new DataField<int64>("cat"),
                                    new DataField<bool>("x"),
                                    new DataField<bool>("y"),
                                    new DataField<bool>("aa"),
                                    new DataField<int64>("vv"),
                                    new DataField<double>("bb"),
                                    new DataField<int64>("etc"),
                                    new DataField<bool>("name7"),
                                    new DataField<int64>("etcetc"),
                                    new DataField<int64>("whatever"))

Maybe I should have started with the column based API? Main disadvantage there is that it's lower level, so I didn't see a one line read option. I dislike a lot of things about the pandas/python world, but for loading / saving / accessing, they generally have things as one line operations, and that doesn't seem impossible in the dotnet world. I work with a wide variety of parquet files that typically have double or triple digit numbers of columns and it would be nice to be able to handle column definitions more programmatically.

thanks in advance for any thoughts.
p.s. I can't share the original code, so sorry for the obfuscation but if you're interested, I could make something comparable in shape and send example code so you can see the true ugliness of my current implementation.

mrinal-thomas · 2023-11-05T19:19:10Z

mrinal-thomas
Nov 5, 2023

Which of the APIs in your dream pseudocode are you missing? Maybe I'm not understanding, but it seems like everything in there is supported by the library (getting list of data fields, appending fields to list, constructing new schema from fields, etc.)

0 replies

daz10000 · 2023-11-15T15:10:19Z

daz10000
Nov 15, 2023
Author

Thanks for the quick and sorry for delay. I did explore the column api and was able to build something more ergonomic where I could programmatically construct the output schema from the input schema. It was much less ugly than my first pass where I had to explicitly define every column and more memory efficient doing it by columns, so thanks - the library was helpful. A few things (and if this is still too vague I'd be happy to elaborate). Looking at my final code for the column based approach which is (at a pseudocode level)

open existing parquet file
read 4 specific columns, calculate 2 more based on them
copy all existing columns and two new ones to output file

I like that the library supports low level and high level approaches, but it's a bit arbitrary that row oriented is high level. Small thing, but one line equivalent for the column approach (at least to get to the schema read). would be nice ergnomically (I use pandas constantly to just peek at a file, and having to open a file stream then the pq file is friction.

let! table = Parquet.ParquetReader.ReadTableFromFileAsync pqPath

Likewise for the output - it's nice to have control over all of this at a low level, but for lots of applications, a one line open for writing (with a single row group) would be handy. There's a lot of ceremony to handle the stream, and then the schema then the row group.

use outFs = File.OpenWrite(outFile)
let outSchema = new ParquetSchema(oldFields @ newFields)
use! writer = ParquetWriter.CreateAsync(outSchema,outFs)
use outWritegroup = writer.CreateRowGroup()

// write actual columns
do! outWritegroup.WriteColumnAsync(myColumn)

let outSchema = new ParquetSchema(oldFields @ newFields)
use! outPQ = ColumnWriter(outputPath,outSchema)

outPQ.WriteColumn(..)

Also small thing, but could you add a lookup for columns based on name. e.g. something like

let! column = myPqFile["mycolumn"]

I found myself writing a bunch of filters to find columns by index, and then remember the indices. The python apis all use strings for column id (I know it has its own pitfalls - an F# type provider would be a dream that made types with the columns as field names but that's a whole nother level of effort.

I'll play with it some more, but overall wonderful tool and just wondering how to get a little closer to the python versions, which are typically one line to read, one line to write. The downside of course is that it's harder to incrementally process files, so you do needed different flavors of high and low level. I guess the column and row oriented interfaces are just a little too tightly coupled with level right now, but I might also be missing something.

thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Ergonomically adding columns to an existing pq file #459

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

[Question] Ergonomically adding columns to an existing pq file #459

daz10000 Oct 29, 2023

Issue description

Replies: 2 comments

mrinal-thomas Nov 5, 2023

daz10000 Nov 15, 2023 Author

daz10000
Oct 29, 2023

mrinal-thomas
Nov 5, 2023

daz10000
Nov 15, 2023
Author