Replies: 2 comments
-
Which of the APIs in your dream pseudocode are you missing? Maybe I'm not understanding, but it seems like everything in there is supported by the library (getting list of data fields, appending fields to list, constructing new schema from fields, etc.) |
Beta Was this translation helpful? Give feedback.
-
Thanks for the quick and sorry for delay. I did explore the column api and was able to build something more ergonomic where I could programmatically construct the output schema from the input schema. It was much less ugly than my first pass where I had to explicitly define every column and more memory efficient doing it by columns, so thanks - the library was helpful. A few things (and if this is still too vague I'd be happy to elaborate). Looking at my final code for the column based approach which is (at a pseudocode level)
I like that the library supports low level and high level approaches, but it's a bit arbitrary that row oriented is high level. Small thing, but one line equivalent for the column approach (at least to get to the schema read). would be nice ergnomically (I use pandas constantly to just peek at a file, and having to open a file stream then the pq file is friction. let! table = Parquet.ParquetReader.ReadTableFromFileAsync pqPath Likewise for the output - it's nice to have control over all of this at a low level, but for lots of applications, a one line open for writing (with a single row group) would be handy. There's a lot of ceremony to handle the stream, and then the schema then the row group.
let outSchema = new ParquetSchema(oldFields @ newFields)
use! outPQ = ColumnWriter(outputPath,outSchema)
outPQ.WriteColumn(..) Also small thing, but could you add a lookup for columns based on name. e.g. something like let! column = myPqFile["mycolumn"] I found myself writing a bunch of filters to find columns by index, and then remember the indices. The python apis all use strings for column id (I know it has its own pitfalls - an F# type provider would be a dream that made types with the columns as field names but that's a whole nother level of effort. I'll play with it some more, but overall wonderful tool and just wondering how to get a little closer to the python versions, which are typically one line to read, one line to write. The downside of course is that it's harder to incrementally process files, so you do needed different flavors of high and low level. I guess the column and row oriented interfaces are just a little too tightly coupled with level right now, but I might also be missing something. thanks! |
Beta Was this translation helpful? Give feedback.
-
Issue description
I had a (largely) good experience using parquet-dotnet for a project recently but my final code looks (to me) horrendous, and I figure I must be missing something. I looked at the documentation and couldn't find a path so any pointers appreciated. I worked extensively with pandas, and am not a great fan. I do most of my work in F#, and type safe code is preferred. What they found a little bit frustrating was the need to enumerate all of the columns when updating an existing file.
Conceptually what I am doing is
I started with a table reader because it seemed the simplest way to load data in an agnostic fashion
I could not for the life of me work out how to programmatically define an output schema just using data from the input schema. What I would like to do is something like
What I ended up having to do which was painful, was to redefine all columns explicitly for the output schema. This was somewhat painful, As there are many columns in the input file. I could almost be resigned to that, having to define types explicitly for a parquet file as part of the type safety. In this case though, all of the type information comes from the existing file so the library is just checking my homework and it could easily help me define the new schema. Again apologies if there is an easy way to do this but I couldn't find an API for defining a schema from an array of table columns. What I conceptually want to do is clone an existing table and just add columns.
On the output side, again I couldn't easily use a loop to generate all of the existing columns I ended up writing them out one by one because I had to get it to work but that was another 50 lines of explicit column writing with indexes that are very fragile. If I move columns around I have to adjust the indices.
The final point of friction was just pulling out a column of data. I know the row oriented interface in Tables is meant for row processing but given the underlying column based structure, it feels like there should still be an API for getting a column of data, preferably by name, as a specific type. I couldn't work out how to do that and I might just be missing things so let me know if there is an easy way.
I ended up defining my output schema like this from scratch
Maybe I should have started with the column based API? Main disadvantage there is that it's lower level, so I didn't see a one line read option. I dislike a lot of things about the pandas/python world, but for loading / saving / accessing, they generally have things as one line operations, and that doesn't seem impossible in the dotnet world. I work with a wide variety of parquet files that typically have double or triple digit numbers of columns and it would be nice to be able to handle column definitions more programmatically.
thanks in advance for any thoughts.
p.s. I can't share the original code, so sorry for the obfuscation but if you're interested, I could make something comparable in shape and send example code so you can see the true ugliness of my current implementation.
Beta Was this translation helpful? Give feedback.
All reactions