Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] OOM with Dataset.scanner but ok with ParquetFile.iter_batches #44799

Open
theogaraj opened this issue Nov 20, 2024 · 4 comments
Open

Comments

@theogaraj
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Setup

I am using pyarrow version 18.0.0.

I am running my tests on an AWS r6g.large instance running Amazon Linux. (I also attempted using instances with larger memory in case the problem was that there was some base-level memory needed irrespective of minimal batch sizes and readahead, but this didn't help.)

My data consists of parquet files in S3, varying in size from a few hundred kB to ~ 1GB, for a total of 3.4GB. This is a sample subset of my actual dataset which is ~ 50GB.

Problem description

I have a set of parquet files with very small row-groups, and I am attempting to use the pyarrow.dataset API to transform this into a set of files with larger row-groups. My basic approach is dataset -> scanner -> write_dataset. After running into OOM problems with default parameters, I ratcheted down the read and write batch sizes and concurrent readahead:

from pyarrow import dataset as ds

data = ds.dataset(INPATH, format='parquet')

# note the small batch size and minimal values for readahead
scanner = data.scanner(
    batch_size=50,
    batch_readahead=1,
    fragment_readahead=1
)

# again, note extremely small values for output batch sizes
ds.write_dataset(
    scanner,
    base_dir=str(OUTPATH),
    format='parquet',
    min_rows_per_group=1000,
    max_rows_per_group=1000
)

Running this results in increasing memory consumption (monitored using top) until the process maxes out available memory and is finally killed.

What worked to keep memory use under control was to replace the dataset scanner with ParquetFile.iter_batches as below:

from pyarrow import dataset as ds
import pyarrow.parquet as pq

def batcherator(filepath, batch_size):
    for f in filepath.glob('*.parquet'):
        with pq.ParquetFile(f) as pf:
            yield from pf.iter_batches(batch_size=batch_size)

scanner = batcherator(INPATH, 2000)   # it's fine with higher batch size than previous

ds.write_dataset(
    scanner,
    base_dir=str(OUTPATH),
    format='parquet',
    min_rows_per_group=10_000,   # again, higher values of write batch sizes
    max_rows_per_group=10_000
)

Since nothing's really changing on the dataset.write_dataset side, it seems like there's some issue with runaway memory use on the scanner side of things?

The closest I could find online was this DuckDB issue duckdb/duckdb#7856 which in turn pointed to this arrow issue #31486 but this seems to hint more at a problem with write_dataset, which for me seemed ok once I replaced how I am reading in the data.

Component(s)

Python

@raulcd
Copy link
Member

raulcd commented Nov 21, 2024

@pitrou @jorisvandenbossche FYI
I add this to my list of things to investigate and try to reproduce.

@raulcd raulcd changed the title OOM with Dataset.scanner but ok with ParquetFile.iter_batches [Python] OOM with Dataset.scanner but ok with ParquetFile.iter_batches Nov 21, 2024
@theogaraj
Copy link
Author

@raulcd thanks for taking a look at this. Let me know if there's any additional info I can provide. I think it would also be ok for me to share the data sample so you could see the actual data I'm working with. I can't give you access to our S3 but could probably upload it anyplace you like, it's ~ 3.4GB. Let me know!

@mapleFU
Copy link
Member

mapleFU commented Nov 27, 2024

Dataset.scanner would try best to enlarge scan-depth to async scanning the files, maybe we can set a smaller prefetch option

@theogaraj
Copy link
Author

@mapleFU I thought that the purpose of the batch_readahead and fragment_readahead parameters of Dataset.scanner was to control the level of parallel/advance reading of the data, in order to control memory usage. Is that not what these parameters do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants