Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Interactive Beam supports asynchronous computations. #33103

Open
1 of 17 tasks
ganesh4991 opened this issue Nov 13, 2024 · 2 comments
Open
1 of 17 tasks

Comments

@ganesh4991
Copy link

What would you like to happen?

Summary

This feature request proposes adding asynchronous computation to Apache Beam's Interactive Beam API. This means allowing long-running tasks to execute in the background without blocking the user interface.

Motivation

Interactive Beam is a powerful tool for iterative pipeline development and debugging. However, long-running collect operations can block the interactive environment, hindering productivity and exploration. Introducing asynchronous computation would significantly improve the user experience by allowing developers to continue building the pipeline while computations are executed in the background.

Proposed Solution

Interactive Beam offers a compute API which runs asynchronously in the background and does not produce any result to be displayed on the interactive interface eg. Colab.

def compute(
*pcolls, 
wait_for_inputs: bool = True,
blocking: bool = False
runner=None,
options=None,
force_compute=False) -> None

This API introduces two new options:

  • wait_for_inputs: Whether to wait until the asynchronous dependencies are
    computed. Setting this to False allows to immediately schedule the
    computation, but also potentially results in running the same pipeline
    stages multiple times.
  • blocking: If False, the computation will run in non-blocking fashion. In
    Colab/IPython environment this mode will also provide the controls for the
    running pipeline. If True, the computation will block until the pipeline
    is done.

compute operations can subsequently be followed collect operations on the same PCollection for users to view the result.

Benefits

  • The ability to compute time consuming PCollections asynchronously
  • Sink operations which do not produce any meaningful output can use compute instead of collect

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@ganesh4991
Copy link
Author

cc @robertwb

@damondouglas
Copy link
Contributor

cc: @damondouglas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants