Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Cosmos tasks failing to heartbeat and killed eventually as zombies #1324

Open
1 task
pankajkoti opened this issue Nov 14, 2024 · 1 comment
Open
1 task
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc bug Something isn't working triage-needed Items need to be reviewed / assigned to milestone

Comments

@pankajkoti
Copy link
Contributor

pankajkoti commented Nov 14, 2024

Astronomer Cosmos Version

1.7.1

dbt-core version

NA

Versions of dbt adapters

No response

LoadMode

AUTOMATIC

ExecutionMode

LOCAL

InvocationMode

SUBPROCESS

airflow version

NA

Operating System

NA

If a you think it's an UI issue, what browsers are you seeing the problem on?

No response

Deployment

Amazon (AWS) MWAA

Deployment details

No response

What happened?

Users have reported in the #airflow-dbt channel in the Apache Airflow Slack channel that tasks failing to report a heartbeat and the executor is considering it a zombie process.

Slack conversation: https://apache-airflow.slack.com/archives/C059CC42E9W/p1731162253771519

Relevant log output

INFO - Task exited with return code -9
Error :  Task did not emit heartbeat within time limit (300 seconds) and will be terminated. See https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html#zombie-undead-tasks

How to reproduce

We're awaiting inputs from users to understand the invocation mode being used here. But the initial internal guess is that it maybe the SUBPROCESS Invocation mode being used and @tatiana pointed out that the likely cause for this could be

self.sub_process.wait()
where we call wait() on the sub_process. The wait call blocks the current execution until the sub_process completes. Irrespective of whether this is the root cause for the issue, we should refactor this piece to use poll() instead of wait() which would allow us to run it in a non-blocking way.

The refactor could look something like this

ret_code = self.sub_process.poll()
while ret_code is None:
    print("Process is still running...")
    # Wait for a short interval before checking again
    time.sleep(2)
    ret_code = self.sub_process.poll()

Anything else :)?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Contact Details

No response

@pankajkoti pankajkoti added bug Something isn't working triage-needed Items need to be reviewed / assigned to milestone labels Nov 14, 2024
@dosubot dosubot bot added the area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc label Nov 14, 2024
@pankajkoti
Copy link
Contributor Author

One of the pending action items here is to verify if the blocking wait() is the actual reason for the task not heartbeating. At this point, upon rethinking we fell this may not be the actual reason as there would be another process that could be responsible for sending heartbeats. Probable cause could be OOM kills here, need to investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc bug Something isn't working triage-needed Items need to be reviewed / assigned to milestone
Projects
None yet
Development

No branches or pull requests

1 participant