Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2495: Update merge script to python3 #1373

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

emkornfield
Copy link
Contributor

  • Fixes differences between python2 and python3
  • Unifies to only using git as gitbox is no longer necessary.
  • Used to merge PARQUET-2483
  • Adds options to clear existing branches used from merge, if for some reason they exist (useful while iterating on updates to the script)
  • Adds the ability to pass PR number as an arg on the command line

It is possible JIRA updates still have some work here, as it errored out while trying to do those.

Make sure you have checked all steps below.

Jira

Tests

  • [ x] My PR adds the following unit tests OR does not need testing for this extremely good reason: No existing unit
    tests and somewhat hard to test as it requires REST calls/etc

Commits

  • [x ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines
    from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Style

  • [x ] My contribution adheres to the code style guidelines and Spotless passes.
    • To apply the necessary changes, run mvn spotless:apply -Pvector-plugins

Documentation

  • [ x] In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

- Fixes differences between python2 and python3
- Unifies to only using git as gitbox is no longer necessary.
- Used to merge PARQUET-2483
- Adds options to clear existing branches used from merge,
  if for some reason they exist (useful while iterating on
  updates to the script)
- Adds the ability to pass PR number as an arg on the command line

It is possible JIRA updates still have some work here, as
it errored out while trying to do those.
@wgtmac
Copy link
Member

wgtmac commented Jun 13, 2024

I'm not an expert in python. Just want to ask a general question on whether we should enforce using the merge script for parquet-java (or other parquet-xxx projects). It has been broken for a long time.

@gszadovszky @julienledem Thoughts?

@emkornfield
Copy link
Contributor Author

I think the main thing the script does (which I haven't fully tested) is allow for easy merging of backports, and JIRA closing (which given current ML thread might not be useful for super long).

@gszadovszky
Copy link
Contributor

There is a description about it in the dev/README. It seems this is for the time when we were not able to simply push the PRs from github. Github was a read only mirror for a time being and we used the apache git repo to push the commits into.

I don't remember using this script. I think we do not need it anymore.

@emkornfield
Copy link
Contributor Author

For reference there is a similar script in Arrow which I think probably has similar origins. The additional benefit even for github, is it looks like it tags issues with appropriate fix versions. Not sure if parquet-java has other ways of tracking this?

@wgtmac
Copy link
Member

wgtmac commented Jun 17, 2024

There is a similar script in Apache ORC: https://github.com/apache/orc/blob/main/dev/merge_orc_pr.py. It can help keep the PR description, update JIRA state and backport commit to other branches in a single shot. Perhaps we do not want to enforce using the script but it might help in the above cases.

@Fokko
Copy link
Contributor

Fokko commented Jun 17, 2024

I haven't used this script and would prefer to see if we can have a more Github-oriented workflow. Rok just raised a vote to migrate the issues from Jira to GitHub.

@emkornfield
Copy link
Contributor Author

@Fokko in your experience how have issues been tagged with fixes for GH oriented workflows?

@emkornfield
Copy link
Contributor Author

@Fokko in your experience how have issues been tagged with fixes for GH oriented workflows?

@Fokko ping, want to understand what you where thinking here?

@Fokko
Copy link
Contributor

Fokko commented Jun 27, 2024

@emkornfield Sorry for the late reply here, the mailbox is a bit swamped, thanks for the ping.

For Iceberg we make these Github releases (example for 1.5.0). Github will generate this list for you, and then the release manager groups and curates this a bit to make it easier to read. Merging would just go through the Github UI since we don't need to keep track of things in Jira, and this way we don't need to use the script.

@emkornfield
Copy link
Contributor Author

@Fokko Thanks, I still think the Arrow issues which use the tool might keep better records.

For instance if you look at a recently resolved issue in Arrow you see it is tagged with a fix version (I believe this is due to using a script). I don't see something similar in a recent Iceberg. I might be missing something but I think the Arrow approach is nicer as it allows for me to directly find an issue and determine which release I might need to use to make use of it.

I also think being able to easily backport fixes to other branches via one step process is nicer even if it doesn't use the UI (but this is likely a rarer use-case).

Thoughts?

@Fokko
Copy link
Contributor

Fokko commented Jun 28, 2024

In Iceberg we create backports by hand. We checkout the branch, and backport using cherry-picking the commits, and create a PR. Can I ask how you resolve merge conflicts with older versions?

@wgtmac
Copy link
Member

wgtmac commented Jun 28, 2024

In Iceberg we create backports by hand. We checkout the branch, and backport using cherry-picking the commits, and create a PR. Can I ask how you resolve merge conflicts with older versions?

In the merge script of Apache ORC, it asks which branch to cherry-pick the commit and tries to do that automatically. If it succeeds, then we are done. Or when it fails, it prints error message and asks the developer to resolve the cherry-pick conflict manually and then come back to the script by answering Y to the question it asked to proceed: https://github.com/apache/orc/blob/main/dev/merge_orc_pr.py#L128-L136

BTW, as I have asked in #2932, it would be good to automatically close the Github issue related to the PR.

@emkornfield
Copy link
Contributor Author

My biggest concern is properly tagging issues to milestones. I think the main question we should answer here is do we want a script. If we do, I think we should check this in and I can cherry-pick the necessary features from some of the other merge scripts (we should also ideally consolidate with the script in parquet-format with maybe convenience wrappers for running both)

@emkornfield
Copy link
Contributor Author

I started a discussion thread on parquet-dev to get a broader audience to discuss a path forward here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants