Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify stop-here behavior to use accumulo-service #5094

Open
wants to merge 2 commits into
base: 2.1
Choose a base branch
from

Conversation

ddanielr
Copy link
Contributor

Fixes a bug in stop-here where accumulo-cluster would shut down other parts of the cluster.
Modifies the behavior to only stop local processes.

Fixes a bug in stop-here where the process would shut down other parts
of the cluster.

Modifies the behavior to only stop local processes.
@ddanielr ddanielr added this to the 2.1.4 milestone Nov 22, 2024
@@ -573,49 +573,32 @@ function stop_all() {
}

function stop_here() {
# Determine hostname without errors to user
hosts_to_check=("$(hostname -a 2>/dev/null | head -1)" "$(hostname -f)")

if echo "${TSERVER_HOSTS}" | grep -Eq 'localhost|127[.]0[.]0[.]1'; then
Copy link
Contributor

@dlmarion dlmarion Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the issue is really with this check. start_here gets the local addresses using:

local_hosts="$(hostname -a 2>/dev/null) $(hostname) localhost 127.0.0.1 $(hostname -I)"

then iterates over ${TSERVER_HOSTS}, and if there is a match, then it starts the service. I'm thinking we should be doing the same thing here instead of what we are currently doing. I'm concerned about calling accumulo-service here vs calling it in control_service because as this change gets merged up to main I don't think it will work the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to "local_hosts", the logic in control_service attempts to do another validation to figure out if it should talk locally or create an SSH connection.

if [[ $host == localhost || $host == "$(hostname -s)" || $host == "$(hostname -f)" || "$(hostname -I)" =~ $host ]]; then
 // do local connection
 else 
 // SSH to host

Having two different host validation methods doesn't seem ideal and since stop-here should be locally scoped it seemed a better form of control to always run the local accumulo-service command vs possibly ssh to the target host.

If the concern is that we are calling accumulo-service two separate ways, I could change the structure of control_service to accept a --local flag which would never allow the SSH option to run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike start-here, there are only two reasons why this method needs knowledge of its own hostname.
The first is for issuing an admin stop command to try and gracefully stop the tserver via a FATE operation.
The second is for control_service to figure out if it needs to run a local command or not.

However I'm not sure of the value of running that command as both this commit and the previous code did not wait for the stop command to complete successfully before starting to issue direct stop and kill commands to accumulo-service.

If we bypass the fate step, then there's no reason to bother with looking at the hostname for stop-here as accumulo-service lists all of the running processes by looking at pid files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change the structure of control_service to accept a --local flag which would never allow the SSH option to run

This makes sense to me. Obviously if the user is executing a command for the local host, we should not be ssh'ing.

[[ -n ${!var_name} ]] && NUM_SSERVERS=${!var_name}
G="SSERVER_HOSTS_${group}"
for sserver in ${!G}; do
end_service $end_cmd "$sserver" sserver "-g" "$group"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the bugs that is fixed is here where $sserver is being used instead of $host.
This causes all scan servers to be terminated across the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants