Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checks for problems in Accumulo #4957

Open
wants to merge 3 commits into
base: 3.1
Choose a base branch
from

Conversation

kevinrr888
Copy link
Member

This PR:

  • Moves existing checks (checkTablets and the fate check for dangling locks) into the appropriate new admin check command
  • Adds new checks
  • New tests in AdminCheckIT
  • SYSTEM_CONFIG now checks for
    • valid locked table/namespace ids (the locked table/namespaces exist)
    • locked table/namespaces are associated with a fate op
  • ROOT_METADATA now checks for
    • offline tablets
    • missing "columns"
    • invalid "columns"
  • ROOT_TABLE now checks for
    • offline tablets
    • tablets for metadata table have no holes, valid (null) prev end row for first tablet, and valid (null) end row for last tablet
    • missing columns
    • invalid columns
  • METADATA_TABLE now checks for
    • offline tablets
    • tablets for user tables (and scanref) have no holes, valid (null) prev end row for first tablet, and valid (null) end row for last tablet
    • missing columns
    • invalid columns
  • SYSTEM_FILES now checks for
    • missing system files
  • USER_FILES now checks for
    • missing user files

There are still quite a few checks that need to be added (mentioned in #4687) and probably more. This is a first/starting PR for these checks. More checks will be added in follow-ons. Something else still left todo are tests for these checks for FAILING cases.

Part of #4892

This commit:
- Moves existing checks (`checkTablets` and the fate check for dangling locks) into the appropriate new `admin check` command
- Adds new checks
- New tests in AdminCheckIT
- SYSTEM_CONFIG now checks for
	- valid locked table/namespace ids (the locked table/namespaces exist)
	- locked table/namespaces are associated with a fate op
- ROOT_METADATA now checks for
	- offline tablets
	- missing "columns"
	- invalid "columns"
- ROOT_TABLE now checks for
	- offline tablets
	- tablets for metadata table have no holes, valid (null) prev end row for first tablet, and valid (null) end row for last tablet
	- missing columns
	- invalid columns
- METADATA_TABLE now checks for
	- offline tablets
	- tablets for user tables (and scanref) have no holes, valid (null) prev end row for first tablet, and valid (null) end row for last tablet
	- missing columns
	- invalid columns
- SYSTEM_FILES now checks for
	- missing system files
- USER_FILES now checks for
	- missing user files

Part of apache#4892
@kevinrr888 kevinrr888 self-assigned this Oct 8, 2024
@kevinrr888 kevinrr888 added this to the 3.1.0 milestone Oct 8, 2024
Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinrr888 these changes look good. I was running the changes locally and I noticed at the top level the accumulo check-server-config command exists. Was wondering if that should be rolled into this check command, if it should could add that to the checklist on #4892.

- `System.out` -> `log.trace/warn` to avoid flooding output with unnecessary/detailed info. The most important info (e.g., output of `admin check list` command, and the final run status table from the `admin check run` command) is still printed to stdout. Problems found are now logged at warn instead of stdout. Detailed, non-error info logged at trace.
- Created new check `Check.TABLE_LOCKS` which ensures that table and namespace locks are valid and are associated with a FATE op.
- New check `assertNoOtherChecksRan()` in `AdminCheckIT` which ensures only the expected checks ran
- A few misc review changes: `MetadataCheckRunner` code improved to only fetch required columns when scanning, object creation moved outside of a loop
@kevinrr888
Copy link
Member Author

@kevinrr888 these changes look good. I was running the changes locally and I noticed at the top level the accumulo check-server-config command exists. Was wondering if that should be rolled into this check command, if it should could add that to the checklist on #4892.

Thanks, I had missed that. Added to #4892. Also noticed accumulo check-compaction-config, which I also added to the checklist. Did you think this should be included as well or no?

@kevinrr888
Copy link
Member Author

Changes in 11f1f76:

  • System.out -> log.trace/warn to avoid flooding output with unnecessary/detailed info. The most important info (e.g., output of admin check list command, and the final run status table from the admin check run command) is still printed to stdout. Problems found are now logged at warn instead of stdout. Detailed, non-error info logged at trace.
  • Created new check Check.TABLE_LOCKS which ensures that table and namespace locks are valid and are associated with a FATE op. This depends on SYSTEM_CONFIG, which seemed like the most fitting location for it in the dependency tree.
  • New check assertNoOtherChecksRan() in AdminCheckIT which ensures only the expected checks ran
  • A few misc review changes: MetadataCheckRunner code improved to only fetch required columns when scanning, object creation moved outside of a loop

Let me know how these changes look. Something still left TODO is tests for failing checks. Tests should be added to ensure everything that is supposed to be checked correctly fails when it is expected to. I imagine this will take quite a bit of time to do and the changes are already pretty large, so thinking it might be best to leave for follow-on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants