Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28615: OTEL: Expose FS counters for queries #5535

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tanishq-chugh
Copy link
Contributor

What changes were proposed in this pull request?

Expose FS counters information via OTEL at query level

Why are the changes needed?

Better usability for OTEL

Does this PR introduce any user-facing change?

Yes, FS related information will be visible at OTEL recievers

Is the change a dependency upgrade?

No

How was this patch tested?

Manual testing

Copy link

sonarcloud bot commented Nov 7, 2024

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having doubts around the Heap consumption due to this at HS2, we can't do it without a flag, else people with OTEL disabled will also get impacted.

Let me think about it & discuss with folks & circle back

@@ -32,6 +42,7 @@ public class QueryInfo {
private Long endTime;
private String state;
private QueryDisplay queryDisplay;
private QueryPlan queryPlan;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having doubts around this, persisting the entire queryPlan for huge queries might lead to some heap challenges at HS2..

Copy link
Contributor

@abstractdog abstractdog Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, but only because of memory reasons

be aware that counters like this are only available after finishing the DAG (when it's aggregated in the AM), so including this in the scope of OTEL looks strange, as this metrics collection is supposed to be a live thing (while running the query)
I guess if we run this code on a cluster while running a query, it will return NPE somewhere, or non-sense counter values

I think for proper metrics users should refer to the query history when its implementation is done

Copy link
Contributor

@abstractdog abstractdog Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed in the meantime: OTEL integration also applies finished queries, in which case we might want to include counters
2 notes:

  1. I agree with not using QueryPlan: as part of the query history service I already implemented a RuntimeContext that's not committed yet but it could be reused here, would you mind me refactoring it to a separate change so you can use it here? in this case, both query history and this code path would use the same

  2. if we're adding counters, are we about to consider adding more? let me attach a query summary here to let you see and decide what could be useful, primarily I'm thinking of:
    DAGCounter
    TaskCounter
    LlapIOCounters
    LlapWmCounters

INFO  : org.apache.tez.common.counters.DAGCounter:
INFO  :    NUM_SUCCEEDED_TASKS: 23
INFO  :    TOTAL_LAUNCHED_TASKS: 23
INFO  :    DATA_LOCAL_TASKS: 18
INFO  :    AM_CPU_MILLISECONDS: 8920
INFO  :    AM_GC_TIME_MILLIS: 0
INFO  :    INITIAL_HELD_CONTAINERS: 0
INFO  :    TOTAL_CONTAINERS_USED: 23
INFO  :    TOTAL_CONTAINER_LAUNCH_COUNT: 23
INFO  :    TOTAL_CONTAINER_RELEASE_COUNT: 23
INFO  :    NODE_USED_COUNT: 2
INFO  :    NODE_TOTAL_COUNT: 2
INFO  : File System Counters:
INFO  :    FILE_BYTES_READ: 3848955
INFO  :    FILE_BYTES_WRITTEN: 14729274
INFO  :    FILE_READ_OPS: 0
INFO  :    FILE_LARGE_READ_OPS: 0
INFO  :    FILE_WRITE_OPS: 0
INFO  :    S3A_BYTES_READ: 21267950
INFO  :    S3A_BYTES_WRITTEN: 0
INFO  :    S3A_READ_OPS: 108
INFO  :    S3A_LARGE_READ_OPS: 0
INFO  :    S3A_WRITE_OPS: 0
INFO  : org.apache.tez.common.counters.TaskCounter:
INFO  :    REDUCE_INPUT_GROUPS: 1069247
INFO  :    REDUCE_INPUT_RECORDS: 1079849
INFO  :    COMBINE_INPUT_RECORDS: 0
INFO  :    SPILLED_RECORDS: 1623096
INFO  :    NUM_SHUFFLED_INPUTS: 60
INFO  :    NUM_SKIPPED_INPUTS: 2
INFO  :    NUM_FAILED_SHUFFLE_INPUTS: 0
INFO  :    MERGED_MAP_OUTPUTS: 44
INFO  :    TASK_DURATION_MILLIS: 122643
INFO  :    INPUT_RECORDS_PROCESSED: 1004974
INFO  :    INPUT_SPLIT_LENGTH_BYTES: 135671627
INFO  :    OUTPUT_RECORDS: 1580632
INFO  :    APPROXIMATE_INPUT_RECORDS: 1001564
INFO  :    OUTPUT_LARGE_RECORDS: 0
INFO  :    OUTPUT_BYTES: 39368971
INFO  :    OUTPUT_BYTES_WITH_OVERHEAD: 42530547
INFO  :    OUTPUT_BYTES_PHYSICAL: 14727862
INFO  :    ADDITIONAL_SPILLS_BYTES_WRITTEN: 0
INFO  :    ADDITIONAL_SPILLS_BYTES_READ: 5770624
INFO  :    ADDITIONAL_SPILL_COUNT: 0
INFO  :    SHUFFLE_CHUNK_COUNT: 24
INFO  :    SHUFFLE_BYTES: 17983447
INFO  :    SHUFFLE_BYTES_DECOMPRESSED: 57039563
INFO  :    SHUFFLE_BYTES_TO_MEM: 8957046
INFO  :    SHUFFLE_BYTES_TO_DISK: 0
INFO  :    SHUFFLE_BYTES_DISK_DIRECT: 9026401
INFO  :    NUM_MEM_TO_DISK_MERGES: 0
INFO  :    NUM_DISK_TO_DISK_MERGES: 0
INFO  :    SHUFFLE_PHASE_TIME: 8867
INFO  :    MERGE_PHASE_TIME: 5033
INFO  :    FIRST_EVENT_RECEIVED: 3608
INFO  :    LAST_EVENT_RECEIVED: 4347
INFO  :    DATA_BYTES_VIA_EVENT: 0
INFO  : HIVE:
INFO  :    DESERIALIZE_ERRORS: 0
INFO  :    RECORDS_IN_Map_1: 102
INFO  :    RECORDS_IN_Map_2: 2875432
INFO  :    RECORDS_IN_Map_6: 10000
INFO  :    RECORDS_IN_Map_7: 500000
INFO  :    RECORDS_IN_Map_8: 10000
INFO  :    RECORDS_OUT_0: 0
INFO  :    RECORDS_OUT_INTERMEDIATE_Map_1: 0
INFO  :    RECORDS_OUT_INTERMEDIATE_Map_2: 1079849
INFO  :    RECORDS_OUT_INTERMEDIATE_Map_6: 366
INFO  :    RECORDS_OUT_INTERMEDIATE_Map_7: 500000
INFO  :    RECORDS_OUT_INTERMEDIATE_Map_8: 366
INFO  :    RECORDS_OUT_INTERMEDIATE_Reducer_3: 0
INFO  :    RECORDS_OUT_INTERMEDIATE_Reducer_4: 0
INFO  :    RECORDS_OUT_INTERMEDIATE_Reducer_5: 51
INFO  :    RECORDS_OUT_OPERATOR_FIL_143: 366
INFO  :    RECORDS_OUT_OPERATOR_FIL_146: 366
INFO  :    RECORDS_OUT_OPERATOR_FIL_149: 2699840
INFO  :    RECORDS_OUT_OPERATOR_FIL_150: 2724950
INFO  :    RECORDS_OUT_OPERATOR_FIL_159: 0
INFO  :    RECORDS_OUT_OPERATOR_FIL_165: 51
INFO  :    RECORDS_OUT_OPERATOR_FIL_168: 500000
INFO  :    RECORDS_OUT_OPERATOR_FIL_172: 2040
INFO  :    RECORDS_OUT_OPERATOR_FIL_177: 0
INFO  :    RECORDS_OUT_OPERATOR_FS_183: 0
INFO  :    RECORDS_OUT_OPERATOR_GBY_155: 539644
INFO  :    RECORDS_OUT_OPERATOR_GBY_156: 540205
INFO  :    RECORDS_OUT_OPERATOR_GBY_162: 534649
INFO  :    RECORDS_OUT_OPERATOR_GBY_164: 51
INFO  :    RECORDS_OUT_OPERATOR_GBY_171: 2048
INFO  :    RECORDS_OUT_OPERATOR_LIM_182: 0
INFO  :    RECORDS_OUT_OPERATOR_MAPJOIN_153: 542954
INFO  :    RECORDS_OUT_OPERATOR_MAPJOIN_154: 547920
INFO  :    RECORDS_OUT_OPERATOR_MAPJOIN_174: 0
INFO  :    RECORDS_OUT_OPERATOR_MAPJOIN_175: 0
INFO  :    RECORDS_OUT_OPERATOR_MAPJOIN_176: 0
INFO  :    RECORDS_OUT_OPERATOR_MAP_0: 0
INFO  :    RECORDS_OUT_OPERATOR_RS_145: 366
INFO  :    RECORDS_OUT_OPERATOR_RS_148: 366
INFO  :    RECORDS_OUT_OPERATOR_RS_157: 539644
INFO  :    RECORDS_OUT_OPERATOR_RS_158: 540205
INFO  :    RECORDS_OUT_OPERATOR_RS_161: 0
INFO  :    RECORDS_OUT_OPERATOR_RS_167: 51
INFO  :    RECORDS_OUT_OPERATOR_RS_170: 500000
INFO  :    RECORDS_OUT_OPERATOR_RS_180: 0
INFO  :    RECORDS_OUT_OPERATOR_SEL_144: 366
INFO  :    RECORDS_OUT_OPERATOR_SEL_147: 366
INFO  :    RECORDS_OUT_OPERATOR_SEL_151: 2699840
INFO  :    RECORDS_OUT_OPERATOR_SEL_152: 2724950
INFO  :    RECORDS_OUT_OPERATOR_SEL_160: 0
INFO  :    RECORDS_OUT_OPERATOR_SEL_163: 534649
INFO  :    RECORDS_OUT_OPERATOR_SEL_166: 51
INFO  :    RECORDS_OUT_OPERATOR_SEL_169: 500000
INFO  :    RECORDS_OUT_OPERATOR_SEL_173: 2040
INFO  :    RECORDS_OUT_OPERATOR_SEL_179: 0
INFO  :    RECORDS_OUT_OPERATOR_SEL_181: 0
INFO  :    RECORDS_OUT_OPERATOR_TNK_178: 0
INFO  :    RECORDS_OUT_OPERATOR_TS_0: 102
INFO  :    RECORDS_OUT_OPERATOR_TS_18: 500000
INFO  :    RECORDS_OUT_OPERATOR_TS_24: 10000
INFO  :    RECORDS_OUT_OPERATOR_TS_3: 2875432
INFO  :    RECORDS_OUT_OPERATOR_TS_6: 10000
INFO  : Shuffle Errors:
INFO  :    BAD_ID: 0
INFO  :    CONNECTION: 0
INFO  :    IO_ERROR: 0
INFO  :    WRONG_LENGTH: 0
INFO  :    WRONG_MAP: 0
INFO  :    WRONG_REDUCE: 0
INFO  : org.apache.hadoop.hive.llap.counters.LlapIOCounters:
INFO  :    ALLOCATED_BYTES: 38535168
INFO  :    ALLOCATED_USED_BYTES: 34036847
INFO  :    CACHE_HIT_BYTES: 0
INFO  :    CACHE_MISS_BYTES: 20493187
INFO  :    CONSUMER_TIME_NS: 3531916885
INFO  :    DECODE_TIME_NS: 1180278124
INFO  :    HDFS_TIME_NS: 21459595705
INFO  :    IO_CPU_NS: 3237492937
INFO  :    IO_USER_NS: 2940000000
INFO  :    METADATA_CACHE_MISS: 58
INFO  :    NUM_DECODED_BATCHES: 355
INFO  :    NUM_DECODED_ROWS: 3395534
INFO  :    NUM_INPUT_ROWS: 3395534
INFO  :    NUM_VECTOR_BATCHES: 3408
INFO  :    ROWS_EMITTED: 3395534
INFO  :    SELECTED_ROWGROUPS: 355
INFO  :    TOTAL_IO_TIME_NS: 13162230633
INFO  : org.apache.hadoop.hive.llap.counters.LlapWmCounters:
INFO  :    GUARANTEED_QUEUED_NS: 0
INFO  :    GUARANTEED_RUNNING_NS: 0
INFO  :    SPECULATIVE_QUEUED_NS: 35824072
INFO  :    SPECULATIVE_RUNNING_NS: 118969998048
INFO  : org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable$LlapExecutorCounters:
INFO  :    EXECUTOR_CPU_NS: 29116150979
INFO  :    EXECUTOR_USER_NS: 27090000000
INFO  : org.apache.hadoop.hive.ql.exec.tez.HiveInputCounters:
INFO  :    GROUPED_INPUT_SPLITS_Map_1: 1
INFO  :    GROUPED_INPUT_SPLITS_Map_2: 11
INFO  :    GROUPED_INPUT_SPLITS_Map_6: 1
INFO  :    GROUPED_INPUT_SPLITS_Map_7: 4
INFO  :    GROUPED_INPUT_SPLITS_Map_8: 1
INFO  :    INPUT_DIRECTORIES_Map_1: 1
INFO  :    INPUT_DIRECTORIES_Map_2: 1
INFO  :    INPUT_DIRECTORIES_Map_6: 1
INFO  :    INPUT_DIRECTORIES_Map_7: 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just created HIVE-28623, let me block you for 1-2 days with it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants