-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: enable cloning of try regions #110020
base: main
Are you sure you want to change the base?
Conversation
Add a utility to clone a try region and all associated regions. Test this by enabling duplication of loops with try regions.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
I am currently running with this enabled by default to get a bit more test coverage. I will flip this to off by default before merging, as cloning loops with EH needs some heuristics that I haven't yet worked out. The code will then only be tested by the new test cases, which give pretty good but not exhaustive coverage of all the possibilities. |
Crypto failure in coreclr x64 release seems unrelated. NAOT failure may be related. |
@jakobbotsch PTAL (note there will be one last PR to disable this by default) The EH region representation is fairly brittle and makes things like this more difficult than they should be, but I have resisted the temptation to revise that. I think we can enable cloning loops with EH in cases where exceptions do not continue loop execution, and/or we can show that the cloned loop will not throw exceptions and we can get rid of the EH. But that is future work. Also there are a lot of potential cloning opportunities that are blocked right now because header and preheader end up in different regions. I think we can finesse that in many cases by splitting the header and putting it into that enclosing region (basically, push the try inside the loop). Haven't tried it yet. |
@jakobbotsch ping I also have a few local changes as I have also integrated this into array de-abstraction... I can try and add that here if you like. |
// so we can keep track of what the "before" picture looked like. | ||
// | ||
if (insertionPoint->hasTryIndex() || insertionPoint->hasHndIndex()) | ||
{ | ||
bool inTry = false; | ||
unsigned region = comp->ehGetMostNestedRegionIndex(insertionPoint, &inTry); | ||
|
||
if (region != 0) | ||
{ | ||
// Convert to true region index | ||
region--; | ||
|
||
while (true) | ||
{ | ||
EHblkDsc* const ebd = comp->ehGetDsc(region); | ||
|
||
if (inTry) | ||
{ | ||
JITDUMP("Noting that enclosing try EH#%02u ends at " FMT_BB "\n", region, ebd->ebdTryLast->bbNum); | ||
regionEnds.Emplace(region, ebd->ebdTryLast, true); | ||
} | ||
else | ||
{ | ||
JITDUMP("Noting that enclsoing handler EH#%02u ends at " FMT_BB "\n", region, | ||
ebd->ebdHndLast->bbNum); | ||
regionEnds.Emplace(region, ebd->ebdHndLast, false); | ||
} | ||
|
||
region = comp->ehGetEnclosingRegionIndex(region, &inTry); | ||
|
||
if (region == EHblkDsc::NO_ENCLOSING_INDEX) | ||
{ | ||
break; | ||
} | ||
} | ||
} | ||
} | ||
|
||
BasicBlock* bottom = GetLexicallyBottomMostBlock(); | ||
// Keep track of how much the region end EH indices change because of EH region cloning. | ||
// | ||
unsigned ehRegionShift = 0; | ||
|
||
// Keep track of which blocks were handled by EH region cloning | ||
// | ||
BitVecTraits traits(comp->compBasicBlockID, comp); | ||
BitVec visited(BitVecOps::MakeEmpty(&traits)); | ||
|
||
VisitLoopBlocksLexical([=, &traits, &visited, &clonedTry, &ehRegionShift](BasicBlock* blk) { | ||
if (canCloneTry) | ||
{ | ||
// If we allow cloning loops with EH, we may have already handled | ||
// this loop block as part of a containing try region. | ||
// | ||
if (BitVecOps::IsMember(&traits, visited, blk->bbNum)) | ||
{ | ||
return BasicBlockVisit::Continue; | ||
} | ||
|
||
// If this is a try region entry, clone the entire region now. | ||
// Defer adding edges and extending EH regions until later. | ||
// | ||
// Updates map, visited, and insertAfter. | ||
// | ||
if (comp->bbIsTryBeg(blk)) | ||
{ | ||
Compiler::CloneTryInfo info(traits, visited); | ||
info.m_map = map; | ||
info.m_addEdges = false; | ||
info.m_profileScale = weightScale; | ||
|
||
BasicBlock* const clonedBlock = comp->fgCloneTryRegion(blk, info, insertAfter); | ||
|
||
assert(clonedBlock != nullptr); | ||
ehRegionShift += info.m_ehRegionShift; | ||
clonedTry = true; | ||
return BasicBlockVisit::Continue; | ||
} | ||
} | ||
else | ||
{ | ||
// We're not expecting to find enclosed EH regions | ||
// | ||
assert(!comp->bbIsTryBeg(blk)); | ||
assert(!comp->bbIsHandlerBeg(blk)); | ||
assert(!BitVecOps::IsMember(&traits, visited, blk->bbNum)); | ||
} | ||
|
||
VisitLoopBlocksLexical([=](BasicBlock* blk) { | ||
// Initialize newBlk as BBJ_ALWAYS without jump target, and fix up jump target later | ||
// with BasicBlock::CopyTarget(). | ||
BasicBlock* newBlk = comp->fgNewBBafter(BBJ_ALWAYS, *insertAfter, /*extendRegion*/ true); | ||
// `blk` was not in loop-enclosed try region or companion region. | ||
// | ||
// Initialize newBlk as BBJ_ALWAYS without jump target; these are fixed up subsequently. | ||
// | ||
// CloneBlockState puts newBlk in the proper EH region. We will fix enclosing region extents | ||
// once cloning is done. | ||
// | ||
BasicBlock* newBlk = comp->fgNewBBafter(BBJ_ALWAYS, *insertAfter, /* extendRegion */ false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems very complicated. Is it possible to extract this to a separate function so that the actual enabled-by-default logic remains readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably so, though I'd like to have the second use case far enough along that whatever I do here works nicely for that case too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mostly worried about the maintainability. I spent a fair amount of effort untangling this logic from what it used to be in loop cloning.
src/coreclr/jit/jitconfigvalues.h
Outdated
@@ -59,6 +59,7 @@ CONFIG_INTEGER(JitBreakMorphTree, "JitBreakMorphTree", 0xffffffff) | |||
CONFIG_INTEGER(JitBreakOnBadCode, "JitBreakOnBadCode", 0) | |||
CONFIG_INTEGER(JitBreakOnMinOpts, "JITBreakOnMinOpts", 0) // Halt if jit switches to MinOpts | |||
CONFIG_INTEGER(JitCloneLoops, "JitCloneLoops", 1) // If 0, don't clone. Otherwise clone loops for optimizations. | |||
CONFIG_INTEGER(JitCloneLoopsWithEH, "JitCloneLoopsWithEH", 1) // If 0, don't clone loops containing EH regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there be diffs given that this is 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are quite a number of diffs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you looked at how many loops we clone in total and the fraction of loops cloned with EH in them?
Here's a sample diff ;;; baseline
; Assembly listing for method LoopsWithEH:Sum_LxTC(int[],int):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rbp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T02] ( 4, 18 ) ref -> rcx class-hnd single-def <int[]>
; V01 arg1 [V01,T03] ( 4, 11 ) int -> rdx single-def
; V02 loc0 [V02,T01] ( 6, 34 ) int -> rax
;* V03 loc1 [V03,T04] ( 0, 0 ) int -> zero-ref
;* V04 loc2 [V04 ] ( 0, 0 ) int -> zero-ref
; V05 OutArgs [V05 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V06 tmp1 [V06 ] ( 0, 0 ) ref -> zero-ref class-hnd "impSpillSpecialSideEff" <System.Exception>
; V07 PSPSym [V07,T05] ( 1, 1 ) long -> [rbp-0x10] do-not-enreg[V] "PSPSym"
; V08 rat0 [V08,T00] ( 6, 41 ) long -> r8 "Widened IV V03"
;
; Lcl frame size = 48
G_M61877_IG01: ;; offset=0x0000
push rbp
sub rsp, 48
lea rbp, [rsp+0x30]
mov qword ptr [rbp-0x10], rsp
;; size=14 bbWeight=1 PerfScore 2.75
G_M61877_IG02: ;; offset=0x000E
xor eax, eax
xor r8d, r8d
test edx, edx
jle SHORT G_M61877_IG07
align [9 bytes for IG03]
;; size=18 bbWeight=1 PerfScore 2.00
G_M61877_IG03: ;; offset=0x0020
inc eax
;; size=2 bbWeight=8 PerfScore 2.00
G_M61877_IG04: ;; offset=0x0022
cmp r8d, dword ptr [rcx+0x08]
jae SHORT G_M61877_IG05
add eax, dword ptr [rcx+4*r8+0x10]
jmp SHORT G_M61877_IG06
;; size=13 bbWeight=8 PerfScore 72.00
G_M61877_IG05: ;; offset=0x002F
call CORINFO_HELP_RNGCHKFAIL
int3
;; size=6 bbWeight=0 PerfScore 0.00
G_M61877_IG06: ;; offset=0x0035
inc r8d
cmp r8d, edx
jl SHORT G_M61877_IG03
;; size=8 bbWeight=8 PerfScore 12.00
G_M61877_IG07: ;; offset=0x003D
add rsp, 48
pop rbp
ret
;; size=6 bbWeight=1 PerfScore 1.75
G_M61877_IG08: ;; offset=0x0043
mov eax, -1
;; size=5 bbWeight=0 PerfScore 0.00
G_M61877_IG09: ;; offset=0x0048
add rsp, 48
pop rbp
ret
;; size=6 bbWeight=0 PerfScore 0.00
G_M61877_IG10: ;; offset=0x004E
push rbp
sub rsp, 48
mov rbp, qword ptr [rcx+0x20]
mov qword ptr [rsp+0x20], rbp
lea rbp, [rbp+0x30]
;; size=18 bbWeight=0 PerfScore 0.00
G_M61877_IG11: ;; offset=0x0060
lea rax, G_M61877_IG08
;; size=7 bbWeight=0 PerfScore 0.00
G_M61877_IG12: ;; offset=0x0067
add rsp, 48
pop rbp
ret
;; size=6 bbWeight=0 PerfScore 0.00
; Total bytes of code 109, prolog size 14, PerfScore 92.50, instruction count 35, allocated bytes for code 109 (MethodHash=e8e90e4a) for method LoopsWithEH:Sum_LxTC(int[],int):int (FullOpts)
; ============================================================ In the diff, note the loop is cloned and the cloned version optimized, and there are two funclets instead of one. We can probably get rid of the try region for the cloned loop (on my todo-list). ;;; diff
; Assembly listing for method LoopsWithEH:Sum_LxTC(int[],int):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rbp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T04] ( 7, 3.66) ref -> rcx class-hnd single-def <int[]>
; V01 arg1 [V01,T03] ( 6, 4.08) int -> rdx single-def
; V02 loc0 [V02,T00] ( 10, 34 ) int -> rax
; V03 loc1 [V03,T05] ( 6, 1.40) int -> r8
;* V04 loc2 [V04 ] ( 0, 0 ) int -> zero-ref
; V05 OutArgs [V05 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V06 tmp1 [V06 ] ( 0, 0 ) ref -> zero-ref class-hnd "impSpillSpecialSideEff" <System.Exception>
; V07 PSPSym [V07,T06] ( 1, 1 ) long -> [rbp-0x10] do-not-enreg[V] "PSPSym"
; V08 rat0 [V08,T01] ( 4, 24.26) byref -> rcx "Strength reduced derived IV"
; V09 rat1 [V09,T02] ( 4, 24.26) int -> rdx "Trip count IV"
;
; Lcl frame size = 48
G_M61877_IG01: ;; offset=0x0000
push rbp
sub rsp, 48
lea rbp, [rsp+0x30]
mov qword ptr [rbp-0x10], rsp
;; size=14 bbWeight=1 PerfScore 2.75
G_M61877_IG02: ;; offset=0x000E
xor eax, eax
xor r8d, r8d
test edx, edx
jle SHORT G_M61877_IG07
;; size=9 bbWeight=1 PerfScore 1.75
G_M61877_IG03: ;; offset=0x0017
test rcx, rcx
je SHORT G_M61877_IG08
cmp dword ptr [rcx+0x08], edx
jl SHORT G_M61877_IG08
add rcx, 16
align [0 bytes for IG04]
;; size=14 bbWeight=0.50 PerfScore 2.75
G_M61877_IG04: ;; offset=0x0025
inc eax
;; size=2 bbWeight=7.92 PerfScore 1.98
G_M61877_IG05: ;; offset=0x0027
add eax, dword ptr [rcx]
;; size=2 bbWeight=7.92 PerfScore 23.76
G_M61877_IG06: ;; offset=0x0029
add rcx, 4
dec edx
jne SHORT G_M61877_IG04
;; size=8 bbWeight=7.92 PerfScore 11.88
G_M61877_IG07: ;; offset=0x0031
add rsp, 48
pop rbp
ret
;; size=6 bbWeight=1 PerfScore 1.75
G_M61877_IG08: ;; offset=0x0037
inc eax
;; size=2 bbWeight=0.08 PerfScore 0.02
G_M61877_IG09: ;; offset=0x0039
cmp r8d, dword ptr [rcx+0x08]
jae SHORT G_M61877_IG10
mov r10d, r8d
add eax, dword ptr [rcx+4*r10+0x10]
jmp SHORT G_M61877_IG11
;; size=16 bbWeight=0.08 PerfScore 0.74
G_M61877_IG10: ;; offset=0x0049
call CORINFO_HELP_RNGCHKFAIL
int3
;; size=6 bbWeight=0 PerfScore 0.00
G_M61877_IG11: ;; offset=0x004F
inc r8d
cmp r8d, edx
jl SHORT G_M61877_IG08
jmp SHORT G_M61877_IG07
;; size=10 bbWeight=0.08 PerfScore 0.28
G_M61877_IG12: ;; offset=0x0059
mov eax, -1
;; size=5 bbWeight=0 PerfScore 0.00
G_M61877_IG13: ;; offset=0x005E
add rsp, 48
pop rbp
ret
;; size=6 bbWeight=0 PerfScore 0.00
G_M61877_IG14: ;; offset=0x0064
push rbp
sub rsp, 48
mov rbp, qword ptr [rcx+0x20]
mov qword ptr [rsp+0x20], rbp
lea rbp, [rbp+0x30]
;; size=18 bbWeight=0 PerfScore 0.00
G_M61877_IG15: ;; offset=0x0076
lea rax, G_M61877_IG12
;; size=7 bbWeight=0 PerfScore 0.00
G_M61877_IG16: ;; offset=0x007D
add rsp, 48
pop rbp
ret
;; size=6 bbWeight=0 PerfScore 0.00
G_M61877_IG17: ;; offset=0x0083
push rbp
sub rsp, 48
mov rbp, qword ptr [rcx+0x20]
mov qword ptr [rsp+0x20], rbp
lea rbp, [rbp+0x30]
;; size=18 bbWeight=0 PerfScore 0.00
G_M61877_IG18: ;; offset=0x0095
lea rax, G_M61877_IG12
;; size=7 bbWeight=0 PerfScore 0.00
G_M61877_IG19: ;; offset=0x009C
add rsp, 48
pop rbp
ret
;; size=6 bbWeight=0 PerfScore 0.00
; Total bytes of code 162, prolog size 14, PerfScore 47.66, instruction count 56, allocated bytes for code 162 (MethodHash=e8e90e4a) for method LoopsWithEH:Sum_LxTC(int[],int):int (FullOpts)
; ============================================================
|
bool clonedTry = false; | ||
BasicBlock* const insertionPoint = *insertAfter; | ||
|
||
INDEBUG(canCloneTry = (JitConfig.JitCloneLoopsWithEH() > 0);) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a parameter. Loop cloning is not the only user of this utility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Unrolling gave me more problems than cloning, and unrolling loops with EH is likely almost never a good idea, or perhaps should be handled differently (say only unroll if we can make the try region cover all the unrolls).
{ | ||
for (BasicBlock* const blk : BlockToBlockMap::KeyIteration(map)) | ||
{ | ||
if (!this->ContainsBlock(blk)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!this->ContainsBlock(blk)) | |
if (!ContainsBlock(blk)) |
CloneTryInfo(BitVecTraits& traits, BitVec& visited) : m_traits(traits), m_visited(visited) {} | ||
BitVecTraits m_traits; | ||
BitVec& m_visited; | ||
BlockToBlockMap* m_map = nullptr; | ||
jitstd::vector<BasicBlock*>* m_blocksToClone = nullptr; | ||
bool m_addEdges = false; | ||
weight_t m_profileScale = 0.0; | ||
unsigned m_ehRegionShift = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The m_
prefix looks a bit odd to me for public fields meant to be accessed externally.
Add a utility to clone a try region and all associated regions.
The main complications are making sure all the relevant bits get cloned (all enclosed and mutually-protecting enclosing regions, associated handlers, filters, callfinallys, and acd entries), adding and adjusting EH clauses, and having the try cloning work both as standalone and as part of a larger cloning activity, including ensuring that nearby cloned blocks end up in the proper regions, and all regions have proper extents.
Contributes to #108913. We will need to clone try regions as part of array enumeration de-abstraction.
I have been testing this by enabling duplication of loops with try regions, but will disable it by default until proper heuristics can be developed.