Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: enable cloning of try regions #110020

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented Nov 20, 2024

Add a utility to clone a try region and all associated regions.

The main complications are making sure all the relevant bits get cloned (all enclosed and mutually-protecting enclosing regions, associated handlers, filters, callfinallys, and acd entries), adding and adjusting EH clauses, and having the try cloning work both as standalone and as part of a larger cloning activity, including ensuring that nearby cloned blocks end up in the proper regions, and all regions have proper extents.

Contributes to #108913. We will need to clone try regions as part of array enumeration de-abstraction.

I have been testing this by enabling duplication of loops with try regions, but will disable it by default until proper heuristics can be developed.

Add a utility to clone a try region and all associated regions.

Test this by enabling duplication of loops with try regions.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 20, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS
Copy link
Member Author

I am currently running with this enabled by default to get a bit more test coverage.

I will flip this to off by default before merging, as cloning loops with EH needs some heuristics that I haven't yet worked out. The code will then only be tested by the new test cases, which give pretty good but not exhaustive coverage of all the possibilities.

@AndyAyersMS AndyAyersMS marked this pull request as ready for review November 22, 2024 16:45
@AndyAyersMS
Copy link
Member Author

Crypto failure in coreclr x64 release seems unrelated. NAOT failure may be related.

@AndyAyersMS
Copy link
Member Author

@jakobbotsch PTAL
cc @dotnet/jit-contrib

(note there will be one last PR to disable this by default)

The EH region representation is fairly brittle and makes things like this more difficult than they should be, but I have resisted the temptation to revise that.

I think we can enable cloning loops with EH in cases where exceptions do not continue loop execution, and/or we can show that the cloned loop will not throw exceptions and we can get rid of the EH. But that is future work.

Also there are a lot of potential cloning opportunities that are blocked right now because header and preheader end up in different regions. I think we can finesse that in many cases by splitting the header and putting it into that enclosing region (basically, push the try inside the loop). Haven't tried it yet.

@AndyAyersMS
Copy link
Member Author

@jakobbotsch ping

I also have a few local changes as I have also integrated this into array de-abstraction... I can try and add that here if you like.

Comment on lines +6065 to +6186
// so we can keep track of what the "before" picture looked like.
//
if (insertionPoint->hasTryIndex() || insertionPoint->hasHndIndex())
{
bool inTry = false;
unsigned region = comp->ehGetMostNestedRegionIndex(insertionPoint, &inTry);

if (region != 0)
{
// Convert to true region index
region--;

while (true)
{
EHblkDsc* const ebd = comp->ehGetDsc(region);

if (inTry)
{
JITDUMP("Noting that enclosing try EH#%02u ends at " FMT_BB "\n", region, ebd->ebdTryLast->bbNum);
regionEnds.Emplace(region, ebd->ebdTryLast, true);
}
else
{
JITDUMP("Noting that enclsoing handler EH#%02u ends at " FMT_BB "\n", region,
ebd->ebdHndLast->bbNum);
regionEnds.Emplace(region, ebd->ebdHndLast, false);
}

region = comp->ehGetEnclosingRegionIndex(region, &inTry);

if (region == EHblkDsc::NO_ENCLOSING_INDEX)
{
break;
}
}
}
}

BasicBlock* bottom = GetLexicallyBottomMostBlock();
// Keep track of how much the region end EH indices change because of EH region cloning.
//
unsigned ehRegionShift = 0;

// Keep track of which blocks were handled by EH region cloning
//
BitVecTraits traits(comp->compBasicBlockID, comp);
BitVec visited(BitVecOps::MakeEmpty(&traits));

VisitLoopBlocksLexical([=, &traits, &visited, &clonedTry, &ehRegionShift](BasicBlock* blk) {
if (canCloneTry)
{
// If we allow cloning loops with EH, we may have already handled
// this loop block as part of a containing try region.
//
if (BitVecOps::IsMember(&traits, visited, blk->bbNum))
{
return BasicBlockVisit::Continue;
}

// If this is a try region entry, clone the entire region now.
// Defer adding edges and extending EH regions until later.
//
// Updates map, visited, and insertAfter.
//
if (comp->bbIsTryBeg(blk))
{
Compiler::CloneTryInfo info(traits, visited);
info.m_map = map;
info.m_addEdges = false;
info.m_profileScale = weightScale;

BasicBlock* const clonedBlock = comp->fgCloneTryRegion(blk, info, insertAfter);

assert(clonedBlock != nullptr);
ehRegionShift += info.m_ehRegionShift;
clonedTry = true;
return BasicBlockVisit::Continue;
}
}
else
{
// We're not expecting to find enclosed EH regions
//
assert(!comp->bbIsTryBeg(blk));
assert(!comp->bbIsHandlerBeg(blk));
assert(!BitVecOps::IsMember(&traits, visited, blk->bbNum));
}

VisitLoopBlocksLexical([=](BasicBlock* blk) {
// Initialize newBlk as BBJ_ALWAYS without jump target, and fix up jump target later
// with BasicBlock::CopyTarget().
BasicBlock* newBlk = comp->fgNewBBafter(BBJ_ALWAYS, *insertAfter, /*extendRegion*/ true);
// `blk` was not in loop-enclosed try region or companion region.
//
// Initialize newBlk as BBJ_ALWAYS without jump target; these are fixed up subsequently.
//
// CloneBlockState puts newBlk in the proper EH region. We will fix enclosing region extents
// once cloning is done.
//
BasicBlock* newBlk = comp->fgNewBBafter(BBJ_ALWAYS, *insertAfter, /* extendRegion */ false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems very complicated. Is it possible to extract this to a separate function so that the actual enabled-by-default logic remains readable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably so, though I'd like to have the second use case far enough along that whatever I do here works nicely for that case too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly worried about the maintainability. I spent a fair amount of effort untangling this logic from what it used to be in loop cloning.

@@ -59,6 +59,7 @@ CONFIG_INTEGER(JitBreakMorphTree, "JitBreakMorphTree", 0xffffffff)
CONFIG_INTEGER(JitBreakOnBadCode, "JitBreakOnBadCode", 0)
CONFIG_INTEGER(JitBreakOnMinOpts, "JITBreakOnMinOpts", 0) // Halt if jit switches to MinOpts
CONFIG_INTEGER(JitCloneLoops, "JitCloneLoops", 1) // If 0, don't clone. Otherwise clone loops for optimizations.
CONFIG_INTEGER(JitCloneLoopsWithEH, "JitCloneLoopsWithEH", 1) // If 0, don't clone loops containing EH regions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be diffs given that this is 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are quite a number of diffs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at how many loops we clone in total and the fraction of loops cloned with EH in them?

@AndyAyersMS
Copy link
Member Author

Here's a sample diff

;;; baseline

; Assembly listing for method LoopsWithEH:Sum_LxTC(int[],int):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rbp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T02] (  4, 18   )     ref  ->  rcx         class-hnd single-def <int[]>
;  V01 arg1         [V01,T03] (  4, 11   )     int  ->  rdx         single-def
;  V02 loc0         [V02,T01] (  6, 34   )     int  ->  rax        
;* V03 loc1         [V03,T04] (  0,  0   )     int  ->  zero-ref   
;* V04 loc2         [V04    ] (  0,  0   )     int  ->  zero-ref   
;  V05 OutArgs      [V05    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V06 tmp1         [V06    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "impSpillSpecialSideEff" <System.Exception>
;  V07 PSPSym       [V07,T05] (  1,  1   )    long  ->  [rbp-0x10]  do-not-enreg[V] "PSPSym"
;  V08 rat0         [V08,T00] (  6, 41   )    long  ->   r8         "Widened IV V03"
;
; Lcl frame size = 48

G_M61877_IG01:  ;; offset=0x0000
       push     rbp
       sub      rsp, 48
       lea      rbp, [rsp+0x30]
       mov      qword ptr [rbp-0x10], rsp
						;; size=14 bbWeight=1 PerfScore 2.75
G_M61877_IG02:  ;; offset=0x000E
       xor      eax, eax
       xor      r8d, r8d
       test     edx, edx
       jle      SHORT G_M61877_IG07
       align    [9 bytes for IG03]
						;; size=18 bbWeight=1 PerfScore 2.00
G_M61877_IG03:  ;; offset=0x0020
       inc      eax
						;; size=2 bbWeight=8 PerfScore 2.00
G_M61877_IG04:  ;; offset=0x0022
       cmp      r8d, dword ptr [rcx+0x08]
       jae      SHORT G_M61877_IG05
       add      eax, dword ptr [rcx+4*r8+0x10]
       jmp      SHORT G_M61877_IG06
						;; size=13 bbWeight=8 PerfScore 72.00
G_M61877_IG05:  ;; offset=0x002F
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
						;; size=6 bbWeight=0 PerfScore 0.00
G_M61877_IG06:  ;; offset=0x0035
       inc      r8d
       cmp      r8d, edx
       jl       SHORT G_M61877_IG03
						;; size=8 bbWeight=8 PerfScore 12.00
G_M61877_IG07:  ;; offset=0x003D
       add      rsp, 48
       pop      rbp
       ret      
						;; size=6 bbWeight=1 PerfScore 1.75
G_M61877_IG08:  ;; offset=0x0043
       mov      eax, -1
						;; size=5 bbWeight=0 PerfScore 0.00
G_M61877_IG09:  ;; offset=0x0048
       add      rsp, 48
       pop      rbp
       ret      
						;; size=6 bbWeight=0 PerfScore 0.00
G_M61877_IG10:  ;; offset=0x004E
       push     rbp
       sub      rsp, 48
       mov      rbp, qword ptr [rcx+0x20]
       mov      qword ptr [rsp+0x20], rbp
       lea      rbp, [rbp+0x30]
						;; size=18 bbWeight=0 PerfScore 0.00
G_M61877_IG11:  ;; offset=0x0060
       lea      rax, G_M61877_IG08
						;; size=7 bbWeight=0 PerfScore 0.00
G_M61877_IG12:  ;; offset=0x0067
       add      rsp, 48
       pop      rbp
       ret      
						;; size=6 bbWeight=0 PerfScore 0.00

; Total bytes of code 109, prolog size 14, PerfScore 92.50, instruction count 35, allocated bytes for code 109 (MethodHash=e8e90e4a) for method LoopsWithEH:Sum_LxTC(int[],int):int (FullOpts)
; ============================================================

In the diff, note the loop is cloned and the cloned version optimized, and there are two funclets instead of one.

We can probably get rid of the try region for the cloned loop (on my todo-list).

;;; diff

; Assembly listing for method LoopsWithEH:Sum_LxTC(int[],int):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rbp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T04] (  7,  3.66)     ref  ->  rcx         class-hnd single-def <int[]>
;  V01 arg1         [V01,T03] (  6,  4.08)     int  ->  rdx         single-def
;  V02 loc0         [V02,T00] ( 10, 34   )     int  ->  rax        
;  V03 loc1         [V03,T05] (  6,  1.40)     int  ->   r8        
;* V04 loc2         [V04    ] (  0,  0   )     int  ->  zero-ref   
;  V05 OutArgs      [V05    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V06 tmp1         [V06    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "impSpillSpecialSideEff" <System.Exception>
;  V07 PSPSym       [V07,T06] (  1,  1   )    long  ->  [rbp-0x10]  do-not-enreg[V] "PSPSym"
;  V08 rat0         [V08,T01] (  4, 24.26)   byref  ->  rcx         "Strength reduced derived IV"
;  V09 rat1         [V09,T02] (  4, 24.26)     int  ->  rdx         "Trip count IV"
;
; Lcl frame size = 48

G_M61877_IG01:  ;; offset=0x0000
       push     rbp
       sub      rsp, 48
       lea      rbp, [rsp+0x30]
       mov      qword ptr [rbp-0x10], rsp
						;; size=14 bbWeight=1 PerfScore 2.75
G_M61877_IG02:  ;; offset=0x000E
       xor      eax, eax
       xor      r8d, r8d
       test     edx, edx
       jle      SHORT G_M61877_IG07
						;; size=9 bbWeight=1 PerfScore 1.75
G_M61877_IG03:  ;; offset=0x0017
       test     rcx, rcx
       je       SHORT G_M61877_IG08
       cmp      dword ptr [rcx+0x08], edx
       jl       SHORT G_M61877_IG08
       add      rcx, 16
       align    [0 bytes for IG04]
						;; size=14 bbWeight=0.50 PerfScore 2.75
G_M61877_IG04:  ;; offset=0x0025
       inc      eax
						;; size=2 bbWeight=7.92 PerfScore 1.98
G_M61877_IG05:  ;; offset=0x0027
       add      eax, dword ptr [rcx]
						;; size=2 bbWeight=7.92 PerfScore 23.76
G_M61877_IG06:  ;; offset=0x0029
       add      rcx, 4
       dec      edx
       jne      SHORT G_M61877_IG04
						;; size=8 bbWeight=7.92 PerfScore 11.88
G_M61877_IG07:  ;; offset=0x0031
       add      rsp, 48
       pop      rbp
       ret      
						;; size=6 bbWeight=1 PerfScore 1.75
G_M61877_IG08:  ;; offset=0x0037
       inc      eax
						;; size=2 bbWeight=0.08 PerfScore 0.02
G_M61877_IG09:  ;; offset=0x0039
       cmp      r8d, dword ptr [rcx+0x08]
       jae      SHORT G_M61877_IG10
       mov      r10d, r8d
       add      eax, dword ptr [rcx+4*r10+0x10]
       jmp      SHORT G_M61877_IG11
						;; size=16 bbWeight=0.08 PerfScore 0.74
G_M61877_IG10:  ;; offset=0x0049
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
						;; size=6 bbWeight=0 PerfScore 0.00
G_M61877_IG11:  ;; offset=0x004F
       inc      r8d
       cmp      r8d, edx
       jl       SHORT G_M61877_IG08
       jmp      SHORT G_M61877_IG07
						;; size=10 bbWeight=0.08 PerfScore 0.28
G_M61877_IG12:  ;; offset=0x0059
       mov      eax, -1
						;; size=5 bbWeight=0 PerfScore 0.00
G_M61877_IG13:  ;; offset=0x005E
       add      rsp, 48
       pop      rbp
       ret      
						;; size=6 bbWeight=0 PerfScore 0.00
G_M61877_IG14:  ;; offset=0x0064
       push     rbp
       sub      rsp, 48
       mov      rbp, qword ptr [rcx+0x20]
       mov      qword ptr [rsp+0x20], rbp
       lea      rbp, [rbp+0x30]
						;; size=18 bbWeight=0 PerfScore 0.00
G_M61877_IG15:  ;; offset=0x0076
       lea      rax, G_M61877_IG12
						;; size=7 bbWeight=0 PerfScore 0.00
G_M61877_IG16:  ;; offset=0x007D
       add      rsp, 48
       pop      rbp
       ret      
						;; size=6 bbWeight=0 PerfScore 0.00
G_M61877_IG17:  ;; offset=0x0083
       push     rbp
       sub      rsp, 48
       mov      rbp, qword ptr [rcx+0x20]
       mov      qword ptr [rsp+0x20], rbp
       lea      rbp, [rbp+0x30]
						;; size=18 bbWeight=0 PerfScore 0.00
G_M61877_IG18:  ;; offset=0x0095
       lea      rax, G_M61877_IG12
						;; size=7 bbWeight=0 PerfScore 0.00
G_M61877_IG19:  ;; offset=0x009C
       add      rsp, 48
       pop      rbp
       ret      
						;; size=6 bbWeight=0 PerfScore 0.00

; Total bytes of code 162, prolog size 14, PerfScore 47.66, instruction count 56, allocated bytes for code 162 (MethodHash=e8e90e4a) for method LoopsWithEH:Sum_LxTC(int[],int):int (FullOpts)
; ============================================================

bool clonedTry = false;
BasicBlock* const insertionPoint = *insertAfter;

INDEBUG(canCloneTry = (JitConfig.JitCloneLoopsWithEH() > 0);)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a parameter. Loop cloning is not the only user of this utility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Unrolling gave me more problems than cloning, and unrolling loops with EH is likely almost never a good idea, or perhaps should be handled differently (say only unroll if we can make the try region cover all the unrolls).

{
for (BasicBlock* const blk : BlockToBlockMap::KeyIteration(map))
{
if (!this->ContainsBlock(blk))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!this->ContainsBlock(blk))
if (!ContainsBlock(blk))

Comment on lines +5363 to +5370
CloneTryInfo(BitVecTraits& traits, BitVec& visited) : m_traits(traits), m_visited(visited) {}
BitVecTraits m_traits;
BitVec& m_visited;
BlockToBlockMap* m_map = nullptr;
jitstd::vector<BasicBlock*>* m_blocksToClone = nullptr;
bool m_addEdges = false;
weight_t m_profileScale = 0.0;
unsigned m_ehRegionShift = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The m_ prefix looks a bit odd to me for public fields meant to be accessed externally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants