Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSDK-8904: Only update weak dependencies once per reconfigure. #4584

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 27 additions & 2 deletions robot/impl/local_robot.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ import (
"github.com/Masterminds/semver/v3"
"github.com/pkg/errors"
"go.uber.org/multierr"
"go.uber.org/zap/zapcore"
packagespb "go.viam.com/api/app/packages/v1"
modulepb "go.viam.com/api/module/v1"
goutils "go.viam.com/utils"
Expand Down Expand Up @@ -74,6 +75,7 @@ type localRobot struct {

// lastWeakDependentsRound stores the value of the resource graph's
// logical clock when updateWeakDependents was called.
lastWeakDependentsMu sync.Mutex
lastWeakDependentsRound atomic.Int64

// configRevision stores the revision of the latest config ingested during
Expand Down Expand Up @@ -754,7 +756,23 @@ func (r *localRobot) updateWeakDependents(ctx context.Context) {
// Track the current value of the resource graph's logical clock. This will
// later be used to determine if updateWeakDependents should be called during
// getDependencies.
r.lastWeakDependentsRound.Store(r.manager.resources.CurrLogicalClockValue())
r.lastWeakDependentsMu.Lock()
logicalNow := r.manager.resources.CurrLogicalClockValue()
if r.lastWeakDependentsRound.Load() < logicalNow {
// RSDK-8904: This method can be called concurrently such that resources are `Reconfigure`d
// in parallel. This logical clock existed for the purpose of avoiding double-reconfigure
// inside `updateWeakDependents` for the same robot `Reconfigure` call.
//
// Dan: Using the clock in this way will early return the "loser" goroutine. The "loser"
// goroutine will proceed despite weak dependents not yet having their `Reconfigure`
// called. I expect this is actually safe despite being a surprising lack of guarantee. I'm
// concerned about blocking here because I also* expect this logical clock atomic integer
Copy link
Member

@cheukt cheukt Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be incorrect in the sense that if two components depends on the motion service and are reconfigured at the same time, one might get an updated motion service while the other will early return and thus get a non-updated version.

I don't know if we were excessively concerned about blocking here versus trying to avoid using more locks when introducing the logical clock atomic integer so I would be ok with blocking here. This is basically an implicit dependency that could not be properly expressed. @benjirewis may have more context around the original introduction

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one might get an updated motion service while the other will early return and thus get a non-updated version.

I also think this is true unless the first goroutine hols the lastWeakDependentsMu until the Reconfigure has actually been called on the motion service. But, is that not just an extension of what @dgottlieb is already saying with

The "loser" goroutine will proceed despite weak dependents not yet having their Reconfigure called.

I think if we cared about avoiding that situation we would have to block even more and not r.lastWeakDependsMu.Unlock() until the end of this function.

I don't know if we were excessively concerned about blocking here versus trying to avoid using more locks when introducing the logical clock atomic integer so I would be ok with blocking here.

I don't have a ton of context around the original introduction of this field, but I agree with, you, @cheukt. Originally, there was likely very little concurrency leveraging this code, so it was probably assumed that calls to updateWeakDependents and getDependencies would be sequential. As such, it was probably "good enough" to store a clock value here and assume that the next (sequential) getDependencies call would not call updateWeakDependents due to the updated lastWeakDependentsRound value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, unless we hold the lock for the entire function and somehow lock for the Reconfiguration of the first component, the second component will never get the "most" up-to-date version of the motion service.

Ok with just leaving the comment there

// was introduced to avoid blocking.
r.lastWeakDependentsMu.Unlock()
return
}
r.lastWeakDependentsRound.Store(logicalNow)
r.lastWeakDependentsMu.Unlock()

allResources := map[resource.Name]resource.Resource{}
internalResources := map[resource.Name]resource.Resource{}
Expand Down Expand Up @@ -856,8 +874,14 @@ func (r *localRobot) updateWeakDependents(ctx context.Context) {
if len(r.getWeakDependencyMatchers(conf.API, conf.Model)) == 0 {
return
}
r.Logger().CDebugw(ctx, "handling weak update for resource", "resource", resName)
deps, err := r.getDependencies(ctx, resName, resNode)
if r.Logger().Level() == zapcore.DebugLevel {
var depNames []string
for name, _ := range deps {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for name, _ := range deps {
for _, name := range deps {

Probably obvious, but don't do for name := range deps like lint is suggesting. Does this log line definitely work/look ok?

Copy link
Member Author

@dgottlieb dgottlieb Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to get the resource name. deps is of type Dependencies:

type Dependencies map[Name]Resource

So getting the key part was intentional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024-11-26T20:43:59.119-0500	DEBUG	TestCheckMaxInstanceValid	impl/local_robot.go:883	handling weak update for resource	{"resource":"rdk:service:motion/fake1","deps":["rdk-internal:service:frame_system/builtin","rdk:component:arm/fake2"],"err":null}

depNames = append(depNames, name.String())
}
r.Logger().CDebugw(ctx, "handling weak update for resource", "resource", resName, "deps", depNames, "err", err)
}
if err != nil {
r.Logger().CErrorw(ctx, "failed to get dependencies during weak dependencies update; skipping", "resource", resName, "error", err)
return
Expand Down Expand Up @@ -893,6 +917,7 @@ func (r *localRobot) updateWeakDependents(ctx context.Context) {
resChan <- struct{}{}
r.reconfigureWorkers.Done()
}()

updateResourceWeakDependents(ctxWithTimeout, conf)
})
select {
Expand Down
Loading