Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

steamcompmgr, wlserver: fix various data races found w/ thread sanitizer #1405

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sharkautarch
Copy link

@sharkautarch sharkautarch commented Jul 4, 2024

For testing w/ TSAN, I added on the meson flags -Db_sanitize=thread --buildtype=debugoptimized when compiling gamescope

I've only done tests w/ running gamescope in nested mode on my X11 host desktop, so this PR might need testing on a wayland host w/ the gamescope wayland backend.

I used the following command to test the latest commit from master w/ TSAN:
TSAN_OPTIONS="suppressions=${HOME}/gamescope/suppress_mangoapp" build/src/gamescope -- vkcube |& tee gamescope_tsan.log
note that I had to suppress mangoapp_update() (the suppress_mangoapp files is only one line: race:mangoapp_update) to avoid the log getting filled with messages about it

I observed at least three unique data race messages, occurring after gamescope + vkcube were actively running (so after initialization), that would reliably get triggered whenever I did the following in order:

  1. Cause gamescope's window(s) to go out-of-focus
  2. wait
  3. bring the window back into focus
  4. move the cursor around a bit

SUMMARY: ThreadSanitizer: data race ../src/wlserver.cpp:337 in bump_input_counter

SUMMARY: ThreadSanitizer: data race ../src/wlserver.cpp:2398 in wlserver_mousebutton(int, bool, unsigned int)

SUMMARY: ThreadSanitizer: data race ../src/wlserver.cpp:2379 in wlserver_mousewarp(double, double, unsigned int, bool)

Full log: https://gist.github.com/sharkautarch/4a8cf292d8b70d133904708ac29465ed

While there were a lot of TSAN warnings which could be false positives and/or non-issues,
the data races addressed by this PR seemed like a bigger deal

EDIT:
I just realized that if (bPrevState & suspended) should be if (!bPrevState & suspended) OOPSIES
Will fix that soon
^FIXED

Also, some explanations for some of the weird things in this PR:

  • I had to make bCursorHidden be std::atomic<char> in order to be able to use atomic fetch_or. This should be safe because I made sure that bCursorHidden is always initialized and assigned bool values which means that the variable will always be either 1 or 0
  • I use atomic_ref w/ inputCounter instead of a normal atomic because inputCounter is passed into an Xlib function. That means that the Xlib function is doing a non-atomic write to inputCounter, but TSAN doesn’t seem to complain, so either it’s a false negative or the threads just never actually race at that spot… I’ve been using gcc’s thread sanitizer, so maybe this might warrant checking w/ clang’s thread sanitizer…!
  • This PR uses a lot of acquire/release memory ordering. This is just to minimize any performance impact, since this PR adds a significant amount of atomic ops. I’m pretty sure that only two threads could simultaneously access memory (at least in the areas that this PR deals with) so, from what I understand, using acquire/release should be fine

EDIT: for the second bullet point, now that I think about it, Xlib is a shared library, which means that the only way for TSAN to catch races from Xlib function calls would be to rebuild Xlib w/ TSAN. So I think I’ll need to edit this PR again
^FIXED
^Just realized that XChangeProperty doesn't modify its input data (or at least I don't think it does). I updated the PR again

last update:
to deal with the one remaining edge case I saw, I changed all instances of

wlserver.bCursorHidden.store(!wlserver.bCursorHasImage, std::memory_order_release);
to:
std::atomic<char> atomicTrue{true};
atomicTrue.compare_exchange_strong(wlserver.bCursorHidden, wlserver.bCursorHasImage);

which makes the line a single CAS RMW operation

Not sure if the change is actually necessary,
and the compare-and-exchange is definitely slower than a load+store.
So let me know whether you think I should undo that...
(edit: for now I've decided to change compare_exchange_strong to compare_exchange_weak, tho this only makes a difference w/ non-x86 architectures)
^ I undid the CAS thing because I realized it didn't work the way I was thinking it would

@sharkautarch sharkautarch force-pushed the wlserver_fix_data_races branch 13 times, most recently from c0251ce to a542451 Compare July 5, 2024 21:04
@misyltoad
Copy link
Collaborator

Please don't do this alignas(std::atomic_ref<bool>::required_alignment)

Just make them atomics.

@misyltoad
Copy link
Collaborator

You also don't really need to do explicit acquire/release here. We can optimize those later. Just keeo them default (seq_cst)

@sharkautarch
Copy link
Author

sharkautarch commented Jul 6, 2024

@Joshua-Ashton

Please don't do this alignas(std::atomic_ref::required_alignment)

ok

You also don't really need to do explicit acquire/release here
I don't think we need seq_cst, all seq_cst adds compared to acquire/release is synchronization between different variables, whereas acquire/release already does synchronization between memory read/writes to the same variable.

If you're sure, then I'll change everything to seq_cst

but on the otherhand:
Just doing some quick calculations, if someone was using a 1000hz polling rate mouse, assuming that translates into wlserver_mousemotion() running 1000 times per second (1hz = 1 cycle/second), using seq_cst atomic writes would mean adding 46000 cpu cycles/second to wlserver_mousemotion() on x86 on an intel coffeelake cpu

  • seq_cst atomic writes to memory on x86 compile down to a mov+xchg (the latter does synchronization) (memory order doesn't make any difference for atomic reads on x86 [as far as the cpu is concerned, that is])
  • looking up the latency of xchg r,m on https://www.agner.org/optimize/instruction_tables.pdf for intel coffeelake, shows that it has a latency of 23 cycles
  • there are two atomic stores in wlserver_mousemotion()
  • 2 x 23 x 1000=46000
  • though now that I think about it, the lowest clock rates of all the coffeelake cpus is 1.7GHz, so: ( (46000Hz)/(1.7GHz) ) * 1second =~ 0.0273ms ... Ok I think I may have just defeated my own point lol
  • RIP anyone who's still using a pentium cpu: xchg r,m is either >100 cpu cycles (pentium 4) or just high, whatever that means...

on the steam deck's zen 2-based-cpu xchg r,m has a latency of about 18 cycles, tho I guess it's unlikely that someone would use a 1000hz input device on the steam deck

@sharkautarch
Copy link
Author

@Joshua-Ashton
Ok, I have changed the bCursorHasImage stuff to normal std::atomic stuff
and I changed the acquire/release stuff to seq_cst

@misyltoad
Copy link
Collaborator

You don't need to specify it... you can just use =

You can also simplify a lot of your code with std::exchange

@sharkautarch
Copy link
Author

sharkautarch commented Jul 6, 2024

@Joshua-Ashton

You don't need to specify it... you can just use =

ok

You can also simplify a lot of your code with std::exchange

Uhhhhh I'm not sure if you meant to say std::atomic<T>::exchange instead,
because I don't think std::exchange works with atomics:
https://godbolt.org/z/YaMfzvnzd

if you actually meant to say std::atomic<T>::exchange, then I don't see any place where that'd be useful,
because for the places where there is an atomic load + atomic store, I don't want to modify the variable that I do .load() from

I had tried to replace some atomic load+atomic stores with one compare and exchange to get around that issue, but wasn't able to use it without getting into the same issue as w/ atomic exchange.

EDIT: I just realized that std::atomic<T>::exchange only modifies the atomic itself -- not whatever's passed into it.
and it also returns the old value of the atomic.
So I was able to replace const bool bCursorWasHidden = wlserver.bCursorHidden.fetch_or(true); with const bool bCursorWasHidden = wlserver.bCursorHidden.exchange(true);

However, c++ std::atomic exchange's function prototype is: T std::atomic<T>::exchange(T desired)
so attempting to replace, say, wlserver.bCursorHidden.store(wlserver.bCursorHasImage.load()) with wlserver.bCursorHidden.exchange(wlserver.bCursorHasImage) I think would be a compile error

@sharkautarch sharkautarch force-pushed the wlserver_fix_data_races branch 8 times, most recently from ddc0f15 to 69a4a18 Compare July 7, 2024 19:30
@sharkautarch
Copy link
Author

@Joshua-Ashton
I’ve edited this PR based on your last feedback
Anything else that I need to change?

src/wlserver.hpp Outdated
bool bCursorHidden = true;
bool bCursorHasImage = true;
std::atomic<uint64_t> ulLastMovedCursorTime = 0;
std::atomic<char> bCursorHidden = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why char?

Copy link
Author

@sharkautarch sharkautarch Jul 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why char?

For some reason, you can't do fetch_or on a boolean atomic...
Which in the one case where I use it, I can't cleanly replace it with something like an atomic exchange.
only integral atomics can do stuff like fetch_or
https://en.cppreference.com/w/cpp/atomic/atomic#Specializations_for_integral_types
(note that technically on x86_64, w/ how I grab the old value of the atomic when using fetch_or, fetch_or actually gets compiled into an atomic compare exchange (which is still lock-free) because x86_64 lock or instruction does not return the old value of the atomic. Funnily enough, aarch64 and riscv [w/ the right atomic isa support] have their own version of lock or that actually returns the old value of the atomic.)

So I use char, since a bool, when stored in a struct/class (in most implementations), actually occupies the same amount of memory as a char.
Would you prefer I use uint8_t instead of char?

I believe most modern cpus should support lock-free atomics on 8 bit values, so I think it should be fine?

I know that basically all x86_64 cpus support it,
plus the same aarch64 isa extension that provided support for stuff like atomic increment also gave support for operations on a wider variety of atomic sizes, including 8bit:
https://devblogs.microsoft.com/oldnewthing/20220811-00/?p=106963

@sharkautarch
Copy link
Author

@Joshua-Ashton
Update:
I changed the one remaining fetch_or to a ternary of suspended ? exchange(true) : load
So now bCursorHidden is std::atomic<bool>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants