Multi-threading
Multi-threaded concepts are important: e.g., atomics, locks, issues with different designs, how to make things thread safe. Cache locality is another huge thing these days. Asynchronous architectures and callbacks are what you will be dealing with every day.
- What is cache locality?
- How do multicore systems ensure their caches are in sync?
- How do you get around this problem?
- Why are signals slow and why is context switching bad?
- What exactly happens during a context switch?
"A computer program or subroutine is called reentrant) if multiple invocations can safely run concurrently on a single processor system."
"Race condition: an unfortunate order of execution causes undesirable behaviour."
Starting in C++11, scoped static initialization is now thread-safe, but it comes with a cost: reentrancy now invokes undefined behavior.
Issues
- Ensure critical sections are as small as possible
- Only a problem for multiple writers (multiple readers OK)
- Too few threads: algorithm is sub-optimal
- Too many threads: overhead of creating/managing and partitioning the data is greater than processing advantage; software threads outnumber the available hardware threads and the OS must intervene
- Data races, deadlocks and livelocks -- unsynchonised access to shared memory can introduce race conditions and undefined behaviour (program results depend non-deterministically on the relative timings of two or more threads)
- Threads versus processes
- STO: strict timestamp ordering (younger transactions wait for older transactions)
DCLP
- Singleton isn’t thread safe
Deadlocks
- Deadlocks
- "Livelock is a risk with some algorithms that detect and recover from deadlock"
- Resource starvation)
- Spinlock -- busy waiting, but useful in situations where the lock will become available in a very short time
Detecting deadlocks
Execution policy
std::execution::sequenced_policy
: The execution policy type used as a unique type to disambiguate parallel algorithm overloading and require that a parallel algorithm's execution may not be parallelized. The invocations of element access functions in parallel algorithms invoked with this policy (usually specified as std::execution::seq) are indeterminately sequenced in the calling thread.std::execution::parallel_policy
: The execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm's execution may be parallelized. The invocations of element access functions in parallel algorithms invoked with this policy (usually specified as std::execution::par) are permitted to execute in either the invoking thread or in a thread implicitly created by the library to support parallel algorithm execution. Any such invocations executing in the same thread are indeterminately sequenced with respect to each other.std::execution::parallel_unsequenced_policy
: The execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm's execution may be parallelized, vectorized, or migrated across threads (such as by a parent-stealing scheduler). The invocations of element access functions in parallel algorithms invoked with this policy are permitted to execute in an unordered fashion in unspecified threads, and unsequenced with respect to one another within each thread.std::execution::unsequenced_policy
: The execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm's execution may be vectorized, e.g., executed on a single thread using instructions that operate on multiple data items.
Prevention
- Try to avoid calling out to external code while holding a lock
- Try to avoid holding locks for longer than you need to
- If you ever need to acquire two locks at once, document the ordering thoroughly and make sure you always use the same order
- Immutability is great for multi-threading: immutable means thread safety; functional programming works well concurrently partly due to the emphasis on immutability
Mutex
See std::mutex
or std::atomic
in C++.
Exercise: write a producer-consumer solution.
- A mutex is a locking mechanism used to synchronize access to a resource
- A semaphore is signaling mechanism: "I am done, you can carry on" kind of signal
C++ threads
Launch with std::thread
, attach with std::join
. C++20 introduced std::jthread
which automatically rejoins on destruction.
std::async
runs the function f asynchronously (potentially in a separate
thread which might be a part of a thread pool) and returns a std::future
that
will eventually hold the result of that function call. Not entirely intuitive, see std::async
.
Threads versus processes
A thread is a branch of execution. A process can consist of multiple threads.
Threads
- Don't try to go lock-free unless you really have to. Locks are expensive, but rarely the bottleneck.
- Document the thread-safety of your types. Most types don't need to be thread-safe, they just need to not be thread hostile: i.e., "You can use them from multiple threads, but it's your responsible for taking out locks if you want to share them.
- Don't access the UI (except in documented thread-safe ways) from a non-UI thread. In Windows Forms, use Control.Invoke/BeginInvoke
- Use locks when you access mutable shared data, both for reads and writes.
C++ algorithms that can take an execution policy
- std::sort
- std::copy
- std::transform
- std::accumulate
- std::for_each
- std::reduce
- std::inclusive_scan
- std::exclusive_scan
- std::transform_reduce
- std::replace_if
- std::replace
- std::remove_if
- std::remove
- std::count_if
- std::count
- std::max_element
- std::min_element
- std::find_if
- std::find
- std::generate
References
- A Fast General Purpose Lock-Free Queue for C++
- Thread safety
- Memory barrier
- https://stackoverflow.com/questions/26013650/threadsafe-lazy-initialization-static-vs-stdcall-once-vs-double-checked-locki/
- https://devblogs.microsoft.com/oldnewthing/20040308-00/?p=40363#:~:text=Starting%20in%20C%2B%2B11,time%20execution%20reaches%20their%20declaration
- The Amazing Performance of C++17 Parallel Algorithms, is it Possible?