Caches

Small amounts of unusually fast memory (Data D$, Instruction I$, Translation Lookaside Buffer TLB), cache misses, speculatively prefetch, does it fit in cache (small is fast), no time/space tradeoff at hardware level, locality counts (stay in the cache), predictability helps.

A modern multi-core machine will have a multi-level cache hierarchy, where the faster and smaller caches belong to individual processors. When one processor modifies a value in its cache, other processors cannot use the old value any more. That memory location will be invalidated in all of the caches.

Furthermore, since caches operate on the granularity of cache lines and not individual bytes, the entire cache line will be invalidated in all caches.

Another definition is: "a multiprocessor is cache consistent if all writes to the same memory location are performed in some sequential order".

TLB
Content-addressable memory
MSI: modified shared invalid
Cache interventions
Exclusive state: issue read, if nobody else has it get write
Owner state (AMD)
Atomic transaction bus
Split transaction bus: everybody observes all the requests at the same time, responses come back some time later
Non atomicity > transient states
Scaling cache coherence > directory based coherence
Premature optimisation

Lectures

What do you mean by "cache friendly"? -- Björn Fahller (code::dive 2019)

Assume any pointer indirection is a cache miss
Smaller working data set is better
Use as much of a cache entry as you can: don't read a whole cache line to check one bit
Doing more work can be faster than doing less
Sequential memory can be very fast due to prefetching
Fewer evicted cache lines means more data in hot cache for the rest of the program
Mispredicted branches can evict cache entries (see speculative execution)

Tools

Valgrind is in essence a virtual machine using just-in-time (JIT) compilation techniques, including dynamic recompilation. Nothing from the original program ever gets run directly on the host processor.
perf
Tracy profiler

Cache coherence

In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.

MESI cache coherence protocol

Each cache line has state, tag, data.

Modified - write back the data before anybody else is allowed to read main memory
Exclusive - mark if nobody is sharing, can change with impunity
Shared
Invalid - unused

Snooping

https://en.wikipedia.org/wiki/Bus_snooping

Write-invalidate (common)
Write-update (causes greater bus activity)

Directory based coherence

Used for large CPUs.

See wiki.

Write-back versus write-through

All Intel-compatible CPUs during the last one/two decades used a write-back strategy for caches (which presumes fetching a cache line first to allow partial writes). Of course that's the theory, reality is slightly more complex than that.

Cache misses

There are three kinds of cache misses:

Instruction read miss
Data read miss
Data write miss

Cache read misses from an instruction cache generally cause the largest delay, because the processor, or at least the thread of execution, has to wait (stall) until the instruction is fetched from main memory. Cache read misses from a data cache usually cause a smaller delay, because instructions not dependent on the cache read can be issued and continue execution until the data is returned from main memory, and the dependent instructions can resume execution. Cache write misses to a data cache generally cause the shortest delay, because the write can be queued and there are few limitations on the execution of subsequent instructions: the processor can continue until the queue is full. There is a more detailed introduction to the types of misses here.

Data cache

Straight line code (no branches)
Linear array traversal

Instruction cache

Avoid iteration over heterogeneous sequences with virtual calls (sort first)
Make "fast paths": branch-free sequences
Inline cautiously
Take advantage of PGO and WPO

Virtual functions

We can use CRTP to avoid the first two.

Load vtable pointer: potentially data cache miss
Load virtual function pointer: potentially data cache miss
Then code of function is called: potential instruction cache miss

Memory banks
Associativity
Inclusive versus exclusive content
Dirty cache not a problem for read-only
Hyper-threading
False cache line sharing - cache ping pong

Cache performance evaluate

Why it's critical
Why it's hard
Tools?

Cache associativity

Direct mapped cache
N-way set associative cache
Fully associative cache

Cache oblivious algorithm design

Typically a cache-oblivious algorithm works by a recursive divide and conquer algorithm, where the problem is divided into smaller and smaller sub-problems. Eventually, one reaches a sub-problem size that fits into cache, regardless of the cache size.

Profiling

PGO: profile guided optimisation
WPO: whole program optimisation
OProfile

Data-oriented design

Design to get the most value out of every single cacheline that is read. Split apart the functions and data.

Performance of a program is dependent on

time/program = instructions/program _ cycles/instruction _ time/cycle

Compiler, optimiser
Cycles per instruction: pipelining
Time per clock cycle

CPI = CPI_ideal + CPI_stall

CPI_stall contributors

Data hazards
Control hazards: branches, exceptions
Memory latency: cache misses

References

https://github.com/rollbear/cache-friendly-samples/blob/master/bheap_aux.cpp
Crunching bytes
Speculative execution
MIT pipelines
Pipeline stall
Predication)
Vector processor
MIMD
What every programmer should know about memory
Be nice to your cache
wiki/CPU_cache
False sharing is not fun -- Herb Setter
Gallery of processor cache effects -- Igor Ostrovsky
Writing faster managed code Jan Gray (section: Of Cache misses, page faults and computer architecture)
Optimising for instruction caches -- Amir Kleen
Memory is not free (more on Vista performance) -- Sergey Solynik

Caches

Caches

Lectures

Tools

Cache coherence

MESI cache coherence protocol

Snooping

Directory based coherence

Write-back versus write-through

Cache misses

Data cache

Instruction cache

Virtual functions

Cache performance evaluate

Cache associativity

Cache oblivious algorithm design

Profiling

Data-oriented design

Performance of a program is dependent on

CPI_stall contributors

References

results matching ""

No results matching ""

Caches

Lectures

Tools

Cache coherence

MESI cache coherence protocol

Snooping

Directory based coherence

Write-back versus write-through

Cache misses

Data cache

Instruction cache

Virtual functions

Cache related issues

Cache performance evaluate

Cache associativity

Cache oblivious algorithm design

Profiling

Data-oriented design

Performance of a program is dependent on

CPI_stall contributors

References

results matching ""

No results matching ""