-Ogofaster

Notes from Andrei at code::dive 2015. Watch the whole lecture (recommended).

Integrals

Prefer 32-bit ints to other sizes

32-bits is the sweet spot as 64-bit ALU can handle two calculations
8, 16-bit computations convert to 32-bits so don't use smaller
Use small ints in arrays

Prefer unsigned to signed

Unsigned is faster
Except when converting to floating point

Most numbers are small

If you find optimisations that work with small numbers, use them

Floating point

Double- and single-precision equivalent speed
80-bit extended only slightly slower
But don't mix them (because conversions)
Int to float is cheap
Floating point to any integral is expensive

Strength reduction

Use minimum strength operations when optimising as the stronger ones are more costly.

Speed hierarchy

operation	cycles
Comparisons	less than 1
(u)int add, subtract, bitops, shift (2 x 32-bits)	1
Floating point add, subtract	TBC
(u)int32 multiply, floating point multiply	TBC
Floating point division, remainder	TBC
(u)int division, remainder	TBC

Why is it cheaper to do floating point division than integer?

Format of floating point: mantissa and exponent, already friendly to division.

Division in integers is unnatural. Division is a search!

Minimise indirect writes

Writes through a pointer: indirection
Hard to speculate, hard to enregister (put in a register)
Whenever you write it's really a read and a write
Because you transfer a 64-byte cache line: CPU has to read the cache line, update the word and write it back; if you can keep all values for a calculation in one CL all the better (even inject fake writes!)
The more dirty cache you have, the more writes you need to do

Improvements

Fewer indirect writes
Regular access patterns

Checkpoint

You can't improve what you can't measure (you can't measure what you don't measure)
Always choose good baselines
Optimise code judiciously for today's dynamics

The forgotten parallelism

Reservation station
Instruction-level parallelism on a single core

How do we use ALUs?

ILP: instruction level parallelisation.

Pipelining
Superscaler execution
Out-of-order execution
Register renaming
Speculative execution
... and many more

better instruction level parallelism = fewer data dependencies

Part 2

https://www.youtube.com/watch?v=9tvbz8CSI8M

Measure
Use good baselines
Reduce strength
Reduce dependencies
Knowing math is knowing optimisation

Unsigned integer overflow has well-defined behaviour; signed integer overflow is undefined.

Part 3

https://www.youtube.com/watch?v=FJJTYQYB1JQ

Incorporate conditionals into expressions rather than using ifs: + (size & 1)
Always use infinite loops (except most times)

1st order conclusions

Code that wants to be fast is left-leaning (goes to the left of the page: no ifs, decision points etc)
Don't mix hot and cold code, break ASAP
"Can't write heavily nested code in 80 characters"

2nd order conclusions

Tension with generic programming: "generic programming is why we can't have nice things"

-Ogofaster

-Ogofaster

Integrals

Prefer 32-bit ints to other sizes

Prefer unsigned to signed

Most numbers are small

Floating point

Strength reduction

Speed hierarchy

Why is it cheaper to do floating point division than integer?

Minimise indirect writes

Improvements

Checkpoint

The forgotten parallelism

How do we use ALUs?

Part 2

Part 3

1st order conclusions

2nd order conclusions

Timeline

results matching ""

No results matching ""