Continuing from my earlier post about code optimization techniques here.
This is all about making the code more data oriented. Which means that the code is written such that its data is organized intelligently in memory, considering CPU cache access patterns and data sizes.
As it is, extracting benefits out of any of these techniques is a black art. Some may back-fire in specific circumstances. So, benchmark... be absolutely sure to measure(twice) before cutting(once).
Time to consider a few more ways to make code run faster...
- Inlining functions: Yes, nothing too special about this but worth noting. Take care about its impact on size of the executables.
- Mark functions as constexpr: The newer C++ standards make it simple to go further than inlines. If you tell the compiler that the function can compute statically for all its inputs, make it constexpr so that the computations finish at compilation stage itself. Of course, the inputs that the callers pass to such a function should also be known fully at compile time. Inlining the function produces no guarantee of actual inlining by the compiler but if we do make the function as constexpr, we are sure that there are absolutely no runtime costs of this function at all.
- Provide specialized versions of a function per data type. This is not just about your function taking a polymorphic base type as an argument. C++ allows overloaded function names. It is sometimes possible to write a function slightly differently for a particular data type so that it is more optimal. Perhaps it saves you a remote call or two. But at a very minimum, if it helps runtime conditional checks, it could be worth doing.
- Unrolling loops: Doing more actions in one iteration of a loop.
for(int i = 0; i < 4*x; ++i){ // process(arr[i]) }
changes to...for(int i = 0; i < x; i += 4){ process(arr[i]); process(arr[i + 1]); process(arr[i + 2]); process(arr[i + 3]); }
which just saves some book-keeping conditional checks and increments in the for-loop that were happening every iteration. - Pass values for anti-aliasing: Sometimes you have got to assure the compiler that some value is certainly not going to change for every iteration of some big loop. Then the compiler will not add a fresh load for that variable.
void func(int arr[N], int* p){ // 'p' passed as pointer. Not sure if it is actually aliasing some arr[i]. for(int i = 0; i < N; ++i){ update(arr[i]); // also uses 'p' // load 'p' again to ensure it is latest. } }
Here, we could pass p by value, that is good enough for the compiler. - Avoid guessing for branch predictions: Sometimes, calculating all conditional paths in a piece of code and then deciding at the very end what to return could help save from a bad branch prediction cost.
- Process large arrays in parallel. Batch them and take care to avoid false sharing, i.e - threads that still access the same data and then block on mutual access. Make sure each thread gets a full cache line.
- If you process something in parallel as in the previous point, you might have to reduce at the end to get a combined result. Reduce pair-wise instead of making all threads block on a single shared data.
- Unroll loops based on latency of inner operations and order the operations so that maximum registers/cache is used at each point within the iteration.
- Separate fast and slow code. Remove exception handling out of tight code.
That's it for now.
Hoping to be mindful of such finer points.
No comments:
Post a Comment