Sunday, August 31, 2014

Code optimization patterns - my quick notes

These patterns are based on what I read about here. I am still reproducing (dumping !) it here (with as high fidelity as I can manage), primarily because these patterns are what I do use in assorted situations and also that I want to keep in deep touch with all of them and keep applying them as much as I can. These are just my notes.

Some of these ideas are thrown up as directly conflicting with Object Orientation principles, by breaking encapsulation and trashing inheritance, in favor of composition. It really is a "black art" to know the appropriate situations where I need to apply them.

Here they are :

1/ Data locality -
a/ Converting a list of pointers to a list of offsets(4 byte integers) into contiguously allocated arrays. Iterating over an array of pointers is bad - it makes memory access to random locations - array of pointers "thrash the cache" (too bad, it rhymes !). Allocating array of objects in contiguous memory locations makes it likely that successive objects will be "found" in the cache.
b/ Hot-cold separation - Perhaps the object is large and the array of objects will not fit in the cache anyway. Perhaps for most iterations, only a few specific attributes of each object are processed. Splitting a larger object into smaller sets so that more frequently accessed parts are together. This might make the entire array of objects fit in the cache too.
c/ Reduced branch mis-prediction by
c1/ sorting a list of objects based on some criteria such that iteration only happens for a subset without enclosed 'if' condition
c2/ Creating separate lists of objects based on same criteria so that sorting an entire list is avoided
c3/ Just placing the iteration loop inside the 'if' condition, not other way - simple enough.
c4/ Not mixing different types of objects together in an array if they are to be iterated upon, even if they have same base class types and having separate arrays of each type.

2/ Dirty flag - I can think of it as the poor man's lazy evaluation. Basic dirty flagging is to have an additional boolean flag to tell if the object / cache is stale and requires refreshing. It helps avoid computation until it is required and only if it is required. Another flavor to the pattern is time-stamping or sequence numbering - We have a number or time-stamp that only increments without any danger of repeating and use it to mark the 'version' of the object. Whenever the sequence number is less than some other 'master', we update the cache.

3/ Object pooling - This is basically re-using allocations of some sort - either resources like database connections, threads or simple objects or even plain memory. We just want to avoid creating and destroying objects in memory when we know they really are required in future in some other contexts. There could be low and high watermarks set so that a minimum and a maximum limit on the number of objects in the pool is enforced, for not consuming too much memory and also being ready for multiple requests in burst. If we think of multi-threading scenarios, then each thread dynamically allocating memory from Operating systems is known to be slow, due to locking required for using the common heap. If each thread can use its own pool of memory, then no locking is required, re-use is achieved and thread-specific memory pool should not end up being fragmented.

4/ Spatial partitioning - Organize objects that are "close" to each other so that given an object, the "neighbors" can be efficiently located and processed. This is obviously helpful with real geometric spaces such as nearby particles being checked for collisions. However, the distance as a metric can be extended in other ways - Given a tree like structure, the distance between any two nodes may be expressed in terms of the number of common parents between them. This helps us partition the nodes better, so that queries like finding the number of child nodes for any given node may be made more efficient. This partitioning may be limited to nodes at same levels in the tree too. This pattern does present difficulties when the nodes can actually change distances to other nodes and we have to come up with efficient ways of updating the partitioning depending on the problem context.

Monday, August 11, 2014

USA Computing Olympiad Training Problem 3.3.3 (Camelot)

This is about an algorithmic coding problem that I saw on the USA Computing Olympiad training pages. There were several others that I solved earlier to reach this stage but this one is special ... named Camelot, in section 3.3.3. It seems to have been first posed in the IOI 1998.

Attempting to solve it, swept me through the full range of emotions, going from cocky over-confidence to being nastily surprised to bewilderment to despair, frustration and finally to immense relief of eventually getting rid of it by solving it ! There was absolutely no elation at all on solving it, just massive relief, that's all !

The statement goes like this :

Centuries ago, King Arthur and the Knights of the Round Table used to meet every year on New Year's Day to celebrate their fellowship. In remembrance of these events, we consider a board game for one player, on which one chess piece king and several knight pieces are placed on squares, no two knights on the same square.

During the play, the player can place more than one piece in the same square. The board squares are assumed big enough so that a piece is never an obstacle for any other piece to move freely.

The player's goal is to move the pieces so as to gather them all in the same square - in the minimal number of moves. To achieve this, he must move the pieces as prescribed above. Additionally, whenever the king and one or more knights are placed in the same square, the player may choose to move the king and one of the knights together from that point on, as a single knight, up to the final gathering point. Moving the knight together with the king counts as a single move.

Write a program to compute the minimum number of moves the player must perform to produce the gathering. The pieces can gather on any square, of course.


Line 1: Two space-separated integers: R,C, the number of rows and columns on the board. There will be no more than 26 columns and no more than 30 rows.
Line 2..end: The input file contains a sequence of space-separated letter/digit pairs, 1 or more per line. The first pair represents the board position of the king; subsequent pairs represent positions of knights. There might be 0 knights or the knights might fill the board. Rows are numbered starting at 1; columns are specified as upper case characters starting with `A'.

8 8
D 4
A 3 A 8
H 1 H 8

The king is positioned at D4. There are four knights, positioned at A3, A8, H1, and H8.

A single line with the number of moves to aggregate the pieces.
SAMPLE OUTPUT (file camelot.out)


They gather at B5. 
Knight 1: A3 - B5 (1 move) 
Knight 2: A8 - C7 - B5 (2 moves) 
Knight 3: H1 - G3 - F5 - D4 (picking up king) - B5 (4 moves) 
Knight 4: H8 - F7 - D6 - B5 (3 moves) 
1 + 2 + 4 + 3 = 10 moves.

You could assume that the King and Knight pieces move just like the standard chess moves. The board co-ordinate annotation is exactly the same too : The columns (A-Z) start go from left to right, while the rows (1 - 30) go from down to upwards.

Basically it requires full appreciation of breadth first search. To solve it, you need to use breadth first search in novel ways, tweaking it to suit the fact that the King and the Knight pieces can join/merge while moving and adjust shortest distances accordingly.

I struggled with the problem since initially I (wrongly) assumed that even two or more Knights could join just like a King and a Knight. That could make this problem even tougher. But as such, the join only happens between King and a Knight.

The input limits on the problem are very delicately adjusted to time-out the solutions that have even one excessive nested iteration. Suppose for 'R' rows, 'C' columns and 'P' pieces, the expected time complexity should be O(R*C*P) ... For the maximum input dimensions, it works out to be approximately 600K operations, should be comfortable for a time limit of 1 second. My initial solution was working OK, but had an extra iteration on number of pieces in the innermost loop, that was timing out my solution.

Finally, I got the code below, that does these things :
1. Records the distance to each square for the King.
2. For each square on the board, records the shortest path to every other square, using breadth first search, and saves all of it.
3. Tweaks breadth first search to record shortest path from each Knight to all other squares & assuming that this Knight picks up the King. This uses information stored in (1)
4. Iterates over the entire board, checking if the square can be the final gathering point. In the innermost loop, each Knight is examined if that Knight can pick up the King to reach the candidate gathering square in the cheapest way.

#include <fstream>
#include <vector>
#include <queue>
#include <math.h>

const int INF = 10000000;

int min(int a, int b){if(a < b) return a; return b;}

int max(int a, int b){if(a > b) return a; return b;}

int kr[8] = {-2, -1, 1, 2, 2, 1, -1, -2};
int kc[8] = {1, 2, 2, 1, -1, -2, -2, -1};

class co_ords{
 int r; int c;

class Knight_King{
 int for_knight;
 int for_king;
 int r;
 int c;

void init(Knight_King dists[30][26], int rows, int cols, int king_row, int king_col){
 for(int rr = 0; rr < rows; ++rr){
  for(int cc = 0; cc < cols; ++cc){
   dists[rr][cc].for_king = max(abs(rr - king_row), abs(cc - king_col));
   dists[rr][cc].for_knight = INF;

class Knight_Cost{
 Knight_King KK[30][26];

bool valid(int r, int c, int rows, int cols){
 if(r >= 0 && c >= 0 && r < rows && c < cols)
  return true;
 return false;
void reset(int arr[30][26], int rows, int cols){
 for(int x = 0; x < rows; ++x){
  for(int y = 0; y < cols; ++y){
   arr[x][y] = INF;
void knights_shortest(int dists[30][26], int rows, int cols, int r, int c, std::queue& q){
 dists[r][c] = 0;
 co_ords rc; rc.r = r; rc.c = c;
 int rr, cc;
  co_ords rc = q.front();
  for(int i = 0; i < 8; ++i){
   rr  = rc.r + kr[i];
   cc = rc.c + kc[i];
   if(valid(rr, cc, rows, cols) && dists[rr][cc] == INF){
    co_ords rrcc; rrcc.r = rr; rrcc.c = cc;
    dists[rr][cc] = min(dists[rr][cc], dists[rc.r][rc.c] + 1);

void knights_paths(Knight_King costs[30][26], int rows, int cols, int r, int c, std::queue& q){
 costs[r][c].for_knight = 0;

 Knight_King rc; rc.r = r; rc.c = c; rc.for_king = costs[r][c].for_king; rc.for_knight = costs[r][c].for_knight;
 int rr, cc;
  Knight_King rc = q.front();
  for(int i = 0; i < 8; ++i){
   rr  = rc.r + kr[i];
   cc = rc.c + kc[i];
   if(valid(rr, cc, rows, cols)){
    bool enq = ((min(rc.for_king, costs[rr][cc].for_king) + rc.for_knight + 1) < (costs[rr][cc].for_king + costs[rr][cc].for_knight));
     Knight_King rrcc; rrcc.r = rr; rrcc.c = cc; rrcc.for_king = min(rc.for_king, costs[rr][cc].for_king); rrcc.for_knight = rc.for_knight + 1;
     costs[rr][cc].for_king = min(rc.for_king, costs[rr][cc].for_king);
     costs[rr][cc].for_knight = rrcc.for_knight;

int get_sum(int dists[30][26], const std::vector& knights, int num_knights){
 int ret = 0;
 for(int i = 0; i < num_knights; ++i){
  ret += dists[knights[i].r][knights[i].c];
 return ret;

class square{
 int dists[30][26];
 square(int rows, int cols){
  for(int i = 0; i < rows; ++i){
   for(int j = 0; j < cols; ++j){
    dists[i][j] = INF;

int main(){
 int rows, cols;
 char col;
 std::vector knights;
 std::queue qkk;
 std::queue q;

 int row, num_knights = 0;
 int king_row, king_col;
 std::ifstream ifs("");
 ifs >> rows >> cols;

 ifs >> col >> row;
 king_col = (col - 'A');
 king_row = (rows - row);
 while(ifs >> col){
  ifs >> row;
  co_ords rc;
  rc.r = rows - row;
  rc.c = col - 'A';
 std::ofstream ofs("camelot.out");
 if(num_knights == 0){
  ofs << "0\n";
  return 0;

 std::vector board[30];
 for(int i = 0; i < 30; ++i){
  for(int j = 0; j < 26; ++j){
   square s(rows, cols);
   knights_shortest(board[i][j].dists, rows, cols, i, j, q);

 std::vector cost(num_knights);

 for(int x = 0; x < num_knights; ++x){
  init(cost[x].KK, rows, cols, king_row, king_col);
  knights_paths(cost[x].KK, rows, cols, knights[x].r, knights[x].c, qkk);

 int min_so_far = INF;
 int sum = INF, tmp_sum, sim_king_dist;

 for(int r = 0; r < rows; ++r){
  for(int c = 0; c < cols; ++c){
   sim_king_dist = max(abs(king_row - r), abs(king_col - c));
   sum = get_sum(board[r][c].dists, knights, num_knights) + sim_king_dist;

   for(int k = 0; k < num_knights; ++k){
    tmp_sum = sum;

    sum -= (sim_king_dist + board[r][c].dists[knights[k].r][knights[k].c]);

    sum += (cost[k].KK[r][c].for_king + cost[k].KK[r][c].for_knight);

    if(sum < min_so_far)
     min_so_far = sum;
    sum = tmp_sum;

 ofs << min_so_far << "\n";
 return 0;

A result of a rivetting journey over the weekend...

Sunday, July 27, 2014

A Red-Black Tree - slightly unusual

The Red-Black Tree is not something I can implement without referring a standard textbook - it's implementation is just way too intricate. Even so, when you read about it, everything conceptually clicks, somehow, but you wonder how someone could have got around to actually inventing this beast !

The Red-Black Tree is the data structure of choice when a worst case O(log N) complexity is desired. As such AVL trees have preceded them and they make the same guarantees but it seems the AVL trees have lost out to the Red-Black Trees in recent implementations. The AVL tree seems to be too eager to re-balance the heights of the sub-trees resulting in larger constant factor in typical usages. Red-Black trees just don't care too much about imbalances, by only guaranteeing that any path to leaf contains a maximum of twice the number of nodes than the other paths from the same node.

So, we see that the Red-Black Tree is incorporated in some very commonly used C++ - STL structures (maps).

I have implemented the tree now, it's a standard implementation, but with a few twists on how it is used. See my GitHub source shared here...

The Red-Black tree in my implementation has the following features :

1. It has a C++ interface, but there are no pointers in the interface or internal code. There are only 4 byte unsigned integers serving as offsets into a linear structure, such as a vector or array.

2. Tree operations, even insertions, do not require copying an element or compound objects. As such the objects are maintained in a vector by the client code and the tree only maintains offsets into the client's vector. Goes without saying that the objects to be inserted need not have a default constructor either.

3. Comparison function (actually a class) is supplied by the client code and invoked by the tree. Nothing new here though ...

4. Removing elements from the tree is complemented by a free-list feature. The free-list holds a list of freed elements so that subsequent insertions can re-use memory from removed elements.

I hope to derive these advantages from these characteristics :

1. Interleaved insertions-deletions should behave better, with reduced need to allocate-release memory, maybe to the OS or to a shared memory pool.

2. Not using any pointers presents multiple benefits - the memory consumed by the tree will not change just by changing between a 32-bit and a 64-bit OS ! Since the 4 byte unsigned integral offsets all refer to memory within a single contiguously allocated array, the memory references should be more compact/localized. Cache hits should be better than what could have been if memory for nodes was allocated individually.

3. No copying of objects involved. Default constructor on elements to be inserted is not required. Only 4 byte offsets referring to the container are stored. Unfortunately, such references mean that offsets of existing elements cannot be changed, that might happen if the container's elements were to be deleted.

This is something in work too - I am planning to implement a companion container to this tree, that can support deletions too. Also to come are a few other data structures that are not commonly found in libraries...

Saturday, July 26, 2014

About API design and copy semantics

This is a tale of how a combination of counter-intuitive copy semantics and API design kept me debugging a piece of code, late into the night, for several hours... Late night coding might be fun but it there is never any fun in working in a crisis mode, with time pressure and against a bug.

I suspect that the reasons behind my ordeal, are also the reasons why most programmers tend to vastly under-estimate the time it will take them to finish a programming task.

Sequences or contiguously allocated arrays usually offer some facilities to copy their internal representation, that is efficient, and shallow. Certainly more efficient than simply iterating over the entire sequence and copying each object by value. Typically such sequences have a raw buffer where the data is managed, a 4 or 8 byte 'length' to know the size of that buffer and whether the sequence 'owns' the buffer or not.

Ownership of the raw buffer is the "hot potato", the sequence which ends up with it should be the one releasing the buffer's memory. The consequences are ghastly - If both of them release or if both fail to release.

When a copy of a sequence is required, into another sequence, the pointer to the raw buffer is assigned, the 'length' is copied and the ownership is relinquished by the original sequence depending on what the caller wants.

Consider such feature from a CORBA sequence, and an implementation from 'Orbacus',
typedef OB::FixSeq< ::CORBA::ULong, int > MSeq; /* CORBA's fixed sequence of integers */

MSeq source;
source.length(2); /* Fill some data to test copy */
source[0] = 1;
source[1] = 2;

MSeq target; /* Empty sequence */

bool takeOwnership = true; /* Let's have target sequence acquire ownership, and source sequence relinquish it too ! */

/* Now actual call to copy sequence into target  */

I noted the following painful aspects of the 'replace()' call, that is supposed to do the efficient copy :

1. The name itself, 'replace' is suggestive of an action to 'change the guts' of the target sequence, but implies nothing about the fact that even the source sequence can be totally 'gutted' through passing certain choice of options transferring ownership.

2. How much of grunt work is expected from the caller ! The function takes 4 arguments including the length, maximum allocation of the source sequence and buffer, all internal details of the source sequence... Why does it not do with just a reference to source sequence and ownership option, and use all such internal detail from the source sequence itself ? After all, 'replace()' is a member function.

3. I find that boolean option of 'takeOwnership' is accepted twice in the same function call - first to tell the source that it should relinquish ownership and second to have the target sequence acquire it. Now, I can't think of a situation where the caller might desire to pass 'takeOwnership' as 'true' in first and 'false' in second or the other way. In fact, using different values for the ownership caused a crash that I debugged. The code was deep into the implementation and the results are totally counter-intuitive.

/* This code crashes  */
/* We are copying between 3 sequences : original -> intermediate -> final */
#include "OB/CORBA.h"

int main(){

typedef OB::FixSeq< ::CORBA::ULong, int > MSeq;

MSeq original;
original[0] = 1;
original[1] = 2;

MSeq intermediate; 

std::cout << intermediate.length() << " " << intermediate[0] << " " << intermediate[1] << "\n";

MSeq final;

std::cout << final.length() << "\n"; /* Length is as expected ==> 2 */

std::cout << final[0] << " " << final[1] << "\n"; // Still ... Crashes here...

return 0;


To build :
g++ -I  -L  -lOB -lJTC -ldl -lpthread 


What is certainly bad is that the length of the 'final' sequence is 2 but the internal buffer is just not there, any access causes a crash. The resulting sequence is simply not internally consistent.

The 'replace()' function API is designed such that it accepted too many arguments, that should have been deduced from a source reference. And that made it easy for a caller to mis-use the API. In that sense, it fails to have certain characteristics mentioned in an excellent piece from Joshua Bloch here !

Saturday, January 18, 2014

Beginning with Haskell

I just gave myself permission to spend some time learning a new programming language, and here are my notes on my initial effort. I bet anyone who has read this, has on their radar, to learn some functional programming language. I chose Haskell. Judging by the experiences of others attempting the same thing, and with vastly more neural firepower than me, maybe that is a brave decision !

So, yes, Haskell is daunting from the outside, but I am only first checking out if it is really possible to make any progress without being an expert in Category Theory or Type Systems ... After all, Haskell is supposed to be very general purpose. Can it give me some prop by allowing a humble "Hello World" come out easy ?

Here is one of the first programs I attempted : calculating the 'n'th Fibonacci number.

With Haskell, this is a first attempt,

fib n
       | n == 0 = 0
       | n == 1 = 1
       | n > 1 = fib(n - 1) + fib(n - 2)

So terse and clean. I could challenge someone to write a specification of the Fibonacci program in plain English and in fewer characters. Even without any introduction to the syntax, we can see through this affair easily, as if someone just wrote the pseudo code. Of course, as horribly performing as a naive C++ program following the same pattern ! Calculating the 30th Fibonacci takes upwards of 5 seconds and if someone asks for the 60th Fibonacci, this program won't return before the Haley's comet and much later. Exponential algorithmic complexity can bite hardest, even with Haskell.

So here is a result after more hours of reading this fantastic introductory online book named "Learn You a Haskell" :

fib_improved n = fst $ fib_helper n (0, 1)

fib_helper 0 (a, b) = (a, b)
fib_helper n (a, b) = fib_helper (n - 1) (b, a + b)

Somehow, this is equally short as the earlier version but is super fast. Asking for the 100th Fibonacci instantly gets me this result : "354224848179261915075" !

The last line of this code tells the whole story, which is using recursion to iterate over a "tuple" repeatedly and transforming the tuple in the process. Probably this textual explanation does a far worse job (at explaining) than what the single line of Haskell makes obvious. The first two lines are just special cases and presentation.

The patterns I see so far are twofold :
1. Recursion is cool, but the algorithmic considerations are paramount. Haskell, with its principles about immutable data, presents additional twists to the efficient coding game. We need to worry about unnecessary copying (to implement data immutability), and how not to blow up the stack even while using recursion.
2. One of the techniques to make things efficient in Haskell is to pass additional arguments to function calls and "build" an answer through them, instead of first calling the function and then processing the returned data from a function. That is what the second version of the Fibonacci program does differently from the first.

Encouraging !