Saturday, June 20, 2009

File locking protocol with only shell scripting primitives

Writing this post out of some faint feeling of guilt and disappointment of not being able to keep up to the writing habit and discipline. It happens to be my first post ofthe year 2009. A lot many things have happened over these months, but as usual,the more things change, the more they seem to remain the same. So I jump directly to the point that pushed me to post this.

Encountered a problem today, regarding certain shell script running on a linux box.The script could be invoked by many users simultaneously. Now the problem is that the script used to write some text to a file. The changes could be updates through a redirect ('>>') or even deletes through 'gnu_sed'. As you might have guessed, the file being written to by the script was common to all users as well. I was looking for a way to serialize the 'write' access. Several incidents showed that multiple concurrent users eventually messed up the file contents through the concurrent shell script invocations.

Now, someone suggested a file locking mechanism. Each invocation of the script will lock the file and other invocations wait till it is released. I don't know for sure, but there is no shell interface for locking some file. I am aware of seeing something like that interface only at the system/C programming level through fcntl/flock calls. I had to think for a while before getting to the following algorithm, that is purely based on shell scripting primitives and guarantees a predictable first-in-first-out ordering. It also ensures that starvation cannot occur. Deadlocks are out of question, with there being only one resource to lock.

Here it is,
1/ Lets assume that a shell script (S) modifies the file (F) and we have a set of processes {P0, P1, ... Pn} that can concurrently invoke S to modify F according to some arbitrary logic.
2/ We make a kind of wrapper script (S2) to control access to S and add some logic.
3/ S2 works like this,
It creates a temporary file in the same directory with a name whose uniqueness is guaranteed e.g - a file named lock., pid - known quantity of the process P.
4/ Next, it checks for the creation dates on all lock.<*> files in that directory.
case 1 : If it finds that the date on its own lock file is earliest, it can invoke the script.
case 2 : If it finds another lock file with an earlier date, it sleeps for a configurable duration and repeats step '4/' once it wakes up. If the dates of another lock file co-incide and they are earliest among all locks, we can make the process with the lower pid the winner. The higher pid process waits and repeats 4/ after sleep.
5/ S2 finally invokes S which does all he work in exclusive mode for 'writes'.
6/ Delete the lock file created by self (having the own pid) after done and S returns control to S2.

We can assert that the FIFO ordering will be maintained, ensured by having the creation dates as the decider.Starvation is prevented by having the processes first create a lock file and thus marking the availability for ascript invocation.A process also has the luxury of relinquishing lock on the file by simply deleting its lock file, regaining lock by simply re-creating the lock file when it needs and queueing up again for access. If a process wants to justread the file, we can allow the script to make a copy of the file and release the lock as soon as possible.

Issues -
1/ The script S should be 'well-behaved'. Otherwise, it can simply use 'exit' to finish the processinstead of returning control to S2 for cleaning up the lock. file of that process. Ifany such pid file is left out in the system ,we can end up with a total starvation because noprocess will find its lock file better than the dead process' lock. A simple workaround is thatif a process P finds that is is deprived of a lock because of only one other process P, it verifies if the process who owns the lock P is still alive with the 'ps -ax' commands.If the process P is not alive, this process P can delete the lock file created by the dead process and assumeto be owner. Only one process will do this check because only one process will be in the 'second' position from gaining access.

2/ Another problem is much tougher to crack. The overall response/efficiency of this locking protocol heavily dependson the sleep time for the processes. The process has to go to a sleep if it finds that the script is already locked. We need the sleep time to be equal to the approximate time that the script S takes to run. If the sleep time is too low, processes will spend computing resources checking locks too frequently, only to find that theyhave to continue sleeping. If it is too high, the processing is under-utilised. In any case, the sleep has to happen, since I cannot find a way that the OS notifies the sleepingprocesses with this scheme. The OS does not participate in any locking here so it cannot help. Overall, If the requesting processes are assumed to be arriving frequently and steadily, then I will prefer having the sleep time on the higher-end. This is because once enough processes are sleeping, there will be a steady state when enough processes keep waking up and continue processing without the under-utilization problem. On the other hand, if the incoming processes are bursty in nature, then I think the sleep time needs tobe kept on the lower-end.
The above protocol certanly solves the central problem mentionned in the link,

- QED -