In a multi-processor system locking between processors to serialize access to critical data structures and variables is necessary. In 4.4BSD Lite² this has been implemented by the addition of a generic lock manager, and the existing file system code has been updated to use it. In this paper we will attempt to describe the lock manager implementation thoroughly, and look at how some of the file system code uses it.
In particular, we will examine how vnode
reclamation
is handled in situations where more than one CPU may be trying to
free a vnode
. We will start by examining the
low-level atomic locks in detail. After that we will see how the
atomic locks are used by the generic locking package, and examine
how multiple, shared and exclusive locks are implemented.
The locking package also supports upgrading a shared lock to an exclusive lock, by draining out other shared locks. This will also be examined in detail.
We will also examine how the locking code is currently used in
the updated file system code, specifically in the functions
vgone()
, vclean()
,
vrele()
, vget()
and
getnewvnode()
to protect against multiple CPUs
trying to reclaim a vnode
at the same time.
The following is pseudo-code describing the lock manager, as well as part of the file system code. Following that there will be annotations of each part of the pseudo-code, describing in detail what the code does.
All multi-processor systems need some amount of locking between processors to make sure access to some data structures or hardware is done atomically. The low-level locking code described here is responsible for serializing such access. Note that the low-level locking code is using spin-locks (a processor will busy-wait while trying to acquire such a lock); an operation that can tie up resources on some architectures. Because of this, and also to implement shared locks, the low-level locking routines described here is usually used only by a high-level lock manager and to protect short critical sections of code.
simple_lock_init
The simple_lock_init()
function is used to initialize a low-level spinlock. This must be done before any other operation on a lock can be performed. The contents of a struct simplelock
is highly architecture dependant, and usually consist of the smallest data type that can be atomically modified.
/* * Initialize a simple lock structure so it can be used */ void simple_lock_init(struct simplelock *lock) { [1] clear lock->data }
simple_lock
The simple_lock()
function is used to acquire
a low-level spinlock. It will spin waiting for a lock to be
released if it is currently held by another processor. This
is the only kind of lock that can be used inside an
interrupt handler. For a low-level lock to be used in an
interrupt handler, care must be taken to disable the
interrupt, acquire the lock, do any processing, release the
lock and re-enable the interrupt. This is necessary to avoid
deadlock between the interrupt handler and other code
executing on the same processor. Note that if there are no
other processors in a system and the lock has already
obtained, the function will never return as the lock cannot
be released.
The test_and_set()
function must be able to
test and set the contents of a struct
simplelock
atomically across processors regardless of
what bus is used to connect them, and return TRUE as long as
the lock is already being held by a process.
/* * Attempt to acquire a lock, spin until lock obtained */ void simple_lock(struct simplelock *lock) { [2] while (test_and_set returns TRUE) ; }
simple_lock_try
The simple_lock_try()
function is used to
acquire a low-level spinlock. If the lock has already been
acquired the function will return failure immediately. Look
at simple_lock()
for a description of
test_and_set()
.
/* * Attempt to acquire a lock * * Return TRUE if success, FALSE otherwise */ int simple_lock_try(struct simplelock *lock) { [3] if (test_and_set returns FALSE) return TRUE return FALSE }
test_and_test()
once, attempting to
acquire a lock. If the lock is acquired, return TRUE,
otherwise FALSE. This operation will never block.
simple_unlock
simple_unlock()
will essentially perform the
same operation as simple_lock_init()
above.
However, some architectures might need to have the state of
a low-level spinlock propagated to other processors in ways
that would not be needed for the initialization of a lock,
but only when setting or clearing it.
Many multiprocessor architectures, like systems based on
Intel®'s MP specification do not need to do this, but
need only clear the contents of the struct
simplelock
.
/* * Unlock */ void simple_unlock(struct simplelock *lock) { [4] clear lock->data }
simple_lock()
or
simple_lock_try()
. This will set the state
of the lock to the same as when the lock was first
initialized with simple_lock_init()
.
The following section will describe the lock manager as found in 4.4BSD Lite². This lock manager supplies both exclusive and shared locks, with recursive exclusive locks within a single process. It also allows upgrading a shared lock to an exclusive lock, as well as downgrading an exclusive lock to a shared lock.
lockinit
As with the low-level spinlocks, a high-level lock needs to be
initialized before use. In the 4.4BSD Lite²
implementation, this will initialize the associated
struct simplelock
and save some parameters for the
lock.
The parameters that can be specified for a high-level lock are priority, which is used to determine a process' priority when it is woken up after sleeping on a lock, a message used when a process goes to sleep waiting for a lock, so that the exact reason a process is sleeping can easily be identified, a timeout, and flags specifying options governing the lock's behavior.
/* * Initialize a lock */ void lockinit(struct lock *lock, int pri, char *msg, int timeout, int flags) { [5] call simple_lock_init on lock->simplelock set lock->priority to pri set lock->wmesg to msg set lock->timo to timeout set lock->flags to flags }
simple_lock_init()
on the embedded
low-level lock. Initialize the remaining fields in a lock
structure from the arguments provided; the priority to wake
up at after being put to sleep on a lock, the message to
pass to the sleep routing when blocking on a lock, the
timeout and any auxiliary flags on a lock.
lockmgr
The lockmgr()
function is used to perform all
operations on a high-level lock after it has been initialized
with lockinit()
. Operations supported are
LK_SHARED
to obtain a shared lock,
LK_DOWNGRADE
to downgrade an exclusive lock to a
shared lock, LK_EXCLUPGRADE
and
LK_UPGRADE
to upgrade a shared lock to an
exclusive lock, LK_EXCLUSIVE
to obtain an
exclusive lock, LK_RELEASE
to free a lock and
LK_DRAIN
to drain out all locks active locks.
The LK_EXCLUPGRADE
and LK_UPGRADE
calls are essentially the same, but
LK_EXCLUPGRADE
will fail if any other process
has already requested an upgrade to an exclusive lock.
A process can also specify that it does not want to be put to
sleep waiting for a lock, such that if a lock request can not
be honored immediately, the lock manager will return an
EBUSY
error to the calling function. This is
described by the boolean condition polling in the
following pseudo-code.
If at any time the process is put to sleep waiting for a
lock to become available, the message specified in
lockinit()
will be used to provide information
about the reason the process was put to sleep. When a
process wakes up after sleeping on a lock, it's priority will
be set to that specified when the lock was first initialized.
/* * Acquire, change or release a lock * * Return 0 if successful, error code if not */ int lockmgr(struct lock *lock, u_int flags, struct simplelock *slock, struct proc *p) { clear error; [6] call simple_lock on lock->interlock [7] if (flags & LK_INTERLOCK) call simple_unlock on slock [8] save combined new and old lock flags switch (lock request) { case LK_SHARED: [9] if (current process does not own a lock) { if (someone else is waiting) [10] if (polling) { set error to EBUSY break; } [11] sleep until exclusive locks and upgrades are cleared; [12] if (error is set) break; [13] increment shared lock count; break; } [14] increment shared lock count; case LK_DOWNGRADE: [15] increment shared lock count by exclusive lock count; [16] reset exclusive lock count and flag; [17] if (anyone waiting for exclusive locks to clear) wake them up; break; case LK_EXCLUPGRADE: [18] if (anyone else is waiting for exclusive upgrade) { reduce shared lock count by one; set error to EBUSY; break; } case LK_UPGRADE: [19] reduce shared lock count by one; if (we would block) [20] if (polling) { set error to EBUSY; break; } [21] if (noone is waiting for upgrade) { set WANT_UPGRADE flag sleep until shared locks are drained; if (error is set) break; [22] set exclusive lock flag and count; break; } [23] if (anyone else is waiting for upgrade) wake them up; case LK_EXCLUSIVE: [24] if (we have exclusive lock) { increment exclusive lock count; break; } if (we would block) [25] if (polling) { set error to EBUSY; break; } [26] if (anyone is waiting for exclusive lock) { sleep until WANT_EXCLUSIVE is clear; if (error is set) break; } [27] set WANT_EXCLUSIVE flag; sleep until shared locks and upgrades are drained; reset WANT_EXCLUSIVE flag; if (error is set) break; [28] set exclusive lock count to one; break; case LK_RELEASE: [29] if (exclusive locks exist) { decrement exclusive lock count; [30] if (lock count zero) clear exclusive lock flag; } [31] else if (shared locks exist) decrement shared lock count; [32] if (anyone waiting for locks to drain) wake them up; break; case LK_DRAIN: if (we would block) [33] if (polling) { set error to EBUSY; break; } [34] sleep until waits for exclusive locks and upgrades are drained; reset error; [35] while (any exclusive locks, upgrades or shared locks exist) { set WAIT_DRAIN flag; [36] call simple_unlock on lock->simplelock; [37] sleep until a lock has drained; if (error is set) return error; [38] call simple_lock on lock->simplelock; } [39] set draining and exclusive lock flag; set exclusive lock count to one; break; } [40] if (we are draining locks and all locks are gone) wake them up anyone waiting for locks to change; [41] call simple_unlock on lock->interlock return error; }
EBUSY
to indicate this.
EBUSY
. Otherwise fall
through to the generic upgrade code.
EBUSY
instead of blocking.
WANT_EXCLUSIVE
flag on the lock and
go to sleep until all other upgrades and shared locks have
been drained. When we wake up, clear the
WANT_EXCLUSIVE
flag. If we get an error while
sleeping, abort processing. Note that we clear the
WANT_EXCLUSIVE
flag regardless of whether we
get an error or not.
EBUSY
.
WAIT_DRAIN
flag and release the
interlock on the lock structure.
lockstatus
The lockstatus()
function is used to retrieve
the status of a lock. The lock must have been previously
initialized by lockinit()
, but need not
otherwise have been used.
The function will return a status value indicating whether any locks are held, and if so, what kind of lock (shared or exclusive). All accesses to internal lock data are protected by asserting the high-level lock's interlock while examining the lock data.
/* * Get the status of a lock */ int lockstatus(struct lock *lock) { [42] call simple_lock on lock->simplelock; [43] if (lock->exclusive_count) { call simple_unlock on lock->simplelock; return exclusive status; } [44] if (lock->shared_count) { call simple_unlock on lock->simplelock; return shared status; } [45] call simple_unlock on lock->simplelock; return unlocked status; }
simple_lock()
on the embedded interlock
to guarantee the lock does not change out from under us.
The file system code in 4.4BSD Lite² has been modified
to be multi-processor safe. In the following section we will
examine how the vnode
reclamation code uses
locks around critical sections to serialize access to
vnodes
.
vgone
The system calls vgone()
to eliminate all
activity on a vnode
. This is done whenever a
vnode
will be re-used, possibly for a
different file system. The operation needs to be protected
to avoid race conditions where the system might attempt to
re-use the vnode
due to a process re-opening
the file previously associated with it on a different
processor or another processor attempts to reclaim the
vnode
at the same time.
We do this by utilizing the struct simplelock
embedded in the vnode
and using the low-level
lock functions to serialize access to it. Note that a
low-level spinlock may not be held for the duration of the
reclamation process. We also have to verify that the
vnode
is not already being reclaimed after
acquiring the spinlock. If we determine that reclamation is
already in progress, we release the spinlock and go to
sleep until the reclamation process has completed.
We will also move the vnode
to the head of the
free list if it is already on it, to force the system to
re-use it as soon as possible, since it is no longer
associated with any underlying file system. Access to the
vnode
free list also needs to be serialized
using a struct simplelock
.
At the end of processing, the vnode
will be
marked as being "BAD
". We have
already associated the vnode
with the
"dead" file system by calling
vclean()
on it.
/* * Eliminate all activity associated with a vnode * in preparation for re-use */ void vgone(struct vnode *vp) { [46] call simple_lock on vp->interlock; [47] if (vgone or vclean in progress) { call simple_unlock on vp->interlock; [48] wait for vgone or vclean to finish; return; } [49] call vclean on vp to close file; [50] remove from mount list if on it; if (vnode on freelist) { call simple_lock on free_list_lock; [51] move vnode to head of free list; call simple_unlock on free_list_lock; } [52] set vnode type to BAD; }
vnode
interlock to ensure the
vnode
does not change out from under us.
vnode
is already being reclaimed we
must let that complete and then release the
vnode
interlock.
vgone()
or
vclean()
is done and return.
vclean()
to close the underlying file
and clear out any underlying file system specific data
from the vnode
.
vnode
is on the mount point
vnode
list, remove it.
vnode
is already on the free list, we
want to move it to the head of the list so that it will
be re-used sooner. Lock the free list, move the
vnode
to the head of the list and unlock the
free list.
vnode
type to
"BAD
". It has already been
associated with the "dead" file system by the
call to vclean()
.
vclean
The vclean()
function is used to disassociate
the attached underlying file system from a
vnode
. Note that the embedded struct
simplelock
is always held on entry to the function.
We also set the VX_LOCK
flag on the
vnode
, marking it as currently being reclaimed
before we continue processing the reclamation.
If we have been asked to close the file associated with the
vnode
, we will invalidate all associated
buffers, forcing them to be written to the underlying file
system. If the file associated with the vnode
was open on entry, we call the underlying file system,
closing the file, and deactivating the vnode
.
This operation will release the embedded struct
simplelock
. If the vnode
is not
currently active, we release the embedded lock.
We then call the underlying file system to reclaim the file
system specific portion of the vnode
, and if
the vnode
was active we call
vrele()
to release it and invalidate any
references to the vnode
in the buffer cache.
If there were any lock structures associated with the
vnode
, we free them, and point the
vnode
at the "dead" file system so
that any attempt to do operations on the
vnode
will fail.
The vnode
has now been completely reclaimed,
and we can clear the lock flag we set at the start of
processing and wake up any processes waiting for the
reclamation to complete.
/* * Disassociate the underlying file system from a vnode. * The vnode interlock is held on entry */ void vclean(struct vnode *vp, int flags, struct proc *p) { [53] save vnode active count; [54] obtain lock on vnode; [55] if (vnode is closing) invalidate associated buffers; [56] if (vnode is active) { [57] if (vnode is closing) call close routine; [58] inactivate vnode; } else release lock on vnode; [59] reclaim vnode; if (vnode was active) [60] call vrele to release vnode; [61] purge cache information about vnode; [62] free lock structures associated with vnode; [63] make vnode reference dead filesystem; [64] clear the reclaimation lock; wake up any sleepers on vnode; }
vnode
so we can use this to decide how to
proceed after we have deactivated it. Also increment the
reference count on the vnode
so that no other
process tries to recycle it before we have completed.
vnode
as being locked for
reclamation. Note that the low-level vnode
interlock has already been acquired at this point. This
lock ensures that no new processes will attempt to
acquire any locks on the vnode
. Thus, once
all shared and exclusive high-level locks has been
drained, it will be safe to deallocate the high-level lock
structures in the VOP_RECLAIM()
routine.
vnode
is
being closed, invalidate all associated buffers in the
buffer cache.
vnode
is currently active. If it is
not, release the vnode
interlock.
vnode
interlock that was acquired on entry.
vnode
as inactive.
vnode
.
vnode
was currently active, call
vrele()
to release it.
vnode
from
the vnode
cache.
vnode
.
vnode
at the "dead" file
system so that any references to the vnode
will fail.
vrele
vrele()
is called to remove references to a
vnode
. Changing the reference count has to be
done inside a spinlock to serialize access to the
vnode
. If we removed the last reference to the
vnode
, acquire a spinlock on the
vnode
free list, append the vnode
to the free list, and, release the free list spinlock. If
we can get an exclusive lock on the vnode
, call
the underlying file system to inform it that we removed the
last reference to the vnode
.
/* * Release a vnode. If count drops to zero, call inactive * routine and return to free list */ void vrele(struct vnode *vp) { call simple_lock on vp->interlock; [65] decrement vp->usecount; [66] if (vp->usecount > 0) { call simple_unlock on vp->interlock; return; } [67] call simple_lock on free_list_lock; [68] append vp to free list; call simple_unlock on free_list_lock; [69] if (exclusive lock on vp succeeds) call inactive on vp; }
vnode
interlock, decrement
the count of active references to the vnode
.
vnode
, release the vnode
interlock and return.
vnodes
from it.
vnode
to the vnode
free
list and release the lock on the free list.
vnode
; if successful mark it as inactive.
Note that if the exclusive lock fails, the
vnode
interlock is released. The
vnode
inactivation will release the
vnode
interlock.
vget
The vget()
function is called when we want to
reuse a specific vnode
, i.e. when we open a
file that has been open recently and for which the
associated vnode
has not been reclaimed for a
different file.
The first thing that is done is to acquire the
vnode
interlock unless it was already set
before vget()
was called. We then check to see
whether reclamation is in progress by vclean()
.
If it is, we flag that we want to be woken up once
reclamation has finished, after which we release the
vnode
interlock. We then go to sleep until the
reclamation process has finished and return
ENOENT
to indicate the vnode
is
no longer usable.
If there are currently no references to the
vnode
, it is on the vnode
free
list. Acquire a lock on the free list, remove the
vnode
and release the lock.
Increment the reference count on the vnode, and see of we
were told to acquire a high-level lock on it. If so, try
acquiring the requested locks, and call
vrele()
is not successful. Note that acquiring
the high-level lock has the side effect of clearing the
vnode
interlock. Return the status from
attempting to acquire the high level vnode
lock.
All processing has now been completed, release the
vnode
interlock and return.
/* * Grab a particular vnode from the free list, increment its * reference count and lock it. If the vnode lock bit is set the * vnode is being eliminated in vgone. The process is awakened * when the transition is completed, and an error returned to * indicate that the vnode is no longer usable (possibly having * been changed to a new file system type). */ int vget(struct vnode *vp, int flags, struct proc *p) { [70] if (vnode interlock not set) acquire interlock; [71] if (vnode reclamation in progress) { flag we want it; release vnode interlock; [72] sleep until vnode reclamation done; return ENOENT status; } [73] if (no references to vnode) { acquire vnode free list lock; [74] remove vnode from free list; release vnode free list lock; } increment number of references to vnode; [75] if (high-level locks wanted) { if (we cannot acquire the high-level lock) call vrele on vnode; return status from acquiring high-level lock; } [76] release vnode interlock; return OK status; }
vnode
interlock
was already set at entry, acquire the vnode
interlock now.
VXLOCK
flag is set in the
vnode
, reclamation is in progress. We then
set the VXWANT
flag indicating that we will
be sleeping until reclamation is complete and release the
interlock.
ENOENT
indicating that the vnode
is no longer
usable.
vnode
, it has been put on the free list and
we need to take it off.
vnode
free list, remove
the specified vnode
from it and release the
lock.
vnode
, attempt to do so now. Note that
attempting to acquire the high-level lock will release the
vnode
interlock we acquired at the start. If
we were unable to acquire the high-level lock, call
vrele()
to put the vnode
back on
the free list. Return the status code from attempting to
acquire the high-level lock on the vnode
.
vnode
is ready for reuse. Release the
interlock and return with the good news.
getnewvnode
The getnewvnode()
is used to allocate a new
vnode
. It will attempt to reclaim
vnodes
from the vnode
free list
unless the free list is currently empty or the current
number of vnodes
is less than the desired
number of vnodes
on the system.
Start out by acquiring the lock on the vnode
free list. Then, if the free list is empty and the current
number of vnodes
is less than twice the
desired maximum number of vnodes
on the system,
or the current count is less than the desired number of
vnodes
, release the lock on the
vnode
free list. Allocate a new
vnode
and clear it and increment the total
number of vnodes
on the system.
If we did not allocate a new vnode
as
outlined above, we need to find one on the
vnode
free list. Walk down the free list,
attempting to acquire the interlock on each
vnode
in turn. If we find a
vnode
we could lock, stop the process. If we
walk to the end of the free list without being able to lock
a vnode
, release the lock on the free list and
return ENFILE
status.
Having found a vnode
on the free list, we take
it off the list. We also mark the vnode
so
that vgone()
will not move it on the free list.
We are now done with the free list, and release the lock we
acquired at the start of processing.
If the vnode
found above has not been
completely cleared out, we now call vgone()
on
it to release any file system specific data associated with
it. Note that this will release the vnode
interlock. If the vnode
had already been
cleared out, we simply release the interlock.
Regardless of whether we allocated a new vnode
or picked one out from the free list, we now set values
according to the arguments passed. We start by setting the
type to NONE
, indicating that it is not
currently associated with any file. We call
cache_purge()
to make sure any old data
associated with the vnode
has removed from the
cache. We then set up the vnode
operations
table according to the vops argument passed and
insert the vnode
on the specified mount
queue.
Last, we set the current number of references to the
vnode
to one and return status indicating
success.
/* * Return the next vnode from the free list */ int getnewvnode(enum vtagtype tag, struct mount *mp, int (**vops)(), struct vnode **vpp) { acquire lock on vnode free list; [77] if ((vnode free list empty && current vnode count < twice the desired count) || current vnode count < desired vnode count) { release lock on vnode free list; [78] allocate new vnode and clear it; increment the vnode count; } else { [79] walk vnode free list until we can acquire interlock on vnode; [80] if (no vnode could be locked) { release lock on vnode free list; return ENFILE status; } [81] remove vnode from vnode free list; [82] mark vnode so vgone() will skip it over; release lock on vnode free list; [83] if (vnode not pointing to dead file system) call vgone to release associated file system data; else release vnode interlock; [84] clear vnode data; } [85] set vnode type to NONE; [86] call cache_purge on vnode; initialize vnode from arguments; [87] insert vnode into the specified mount queue; set reference count on vnode to 1; return OK; }
vnode
free list. Then, if the current number of
vnodes
is less than the desired number of
vnodes
or the free list is empty and the
current number of vnodes
is less than twice
the number of desired number, allocate a new
vnode
.
vnode
free list and
allocate a new vnode
. Clear the newly
allocated vnode
and increment the total count
of vnodes
we have.
vnode
from the free list. Start walking down the free list,
attempting to acquire an interlock on each
vnode
visited. If we find one we are able to
lock, stop walking the free list and continue processing.
vnode
above, release the lock on the free
list and return ENFILE
, indicating that no
more new vnodes
can be allocated or opened.
vnode
on the free
list, take it off the free list.
vnode
with the value
0xdeadb which is special-cased in
vgone()
, making sure it does not get get
reordered on the free list.
vnode
from the free list was not
marked as "BAD
", indicating it has
been disassociated from the underlying file system, call
vgone()
on it to release any associated file
system data. This has the side effect of releasing the
interlock on the vnode
. If it had already
been disassociated from the underlying file system, just
release the interlock acquired above.
vnode
,
clear out fields that might have been used previously.
vnode
type to
"NONE
", indicating it is not
currently associated with any specific file type.
cache_purge()
on the
vnode
, ensuring that any data previously
associated with it is invalidated. Then use the arguments
passed into getnewvnode()
to associate the
vnode
with the specified file system.
insmntque()
to insert the
vnode
onto the specified mount queue and set
the reference count to one. Return success to the calling
function.