summaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems/orangefs.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems/orangefs.txt')
-rw-r--r--Documentation/filesystems/orangefs.txt79
1 files changed, 66 insertions, 13 deletions
diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.txt
index 925a53e52097..e1a0056a365f 100644
--- a/Documentation/filesystems/orangefs.txt
+++ b/Documentation/filesystems/orangefs.txt
@@ -221,18 +221,71 @@ contains the "downcall" which expresses the results of the request.
The slab allocator is used to keep a cache of op structures handy.
-The life cycle of a typical op goes like this:
-
- - obtain and initialize an op structure from the op_cache.
-
- - queue the op to the pvfs device so that its upcall data can be
- read by userspace.
-
- - wait for userspace to write downcall data back to the pvfs device.
-
- - consume the downcall and return the op struct to the op_cache.
-
-Some ops are atypical with respect to their payloads: readdir and io ops.
+At init time the kernel module defines and initializes a request list
+and an in_progress hash table to keep track of all the ops that are
+in flight at any given time.
+
+Ops are stateful:
+
+ * unknown - op was just initialized
+ * waiting - op is on request_list (upward bound)
+ * inprogr - op is in progress (waiting for downcall)
+ * serviced - op has matching downcall; ok
+ * purged - op has to start a timer since client-core
+ exited uncleanly before servicing op
+ * given up - submitter has given up waiting for it
+
+When some arbitrary userspace program needs to perform a
+filesystem operation on Orangefs (readdir, I/O, create, whatever)
+an op structure is initialized and tagged with a distinguishing ID
+number. The upcall part of the op is filled out, and the op is
+passed to the "service_operation" function.
+
+Service_operation changes the op's state to "waiting", puts
+it on the request list, and signals the Orangefs file_operations.poll
+function through a wait queue. Userspace is polling the pseudo-device
+and thus becomes aware of the upcall request that needs to be read.
+
+When the Orangefs file_operations.read function is triggered, the
+request list is searched for an op that seems ready-to-process.
+The op is removed from the request list. The tag from the op and
+the filled-out upcall struct are copy_to_user'ed back to userspace.
+
+If any of these (and some additional protocol) copy_to_users fail,
+the op's state is set to "waiting" and the op is added back to
+the request list. Otherwise, the op's state is changed to "in progress",
+and the op is hashed on its tag and put onto the end of a list in the
+in_progress hash table at the index the tag hashed to.
+
+When userspace has assembled the response to the upcall, it
+writes the response, which includes the distinguishing tag, back to
+the pseudo device in a series of io_vecs. This triggers the Orangefs
+file_operations.write_iter function to find the op with the associated
+tag and remove it from the in_progress hash table. As long as the op's
+state is not "canceled" or "given up", its state is set to "serviced".
+The file_operations.write_iter function returns to the waiting vfs,
+and back to service_operation through wait_for_matching_downcall.
+
+Service operation returns to its caller with the op's downcall
+part (the response to the upcall) filled out.
+
+The "client-core" is the bridge between the kernel module and
+userspace. The client-core is a daemon. The client-core has an
+associated watchdog daemon. If the client-core is ever signaled
+to die, the watchdog daemon restarts the client-core. Even though
+the client-core is restarted "right away", there is a period of
+time during such an event that the client-core is dead. A dead client-core
+can't be triggered by the Orangefs file_operations.poll function.
+Ops that pass through service_operation during a "dead spell" can timeout
+on the wait queue and one attempt is made to recycle them. Obviously,
+if the client-core stays dead too long, the arbitrary userspace processes
+trying to use Orangefs will be negatively affected. Waiting ops
+that can't be serviced will be removed from the request list and
+have their states set to "given up". In-progress ops that can't
+be serviced will be removed from the in_progress hash table and
+have their states set to "given up".
+
+Readdir and I/O ops are atypical with respect to their payloads.
- readdir ops use the smaller of the two pre-allocated pre-partitioned
memory buffers. The readdir buffer is only available to userspace.
@@ -311,7 +364,7 @@ particular response.
jamb everything needed to represent a pvfs2_readdir_response_t into
the readdir buffer descriptor specified in the upcall.
-writev() on /dev/pvfs2-req is used to pass responses to the requests
+Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests
made by the kernel side.
A buffer_list containing: