diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/filesystems/orangefs.txt | 79 |
1 files changed, 66 insertions, 13 deletions
diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.txt index 925a53e52097..e1a0056a365f 100644 --- a/Documentation/filesystems/orangefs.txt +++ b/Documentation/filesystems/orangefs.txt @@ -221,18 +221,71 @@ contains the "downcall" which expresses the results of the request. The slab allocator is used to keep a cache of op structures handy. -The life cycle of a typical op goes like this: - - - obtain and initialize an op structure from the op_cache. - - - queue the op to the pvfs device so that its upcall data can be - read by userspace. - - - wait for userspace to write downcall data back to the pvfs device. - - - consume the downcall and return the op struct to the op_cache. - -Some ops are atypical with respect to their payloads: readdir and io ops. +At init time the kernel module defines and initializes a request list +and an in_progress hash table to keep track of all the ops that are +in flight at any given time. + +Ops are stateful: + + * unknown - op was just initialized + * waiting - op is on request_list (upward bound) + * inprogr - op is in progress (waiting for downcall) + * serviced - op has matching downcall; ok + * purged - op has to start a timer since client-core + exited uncleanly before servicing op + * given up - submitter has given up waiting for it + +When some arbitrary userspace program needs to perform a +filesystem operation on Orangefs (readdir, I/O, create, whatever) +an op structure is initialized and tagged with a distinguishing ID +number. The upcall part of the op is filled out, and the op is +passed to the "service_operation" function. + +Service_operation changes the op's state to "waiting", puts +it on the request list, and signals the Orangefs file_operations.poll +function through a wait queue. Userspace is polling the pseudo-device +and thus becomes aware of the upcall request that needs to be read. + +When the Orangefs file_operations.read function is triggered, the +request list is searched for an op that seems ready-to-process. +The op is removed from the request list. The tag from the op and +the filled-out upcall struct are copy_to_user'ed back to userspace. + +If any of these (and some additional protocol) copy_to_users fail, +the op's state is set to "waiting" and the op is added back to +the request list. Otherwise, the op's state is changed to "in progress", +and the op is hashed on its tag and put onto the end of a list in the +in_progress hash table at the index the tag hashed to. + +When userspace has assembled the response to the upcall, it +writes the response, which includes the distinguishing tag, back to +the pseudo device in a series of io_vecs. This triggers the Orangefs +file_operations.write_iter function to find the op with the associated +tag and remove it from the in_progress hash table. As long as the op's +state is not "canceled" or "given up", its state is set to "serviced". +The file_operations.write_iter function returns to the waiting vfs, +and back to service_operation through wait_for_matching_downcall. + +Service operation returns to its caller with the op's downcall +part (the response to the upcall) filled out. + +The "client-core" is the bridge between the kernel module and +userspace. The client-core is a daemon. The client-core has an +associated watchdog daemon. If the client-core is ever signaled +to die, the watchdog daemon restarts the client-core. Even though +the client-core is restarted "right away", there is a period of +time during such an event that the client-core is dead. A dead client-core +can't be triggered by the Orangefs file_operations.poll function. +Ops that pass through service_operation during a "dead spell" can timeout +on the wait queue and one attempt is made to recycle them. Obviously, +if the client-core stays dead too long, the arbitrary userspace processes +trying to use Orangefs will be negatively affected. Waiting ops +that can't be serviced will be removed from the request list and +have their states set to "given up". In-progress ops that can't +be serviced will be removed from the in_progress hash table and +have their states set to "given up". + +Readdir and I/O ops are atypical with respect to their payloads. - readdir ops use the smaller of the two pre-allocated pre-partitioned memory buffers. The readdir buffer is only available to userspace. @@ -311,7 +364,7 @@ particular response. jamb everything needed to represent a pvfs2_readdir_response_t into the readdir buffer descriptor specified in the upcall. -writev() on /dev/pvfs2-req is used to pass responses to the requests +Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests made by the kernel side. A buffer_list containing: |