summaryrefslogtreecommitdiffstats
path: root/fs/dlm (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Treewide: Stop corrupting socket's task_fragBenjamin Coddington2022-12-201-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the GFP_NOIO flag on sk_allocation which the networking system uses to decide when it is safe to use current->task_frag. The results of this are unexpected corruption in task_frag when SUNRPC is involved in memory reclaim. The corruption can be seen in crashes, but the root cause is often difficult to ascertain as a crashing machine's stack trace will have no evidence of being near NFS or SUNRPC code. I believe this problem to be much more pervasive than reports to the community may indicate. Fix this by having kernel users of sockets that may corrupt task_frag due to reclaim set sk_use_task_frag = false. Preemptively correcting this situation for users that still set sk_allocation allows them to convert to memalloc_nofs_save/restore without the same unexpected corruptions that are sure to follow, unlikely to show up in testing, and difficult to bisect. CC: Philipp Reisner <philipp.reisner@linbit.com> CC: Lars Ellenberg <lars.ellenberg@linbit.com> CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com> CC: Jens Axboe <axboe@kernel.dk> CC: Josef Bacik <josef@toxicpanda.com> CC: Keith Busch <kbusch@kernel.org> CC: Christoph Hellwig <hch@lst.de> CC: Sagi Grimberg <sagi@grimberg.me> CC: Lee Duncan <lduncan@suse.com> CC: Chris Leech <cleech@redhat.com> CC: Mike Christie <michael.christie@oracle.com> CC: "James E.J. Bottomley" <jejb@linux.ibm.com> CC: "Martin K. Petersen" <martin.petersen@oracle.com> CC: Valentina Manea <valentina.manea.m@gmail.com> CC: Shuah Khan <shuah@kernel.org> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: David Howells <dhowells@redhat.com> CC: Marc Dionne <marc.dionne@auristor.com> CC: Steve French <sfrench@samba.org> CC: Christine Caulfield <ccaulfie@redhat.com> CC: David Teigland <teigland@redhat.com> CC: Mark Fasheh <mark@fasheh.com> CC: Joel Becker <jlbec@evilplan.org> CC: Joseph Qi <joseph.qi@linux.alibaba.com> CC: Eric Van Hensbergen <ericvh@gmail.com> CC: Latchesar Ionkov <lucho@ionkov.net> CC: Dominique Martinet <asmadeus@codewreck.org> CC: Ilya Dryomov <idryomov@gmail.com> CC: Xiubo Li <xiubli@redhat.com> CC: Chuck Lever <chuck.lever@oracle.com> CC: Jeff Layton <jlayton@kernel.org> CC: Trond Myklebust <trond.myklebust@hammerspace.com> CC: Anna Schumaker <anna@kernel.org> CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Herbert Xu <herbert@gondor.apana.org.au> Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* fs: dlm: fix building without lockdepAlexander Aring2022-11-221-1/+5
| | | | | | | | | | | | | | This patch uses assert_spin_locked() instead of lockdep_is_held() where it's available to use because lockdep_is_held() is only available if CONFIG_LOCKDEP is set. In other cases like lockdep_sock_is_held() we surround it by a CONFIG_LOCKDEP idef. Fixes: dbb751ffab0b ("fs: dlm: parallelize lowcomms socket handling") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: parallelize lowcomms socket handlingAlexander Aring2022-11-213-484/+586
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is rework of lowcomms handling, the main goal was here to handle recvmsg() and sendpage() to run parallel. Parallel in two senses: 1. per connection and 2. that recvmsg()/sendpage() doesn't block each other. Currently recvmsg()/sendpage() cannot run parallel because two workqueues "dlm_recv" and "dlm_send" are ordered workqueues. That means only one work item can be executed. The amount of queue items will be increased about the amount of nodes being inside the cluster. The current two workqueues for sending and receiving can also block each other if the same connection is executed at the same time in dlm_recv and dlm_send workqueue because a per connection mutex for the socket handling. To make it more parallel we introduce one "dlm_io" workqueue which is not an ordered workqueue, the amount of workers are not limited. Due per connection flags SEND/RECV pending we schedule workers ordered per connection and per send and receive task. To get rid of the mutex blocking same workers to do socket handling we switched to a semaphore which handles socket operations as read lock and sock releases as write operations, to prevent sock_release() being called while the socket is being used. There might be more optimization removing the semaphore and replacing it with other synchronization mechanism, however due other circumstances e.g. othercon behaviour it seems complicated to doing this change. I added comments to remove the othercon handling and moving to a different synchronization mechanism as this is done. We need to do that to the next dlm major version upgrade because it is not backwards compatible with the current connect mechanism. The processing of dlm messages need to be still handled by a ordered workqueue. An dlm_process ordered workqueue was introduced which gets filled by the receive worker. This is probably the next bottleneck of DLM but the application can't currently parse dlm messages parallel. A comment was introduced to lift the workqueue context of dlm processing in a non-sleepable softirq to get messages processing done fast. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: don't init error valueAlexander Aring2022-11-211-1/+1
| | | | | | | | This patch removes a init of an error value to -EINVAL which is not necessary. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: use saved sk_error_report()Alexander Aring2022-11-211-5/+1
| | | | | | | | | | This patch changes the handling of calling the original sk_error_report() by not putting it on the stack and calling it later. If the listen_sock.sk_error_report() is NULL in this moment it indicates a bug in our implementation. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: use sock2con without checking nullAlexander Aring2022-11-211-13/+4
| | | | | | | | | This patch removes null checks on private data for sockets. If we have a null dereference there we having a bug in our implementation that such callback occurs in this state. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: remove dlm_node_addrs lookup listAlexander Aring2022-11-211-154/+136
| | | | | | | | | | | | | | | This patch merges the dlm_node_addrs lookup list to the connection structure. It is a per node mapping to some configuration setup by configfs. We don't need two lookup structures. The connection hash has now a lifetime like the dlm_node_addrs entries. Means we add only new entries when configure cluster and not while new connections are coming in, remove connection when a node got fenced and cleanup all connection when the dlm exits. It should work the same and even will show more issues because we don't try to somehow keep those two data structures in sync with the current cluster configuration. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: don't put dlm_local_addrs on heapAlexander Aring2022-11-211-26/+12
| | | | | | | | | This patch removes to allocate the dlm_local_addr[] pointers on the heap. Instead we directly store the type of "struct sockaddr_storage". This removes function deinit_local() because it was freeing memory only. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: cleanup listen sock handlingAlexander Aring2022-11-211-34/+17
| | | | | | | | | | | | | This patch removes save_listen_callbacks() and add_listen_sock() as they are only used once in lowcomms functionality. For shutdown lowcomms it's not necessary to whole flush the workqueues to synchronize with restoring the old sk_data_ready() callback. Only the listen con receive work need to be cancelled. For each individual node shutdown we should be sure that last ack was been transmitted which is done by flushing per connection swork worker. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: remove socket shutdown handlingAlexander Aring2022-11-213-107/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") we have functionality like TCP offers for half-closed sockets on dlm application protocol layer. This feature is required because the cluster manager events about leaving resource memberships can be locally already occurred but other cluster nodes having a pending leaving membership over the cluster manager protocol happening. In this time the local dlm node already shutdown it's connection and don't transmit anymore any new dlm messages, but however it still needs to be able to accept dlm messages because the pending leave membership request of the cluster manager protocol which the dlm kernel implementation has no control about it. We have this functionality on the application for two reasons, the main reason is that SCTP does not support such functionality on socket layer. But we can do it inside application layer. Another small issue is that this feature is broken in the TCP world because some NAT devices does not implement such functionality correctly. This is the same reason why the reliable connection session layer in DLM exists. We give up on middle devices in the networking which sends e.g. TCP resets out. In DLM we cannot have any message dropping and we ensure it over a session layer that it can't happen. Back to the half-closed grace shutdown handling. It's not necessary anymore to do it on socket layer (which is only support for TCP sockets) because we do it on application layer. This patch removes this handling, if there are still issues then we have a problem on the application layer for such handling. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: use listen sock as dlm running indicatorAlexander Aring2022-11-213-15/+10
| | | | | | | | | | | This patch will switch from dlm_allow_conn to check if dlm lowcomms is running or not to if we actually have a listen socket set or not. The list socket will be set and unset in lowcomms start and shutdown functionality. To synchronize with data_ready() callback we will set the socket callback to NULL while socket lock is held. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: use list_first_entry_or_nullAlexander Aring2022-11-211-6/+3
| | | | | | | | Instead of check on list_empty() we can do the same with list_first_entry_or_null() and return NULL if the returned value is NULL. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: remove twice INIT_WORKAlexander Aring2022-11-211-1/+0
| | | | | | | | | This patch removed a twice INIT_WORK() functionality. We already doing this inside of dlm_lowcomms_init() functionality which is called only once dlm is loaded. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: add midcomms init/start functionsAlexander Aring2022-11-216-12/+37
| | | | | | | | | | | | This patch introduces leftovers of init, start, stop and exit functionality. The dlm application layer should always call the midcomms layer which getting aware of such event and redirect it to the lowcomms layer. Some functionality which is currently handled inside the start functionality of midcomms and lowcomms should be handled in the init functionality as it only need to be initialized once when dlm is loaded. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: add dst nodeid for msg tracingAlexander Aring2022-11-211-4/+6
| | | | | | | | | | | | | | | | | In DLM when we send a dlm message it is easy to add the lock resource name, but additional lookup is required when to trace the receive message side. The idea here is to move the lookup work to the user by using a lookup to find the right send message with recv message. As note DLM can't drop any message which is guaranteed by a special session layer. For doing the lookup a 3 tupel is required as an unique identification which is dst nodeid, src nodeid and sequence number. This patch adds the destination nodeid to the dlm message trace points. The source nodeid is given by the h_nodeid field inside the header. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: rename DLM_IFL_NEED_SCHED to DLM_IFL_CB_PENDINGAlexander Aring2022-11-213-8/+6
| | | | | | | | | | | | | | This patch renames DLM_IFL_NEED_SCHED to DLM_IFL_CB_PENDING because CB_PENDING is a proper name to describe this flag. This flag is set when callback enqueue will return DLM_ENQUEUE_CALLBACK_NEED_SCHED because the callback worker need to be queued. The flag tells that callbacks are currently pending to be called and will be unset if the callback work for the specific lkb is done. The term need schedule is part of this time but a proper name is to say that there are some callbacks pending to being called. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: ast do WARN_ON_ONCE() on hotpathAlexander Aring2022-11-212-7/+7
| | | | | | | | | | This patch changes the ast hotpath functionality in very unlikely cases that we do WARN_ON_ONCE() instead of WARN_ON() to not spamming the console output if we run into states that it would occur over and over again. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: drop lkb ref in bug caseAlexander Aring2022-11-211-1/+2
| | | | | | | | | This patch will drop the lkb reference in an very unlikely case which should in practice not happened. However if it happens we cleanup the reference just in case. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: avoid false-positive checker warningAlexander Aring2022-11-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch avoid the false-positive checker warning about writing 112 bytes into a 88 bytes field "e->request", see: [ 54.891560] dlm: csmb1: dlm_recover_directory 23 out 2 messages [ 54.990542] ------------[ cut here ]------------ [ 54.991012] memcpy: detected field-spanning write (size 112) of single field "&e->request" at fs/dlm/requestqueue.c:47 (size 88) [ 54.992150] WARNING: CPU: 0 PID: 297 at fs/dlm/requestqueue.c:47 dlm_add_requestqueue+0x177/0x180 [ 54.993002] CPU: 0 PID: 297 Comm: kworker/u4:3 Not tainted 6.1.0-rc5-00008-ge01d50cbd6ee #248 [ 54.993878] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014 [ 54.994718] Workqueue: dlm_recv process_recv_sockets [ 54.995230] RIP: 0010:dlm_add_requestqueue+0x177/0x180 [ 54.995731] Code: e7 01 0f 85 3b ff ff ff b9 58 00 00 00 48 c7 c2 c0 41 74 82 4c 89 ee 48 c7 c7 20 42 74 82 c6 05 8b 8d 30 02 01 e8 51 07 be 00 <0f> 0b e9 12 ff ff ff 66 90 0f 1f 44 00 00 41 57 48 8d 87 10 08 00 [ 54.997483] RSP: 0018:ffffc90000b1fbe8 EFLAGS: 00010282 [ 54.997990] RAX: 0000000000000000 RBX: ffff888024fc3d00 RCX: 0000000000000000 [ 54.998667] RDX: 0000000000000001 RSI: ffffffff81155014 RDI: fffff52000163f73 [ 54.999342] RBP: ffff88800dbac000 R08: 0000000000000001 R09: ffffc90000b1fa5f [ 54.999997] R10: fffff52000163f4b R11: 203a7970636d656d R12: ffff88800cfb0018 [ 55.000673] R13: 0000000000000070 R14: ffff888024fc3d18 R15: 0000000000000000 [ 55.001344] FS: 0000000000000000(0000) GS:ffff88806d600000(0000) knlGS:0000000000000000 [ 55.002078] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 55.002603] CR2: 00007f35d4f0b9a0 CR3: 0000000025495002 CR4: 0000000000770ef0 [ 55.003258] PKRU: 55555554 [ 55.003514] Call Trace: [ 55.003756] <TASK> [ 55.003953] dlm_receive_buffer+0x1c0/0x200 [ 55.004348] dlm_process_incoming_buffer+0x46d/0x780 [ 55.004786] ? kernel_recvmsg+0x8b/0xc0 [ 55.005150] receive_from_sock.isra.0+0x168/0x420 [ 55.005582] ? process_listen_recv_socket+0x10/0x10 [ 55.006018] ? finish_task_switch.isra.0+0xe0/0x400 [ 55.006469] ? __switch_to+0x2fe/0x6a0 [ 55.006808] ? read_word_at_a_time+0xe/0x20 [ 55.007197] ? strscpy+0x146/0x190 [ 55.007505] process_one_work+0x3d0/0x6b0 [ 55.007863] worker_thread+0x8d/0x620 [ 55.008209] ? __kthread_parkme+0xd8/0xf0 [ 55.008565] ? process_one_work+0x6b0/0x6b0 [ 55.008937] kthread+0x171/0x1a0 [ 55.009251] ? kthread_exit+0x60/0x60 [ 55.009582] ret_from_fork+0x1f/0x30 [ 55.009903] </TASK> [ 55.010120] ---[ end trace 0000000000000000 ]--- [ 55.025783] dlm: csmb1: dlm_recover 5 generation 3 done: 201 ms [ 55.026466] gfs2: fsid=smbcluster:csmb1.0: recover generation 3 done It seems the checker is unable to detect the additional length bytes which was allocated additionally for the flexible array in struct dlm_message. To solve it we split the memcpy() into copy for the 88 bytes struct and another memcpy() for the flexible array m_extra field. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: use WARN_ON_ONCE() instead of WARN_ON()Alexander Aring2022-11-081-9/+9
| | | | | | | | | To not get the console spammed about WARN_ON() of invalid states in the dlm midcomms hot path handling we switch to WARN_ON_ONCE() to get it only once that there might be an issue with the midcomms state handling. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: fix log of lowcomms vs midcommsAlexander Aring2022-11-081-1/+1
| | | | | | | | | | This patch will fix a small issue when printing out that dlm_midcomms_start() failed to start and it was printing out that the dlm subcomponent lowcomms was failed but lowcomms is behind the midcomms layer. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: catch dlm_add_member() errorAlexander Aring2022-11-081-1/+4
| | | | | | | | This patch will catch a possible dlm_add_member() and delivers it to the dlm recovery handling. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: relax sending to allow receivingAlexander Aring2022-11-081-5/+10
| | | | | | | | | | | | | | | This patch drops additionally the sock_mutex when there is a sending message burst. Since we have acknowledge handling we free sending buffers only when we receive an ack back, but if we are stuck in send_to_sock() looping because dlm sends a lot of messages and we never leave the loop the sending buffer fill up very quickly. We can't receive during this iteration because the sock_mutex is held. This patch will unlock the sock_mutex so it should be possible to receive messages when a burst of sending messages happens. This will allow to free up memory because acks which are already received can be processed. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: remove ls_remove_wait waitqueueAlexander Aring2022-11-083-61/+2
| | | | | | | | | | | | | | | | | | | | | | This patch removes the ls_remove_wait waitqueue handling. The current handling tries to wait before a lookup is send out for a identically resource name which is going to be removed. Hereby the remove message should be send out before the new lookup message. The reason is that after a lookup request and response will actually use the specific remote rsb. A followed remove message would delete the rsb on the remote side but it's still being used. To reach a similar behaviour we simple send the remove message out while the rsb lookup lock is held and the rsb is removed from the toss list. Other find_rsb() calls would never have the change to get a rsb back to live while a remove message will be send out (without holding the lock). This behaviour requires a non-sleepable context which should be provided now and might be the reason why it was not implemented so in the first place. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: allow different allocation context per _create_messageAlexander Aring2022-11-084-16/+23
| | | | | | | | | This patch allows to give the use control about the allocation context based on a per message basis. Currently all messages forced to be created under GFP_NOFS context. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: use a non-static queue for callbacksAlexander Aring2022-11-089-217/+222
| | | | | | | | | | | | | | | | | | | | | | | | This patch will introducde a queue implementation for callbacks by using the Linux lists. The current callback queue handling is implemented by a static limit of 6 entries, see DLM_CALLBACKS_SIZE. The sequence number inside the callback structure was used to see if the entries inside the static entry is valid or not. We don't need any sequence numbers anymore with a dynamic datastructure with grows and shrinks during runtime to offer such functionality. We assume that every callback will be delivered to the DLM user if once queued. Therefore the callback flag DLM_CB_SKIP was dropped and the check for skipping bast was moved before worker handling and not skip while the callback worker executes. This will reduce unnecessary queues of the callback worker. All last callback saves are pointers now and don't need to copied over. There is a reference counter for callback structures which will care about to free the callback structures at the right time if they are not referenced anymore. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: move last cast bast time to function callAlexander Aring2022-11-081-6/+4
| | | | | | | | This patch moves the debugging information of the last cast and bast time when calling the last and bast function call. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: use spin lock instead of mutexAlexander Aring2022-11-083-6/+6
| | | | | | | | | There is no need to use a mutex in those hot path sections. We change it to spin lock to serve callbacks more faster by not allowing schedule. The locked sections will not be locked for a long time. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: convert ls_cb_mutex mutex to spinlockAlexander Aring2022-11-083-8/+8
| | | | | | | | This patch converts the ls_cb_mutex mutex to a spinlock, there is no sleepable context when this lock is held. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: use list_first_entry marcoAlexander Aring2022-11-081-1/+1
| | | | | | | | Instead of using list_entry() this patch moves to using the list_first_entry() macro. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: let dlm_add_cb queue work after resume onlyAlexander Aring2022-11-081-2/+2
| | | | | | | | | We should allow dlm_add_cb() to call queue_work() only after the recovery queued pending for delayed lkbs. This patch will move the switch LSFL_CB_DELAY after the delayed lkb work was processed. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fd: dlm: trace send/recv of dlm message and rcomAlexander Aring2022-11-084-17/+56
| | | | | | | | | | | | | | | | | | | | | | This patch adds tracepoints for send and recv cases of dlm messages and dlm rcom messages. In case of send and dlm message we add the dlm rsb resource name this dlm messages belongs to. This has the advantage to follow dlm messages on a per lock basis. In case of recv message the resource name can be extracted by follow the send message sequence number. The dlm message DLM_MSG_PURGE doesn't belong to a lock request and will not set the resource name in a dlm_message trace. The same for all rcom messages. There is additional handling required for this debugging functionality which is tried to be small as possible. Also the midcomms layer gets aware of lock resource names, for now this is required to make a connection between sequence number and lock resource names. It is for debugging purpose only. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: use packet in dlm_mhandleAlexander Aring2022-11-081-3/+3
| | | | | | | | | To allow more than just dereferencing the inner header we directly point to the inner dlm packet which allows us to dereference the header, rcom or message structure. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: remove send repeat remove handlingAlexander Aring2022-11-081-74/+0
| | | | | | | | | | | | | | | | This patch removes the send repeat remove handling. This handling is there to repeatingly DLM_MSG_REMOVE messages in cases the dlm stack thinks it was not received at the first time. In cases of message drops this functionality is necessary, but since the DLM midcomms layer guarantees there are no messages drops between cluster nodes this feature became not strict necessary anymore. Due message delays/processing it could be that two send_repeat_remove() are sent out while the other should be still on it's way. We remove the repeat remove handling because we are sure that the message cannot be dropped due communication errors. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: retry accept() until -EAGAIN or error returnsAlexander Aring2022-11-081-1/+5
| | | | | | | | | | | | | | | | This patch fixes a race if we get two times an socket data ready event while the listen connection worker is queued. Currently it will be served only once but we need to do it (in this case twice) until we hit -EAGAIN which tells us there is no pending accept going on. This patch wraps an do while loop until we receive a return value which is different than 0 as it was done before commit d11ccd451b65 ("fs: dlm: listen socket out of connection hash"). Cc: stable@vger.kernel.org Fixes: d11ccd451b65 ("fs: dlm: listen socket out of connection hash") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* fs: dlm: fix sock release if listen failsAlexander Aring2022-11-081-2/+1
| | | | | | | | | | | | This patch fixes a double sock_release() call when the listen() is called for the dlm lowcomms listen socket. The caller of dlm_listen_for_all should never care about releasing the socket if dlm_listen_for_all() fails, it's done now only once if listen() fails. Cc: stable@vger.kernel.org Fixes: 2dc6b1158c28 ("fs: dlm: introduce generic listen") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* dlm: replace one-element array with fixed size arrayPaulo Miguel Almeida2022-11-082-2/+2
| | | | | | | | | | | | | | | | One-element arrays are deprecated. So, replace one-element array with fixed size array member in struct dlm_ls, and refactor the rest of the code, accordingly. Link: https://github.com/KSPP/linux/issues/79 Link: https://github.com/KSPP/linux/issues/228 Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101836 Link: https://lore.kernel.org/lkml/Y0W5jkiXUkpNl4ap@mail.google.com/ Signed-off-by: Paulo Miguel Almeida <paulo.miguel.almeida.rodenas@gmail.com> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Teigland <teigland@redhat.com>
* Merge tag 'net-next-6.1' of ↵Linus Torvalds2022-10-041-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Introduce and use a single page frag cache for allocating small skb heads, clawing back the 10-20% performance regression in UDP flood test from previous fixes. - Run packets which already went thru HW coalescing thru SW GRO. This significantly improves TCP segment coalescing and simplifies deployments as different workloads benefit from HW or SW GRO. - Shrink the size of the base zero-copy send structure. - Move TCP init under a new slow / sleepable version of DO_ONCE(). BPF: - Add BPF-specific, any-context-safe memory allocator. - Add helpers/kfuncs for PKCS#7 signature verification from BPF programs. - Define a new map type and related helpers for user space -> kernel communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF). - Allow targeting BPF iterators to loop through resources of one task/thread. - Add ability to call selected destructive functions. Expose crash_kexec() to allow BPF to trigger a kernel dump. Use CAP_SYS_BOOT check on the loading process to judge permissions. - Enable BPF to collect custom hierarchical cgroup stats efficiently by integrating with the rstat framework. - Support struct arguments for trampoline based programs. Only structs with size <= 16B and x86 are supported. - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping sockets (instead of just TCP and UDP sockets). - Add a helper for accessing CLOCK_TAI for time sensitive network related programs. - Support accessing network tunnel metadata's flags. - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open. - Add support for writing to Netfilter's nf_conn:mark. Protocols: - WiFi: more Extremely High Throughput (EHT) and Multi-Link Operation (MLO) work (802.11be, WiFi 7). - vsock: improve support for SO_RCVLOWAT. - SMC: support SO_REUSEPORT. - Netlink: define and document how to use netlink in a "modern" way. Support reporting missing attributes via extended ACK. - IPSec: support collect metadata mode for xfrm interfaces. - TCPv6: send consistent autoflowlabel in SYN_RECV state and RST packets. - TCP: introduce optional per-netns connection hash table to allow better isolation between namespaces (opt-in, at the cost of memory and cache pressure). - MPTCP: support TCP_FASTOPEN_CONNECT. - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior. - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets. - Open vSwitch: - Allow specifying ifindex of new interfaces. - Allow conntrack and metering in non-initial user namespace. - TLS: support the Korean ARIA-GCM crypto algorithm. - Remove DECnet support. Driver API: - Allow selecting the conduit interface used by each port in DSA switches, at runtime. - Ethernet Power Sourcing Equipment and Power Device support. - Add tc-taprio support for queueMaxSDU parameter, i.e. setting per traffic class max frame size for time-based packet schedules. - Support PHY rate matching - adapting between differing host-side and link-side speeds. - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode. - Validate OF (device tree) nodes for DSA shared ports; make phylink-related properties mandatory on DSA and CPU ports. Enforcing more uniformity should allow transitioning to phylink. - Require that flash component name used during update matches one of the components for which version is reported by info_get(). - Remove "weight" argument from driver-facing NAPI API as much as possible. It's one of those magic knobs which seemed like a good idea at the time but is too indirect to use in practice. - Support offload of TLS connections with 256 bit keys. New hardware / drivers: - Ethernet: - Microchip KSZ9896 6-port Gigabit Ethernet Switch - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs - Analog Devices ADIN1110 and ADIN2111 industrial single pair Ethernet (10BASE-T1L) MAC+PHY. - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP). - Ethernet SFPs / modules: - RollBall / Hilink / Turris 10G copper SFPs - HALNy GPON module - WiFi: - CYW43439 SDIO chipset (brcmfmac) - CYW89459 PCIe chipset (brcmfmac) - BCM4378 on Apple platforms (brcmfmac) Drivers: - CAN: - gs_usb: HW timestamp support - Ethernet PHYs: - lan8814: cable diagnostics - Ethernet NICs: - Intel (100G): - implement control of FCS/CRC stripping - port splitting via devlink - L2TPv3 filtering offload - nVidia/Mellanox: - tunnel offload for sub-functions - MACSec offload, w/ Extended packet number and replay window offload - significantly restructure, and optimize the AF_XDP support, align the behavior with other vendors - Huawei: - configuring DSCP map for traffic class selection - querying standard FEC statistics - querying SerDes lane number via ethtool - Marvell/Cavium: - egress priority flow control - MACSec offload - AMD/SolarFlare: - PTP over IPv6 and raw Ethernet - small / embedded: - ax88772: convert to phylink (to support SFP cages) - altera: tse: convert to phylink - ftgmac100: support fixed link - enetc: standard Ethtool counters - macb: ZynqMP SGMII dynamic configuration support - tsnep: support multi-queue and use page pool - lan743x: Rx IP & TCP checksum offload - igc: add xdp frags support to ndo_xdp_xmit - Ethernet high-speed switches: - Marvell (prestera): - support SPAN port features (traffic mirroring) - nexthop object offloading - Microchip (sparx5): - multicast forwarding offload - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets) - Ethernet embedded switches: - Marvell (mv88e6xxx): - support RGMII cmode - NXP (felix): - standardized ethtool counters - Microchip (lan966x): - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets) - traffic policing and mirroring - link aggregation / bonding offload - QUSGMII PHY mode support - Qualcomm 802.11ax WiFi (ath11k): - cold boot calibration support on WCN6750 - support to connect to a non-transmit MBSSID AP profile - enable remain-on-channel support on WCN6750 - Wake-on-WLAN support for WCN6750 - support to provide transmit power from firmware via nl80211 - support to get power save duration for each client - spectral scan support for 160 MHz - MediaTek WiFi (mt76): - WiFi-to-Ethernet bridging offload for MT7986 chips - RealTek WiFi (rtw89): - P2P support" * tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1864 commits) eth: pse: add missing static inlines once: rename _SLOW to _SLEEPABLE net: pse-pd: add regulator based PSE driver dt-bindings: net: pse-dt: add bindings for regulator based PoDL PSE controller ethtool: add interface to interact with Ethernet Power Equipment net: mdiobus: search for PSE nodes by parsing PHY nodes. net: mdiobus: fwnode_mdiobus_register_phy() rework error handling net: add framework to support Ethernet PSE and PDs devices dt-bindings: net: phy: add PoDL PSE property net: marvell: prestera: Propagate nh state from hw to kernel net: marvell: prestera: Add neighbour cache accounting net: marvell: prestera: add stub handler neighbour events net: marvell: prestera: Add heplers to interact with fib_notifier_info net: marvell: prestera: Add length macros for prestera_ip_addr net: marvell: prestera: add delayed wq and flush wq on deinit net: marvell: prestera: Add strict cleanup of fib arbiter net: marvell: prestera: Add cleanup of allocated fib_nodes net: marvell: prestera: Add router nexthops ABI eth: octeon: fix build after netif_napi_add() changes net/mlx5: E-Switch, Return EBUSY if can't get mode lock ...
| * genetlink: start to validate reserved header bytesJakub Kicinski2022-08-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We had historically not checked that genlmsghdr.reserved is 0 on input which prevents us from using those precious bytes in the future. One use case would be to extend the cmd field, which is currently just 8 bits wide and 256 is not a lot of commands for some core families. To make sure that new families do the right thing by default put the onus of opting out of validation on existing families. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> (NetLabel) Signed-off-by: David S. Miller <davem@davemloft.net>
* | fs: dlm: fix possible use after free if tracingAlexander Aring2022-09-261-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a possible use after free if tracing for the specific event is enabled. To avoid the use after free we introduce a out_put label like all other user lock specific requests and safe in a boolean to do a put or not which depends on the execution path of dlm_user_request(). Cc: stable@vger.kernel.org Fixes: 7a3de7324c2b ("fs: dlm: trace user space callbacks") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* | fs: dlm: const void resource name parameterAlexander Aring2022-08-232-11/+14
| | | | | | | | | | | | | | | | | | | | The resource name parameter should never be changed by DLM so we declare it as const. At some point it is handled as a char pointer, a resource name can be a non printable ascii string as well. This patch change it to handle it as void pointer as it is offered by DLM API. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* | fs: dlm: LSFL_CB_DELAY only for kernel lockspacesAlexander Aring2022-08-231-6/+7
| | | | | | | | | | | | | | | | | | This patch only set/clear the LSFL_CB_DELAY bit when it's actually a kernel lockspace signaled by if ls->ls_callback_wq is set or not set in this case. User lockspaces will never evaluate this flag. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* | fs: dlm: remove DLM_LSFL_FS from uapiAlexander Aring2022-08-233-7/+40
| | | | | | | | | | | | | | | | | | | | | | | | The DLM_LSFL_FS flag is set in lockspaces created directly for a kernel user, as opposed to those lockspaces created for user space applications. The user space libdlm allowed this flag to be set for lockspaces created from user space, but then used by a kernel user. No kernel user has ever used this method, so remove the ability to do it. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* | fs: dlm: trace user space callbacksAlexander Aring2022-08-232-5/+26
| | | | | | | | | | | | | | | | | | | | | | | | This patch adds trace callbacks for user locks. Unfortenately user locks are handled in a different way than kernel locks in some cases. User locks never call the dlm_lock()/dlm_unlock() kernel API and use the next step internal API of dlm. Adding those traces from user API callers should make it possible for dlm trace system to see lock handling for user locks as well. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* | fs: dlm: change ls_clear_proc_locks to spinlockAlexander Aring2022-08-234-8/+8
| | | | | | | | | | | | | | | | | | | | This patch changes the ls_clear_proc_locks to a spinlock because there is no need to handle it as a mutex as there is no sleepable context when ls_clear_proc_locks is held. This allows us to call those functionality in non-sleepable contexts. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* | fs: dlm: remove dlm_del_ast prototypeAlexander Aring2022-08-231-1/+0
| | | | | | | | | | | | | | | | This patch removes dlm_del_ast() prototype which is not being used in the dlm subsystem because there is not implementation for it. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* | fs: dlm: handle rcom in else if branchAlexander Aring2022-08-231-1/+4
| | | | | | | | | | | | | | | | | | | | Currently we handle in dlm_receive_buffer() everything else than a DLM_MSG type as DLM_RCOM message. Although a different message than DLM_MSG should be a DLM_RCOM we should explicit check on DLM_RCOM and drop a log_error() if we see something unexpected. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* | fs: dlm: allow lockspaces have zero lvblenAlexander Aring2022-08-231-1/+1
| | | | | | | | | | | | | | | | A dlm user may not use the DLM_LKF_VALBLK flag in the DLM API, so a zero lvblen should be allowed as a lockspace parameter. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* | fs: dlm: fix invalid derefence of sb_lvbptrAlexander Aring2022-08-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I experience issues when putting a lkbsb on the stack and have sb_lvbptr field to a dangled pointer while not using DLM_LKF_VALBLK. It will crash with the following kernel message, the dangled pointer is here 0xdeadbeef as example: [ 102.749317] BUG: unable to handle page fault for address: 00000000deadbeef [ 102.749320] #PF: supervisor read access in kernel mode [ 102.749323] #PF: error_code(0x0000) - not-present page [ 102.749325] PGD 0 P4D 0 [ 102.749332] Oops: 0000 [#1] PREEMPT SMP PTI [ 102.749336] CPU: 0 PID: 1567 Comm: lock_torture_wr Tainted: G W 5.19.0-rc3+ #1565 [ 102.749343] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-2.module+el8.7.0+15506+033991b0 04/01/2014 [ 102.749344] RIP: 0010:memcpy_erms+0x6/0x10 [ 102.749353] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe [ 102.749355] RSP: 0018:ffff97a58145fd08 EFLAGS: 00010202 [ 102.749358] RAX: ffff901778b77070 RBX: 0000000000000000 RCX: 0000000000000040 [ 102.749360] RDX: 0000000000000040 RSI: 00000000deadbeef RDI: ffff901778b77070 [ 102.749362] RBP: ffff97a58145fd10 R08: ffff901760b67a70 R09: 0000000000000001 [ 102.749364] R10: ffff9017008e2cb8 R11: 0000000000000001 R12: ffff901760b67a70 [ 102.749366] R13: ffff901760b78f00 R14: 0000000000000003 R15: 0000000000000001 [ 102.749368] FS: 0000000000000000(0000) GS:ffff901876e00000(0000) knlGS:0000000000000000 [ 102.749372] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 102.749374] CR2: 00000000deadbeef CR3: 000000017c49a004 CR4: 0000000000770ef0 [ 102.749376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 102.749378] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 102.749379] PKRU: 55555554 [ 102.749381] Call Trace: [ 102.749382] <TASK> [ 102.749383] ? send_args+0xb2/0xd0 [ 102.749389] send_common+0xb7/0xd0 [ 102.749395] _unlock_lock+0x2c/0x90 [ 102.749400] unlock_lock.isra.56+0x62/0xa0 [ 102.749405] dlm_unlock+0x21e/0x330 [ 102.749411] ? lock_torture_stats+0x80/0x80 [dlm_locktorture] [ 102.749416] torture_unlock+0x5a/0x90 [dlm_locktorture] [ 102.749419] ? preempt_count_sub+0xba/0x100 [ 102.749427] lock_torture_writer+0xbd/0x150 [dlm_locktorture] [ 102.786186] kthread+0x10a/0x130 [ 102.786581] ? kthread_complete_and_exit+0x20/0x20 [ 102.787156] ret_from_fork+0x22/0x30 [ 102.787588] </TASK> [ 102.787855] Modules linked in: dlm_locktorture torture rpcsec_gss_krb5 intel_rapl_msr intel_rapl_common kvm_intel iTCO_wdt iTCO_vendor_support kvm vmw_vsock_virtio_transport qxl irqbypass vmw_vsock_virtio_transport_common drm_ttm_helper crc32_pclmul joydev crc32c_intel ttm vsock virtio_scsi virtio_balloon snd_pcm drm_kms_helper virtio_console snd_timer snd drm soundcore syscopyarea i2c_i801 sysfillrect sysimgblt i2c_smbus pcspkr fb_sys_fops lpc_ich serio_raw [ 102.792536] CR2: 00000000deadbeef [ 102.792930] ---[ end trace 0000000000000000 ]--- This patch fixes the issue by checking also on DLM_LKF_VALBLK on exflags is set when copying the lvbptr array instead of if it's just null which fixes for me the issue. I think this patch can fix other dlm users as well, depending how they handle the init, freeing memory handling of sb_lvbptr and don't set DLM_LKF_VALBLK for some dlm_lock() calls. It might a there could be a hidden issue all the time. However with checking on DLM_LKF_VALBLK the user always need to provide a sb_lvbptr non-null value. There might be more intelligent handling between per ls lvblen, DLM_LKF_VALBLK and non-null to report the user the way how DLM API is used is wrong but can be added for later, this will only fix the current behaviour. Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* | fs: dlm: handle -EINVAL as log_error()Alexander Aring2022-08-231-2/+30
| | | | | | | | | | | | | | | | | | If the user generates -EINVAL it's probably because they are using DLM incorrectly. Change the log level to make these errors more visible. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>