| Commit message (Collapse) | Author | Files | Lines |
|
There is no xas_pause(&xas) in collapse_file()'s main loop, at the points
where it does xas_unlock_irq(&xas) and then continues.
That would explain why, once two weeks ago and twice yesterday, I have
hit the VM_BUG_ON_PAGE(page != xas_load(&xas), page) since "mm/khugepaged:
fix iteration in collapse_file" removed the xas_set(&xas, index) just
before it: xas.xa_node could be left pointing to a stale node, if there
was concurrent activity on the file which transformed its xarray.
I tried inserting xas_pause()s, but then even bootup crashed on that
VM_BUG_ON_PAGE(): there appears to be a subtle "nextness" implicit in
xas_pause().
xas_next() and xas_pause() are good for use in simple loops, but not in
this one: xas_set() worked well until now, so use xas_set(&xas, index)
explicitly at the head of the loop; and change that VM_BUG_ON_PAGE() not
to need its own xas_set(), and not to interfere with the xa_state (which
would probably stop the crashes from xas_pause(), but I trust that less).
The user-visible effects of this bug (if VM_BUG_ONs are configured out)
would be data loss and data leak - potentially - though in practice I
expect it is more likely that a subsequent check (e.g. on mapping or on
nr_none) would notice an inconsistency, and just abandon the collapse.
Link: https://lore.kernel.org/linux-mm/f18e4b64-3f88-a8ab-56cc-d1f5f9c58d4@google.com/
Fixes: c8a8f3b4a95a ("mm/khugepaged: fix iteration in collapse_file")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Stevens <stevensd@chromium.org>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the new-and-improved attempt at avoiding huge memory load spikes
when the user space boot sequence tries to load hundreds (or even
thousands) of redundant duplicate modules in parallel.
See commit 9828ed3f695a ("module: error out early on concurrent load of
the same module file") for background and an earlier failed attempt that
was reverted.
That earlier attempt just said "concurrently loading the same module is
silly, just open the module file exclusively and return -ETXTBSY if
somebody else is already loading it".
While it is true that concurrent module loads of the same module is
silly, the reason that earlier attempt then failed was that the
concurrently loaded module would often be a prerequisite for another
module.
Thus failing to load the prerequisite would then cause cascading
failures of the other modules, rather than just short-circuiting that
one unnecessary module load.
At the same time, we still really don't want to load the contents of the
same module file hundreds of times, only to then wait for an eventually
successful load, and have everybody else return -EEXIST.
As a result, this takes another approach, and treats concurrent module
loads from the same file as "idempotent" in the inode. So if one module
load is ongoing, we don't start a new one, but instead just wait for the
first one to complete and return the same return value as it did.
So unlike the first attempt, this does not return early: the intent is
not to speed up the boot, but to avoid a thundering herd problem in
allocating memory (both physical and virtual) for a module more than
once.
Also note that this does change behavior: it used to be that when you
had concurrent loads, you'd have one "winner" that would return success,
and everybody else would return -EEXIST.
In contrast, this idempotent logic goes all Oprah on the problem, and
says "You are a winner! And you are a winner! We are ALL winners". But
since there's no possible actual real semantic difference between "you
loaded the module" and "somebody else already loaded the module", this
is more of a feel-good change than an actual honest-to-goodness semantic
change.
Of course, any true Johnny-come-latelies that don't get caught in the
concurrency filter will still return -EEXIST. It's no different from
not even getting a seat at an Oprah taping. That's life.
See the long thread on the kernel mailing list about this all, which
includes some numbers for memory use before and after the patch.
Link: https://lore.kernel.org/lkml/20230524213620.3509138-1-mcgrof@kernel.org/
Reviewed-by: Johan Hovold <johan@kernel.org>
Tested-by: Johan Hovold <johan@kernel.org>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Rudi Heitbaum <rudi@heitbaum..com>
Tested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This will simplify the next step, where we can then key off the inode to
do one idempotent module load.
Let's do the obvious re-organization in one step, and then the new code
in another.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The memory encryption initialization logic was moved from init/main.c
into arch_cpu_finalize_init() in commit 439e17576eb4 ("init, x86: Move
mem_encrypt_init() into arch_cpu_finalize_init()"), but a stale
declaration for the init function was left in <linux/init.h>.
And didn't cause any problems if you had X86_MEM_ENCRYPT enabled, which
apparently everybody involved did have. See also commit 0a9567ac5e6a
("x86/mem_encrypt: Unbreak the AMD_MEM_ENCRYPT=n build") in this whole
sad saga of conflicting declarations for different situations.
Reported-by: Matthew Wilcox <willy@infradead.org>
Fixes: 439e17576eb4 init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit ca5e863233e8 ("mm/gup: remove vmas parameter from
get_user_pages_remote()") removed the vma argument from GUP handling,
and instead added a helper function (get_user_page_vma_remote()) that
looks it up separately using 'vma_lookup()'. And then converted
existing users that needed a vma to use the helper instead.
However, the helper function intentionally acts exactly like the old
get_user_pages_remote() did, and only fills in 'vma' on successful page
lookup. Fine so far.
However, __access_remote_vm() wants the vma even for the unsuccessful
case, and used to do a
vma = vma_lookup(mm, addr);
explicitly to look it up when the get_user_page() failed.
However, that conversion commit incorrectly removed that vma lookup,
thinking that get_user_page_vma_remote() would have done it. Not so.
So add the vma_lookup() back in.
Fixes: ca5e863233e8 ("mm/gup: remove vmas parameter from get_user_pages_remote()")
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit 6ebe94baa2b9ddf3ccbb7f94df6ab26234532734.
The patch "nios2: Convert __pte_free_tlb() to use ptdescs" was supposed
to go together with a patchset that Vishal Moola had planned taking it
through the mm tree. By just having this patch, all NIOS2 builds are
broken.
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Recently, our friends from bluetooth subsystem reported [1] that after
commit 5e2ff6704a27 ("scm: add SO_PASSPIDFD and SCM_PIDFD") scm_recv()
helper become unusable in kernel modules (because it uses unexported
pidfd_prepare() API).
We were aware of this issue and workarounded it in a hard way
by commit 97154bcf4d1b ("af_unix: Kconfig: make CONFIG_UNIX bool").
But recently a new functionality was added in the scope of commit
817efd3cad74 ("Bluetooth: hci_sock: Forward credentials to monitor")
and after that bluetooth can't be compiled as a kernel module.
After some discussion in [1] we decided to split scm_recv() into
two helpers, one won't support SCM_PIDFD (used for unix sockets),
and another one will be completely the same as it was before commit
5e2ff6704a27 ("scm: add SO_PASSPIDFD and SCM_PIDFD").
Link: https://lore.kernel.org/lkml/CAJqdLrpFcga4n7wxBhsFqPQiN8PKFVr6U10fKcJ9W7AcZn+o6Q@mail.gmail.com/ [1]
Fixes: 5e2ff6704a27 ("scm: add SO_PASSPIDFD and SCM_PIDFD")
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230627174314.67688-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzkaller hit a WARN_ON_ONCE(!scm->pid) in scm_pidfd_recv().
In unix_stream_read_generic(), if there is no skb in the queue, we could
bail out the do-while loop without calling scm_set_cred():
1. No skb in the queue
2. sk is non-blocking
or
shutdown(sk, RCV_SHUTDOWN) is called concurrently
or
peer calls close()
If the socket is configured with SO_PASSPIDFD, scm_pidfd_recv() would
populate cmsg with garbage emitting the warning.
Let's skip SCM_PIDFD if scm->pid is NULL in scm_pidfd_recv().
Note another way would be skip calling scm_recv() in such cases, but this
caused a regression resulting in commit 9d797ee2dce1 ("Revert "af_unix:
Call scm_recv() only after scm_set_cred()."").
WARNING: CPU: 1 PID: 3245 at include/net/scm.h:138 scm_pidfd_recv include/net/scm.h:138 [inline]
WARNING: CPU: 1 PID: 3245 at include/net/scm.h:138 scm_recv.constprop.0+0x754/0x850 include/net/scm.h:177
Modules linked in:
CPU: 1 PID: 3245 Comm: syz-executor.1 Not tainted 6.4.0-rc5-01219-gfa0e21fa4443 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:scm_pidfd_recv include/net/scm.h:138 [inline]
RIP: 0010:scm_recv.constprop.0+0x754/0x850 include/net/scm.h:177
Code: 67 fd e9 55 fd ff ff e8 4a 70 67 fd e9 7f fd ff ff e8 40 70 67 fd e9 3e fb ff ff e8 36 70 67 fd e9 02 fd ff ff e8 8c 3a 20 fd <0f> 0b e9 fe fb ff ff e8 50 70 67 fd e9 2e f9 ff ff e8 46 70 67 fd
RSP: 0018:ffffc90009af7660 EFLAGS: 00010216
RAX: 00000000000000a1 RBX: ffff888041e58a80 RCX: ffffc90003852000
RDX: 0000000000040000 RSI: ffffffff842675b4 RDI: 0000000000000007
RBP: ffffc90009af7810 R08: 0000000000000007 R09: 0000000000000013
R10: 00000000000000f8 R11: 0000000000000001 R12: ffffc90009af7db0
R13: 0000000000000000 R14: ffff888041e58a88 R15: 1ffff9200135eecc
FS: 00007f6b7113f640(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6b7111de38 CR3: 0000000012a6e002 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
<TASK>
unix_stream_read_generic+0x5fe/0x1f50 net/unix/af_unix.c:2830
unix_stream_recvmsg+0x194/0x1c0 net/unix/af_unix.c:2880
sock_recvmsg_nosec net/socket.c:1019 [inline]
sock_recvmsg+0x188/0x1d0 net/socket.c:1040
____sys_recvmsg+0x210/0x610 net/socket.c:2712
___sys_recvmsg+0xff/0x190 net/socket.c:2754
do_recvmmsg+0x25d/0x6c0 net/socket.c:2848
__sys_recvmmsg net/socket.c:2927 [inline]
__do_sys_recvmmsg net/socket.c:2950 [inline]
__se_sys_recvmmsg net/socket.c:2943 [inline]
__x64_sys_recvmmsg+0x224/0x290 net/socket.c:2943
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3f/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7f6b71da2e5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007f6b7113ecc8 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 00000000004bc050 RCX: 00007f6b71da2e5d
RDX: 0000000000000007 RSI: 0000000020006600 RDI: 000000000000000b
RBP: 00000000004bc050 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000120 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f6b71e03530 R15: 0000000000000000
</TASK>
Fixes: 5e2ff6704a27 ("scm: add SO_PASSPIDFD and SCM_PIDFD")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230627174314.67688-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Simplify comparison, no functional changes.
Cc: Bryan Whitehead <bryan.whitehead@microchip.com>
Cc: UNGLinuxDriver@microchip.com
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Moritz Fischer <moritzf@google.com>
Link: https://lore.kernel.org/r/20230627035432.1296760-1-moritzf@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
It feels very unlikely that anybody would want to do a GUP in an
unmapped area under the stack pointer, but real users sometimes do some
really strange things. So add a (temporary) warning for the case where
a GUP fails and expanding the stack might have made it work.
It's trivial to do the expansion in the caller as part of getting the mm
lock in the first place - see __access_remote_vm() for ptrace, for
example - it's just that it's unnecessarily painful to do it deep in the
guts of the GUP lookup when we might have to drop and re-take the lock.
I doubt anybody actually does anything quite this strange, but let's be
proactive: adding these warnings is simple, and will make debugging it
much easier if they trigger.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.
For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks. Let's see if any
strange users really wanted that.
It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma. This makes it fairly
straightforward to convert the remaining architectures.
As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid. So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64
Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
syzbot reported a warning in __local_bh_enable_ip(). [0]
Commit 8d61f926d420 ("netlink: fix potential deadlock in
netlink_set_err()") converted read_lock(&nl_table_lock) to
read_lock_irqsave() in __netlink_diag_dump() to prevent a deadlock.
However, __netlink_diag_dump() calls sock_i_ino() that uses
read_lock_bh() and read_unlock_bh(). If CONFIG_TRACE_IRQFLAGS=y,
read_unlock_bh() finally enables IRQ even though it should stay
disabled until the following read_unlock_irqrestore().
Using read_lock() in sock_i_ino() would trigger a lockdep splat
in another place that was fixed in commit f064af1e500a ("net: fix
a lockdep splat"), so let's add __sock_i_ino() that would be safe
to use under BH disabled.
[0]:
WARNING: CPU: 0 PID: 5012 at kernel/softirq.c:376 __local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376
Modules linked in:
CPU: 0 PID: 5012 Comm: syz-executor487 Not tainted 6.4.0-rc7-syzkaller-00202-g6f68fc395f49 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
RIP: 0010:__local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376
Code: 45 bf 01 00 00 00 e8 91 5b 0a 00 e8 3c 15 3d 00 fb 65 8b 05 ec e9 b5 7e 85 c0 74 58 5b 5d c3 65 8b 05 b2 b6 b4 7e 85 c0 75 a2 <0f> 0b eb 9e e8 89 15 3d 00 eb 9f 48 89 ef e8 6f 49 18 00 eb a8 0f
RSP: 0018:ffffc90003a1f3d0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000201 RCX: 1ffffffff1cf5996
RDX: 0000000000000000 RSI: 0000000000000201 RDI: ffffffff8805c6f3
RBP: ffffffff8805c6f3 R08: 0000000000000001 R09: ffff8880152b03a3
R10: ffffed1002a56074 R11: 0000000000000005 R12: 00000000000073e4
R13: dffffc0000000000 R14: 0000000000000002 R15: 0000000000000000
FS: 0000555556726300(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000045ad50 CR3: 000000007c646000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
sock_i_ino+0x83/0xa0 net/core/sock.c:2559
__netlink_diag_dump+0x45c/0x790 net/netlink/diag.c:171
netlink_diag_dump+0xd6/0x230 net/netlink/diag.c:207
netlink_dump+0x570/0xc50 net/netlink/af_netlink.c:2269
__netlink_dump_start+0x64b/0x910 net/netlink/af_netlink.c:2374
netlink_dump_start include/linux/netlink.h:329 [inline]
netlink_diag_handler_dump+0x1ae/0x250 net/netlink/diag.c:238
__sock_diag_cmd net/core/sock_diag.c:238 [inline]
sock_diag_rcv_msg+0x31e/0x440 net/core/sock_diag.c:269
netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2547
sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:280
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1914
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg+0xde/0x190 net/socket.c:747
____sys_sendmsg+0x71c/0x900 net/socket.c:2503
___sys_sendmsg+0x110/0x1b0 net/socket.c:2557
__sys_sendmsg+0xf7/0x1c0 net/socket.c:2586
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f5303aaabb9
Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc7506e548 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5303aaabb9
RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003
RBP: 00007f5303a6ed60 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f5303a6edf0
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Fixes: 8d61f926d420 ("netlink: fix potential deadlock in netlink_set_err()")
Reported-by: syzbot+5da61cf6a9bc1902d422@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=5da61cf6a9bc1902d422
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230626164313.52528-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When using the felix driver (the only one which supports UC filtering
and MC filtering) as a DSA master for a random other DSA switch, one can
see the following stack trace when the downstream switch ports join a
VLAN-aware bridge:
=============================
WARNING: suspicious RCU usage
-----------------------------
net/8021q/vlan_core.c:238 suspicious rcu_dereference_protected() usage!
stack backtrace:
Workqueue: dsa_ordered dsa_slave_switchdev_event_work
Call trace:
lockdep_rcu_suspicious+0x170/0x210
vlan_for_each+0x8c/0x188
dsa_slave_sync_uc+0x128/0x178
__hw_addr_sync_dev+0x138/0x158
dsa_slave_set_rx_mode+0x58/0x70
__dev_set_rx_mode+0x88/0xa8
dev_uc_add+0x74/0xa0
dsa_port_bridge_host_fdb_add+0xec/0x180
dsa_slave_switchdev_event_work+0x7c/0x1c8
process_one_work+0x290/0x568
What it's saying is that vlan_for_each() expects rtnl_lock() context and
it's not getting it, when it's called from the DSA master's ndo_set_rx_mode().
The caller of that - dsa_slave_set_rx_mode() - is the slave DSA
interface's dsa_port_bridge_host_fdb_add() which comes from the deferred
dsa_slave_switchdev_event_work().
We went to great lengths to avoid the rtnl_lock() context in that call
path in commit 0faf890fc519 ("net: dsa: drop rtnl_lock from
dsa_slave_switchdev_event_work"), and calling rtnl_lock() is simply not
an option due to the possibility of deadlocking when calling
dsa_flush_workqueue() from the call paths that do hold rtnl_lock() -
basically all of them.
So, when the DSA master calls vlan_for_each() from its ndo_set_rx_mode(),
the state of the 8021q driver on this device is really not protected
from concurrent access by anything.
Looking at net/8021q/, I don't think that vlan_info->vid_list was
particularly designed with RCU traversal in mind, so introducing an RCU
read-side form of vlan_for_each() - vlan_for_each_rcu() - won't be so
easy, and it also wouldn't be exactly what we need anyway.
In general I believe that the solution isn't in net/8021q/ anyway;
vlan_for_each() is not cut out for this task. DSA doesn't need rtnl_lock()
to be held per se - since it's not a netdev state change that we're
blocking, but rather, just concurrent additions/removals to a VLAN list.
We don't even need sleepable context - the callback of vlan_for_each()
just schedules deferred work.
The proposed escape is to remove the dependency on vlan_for_each() and
to open-code a non-sleepable, rtnl-free alternative to that, based on
copies of the VLAN list modified from .ndo_vlan_rx_add_vid() and
.ndo_vlan_rx_kill_vid().
Fixes: 64fdc5f341db ("net: dsa: sync unicast and multicast addresses for VLAN filters too")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230626154402.3154454-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This reverts commit 3f5f118bb657f94641ea383c7c1b8c09a5d46ea2.
Konrad reported that desktop environment below cannot be reached after
commit 3f5f118bb657 ("af_unix: Call scm_recv() only after scm_set_cred().")
- postmarketOS (Alpine Linux w/ musl 1.2.4)
- busybox 1.36.1
- GNOME 44.1
- networkmanager 1.42.6
- openrc 0.47
Regarding to the warning of SO_PASSPIDFD, I'll post another patch to
suppress it by skipping SCM_PIDFD if scm->pid == NULL in scm_pidfd_recv().
Reported-by: Konrad Dybcio <konradybcio@kernel.org>
Link: https://lore.kernel.org/netdev/8c7f9abd-4f84-7296-2788-1e130d6304a0@kernel.org/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20230626205837.82086-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Stephen reports warnings when rendering phylink kdocs as HTML:
include/linux/phylink.h:110: ERROR: Unexpected indentation.
include/linux/phylink.h:111: WARNING: Block quote ends without a blank line; unexpected unindent.
include/linux/phylink.h:614: WARNING: Inline literal start-string without end-string.
include/linux/phylink.h:644: WARNING: Inline literal start-string without end-string.
Make phylink_pcs_neg_mode() use a proper list format to fix the first
two warnings.
The last two warnings, AFAICT, come from the use of shorthand like
phylink_mode_*(). Perhaps those should be special-cased at the Sphinx
level.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lore.kernel.org/all/20230626162908.2f149f98@canb.auug.org.au/
Link: https://lore.kernel.org/r/20230626214640.3142252-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix the mishandling of MSG_DONTWAIT and also reinstates the per-page
checking of the source pages (which might have come from a DIO write by
userspace) by partially reverting the changes to support MSG_SPLICE_PAGES
and doing things a little differently. In messenger_v1:
(1) The ceph_tcp_sendpage() is resurrected and the callers reverted to use
that.
(2) The callers now pass MSG_MORE unconditionally. Previously, they were
passing in MSG_MORE|MSG_SENDPAGE_NOTLAST and then degrading that to
just MSG_MORE on the last call to ->sendpage().
(3) Make ceph_tcp_sendpage() a wrapper around sendmsg() rather than
sendpage(), setting MSG_SPLICE_PAGES if sendpage_ok() returns true on
the page.
In messenger_v2:
(4) Bring back do_try_sendpage() and make the callers use that.
(5) Make do_try_sendpage() use sendmsg() for both cases and set
MSG_SPLICE_PAGES if sendpage_ok() is set.
Fixes: 40a8c17aa770 ("ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage")
Fixes: fa094ccae1e7 ("ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()")
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Link: https://lore.kernel.org/r/CAOi1vP9vjLfk3W+AJFeexC93jqPaPUn2dD_4NrzxwoZTbYfOnw@mail.gmail.com/
Link: https://lore.kernel.org/r/CAOi1vP_Bn918j24S94MuGyn+Gxk212btw7yWeDrRcW1U8pc_BA@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Link: https://lore.kernel.org/r/3101881.1687801973@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/3111635.1687813501@warthog.procyon.org.uk/ # v2
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Link: https://lore.kernel.org/r/3199652.1687873788@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Two deadly typos break RX and TX traffic on the VSC8502 PHY using RGMII
if phy-mode = "rgmii-id" or "rgmii-txid", and no "tx-internal-delay-ps"
override exists. The negative error code from phy_get_internal_delay()
does not get overridden with the delay deduced from the phy-mode, and
later gets committed to hardware. Also, the rx_delay gets overridden by
what should have been the tx_delay.
Fixes: dbb050d2bfc8 ("phy: mscc: Add support for RGMII delay configuration")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Harini Katakam <harini.katakam@amd.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230627134235.3453358-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use vmalloc_array and vcalloc to protect against
multiplication overflows.
The changes were done using the following Coccinelle
semantic patch:
// <smpl>
@initialize:ocaml@
@@
let rename alloc =
match alloc with
"vmalloc" -> "vmalloc_array"
| "vzalloc" -> "vcalloc"
| _ -> failwith "unknown"
@@
size_t e1,e2;
constant C1, C2;
expression E1, E2, COUNT, x1, x2, x3;
typedef u8;
typedef __u8;
type t = {u8,__u8,char,unsigned char};
identifier alloc = {vmalloc,vzalloc};
fresh identifier realloc = script:ocaml(alloc) { rename alloc };
@@
(
alloc(x1*x2*x3)
|
alloc(C1 * C2)
|
alloc((sizeof(t)) * (COUNT), ...)
|
- alloc((e1) * (e2))
+ realloc(e1, e2)
|
- alloc((e1) * (COUNT))
+ realloc(COUNT, e1)
|
- alloc((E1) * (E2))
+ realloc(E1, E2)
)
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20230627144339.144478-23-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use vmalloc_array and vcalloc to protect against
multiplication overflows.
The changes were done using the following Coccinelle
semantic patch:
// <smpl>
@initialize:ocaml@
@@
let rename alloc =
match alloc with
"vmalloc" -> "vmalloc_array"
| "vzalloc" -> "vcalloc"
| _ -> failwith "unknown"
@@
size_t e1,e2;
constant C1, C2;
expression E1, E2, COUNT, x1, x2, x3;
typedef u8;
typedef __u8;
type t = {u8,__u8,char,unsigned char};
identifier alloc = {vmalloc,vzalloc};
fresh identifier realloc = script:ocaml(alloc) { rename alloc };
@@
(
alloc(x1*x2*x3)
|
alloc(C1 * C2)
|
alloc((sizeof(t)) * (COUNT), ...)
|
- alloc((e1) * (e2))
+ realloc(e1, e2)
|
- alloc((e1) * (COUNT))
+ realloc(COUNT, e1)
|
- alloc((E1) * (E2))
+ realloc(E1, E2)
)
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20230627144339.144478-19-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use vmalloc_array and vcalloc to protect against
multiplication overflows.
The changes were done using the following Coccinelle
semantic patch:
// <smpl>
@initialize:ocaml@
@@
let rename alloc =
match alloc with
"vmalloc" -> "vmalloc_array"
| "vzalloc" -> "vcalloc"
| _ -> failwith "unknown"
@@
size_t e1,e2;
constant C1, C2;
expression E1, E2, COUNT, x1, x2, x3;
typedef u8;
typedef __u8;
type t = {u8,__u8,char,unsigned char};
identifier alloc = {vmalloc,vzalloc};
fresh identifier realloc = script:ocaml(alloc) { rename alloc };
@@
(
alloc(x1*x2*x3)
|
alloc(C1 * C2)
|
alloc((sizeof(t)) * (COUNT), ...)
|
- alloc((e1) * (e2))
+ realloc(e1, e2)
|
- alloc((e1) * (COUNT))
+ realloc(COUNT, e1)
|
- alloc((E1) * (E2))
+ realloc(E1, E2)
)
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20230627144339.144478-12-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use vmalloc_array and vcalloc to protect against
multiplication overflows.
The changes were done using the following Coccinelle
semantic patch:
// <smpl>
@initialize:ocaml@
@@
let rename alloc =
match alloc with
"vmalloc" -> "vmalloc_array"
| "vzalloc" -> "vcalloc"
| _ -> failwith "unknown"
@@
size_t e1,e2;
constant C1, C2;
expression E1, E2, COUNT, x1, x2, x3;
typedef u8;
typedef __u8;
type t = {u8,__u8,char,unsigned char};
identifier alloc = {vmalloc,vzalloc};
fresh identifier realloc = script:ocaml(alloc) { rename alloc };
@@
(
alloc(x1*x2*x3)
|
alloc(C1 * C2)
|
alloc((sizeof(t)) * (COUNT), ...)
|
- alloc((e1) * (e2))
+ realloc(e1, e2)
|
- alloc((e1) * (COUNT))
+ realloc(COUNT, e1)
|
- alloc((E1) * (E2))
+ realloc(E1, E2)
)
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20230627144339.144478-10-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use vmalloc_array and vcalloc to protect against
multiplication overflows.
The changes were done using the following Coccinelle
semantic patch:
// <smpl>
@initialize:ocaml@
@@
let rename alloc =
match alloc with
"vmalloc" -> "vmalloc_array"
| "vzalloc" -> "vcalloc"
| _ -> failwith "unknown"
@@
size_t e1,e2;
constant C1, C2;
expression E1, E2, COUNT, x1, x2, x3;
typedef u8;
typedef __u8;
type t = {u8,__u8,char,unsigned char};
identifier alloc = {vmalloc,vzalloc};
fresh identifier realloc = script:ocaml(alloc) { rename alloc };
@@
(
alloc(x1*x2*x3)
|
alloc(C1 * C2)
|
alloc((sizeof(t)) * (COUNT), ...)
|
- alloc((e1) * (e2))
+ realloc(e1, e2)
|
- alloc((e1) * (COUNT))
+ realloc(COUNT, e1)
|
- alloc((E1) * (E2))
+ realloc(E1, E2)
)
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20230627144339.144478-5-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use vmalloc_array and vcalloc to protect against
multiplication overflows.
The changes were done using the following Coccinelle
semantic patch:
// <smpl>
@initialize:ocaml@
@@
let rename alloc =
match alloc with
"vmalloc" -> "vmalloc_array"
| "vzalloc" -> "vcalloc"
| _ -> failwith "unknown"
@@
size_t e1,e2;
constant C1, C2;
expression E1, E2, COUNT, x1, x2, x3;
typedef u8;
typedef __u8;
type t = {u8,__u8,char,unsigned char};
identifier alloc = {vmalloc,vzalloc};
fresh identifier realloc = script:ocaml(alloc) { rename alloc };
@@
(
alloc(x1*x2*x3)
|
alloc(C1 * C2)
|
alloc((sizeof(t)) * (COUNT), ...)
|
- alloc((e1) * (e2))
+ realloc(e1, e2)
|
- alloc((e1) * (COUNT))
+ realloc(COUNT, e1)
|
- alloc((E1) * (E2))
+ realloc(E1, E2)
)
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20230627144339.144478-3-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add RmNet support for LARA-R6 01B.
The new LARA-R6 product variant identified by the "01B" string can be
configured (by AT interface) in three different USB modes:
* Default mode (Vendor ID: 0x1546 Product ID: 0x1311) with 4 serial
interfaces
* RmNet mode (Vendor ID: 0x1546 Product ID: 0x1312) with 4 serial
interfaces and 1 RmNet virtual network interface
* CDC-ECM mode (Vendor ID: 0x1546 Product ID: 0x1313) with 4 serial
interface and 1 CDC-ECM virtual network interface
The first 4 interfaces of all the 3 configurations (default, RmNet, ECM)
are the same.
In RmNet mode LARA-R6 01B exposes the following interfaces:
If 0: Diagnostic
If 1: AT parser
If 2: AT parser
If 3: AT parset/alternative functions
If 4: RMNET interface
Signed-off-by: Davide Tronchin <davide.tronchin.94@gmail.com>
Link: https://lore.kernel.org/r/20230626125336.3127-1-davide.tronchin.94@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Our MPTCP CI and Stephen got this error:
In file included from builtin-trace.c:907:
trace/beauty/msg_flags.c: In function 'syscall_arg__scnprintf_msg_flags':
trace/beauty/msg_flags.c:28:21: error: 'MSG_SPLICE_PAGES' undeclared (first use in this function)
28 | if (flags & MSG_##n) { | ^~~~
trace/beauty/msg_flags.c:50:9: note: in expansion of macro 'P_MSG_FLAG'
50 | P_MSG_FLAG(SPLICE_PAGES);
| ^~~~~~~~~~
trace/beauty/msg_flags.c:28:21: note: each undeclared identifier is reported only once for each function it appears in
28 | if (flags & MSG_##n) { | ^~~~
trace/beauty/msg_flags.c:50:9: note: in expansion of macro 'P_MSG_FLAG'
50 | P_MSG_FLAG(SPLICE_PAGES);
| ^~~~~~~~~~
The fix is similar to what was done with MSG_FASTOPEN: the new macro is
defined if it is not defined in the system headers.
Fixes: b848b26c6672 ("net: Kill MSG_SENDPAGE_NOTLAST")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/r/20230626112847.2ef3d422@canb.auug.org.au/
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20230626090239.899672-1-matthieu.baerts@tessares.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
ipvlan_queue_xmit() should return NET_XMIT_XXX, but
ipvlan_xmit_mode_l2/l3() returns rx_handler_result_t or NET_RX_XXX
in some cases. ipvlan_rcv_frame() will only return RX_HANDLER_CONSUMED
in ipvlan_xmit_mode_l2/l3() because 'local' is true. It's equal to
NET_XMIT_SUCCESS. But dev_forward_skb() can return NET_RX_SUCCESS or
NET_RX_DROP, and returning NET_RX_DROP(NET_XMIT_DROP) will increase
both ipvlan and ipvlan->phy_dev drops counter.
The skb to forward can be treated as xmitted successfully. This patch
makes ipvlan_queue_xmit() return NET_XMIT_SUCCESS for forward skb.
Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230626093347.7492-1-cambda@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This was noticed by a user who noticied that the mtime of a file
backing a loopback device was getting bumped when the loopback device
is mounted read/only. Note: This doesn't show up when doing a
loopback mount of a file directly, via "mount -o ro /tmp/foo.img
/mnt", since the loop device is set read-only when mount automatically
creates loop device. However, this is noticeable for a LUKS loop
device like this:
% cryptsetup luksOpen /tmp/foo.img test
% mount -o ro /dev/loop0 /mnt ; umount /mnt
or, if LUKS is not in use, if the user manually creates the loop
device like this:
% losetup /dev/loop0 /tmp/foo.img
% mount -o ro /dev/loop0 /mnt ; umount /mnt
The modified mtime causes rsync to do a rolling checksum scan of the
file on the local and remote side, incrementally increasing the time
to rsync the not-modified-but-touched image file.
Fixes: eee00237fa5e ("ext4: commit super block if fs record error when journal record without error")
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/ZIauBR7YiV3rVAHL@glitch
Reported-by: Sean Greenslade <sean@seangreenslade.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We got a NULL pointer dereference issue below while running generic/475
I/O failure pressure test.
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 1 PID: 15600 Comm: fsstress Not tainted 6.4.0-rc5-xfstests-00055-gd3ab1bca26b4 #190
RIP: 0010:jbd2_journal_set_features+0x13d/0x430
...
Call Trace:
<TASK>
? __die+0x23/0x60
? page_fault_oops+0xa4/0x170
? exc_page_fault+0x67/0x170
? asm_exc_page_fault+0x26/0x30
? jbd2_journal_set_features+0x13d/0x430
jbd2_journal_revoke+0x47/0x1e0
__ext4_forget+0xc3/0x1b0
ext4_free_blocks+0x214/0x2f0
ext4_free_branches+0xeb/0x270
ext4_ind_truncate+0x2bf/0x320
ext4_truncate+0x1e4/0x490
ext4_handle_inode_extension+0x1bd/0x2a0
? iomap_dio_complete+0xaf/0x1d0
The root cause is the journal super block had been failed to write out
due to I/O fault injection, it's uptodate bit was cleared by
end_buffer_write_sync() and didn't reset yet in jbd2_write_superblock().
And it raced by journal_get_superblock()->bh_read(), unfortunately, the
read IO is also failed, so the error handling in
journal_fail_superblock() unexpectedly clear the journal->j_sb_buffer,
finally lead to above NULL pointer dereference issue.
If the journal super block had been read and verified, there is no need
to call bh_read() read it again even if it has been failed to written
out. So the fix could be simply move buffer_verified(bh) in front of
bh_read(). Also remove a stale comment left in
jbd2_journal_check_used_features().
Fixes: 51bacdba23d8 ("jbd2: factor out journal initialization from journal_get_superblock()")
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230616015547.3155195-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
freeze_bdev() can fail due to a lot of reasons, it needs to check its
reason before later process.
Fixes: 783d94854499 ("ext4: add EXT4_IOC_GOINGDOWN ioctl")
Cc: stable@kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230606073203.1310389-1-chao@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Rename ext4_quota_off_umount() to ext4_quotas_off(), and add type
parameter to replace open code in ext4_enable_quotas().
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230327141630.156875-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Yi found during a review of the patch "ext4: don't BUG on inconsistent
journal feature" that when ext4_mark_recovery_complete() returns an error
value, the error handling path does not turn off the enabled quotas,
which triggers the following kmemleak:
================================================================
unreferenced object 0xffff8cf68678e7c0 (size 64):
comm "mount", pid 746, jiffies 4294871231 (age 11.540s)
hex dump (first 32 bytes):
00 90 ef 82 f6 8c ff ff 00 00 00 00 41 01 00 00 ............A...
c7 00 00 00 bd 00 00 00 0a 00 00 00 48 00 00 00 ............H...
backtrace:
[<00000000c561ef24>] __kmem_cache_alloc_node+0x4d4/0x880
[<00000000d4e621d7>] kmalloc_trace+0x39/0x140
[<00000000837eee74>] v2_read_file_info+0x18a/0x3a0
[<0000000088f6c877>] dquot_load_quota_sb+0x2ed/0x770
[<00000000340a4782>] dquot_load_quota_inode+0xc6/0x1c0
[<0000000089a18bd5>] ext4_enable_quotas+0x17e/0x3a0 [ext4]
[<000000003a0268fa>] __ext4_fill_super+0x3448/0x3910 [ext4]
[<00000000b0f2a8a8>] ext4_fill_super+0x13d/0x340 [ext4]
[<000000004a9489c4>] get_tree_bdev+0x1dc/0x370
[<000000006e723bf1>] ext4_get_tree+0x1d/0x30 [ext4]
[<00000000c7cb663d>] vfs_get_tree+0x31/0x160
[<00000000320e1bed>] do_new_mount+0x1d5/0x480
[<00000000c074654c>] path_mount+0x22e/0xbe0
[<0000000003e97a8e>] do_mount+0x95/0xc0
[<000000002f3d3736>] __x64_sys_mount+0xc4/0x160
[<0000000027d2140c>] do_syscall_64+0x3f/0x90
================================================================
To solve this problem, we add a "failed_mount10" tag, and call
ext4_quota_off_umount() in this tag to release the enabled qoutas.
Fixes: 11215630aada ("ext4: don't BUG on inconsistent journal feature")
Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230327141630.156875-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Update the description in doc about the s_head parameter in the journal
superblock.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20230322013353.1843306-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Always enable 'JBD2_CYCLE_RECORD' journal option on ext4, letting the
jbd2 continue to record new journal transactions from the recovered
journal head or the checkpointed transactions in the previous mount.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230322013353.1843306-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
For a newly mounted file system, the journal committing thread always
record new transactions from the start of the journal area, no matter
whether the journal was clean or just has been recovered. So the logdump
code in debugfs cannot dump continuous logs between each mount, it is
disadvantageous to analysis corrupted file system image and locate the
file system inconsistency bugs.
If we get a corrupted file system in the running products and want to
find out what has happened, besides lookup the system log, one effective
way is to backtrack the journal log. But we may not always run e2fsck
before each mount and the default fsck -a mode also cannot always
checkout all inconsistencies, so it could left over some inconsistencies
into the next mount until we detect it. Finally, transactions in the
journal may probably discontinuous and some relatively new transactions
has been covered, it becomes hard to analyse. If we could record
transactions continuously between each mount, we could acquire more
useful info from the journal. Like this:
|Previous mount checkpointed/recovered logs|Current mount logs |
|{------}{---}{--------} ... {------}| ... |{======}{========}...000000|
And yes the journal area is limited and cannot record everything, the
problematic transaction may also be covered even if we do this, but
this is still useful for fuzzy tests and short-running products.
This patch save the head blocknr in the superblock after flushing the
journal or unmounting the file system, let the next mount could continue
to record new transaction behind it. This change is backward compatible
because the old kernel does not care about the head blocknr of the
journal. It is also fine if we mount a clean old image without valid
head blocknr, we fail back to set it to s_first just like before.
Finally, for the case of mount an unclean file system, we could also get
the journal head easily after scanning/replaying the journal, it will
continue to record new transaction after the recovered transactions.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230322013353.1843306-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
journal->j_format_version is no longer used, remove it.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-7-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Current journal_get_superblock() couple journal superblock checking and
partial journal initialization, factor out initialization part from it
to make things clear.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-6-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We should only check and set extented features if journal format version
is 2, and now we check the in memory copy of the superblock
'journal->j_format_version', which relys on the parameter initialization
sequence, switch to use the h_blocktype in superblock cloud be more
clear.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-5-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
JBD2_HAS_[IN|RO_]COMPAT_FEATURE macros are no longer used, just remove
them.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-4-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
As discussed in [1], 'sbi->s_journal_bdev != sb->s_bdev' will always
become true if sbi->s_journal_bdev exists. Filesystem block device and
journal block device are both opened with 'FMODE_EXCL' mode, so these
two devices can't be same one. Then we can remove the redundant checking
'sbi->s_journal_bdev != sb->s_bdev' if 'sbi->s_journal_bdev' exists.
[1] https://lore.kernel.org/lkml/f86584f6-3877-ff18-47a1-2efaa12d18b2@huawei.com/
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-3-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Following process makes ext4 load stale buffer heads from last failed
mounting in a new mounting operation:
mount_bdev
ext4_fill_super
| ext4_load_and_init_journal
| ext4_load_journal
| jbd2_journal_load
| load_superblock
| journal_get_superblock
| set_buffer_verified(bh) // buffer head is verified
| jbd2_journal_recover // failed caused by EIO
| goto failed_mount3a // skip 'sb->s_root' initialization
deactivate_locked_super
kill_block_super
generic_shutdown_super
if (sb->s_root)
// false, skip ext4_put_super->invalidate_bdev->
// invalidate_mapping_pages->mapping_evict_folio->
// filemap_release_folio->try_to_free_buffers, which
// cannot drop buffer head.
blkdev_put
blkdev_put_whole
if (atomic_dec_and_test(&bdev->bd_openers))
// false, systemd-udev happens to open the device. Then
// blkdev_flush_mapping->kill_bdev->truncate_inode_pages->
// truncate_inode_folio->truncate_cleanup_folio->
// folio_invalidate->block_invalidate_folio->
// filemap_release_folio->try_to_free_buffers will be skipped,
// dropping buffer head is missed again.
Second mount:
ext4_fill_super
ext4_load_and_init_journal
ext4_load_journal
ext4_get_journal
jbd2_journal_init_inode
journal_init_common
bh = getblk_unmovable
bh = __find_get_block // Found stale bh in last failed mounting
journal->j_sb_buffer = bh
jbd2_journal_load
load_superblock
journal_get_superblock
if (buffer_verified(bh))
// true, skip journal->j_format_version = 2, value is 0
jbd2_journal_recover
do_one_pass
next_log_block += count_tags(journal, bh)
// According to journal_tag_bytes(), 'tag_bytes' calculating is
// affected by jbd2_has_feature_csum3(), jbd2_has_feature_csum3()
// returns false because 'j->j_format_version >= 2' is not true,
// then we get wrong next_log_block. The do_one_pass may exit
// early whenoccuring non JBD2_MAGIC_NUMBER in 'next_log_block'.
The filesystem is corrupted here, journal is partially replayed, and
new journal sequence number actually is already used by last mounting.
The invalidate_bdev() can drop all buffer heads even racing with bare
reading block device(eg. systemd-udev), so we can fix it by invalidating
bdev in error handling path in __ext4_fill_super().
Fetch a reproducer in [Link].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217171
Fixes: 25ed6e8a54df ("jbd2: enable journal clients to enable v2 checksumming")
Cc: stable@vger.kernel.org # v3.5
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-2-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We've had reports of significant performance regression of sub-block
(unaligned) direct writes due to the added exclusivity restrictions
in ext4. The purpose of the exclusivity requirement for unaligned
direct writes is to avoid data corruption caused by unserialized
partial block zeroing in the iomap dio layer across overlapping
writes.
XFS has similar requirements for the same underlying reasons, yet
doesn't suffer the extreme performance regression that ext4 does.
The reason for this is that XFS utilizes IOMAP_DIO_OVERWRITE_ONLY
mode, which allows for optimistic submission of concurrent unaligned
I/O and kicks back writes that require partial block zeroing such
that they can be submitted in a safe, exclusive context. Since ext4
already performs most of these checks pre-submission, it can support
something similar without necessarily relying on the iomap flag and
associated retry mechanism.
Update the dio write submission path to allow concurrent submission
of unaligned direct writes that are purely overwrite and so will not
require block zeroing. To improve readability of the various related
checks, move the unaligned I/O handling down into
ext4_dio_write_checks(), where the dio draining and force wait logic
can immediately follow the locking requirement checks. Finally, the
IOMAP_DIO_OVERWRITE_ONLY flag is set to enable a warning check as a
precaution should the ext4 overwrite logic ever become inconsistent
with the zeroing expectations of iomap dio.
The performance improvement of sub-block direct write I/O is shown
in the following fio test on a 64xcpu guest vm:
Test: fio --name=test --ioengine=libaio --direct=1 --group_reporting
--overwrite=1 --thread --size=10G --filename=/mnt/fio
--readwrite=write --ramp_time=10s --runtime=60s --numjobs=8
--blocksize=2k --iodepth=256 --allow_file_create=0
v6.2: write: IOPS=4328, BW=8724KiB/s
v6.2 (patched): write: IOPS=801k, BW=1565MiB/s
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230314130759.642710-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Line wrap and slightly clarify the comments describing mballoc's
cirtiera.
Define EXT4_MB_NUM_CRS as part of the enum, so that it will
automatically get updated when criteria is added or removed.
Also fix a potential unitialized use of 'cr' variable if
CONFIG_EXT4_DEBUG is enabled.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After ext4_es_insert_extent() returns void, the return value in
ext4_zeroout_es() is also unnecessary, so make it return void too.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-13-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Now ext4_es_insert_extent() never return error, so make it return void.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-12-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Now it never fails when inserting a delay extent, so the return value in
ext4_es_insert_delayed_block is no longer necessary, let it return void.
[ Fixed bug which caused system hangs during bigalloc test runs. See
https://lore.kernel.org/r/20230612030405.GH1436857@mit.edu for more
details. -- TYT ]
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-11-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Now ext4_es_remove_extent() never fails, so make it return void.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-10-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Similar to in ext4_es_insert_delayed_block(), we use preallocations that
do not fail to avoid inconsistencies, but we do not care about es that are
not must be kept, and we return 0 even if such es memory allocation fails.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-9-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Similar to in ext4_es_remove_extent(), we use a no-fail preallocation
to avoid inconsistencies, except that here we may have to preallocate
two extent_status.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-8-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
If __es_remove_extent() returns an error it means that when splitting
extent, allocating an extent that must be kept failed, where returning
an error directly would cause the extent tree to be inconsistent. So we
use GFP_NOFAIL to pre-allocate an extent_status and pass it to
__es_remove_extent() to avoid this problem.
In addition, since the allocated memory is outside the i_es_lock, the
extent_status tree may change and the pre-allocated extent_status is
no longer needed, so we release the pre-allocated extent_status when
es->es_len is not initialized.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-7-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When splitting extent, if the second extent can not be dropped, we return
-ENOMEM and use GFP_NOFAIL to preallocate an extent_status outside of
i_es_lock and pass it to __es_remove_extent() to be used as the second
extent. This ensures that __es_remove_extent() is executed successfully,
thus ensuring consistency in the extent status tree. If the second extent
is not undroppable, we simply drop it and return 0. Then retry is no longer
necessary, remove it.
Now, __es_remove_extent() will always remove what it should, maybe more.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-6-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|