| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Letting xenbus_grant_ring() tear down grants in the error case is
problematic, as the other side could already have used these grants.
Calling gnttab_end_foreign_access_ref() without checking success is
resulting in an unclear situation for any caller of xenbus_grant_ring()
as in the error case the memory pages of the ring page might be
partially mapped. Freeing them would risk unwanted foreign access to
them, while not freeing them would leak memory.
In order to remove the need to undo any gnttab_grant_foreign_access()
calls, use gnttab_alloc_grant_references() to make sure no further
error can occur in the loop granting access to the ring pages.
It should be noted that this way of handling removes leaking of
grant entries in the error case, too.
This is CVE-2022-23040 / part of XSA-396.
Reported-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
|
| |
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes for various problems that have user visible effects
or seem to be urgent:
- fix corruption when combining DIO and non-blocking io_uring over
multiple extents (seen on MariaDB)
- fix relocation crash due to premature return from commit
- fix quota deadlock between rescan and qgroup removal
- fix item data bounds checks in tree-checker (found on a fuzzed
image)
- fix fsync of prealloc extents after EOF
- add missing run of delayed items after unlink during log replay
- don't start relocation until snapshot drop is finished
- fix reversed condition for subpage writers locking
- fix warning on page error"
* tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fallback to blocking mode when doing async dio over multiple extents
btrfs: add missing run of delayed items after unlink during log replay
btrfs: qgroup: fix deadlock between rescan worker and remove qgroup
btrfs: fix relocation crash due to premature return from btrfs_commit_transaction()
btrfs: do not start relocation until in progress drops are done
btrfs: tree-checker: use u64 for item data end to avoid overflow
btrfs: do not WARN_ON() if we have PageError set
btrfs: fix lost prealloc extents beyond eof after full fsync
btrfs: subpage: fix a wrong check on subpage->writers
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Some users recently reported that MariaDB was getting a read corruption
when using io_uring on top of btrfs. This started to happen in 5.16,
after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
during direct IO reads and writes"). That changed btrfs to use the new
iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
corresponds to a memory mapped file region. That type of scenario is
exercised by test case generic/647 from fstests.
For this MariaDB scenario, we attempt to read 16K from file offset X
using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
with a size of 4K, and what happens is the following:
1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
2) iomap creates a struct iomap_dio object, its reference count is
initialized to 1 and its ->size field is initialized to 0;
3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
the first 4K extent, and setups an iomap for this extent consisting
of a single page;
4) At iomap_dio_bio_iter(), we are able to access the first page of the
buffer (struct iov_iter) with bio_iov_iter_get_pages() without
triggering a page fault;
5) iomap submits a bio for this 4K extent
(iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
the refcount on the struct iomap_dio object to 2; The ->size field
of the struct iomap_dio object is incremented to 4K;
6) iomap calls btrfs_iomap_begin() again, this time with a file
offset of X + 4K. There we setup an iomap for the next extent
that also has a size of 4K;
7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
which tries to access the next page (2nd page) of the buffer.
This triggers a page fault and returns -EFAULT;
8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
the struct iomap_dio object has a ->size value of 4K (we submitted
a bio for an extent already). The 'wait_for_completion' variable
is not set to true, because our iocb has IOCB_NOWAIT set;
9) At the bottom of __iomap_dio_rw(), we decrement the reference count
of the struct iomap_dio object from 2 to 1. Because we were not
the only ones holding a reference on it and 'wait_for_completion' is
set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
just returns it up the callchain, up to io_uring;
10) The bio submitted for the first extent (step 5) completes and its
bio endio function, iomap_dio_bio_end_io(), decrements the last
reference on the struct iomap_dio object, resulting in calling
iomap_dio_complete_work() -> iomap_dio_complete().
11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
and return 4K (the amount of io done) to iomap_dio_complete_work();
12) iomap_dio_complete_work() calls the iocb completion callback,
iocb->ki_complete() with a second argument value of 4K (total io
done) and the iocb with the adjust ki_pos of X + 4K. This results
in completing the read request for io_uring, leaving it with a
result of 4K bytes read, and only the first page of the buffer
filled in, while the remaining 3 pages, corresponding to the other
3 extents, were not filled;
13) For the application, the result is unexpected because if we ask
to read N bytes, it expects to get N bytes read as long as those
N bytes don't cross the EOF (i_size).
MariaDB reports this as an error, as it's not expecting a short read,
since it knows it's asking for read operations fully within the i_size
boundary. This is typical in many applications, but it may also be
questionable if they should react to such short reads by issuing more
read calls to get the remaining data. Nevertheless, the short read
happened due to a change in btrfs regarding how it deals with page
faults while in the middle of a read operation, and there's no reason
why btrfs can't have the previous behaviour of returning the whole data
that was requested by the application.
The problem can also be triggered with the following simple program:
/* Get O_DIRECT */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <liburing.h>
int main(int argc, char *argv[])
{
char *foo_path;
struct io_uring ring;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
struct iovec iovec;
int fd;
long pagesize;
void *write_buf;
void *read_buf;
ssize_t ret;
int i;
if (argc != 2) {
fprintf(stderr, "Use: %s <directory>\n", argv[0]);
return 1;
}
foo_path = malloc(strlen(argv[1]) + 5);
if (!foo_path) {
fprintf(stderr, "Failed to allocate memory for file path\n");
return 1;
}
strcpy(foo_path, argv[1]);
strcat(foo_path, "/foo");
/*
* Create file foo with 2 extents, each with a size matching
* the page size. Then allocate a buffer to read both extents
* with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
* the read with io_uring, access the first page of the buffer
* to fault it in, so that during the read we only trigger a
* page fault when accessing the second page of the buffer.
*/
fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
O_DIRECT, 0666);
if (fd == -1) {
fprintf(stderr,
"Failed to create file 'foo': %s (errno %d)",
strerror(errno), errno);
return 1;
}
pagesize = sysconf(_SC_PAGE_SIZE);
ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
if (ret) {
fprintf(stderr, "Failed to allocate write buffer\n");
return 1;
}
memset(write_buf, 0xab, pagesize);
memset(write_buf + pagesize, 0xcd, pagesize);
/* Create 2 extents, each with a size matching page size. */
for (i = 0; i < 2; i++) {
ret = pwrite(fd, write_buf + i * pagesize, pagesize,
i * pagesize);
if (ret != pagesize) {
fprintf(stderr,
"Failed to write to file, ret = %ld errno %d (%s)\n",
ret, errno, strerror(errno));
return 1;
}
ret = fsync(fd);
if (ret != 0) {
fprintf(stderr, "Failed to fsync file\n");
return 1;
}
}
close(fd);
fd = open(foo_path, O_RDONLY | O_DIRECT);
if (fd == -1) {
fprintf(stderr,
"Failed to open file 'foo': %s (errno %d)",
strerror(errno), errno);
return 1;
}
ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
if (ret) {
fprintf(stderr, "Failed to allocate read buffer\n");
return 1;
}
/*
* Fault in only the first page of the read buffer.
* We want to trigger a page fault for the 2nd page of the
* read buffer during the read operation with io_uring
* (O_DIRECT and IOCB_NOWAIT).
*/
memset(read_buf, 0, 1);
ret = io_uring_queue_init(1, &ring, 0);
if (ret != 0) {
fprintf(stderr, "Failed to create io_uring queue\n");
return 1;
}
sqe = io_uring_get_sqe(&ring);
if (!sqe) {
fprintf(stderr, "Failed to get io_uring sqe\n");
return 1;
}
iovec.iov_base = read_buf;
iovec.iov_len = 2 * pagesize;
io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
ret = io_uring_submit_and_wait(&ring, 1);
if (ret != 1) {
fprintf(stderr,
"Failed at io_uring_submit_and_wait()\n");
return 1;
}
ret = io_uring_wait_cqe(&ring, &cqe);
if (ret < 0) {
fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
return 1;
}
printf("io_uring read result for file foo:\n\n");
printf(" cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
printf(" memcmp(read_buf, write_buf) == %d (expected 0)\n",
memcmp(read_buf, write_buf, 2 * pagesize));
io_uring_cqe_seen(&ring, cqe);
io_uring_queue_exit(&ring);
return 0;
}
When running it on an unpatched kernel:
$ gcc io_uring_test.c -luring
$ mkfs.btrfs -f /dev/sda
$ mount /dev/sda /mnt/sda
$ ./a.out /mnt/sda
io_uring read result for file foo:
cqe->res == 4096 (expected 8192)
memcmp(read_buf, write_buf) == -205 (expected 0)
After this patch, the read always returns 8192 bytes, with the buffer
filled with the correct data. Although that reproducer always triggers
the bug in my test vms, it's possible that it will not be so reliable
on other environments, as that can happen if the bio for the first
extent completes and decrements the reference on the struct iomap_dio
object before we do the atomic_dec_and_test() on the reference at
__iomap_dio_rw().
Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
set) over a range that spans multiple extents (or a mix of extents and
holes). This avoids returning success to the caller when we only did
partial IO, which is not optimal for writes and for reads it's actually
incorrect, as the caller doesn't expect to get less bytes read than it has
requested (unless EOF is crossed), as previously mentioned. This is also
the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
even though it doesn't use IOMAP_DIO_PARTIAL.
A test case for fstests will follow soon.
Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
During log replay, whenever we need to check if a name (dentry) exists in
a directory we do searches on the subvolume tree for inode references or
or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY
keys as well, before kernel 5.17). However when during log replay we
unlink a name, through btrfs_unlink_inode(), we may not delete inode
references and dir index keys from a subvolume tree and instead just add
the deletions to the delayed inode's delayed items, which will only be
run when we commit the transaction used for log replay. This means that
after an unlink operation during log replay, if we attempt to search for
the same name during log replay, we will not see that the name was already
deleted, since the deletion is recorded only on the delayed items.
We run delayed items after every unlink operation during log replay,
except at unlink_old_inode_refs() and at add_inode_ref(). This was due
to an overlook, as delayed items should be run after evert unlink, for
the reasons stated above.
So fix those two cases.
Fixes: 0d836392cadd5 ("Btrfs: fix mount failure after fsync due to hard link recreation")
Fixes: 1f250e929a9c9 ("Btrfs: fix log replay failure after unlink and link combination")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The commit e804861bd4e6 ("btrfs: fix deadlock between quota disable and
qgroup rescan worker") by Kawasaki resolves deadlock between quota
disable and qgroup rescan worker. But also there is a deadlock case like
it. It's about enabling or disabling quota and creating or removing
qgroup. It can be reproduced in simple script below.
for i in {1..100}
do
btrfs quota enable /mnt &
btrfs qgroup create 1/0 /mnt &
btrfs qgroup destroy 1/0 /mnt &
btrfs quota disable /mnt &
done
Here's why the deadlock happens:
1) The quota rescan task is running.
2) Task A calls btrfs_quota_disable(), locks the qgroup_ioctl_lock
mutex, and then calls btrfs_qgroup_wait_for_completion(), to wait for
the quota rescan task to complete.
3) Task B calls btrfs_remove_qgroup() and it blocks when trying to lock
the qgroup_ioctl_lock mutex, because it's being held by task A. At that
point task B is holding a transaction handle for the current transaction.
4) The quota rescan task calls btrfs_commit_transaction(). This results
in it waiting for all other tasks to release their handles on the
transaction, but task B is blocked on the qgroup_ioctl_lock mutex
while holding a handle on the transaction, and that mutex is being held
by task A, which is waiting for the quota rescan task to complete,
resulting in a deadlock between these 3 tasks.
To resolve this issue, the thread disabling quota should unlock
qgroup_ioctl_lock before waiting rescan completion. Move
btrfs_qgroup_wait_for_completion() after unlock of qgroup_ioctl_lock.
Fixes: e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
btrfs_commit_transaction()
We are seeing crashes similar to the following trace:
[38.969182] WARNING: CPU: 20 PID: 2105 at fs/btrfs/relocation.c:4070 btrfs_relocate_block_group+0x2dc/0x340 [btrfs]
[38.973556] CPU: 20 PID: 2105 Comm: btrfs Not tainted 5.17.0-rc4 #54
[38.974580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[38.976539] RIP: 0010:btrfs_relocate_block_group+0x2dc/0x340 [btrfs]
[38.980336] RSP: 0000:ffffb0dd42e03c20 EFLAGS: 00010206
[38.981218] RAX: ffff96cfc4ede800 RBX: ffff96cfc3ce0000 RCX: 000000000002ca14
[38.982560] RDX: 0000000000000000 RSI: 4cfd109a0bcb5d7f RDI: ffff96cfc3ce0360
[38.983619] RBP: ffff96cfc309c000 R08: 0000000000000000 R09: 0000000000000000
[38.984678] R10: ffff96cec0000001 R11: ffffe84c80000000 R12: ffff96cfc4ede800
[38.985735] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96cfc3ce0360
[38.987146] FS: 00007f11c15218c0(0000) GS:ffff96d6dfb00000(0000) knlGS:0000000000000000
[38.988662] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[38.989398] CR2: 00007ffc922c8e60 CR3: 00000001147a6001 CR4: 0000000000370ee0
[38.990279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[38.991219] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[38.992528] Call Trace:
[38.992854] <TASK>
[38.993148] btrfs_relocate_chunk+0x27/0xe0 [btrfs]
[38.993941] btrfs_balance+0x78e/0xea0 [btrfs]
[38.994801] ? vsnprintf+0x33c/0x520
[38.995368] ? __kmalloc_track_caller+0x351/0x440
[38.996198] btrfs_ioctl_balance+0x2b9/0x3a0 [btrfs]
[38.997084] btrfs_ioctl+0x11b0/0x2da0 [btrfs]
[38.997867] ? mod_objcg_state+0xee/0x340
[38.998552] ? seq_release+0x24/0x30
[38.999184] ? proc_nr_files+0x30/0x30
[38.999654] ? call_rcu+0xc8/0x2f0
[39.000228] ? __x64_sys_ioctl+0x84/0xc0
[39.000872] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[39.001973] __x64_sys_ioctl+0x84/0xc0
[39.002566] do_syscall_64+0x3a/0x80
[39.003011] entry_SYSCALL_64_after_hwframe+0x44/0xae
[39.003735] RIP: 0033:0x7f11c166959b
[39.007324] RSP: 002b:00007fff2543e998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[39.008521] RAX: ffffffffffffffda RBX: 00007f11c1521698 RCX: 00007f11c166959b
[39.009833] RDX: 00007fff2543ea40 RSI: 00000000c4009420 RDI: 0000000000000003
[39.011270] RBP: 0000000000000003 R08: 0000000000000013 R09: 00007f11c16f94e0
[39.012581] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff25440df3
[39.014046] R13: 0000000000000000 R14: 00007fff2543ea40 R15: 0000000000000001
[39.015040] </TASK>
[39.015418] ---[ end trace 0000000000000000 ]---
[43.131559] ------------[ cut here ]------------
[43.132234] kernel BUG at fs/btrfs/extent-tree.c:2717!
[43.133031] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[43.133702] CPU: 1 PID: 1839 Comm: btrfs Tainted: G W 5.17.0-rc4 #54
[43.134863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[43.136426] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs]
[43.139913] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246
[43.140629] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001
[43.141604] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff
[43.142645] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50
[43.143669] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000
[43.144657] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000
[43.145686] FS: 00007f7657dd68c0(0000) GS:ffff96d6df640000(0000) knlGS:0000000000000000
[43.146808] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[43.147584] CR2: 00007f7fe81bf5b0 CR3: 00000001093ee004 CR4: 0000000000370ee0
[43.148589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[43.149581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[43.150559] Call Trace:
[43.150904] <TASK>
[43.151253] btrfs_finish_extent_commit+0x88/0x290 [btrfs]
[43.152127] btrfs_commit_transaction+0x74f/0xaa0 [btrfs]
[43.152932] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
[43.153786] btrfs_ioctl+0x1edc/0x2da0 [btrfs]
[43.154475] ? __check_object_size+0x150/0x170
[43.155170] ? preempt_count_add+0x49/0xa0
[43.155753] ? __x64_sys_ioctl+0x84/0xc0
[43.156437] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[43.157456] __x64_sys_ioctl+0x84/0xc0
[43.157980] do_syscall_64+0x3a/0x80
[43.158543] entry_SYSCALL_64_after_hwframe+0x44/0xae
[43.159231] RIP: 0033:0x7f7657f1e59b
[43.161819] RSP: 002b:00007ffda5cd1658 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[43.162702] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7657f1e59b
[43.163526] RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000003
[43.164358] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[43.165208] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[43.166029] R13: 00005621b91c3232 R14: 00005621b91ba580 R15: 00007ffda5cd1800
[43.166863] </TASK>
[43.167125] Modules linked in: btrfs blake2b_generic xor pata_acpi ata_piix libata raid6_pq scsi_mod libcrc32c virtio_net virtio_rng net_failover rng_core failover scsi_common
[43.169552] ---[ end trace 0000000000000000 ]---
[43.171226] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs]
[43.174767] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246
[43.175600] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001
[43.176468] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff
[43.177357] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50
[43.178271] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000
[43.179178] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000
[43.180071] FS: 00007f7657dd68c0(0000) GS:ffff96d6df800000(0000) knlGS:0000000000000000
[43.181073] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[43.181808] CR2: 00007fe09905f010 CR3: 00000001093ee004 CR4: 0000000000370ee0
[43.182706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[43.183591] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
We first hit the WARN_ON(rc->block_group->pinned > 0) in
btrfs_relocate_block_group() and then the BUG_ON(!cache) in
unpin_extent_range(). This tells us that we are exiting relocation and
removing the block group with bytes still pinned for that block group.
This is supposed to be impossible: the last thing relocate_block_group()
does is commit the transaction to get rid of pinned extents.
Commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") introduced an optimization so that
commits from fsync don't have to wait for the previous commit to unpin
extents. This was only intended to affect fsync, but it inadvertently
made it possible for any commit to skip waiting for the previous commit
to unpin. This is because if a call to btrfs_commit_transaction() finds
that another thread is already committing the transaction, it waits for
the other thread to complete the commit and then returns. If that other
thread was in fsync, then it completes the commit without completing the
previous commit. This makes the following sequence of events possible:
Thread 1____________________|Thread 2 (fsync)_____________________|Thread 3 (balance)___________________
btrfs_commit_transaction(N) | |
btrfs_run_delayed_refs | |
pin extents | |
... | |
state = UNBLOCKED |btrfs_sync_file |
| btrfs_start_transaction(N + 1) |relocate_block_group
| | btrfs_join_transaction(N + 1)
| btrfs_commit_transaction(N + 1) |
... | trans->state = COMMIT_START |
| | btrfs_commit_transaction(N + 1)
| | wait_for_commit(N + 1, COMPLETED)
| wait_for_commit(N, SUPER_COMMITTED)|
state = SUPER_COMMITTED | ... |
btrfs_finish_extent_commit| |
unpin_extent_range() | trans->state = COMPLETED |
| | return
| |
... | |Thread 1 isn't done, so pinned > 0
| |and we WARN
| |
| |btrfs_remove_block_group
unpin_extent_range() | |
Thread 3 removed the | |
block group, so we BUG| |
There are other sequences involving SUPER_COMMITTED transactions that
can cause a similar outcome.
We could fix this by making relocation explicitly wait for unpinning,
but there may be other cases that need it. Josef mentioned ENOSPC
flushing and the free space cache inode as other potential victims.
Rather than playing whack-a-mole, this fix is conservative and makes all
commits not in fsync wait for all previous transactions, which is what
the optimization intended.
Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We hit a bug with a recovering relocation on mount for one of our file
systems in production. I reproduced this locally by injecting errors
into snapshot delete with balance running at the same time. This
presented as an error while looking up an extent item
WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8
RIP: 0010:lookup_inline_extent_backref+0x647/0x680
RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
Call Trace:
<TASK>
insert_inline_extent_backref+0x46/0xd0
__btrfs_inc_extent_ref.isra.0+0x5f/0x200
? btrfs_merge_delayed_refs+0x164/0x190
__btrfs_run_delayed_refs+0x561/0xfa0
? btrfs_search_slot+0x7b4/0xb30
? btrfs_update_root+0x1a9/0x2c0
btrfs_run_delayed_refs+0x73/0x1f0
? btrfs_update_root+0x1a9/0x2c0
btrfs_commit_transaction+0x50/0xa50
? btrfs_update_reloc_root+0x122/0x220
prepare_to_merge+0x29f/0x320
relocate_block_group+0x2b8/0x550
btrfs_relocate_block_group+0x1a6/0x350
btrfs_relocate_chunk+0x27/0xe0
btrfs_balance+0x777/0xe60
balance_kthread+0x35/0x50
? btrfs_balance+0xe60/0xe60
kthread+0x16b/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Normally snapshot deletion and relocation are excluded from running at
the same time by the fs_info->cleaner_mutex. However if we had a
pending balance waiting to get the ->cleaner_mutex, and a snapshot
deletion was running, and then the box crashed, we would come up in a
state where we have a half deleted snapshot.
Again, in the normal case the snapshot deletion needs to complete before
relocation can start, but in this case relocation could very well start
before the snapshot deletion completes, as we simply add the root to the
dead roots list and wait for the next time the cleaner runs to clean up
the snapshot.
Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
had a pending drop_progress key. If they do then we know we were in the
middle of the drop operation and set a flag on the fs_info. Then
balance can wait until this flag is cleared to start up again.
If there are DEAD_ROOT's that don't have a drop_progress set then we're
safe to start balance right away as we'll be properly protected by the
cleaner_mutex.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
User reported there is an array-index-out-of-bounds access while
mounting the crafted image:
[350.411942 ] loop0: detected capacity change from 0 to 262144
[350.427058 ] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 8 /dev/loop0 scanned by systemd-udevd (1044)
[350.428564 ] BTRFS info (device loop0): disk space caching is enabled
[350.428568 ] BTRFS info (device loop0): has skinny extents
[350.429589 ]
[350.429619 ] UBSAN: array-index-out-of-bounds in fs/btrfs/struct-funcs.c:161:1
[350.429636 ] index 1048096 is out of range for type 'page *[16]'
[350.429650 ] CPU: 0 PID: 9 Comm: kworker/u8:1 Not tainted 5.16.0-rc4
[350.429652 ] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[350.429653 ] Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs]
[350.429772 ] Call Trace:
[350.429774 ] <TASK>
[350.429776 ] dump_stack_lvl+0x47/0x5c
[350.429780 ] ubsan_epilogue+0x5/0x50
[350.429786 ] __ubsan_handle_out_of_bounds+0x66/0x70
[350.429791 ] btrfs_get_16+0xfd/0x120 [btrfs]
[350.429832 ] check_leaf+0x754/0x1a40 [btrfs]
[350.429874 ] ? filemap_read+0x34a/0x390
[350.429878 ] ? load_balance+0x175/0xfc0
[350.429881 ] validate_extent_buffer+0x244/0x310 [btrfs]
[350.429911 ] btrfs_validate_metadata_buffer+0xf8/0x100 [btrfs]
[350.429935 ] end_bio_extent_readpage+0x3af/0x850 [btrfs]
[350.429969 ] ? newidle_balance+0x259/0x480
[350.429972 ] end_workqueue_fn+0x29/0x40 [btrfs]
[350.429995 ] btrfs_work_helper+0x71/0x330 [btrfs]
[350.430030 ] ? __schedule+0x2fb/0xa40
[350.430033 ] process_one_work+0x1f6/0x400
[350.430035 ] ? process_one_work+0x400/0x400
[350.430036 ] worker_thread+0x2d/0x3d0
[350.430037 ] ? process_one_work+0x400/0x400
[350.430038 ] kthread+0x165/0x190
[350.430041 ] ? set_kthread_struct+0x40/0x40
[350.430043 ] ret_from_fork+0x1f/0x30
[350.430047 ] </TASK>
[350.430047 ]
[350.430077 ] BTRFS warning (device loop0): bad eb member start: ptr 0xffe20f4e start 20975616 member offset 4293005178 size 2
btrfs check reports:
corrupt leaf: root=3 block=20975616 physical=20975616 slot=1, unexpected
item end, have 4294971193 expect 3897
The first slot item offset is 4293005033 and the size is 1966160.
In check_leaf, we use btrfs_item_end() to check item boundary versus
extent_buffer data size. However, return type of btrfs_item_end() is u32.
(u32)(4293005033 + 1966160) == 3897, overflow happens and the result 3897
equals to leaf data size reasonably.
Fix it by use u64 variable to store item data end in check_leaf() to
avoid u32 overflow.
This commit does solve the invalid memory access showed by the stack
trace. However, its metadata profile is DUP and another copy of the
leaf is fine. So the image can be mounted successfully. But when umount
is called, the ASSERT btrfs_mark_buffer_dirty() will be triggered
because the only node in extent tree has 0 item and invalid owner. It's
solved by another commit
"btrfs: check extent buffer owner against the owner rootid".
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215299
Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Whenever we do any extent buffer operations we call
assert_eb_page_uptodate() to complain loudly if we're operating on an
non-uptodate page. Our overnight tests caught this warning earlier this
week
WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50
CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G W 5.17.0-rc3+ #564
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Workqueue: btrfs-cache btrfs_work_helper
RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246
RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000
RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0
RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000
R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1
R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000
FS: 0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0
Call Trace:
extent_buffer_test_bit+0x3f/0x70
free_space_test_bit+0xa6/0xc0
load_free_space_tree+0x1f6/0x470
caching_thread+0x454/0x630
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? lock_release+0x1f0/0x2d0
btrfs_work_helper+0xf2/0x3e0
? lock_release+0x1f0/0x2d0
? finish_task_switch.isra.0+0xf9/0x3a0
process_one_work+0x26d/0x580
? process_one_work+0x580/0x580
worker_thread+0x55/0x3b0
? process_one_work+0x580/0x580
kthread+0xf0/0x120
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
This was partially fixed by c2e39305299f01 ("btrfs: clear extent buffer
uptodate when we fail to write it"), however all that fix did was keep
us from finding extent buffers after a failed writeout. It didn't keep
us from continuing to use a buffer that we already had found.
In this case we're searching the commit root to cache the block group,
so we can start committing the transaction and switch the commit root
and then start writing. After the switch we can look up an extent
buffer that hasn't been written yet and start processing that block
group. Then we fail to write that block out and clear Uptodate on the
page, and then we start spewing these errors.
Normally we're protected by the tree lock to a certain degree here. If
we read a block we have that block read locked, and we block the writer
from locking the block before we submit it for the write. However this
isn't necessarily fool proof because the read could happen before we do
the submit_bio and after we locked and unlocked the extent buffer.
Also in this particular case we have path->skip_locking set, so that
won't save us here. We'll simply get a block that was valid when we
read it, but became invalid while we were using it.
What we really want is to catch the case where we've "read" a block but
it's not marked Uptodate. On read we ClearPageError(), so if we're
!Uptodate and !Error we know we didn't do the right thing for reading
the page.
Fix this by checking !Uptodate && !Error, this way we will not complain
if our buffer gets invalidated while we're using it, and we'll maintain
the spirit of the check which is to make sure we have a fully in-cache
block while we're messing with it.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When doing a full fsync, if we have prealloc extents beyond (or at) eof,
and the leaves that contain them were not modified in the current
transaction, we end up not logging them. This results in losing those
extents when we replay the log after a power failure, since the inode is
truncated to the current value of the logged i_size.
Just like for the fast fsync path, we need to always log all prealloc
extents starting at or beyond i_size. The fast fsync case was fixed in
commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size
after fsync log replay") but it missed the full fsync path. The problem
exists since the very early days, when the log tree was added by
commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize
synchronous operations").
Example reproducer:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
# Create our test file with many file extent items, so that they span
# several leaves of metadata, even if the node/page size is 64K. Use
# direct IO and not fsync/O_SYNC because it's both faster and it avoids
# clearing the full sync flag from the inode - we want the fsync below
# to trigger the slow full sync code path.
$ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
# Now add two preallocated extents to our file without extending the
# file's size. One right at i_size, and another further beyond, leaving
# a gap between the two prealloc extents.
$ xfs_io -c "falloc -k 16M 1M" /mnt/foo
$ xfs_io -c "falloc -k 20M 1M" /mnt/foo
# Make sure everything is durably persisted and the transaction is
# committed. This makes all created extents to have a generation lower
# than the generation of the transaction used by the next write and
# fsync.
sync
# Now overwrite only the first extent, which will result in modifying
# only the first leaf of metadata for our inode. Then fsync it. This
# fsync will use the slow code path (inode full sync bit is set) because
# it's the first fsync since the inode was created/loaded.
$ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
# Extent list before power failure.
$ xfs_io -c "fiemap -v" /mnt/foo
/mnt/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 2178048..2178055 8 0x0
1: [8..16383]: 26632..43007 16376 0x0
2: [16384..32767]: 2156544..2172927 16384 0x0
3: [32768..34815]: 2172928..2174975 2048 0x800
4: [34816..40959]: hole 6144
5: [40960..43007]: 2174976..2177023 2048 0x801
<power fail>
# Mount fs again, trigger log replay.
$ mount /dev/sdc /mnt
# Extent list after power failure and log replay.
$ xfs_io -c "fiemap -v" /mnt/foo
/mnt/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 2178048..2178055 8 0x0
1: [8..16383]: 26632..43007 16376 0x0
2: [16384..32767]: 2156544..2172927 16384 0x1
# The prealloc extents at file offsets 16M and 20M are missing.
So fix this by calling btrfs_log_prealloc_extents() when we are doing a
full fsync, so that we always log all prealloc extents beyond eof.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
[BUG]
When looping btrfs/074 with 64K page size and 4K sectorsize, there is a
low chance (1/50~1/100) to crash with the following ASSERT() triggered
in btrfs_subpage_start_writer():
ret = atomic_add_return(nbits, &subpage->writers);
ASSERT(ret == nbits); <<< This one <<<
[CAUSE]
With more debugging output on the parameters of
btrfs_subpage_start_writer(), it shows a very concerning error:
ret=29 nbits=13 start=393216 len=53248
For @nbits it's correct, but @ret which is the returned value from
atomic_add_return(), it's not only larger than nbits, but also larger
than max sectors per page value (for 64K page size and 4K sector size,
it's 16).
This indicates that some call sites are not properly decreasing the value.
And that's exactly the case, in btrfs_page_unlock_writer(), due to the
fact that we can have page locked either by lock_page() or
process_one_page(), we have to check if the subpage has any writer.
If no writers, it's locked by lock_page() and we only need to unlock it.
But unfortunately the check for the writers are completely opposite:
if (atomic_read(&subpage->writers))
/* No writers, locked by plain lock_page() */
return unlock_page(page);
We directly unlock the page if it has writers, which is the completely
opposite what we want.
Thankfully the affected call site is only limited to
extent_write_locked_range(), so it's mostly affecting compressed write.
[FIX]
Just fix the wrong check condition to fix the bug.
Fixes: e55a0de18572 ("btrfs: rework page locking in __extent_writepage()")
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull kvm fixes from Paolo Bonzini:
"x86 guest:
- Tweaks to the paravirtualization code, to avoid using them when
they're pointless or harmful
x86 host:
- Fix for SRCU lockdep splat
- Brown paper bag fix for the propagation of errno"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run
KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots()
KVM: x86: Yield to IPI target vCPU only if it is busy
x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64
x86/kvm: Don't waste memory if kvmclock is disabled
x86/kvm: Don't use PV TLB/yield when mwait is advertised
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
kvm_arch_vcpu_ioctl_run is already doing srcu_read_lock/unlock in two
places, namely vcpu_run and post_kvm_run_save, and a third is actually
needed around the call to vcpu->arch.complete_userspace_io to avoid
the following splat:
WARNING: suspicious RCU usage
arch/x86/kvm/pmu.c:190 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by CPU 28/KVM/370841:
#0: ff11004089f280b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x87/0x730 [kvm]
Call Trace:
<TASK>
dump_stack_lvl+0x59/0x73
reprogram_fixed_counter+0x15d/0x1a0 [kvm]
kvm_pmu_trigger_event+0x1a3/0x260 [kvm]
? free_moved_vector+0x1b4/0x1e0
complete_fast_pio_in+0x8a/0xd0 [kvm]
This splat is not at all unexpected, since complete_userspace_io callbacks
can execute similar code to vmexits. For example, SVM with nrips=false
will call into the emulator from svm_skip_emulated_instruction().
While it's tempting to never acquire kvm->srcu for an uninitialized vCPU,
practically speaking there's no penalty to acquiring kvm->srcu "early"
as the KVM_MP_STATE_UNINITIALIZED path is a one-time thing per vCPU. On
the other hand, seemingly innocuous helpers like kvm_apic_accept_events()
and sync_regs() can theoretically reach code that might access
SRCU-protected data structures, e.g. sync_regs() can trigger forced
existing of nested mode via kvm_vcpu_ioctl_x86_set_vcpu_events().
Reported-by: Like Xu <likexu@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Just like on the optional mmu_alloc_direct_roots() path, once shadow
path reaches "r = -EIO" somewhere, the caller needs to know the actual
state in order to enter error handling and avoid something worse.
Fixes: 4a38162ee9f1 ("KVM: MMU: load PDPTRs outside mmu_lock")
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220301124941.48412-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When sending a call-function IPI-many to vCPUs, yield to the
IPI target vCPU which is marked as preempted.
but when emulating HLT, an idling vCPU will be voluntarily
scheduled out and mark as preempted from the guest kernel
perspective. yielding to idle vCPU is pointless and increase
unnecessary vmexit, maybe miss the true preempted vCPU
so yield to IPI target vCPU only if vCPU is busy and preempted
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Message-Id: <1644380201-29423-1-git-send-email-lirongqing@baidu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When Linux runs as an Isolated VM on Hyper-V, it supports AMD SEV-SNP
but it's partially enlightened, i.e. cc_platform_has(
CC_ATTR_GUEST_MEM_ENCRYPT) is true but sev_active() is false.
Commit 4d96f9109109 per se is good, but with it now
kvm_setup_vsyscall_timeinfo() -> kvmclock_init_mem() calls
set_memory_decrypted(), and later gets stuck when trying to zere out
the pages pointed by 'hvclock_mem', if Linux runs as an Isolated VM on
Hyper-V. The cause is that here now the Linux VM should no longer access
the original guest physical addrss (GPA); instead the VM should do
memremap() and access the original GPA + ms_hyperv.shared_gpa_boundary:
see the example code in drivers/hv/connection.c: vmbus_connect() or
drivers/hv/ring_buffer.c: hv_ringbuffer_init(). If the VM tries to
access the original GPA, it keepts getting injected a fault by Hyper-V
and gets stuck there.
Here the issue happens only when the VM has >=65 vCPUs, because the
global static array hv_clock_boot[] can hold 64 "struct
pvclock_vsyscall_time_info" (the sizeof of the struct is 64 bytes), so
kvmclock_init_mem() only allocates memory in the case of vCPUs > 64.
Since the 'hvclock_mem' pages are only useful when the kvm clock is
supported by the underlying hypervisor, fix the issue by returning
early when Linux VM runs on Hyper-V, which doesn't support kvm clock.
Fixes: 4d96f9109109 ("x86/sev: Replace occurrences of sev_active() with cc_platform_has()")
Tested-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Message-Id: <20220225084600.17817-1-decui@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Even if "no-kvmclock" is passed in cmdline parameter, the guest kernel
still allocates hvclock_mem which is scaled by the number of vCPUs,
let's check kvmclock enable in advance to avoid this memory waste.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1645520523-30814-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
MWAIT is advertised in host is not overcommitted scenario, however, PV
TLB/sched yield should be enabled in host overcommitted scenario. Let's
add the MWAIT checking when enabling PV TLB/sched yield.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1645777780-2581-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set.
Thanks to Murilo Opsfelder Araujo, and Erhard F"
* tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The following build failure occurs when CONFIG_PPC_64S_HASH_MMU is not
set:
arch/powerpc/kernel/setup_64.c: In function ‘setup_per_cpu_areas’:
arch/powerpc/kernel/setup_64.c:811:21: error: ‘mmu_linear_psize’ undeclared (first use in this function); did you mean ‘mmu_virtual_psize’?
811 | if (mmu_linear_psize == MMU_PAGE_4K)
| ^~~~~~~~~~~~~~~~
| mmu_virtual_psize
arch/powerpc/kernel/setup_64.c:811:21: note: each undeclared identifier is reported only once for each function it appears in
Move the declaration of mmu_linear_psize outside of
CONFIG_PPC_64S_HASH_MMU ifdef.
After the above is fixed, it fails later with the following error:
ld: arch/powerpc/kexec/file_load_64.o: in function `.arch_kexec_kernel_image_probe':
file_load_64.c:(.text+0x1c1c): undefined reference to `.add_htab_mem_range'
Fix that, too, by conditioning add_htab_mem_range() symbol to
CONFIG_PPC_64S_HASH_MMU.
Fixes: 387e220a2e5e ("powerpc/64s: Move hash MMU support code under CONFIG_PPC_64S_HASH_MMU")
Reported-by: Erhard F. <erhard_f@mailbox.org>
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215567
Link: https://lore.kernel.org/r/20220301204743.45133-1-muriloo@linux.ibm.com
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix sorting on old "cpu" value in histograms
- Fix return value of __setup() boot parameter handlers
* tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix return value of __setup handlers
tracing/histogram: Fix sorting on old "cpu" value
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
__setup() handlers should generally return 1 to indicate that the
boot options have been handled.
Using invalid option values causes the entire kernel boot option
string to be reported as Unknown and added to init's environment
strings, polluting it.
Unknown kernel command line parameters "BOOT_IMAGE=/boot/bzImage-517rc6
kprobe_event=p,syscall_any,$arg1 trace_options=quiet
trace_clock=jiffies", will be passed to user space.
Run /sbin/init as init process
with arguments:
/sbin/init
with environment:
HOME=/
TERM=linux
BOOT_IMAGE=/boot/bzImage-517rc6
kprobe_event=p,syscall_any,$arg1
trace_options=quiet
trace_clock=jiffies
Return 1 from the __setup() handlers so that init's environment is not
polluted with kernel boot options.
Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
Link: https://lkml.kernel.org/r/20220303031744.32356-1-rdunlap@infradead.org
Cc: stable@vger.kernel.org
Fixes: 7bcfaf54f591 ("tracing: Add trace_options kernel command line parameter")
Fixes: e1e232ca6b8f ("tracing: Add trace_clock=<clock> kernel parameter")
Fixes: 970988e19eb0 ("tracing/kprobe: Add kprobe_event= boot parameter")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When trying to add a histogram against an event with the "cpu" field, it
was impossible due to "cpu" being a keyword to key off of the running CPU.
So to fix this, it was changed to "common_cpu" to match the other generic
fields (like "common_pid"). But since some scripts used "cpu" for keying
off of the CPU (for events that did not have "cpu" as a field, which is
most of them), a backward compatibility trick was added such that if "cpu"
was used as a key, and the event did not have "cpu" as a field name, then
it would fallback and switch over to "common_cpu".
This fix has a couple of subtle bugs. One was that when switching over to
"common_cpu", it did not change the field name, it just set a flag. But
the code still found a "cpu" field. The "cpu" field is used for filtering
and is returned when the event does not have a "cpu" field.
This was found by:
# cd /sys/kernel/tracing
# echo hist:key=cpu,pid:sort=cpu > events/sched/sched_wakeup/trigger
# cat events/sched/sched_wakeup/hist
Which showed the histogram unsorted:
{ cpu: 19, pid: 1175 } hitcount: 1
{ cpu: 6, pid: 239 } hitcount: 2
{ cpu: 23, pid: 1186 } hitcount: 14
{ cpu: 12, pid: 249 } hitcount: 2
{ cpu: 3, pid: 994 } hitcount: 5
Instead of hard coding the "cpu" checks, take advantage of the fact that
trace_event_field_field() returns a special field for "cpu" and "CPU" if
the event does not have "cpu" as a field. This special field has the
"filter_type" of "FILTER_CPU". Check that to test if the returned field is
of the CPU type instead of doing the string compare.
Also, fix the sorting bug by testing for the hist_field flag of
HIST_FIELD_FL_CPU when setting up the sort routine. Otherwise it will use
the special CPU field to know what compare routine to use, and since that
special field does not have a size, it returns tracing_map_cmp_none.
Cc: stable@vger.kernel.org
Fixes: 1e3bac71c505 ("tracing/histogram: Rename "cpu" to "common_cpu"")
Reported-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
- a fixup for Goodix touchscreen driver allowing it to work on certain
Cherry Trail devices
- a fix for imbalanced enable/disable regulator in Elam touchpad driver
that became apparent when used with Asus TF103C 2-in-1 dock
- a couple new input keycodes used on newer keyboards
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
HID: add mapping for KEY_ALL_APPLICATIONS
HID: add mapping for KEY_DICTATE
Input: elan_i2c - fix regulator enable count imbalance after suspend/resume
Input: elan_i2c - move regulator_[en|dis]able() out of elan_[en|dis]able_power()
Input: goodix - workaround Cherry Trail devices with a bogus ACPI Interrupt() resource
Input: goodix - use the new soc_intel_is_byt() helper
Input: samsung-keypad - properly state IOMEM dependency
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
This patch adds a new key definition for KEY_ALL_APPLICATIONS
and aliases KEY_DASHBOARD to it.
It also maps the 0x0c/0x2a2 usage code to KEY_ALL_APPLICATIONS.
Signed-off-by: William Mahon <wmahon@chromium.org>
Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20220303035618.1.I3a7746ad05d270161a18334ae06e3b6db1a1d339@changeid
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Numerous keyboards are adding dictate keys which allows for text
messages to be dictated by a microphone.
This patch adds a new key definition KEY_DICTATE and maps 0x0c/0x0d8
usage code to this new keycode. Additionally hid-debug is adjusted to
recognize this new usage code as well.
Signed-off-by: William Mahon <wmahon@chromium.org>
Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20220303021501.1.I5dbf50eb1a7a6734ee727bda4a8573358c6d3ec0@changeid
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Before these changes elan_suspend() would only disable the regulator
when device_may_wakeup() returns false; whereas elan_resume() would
unconditionally enable it, leading to an enable count imbalance when
device_may_wakeup() returns true.
This triggers the "WARN_ON(regulator->enable_count)" in regulator_put()
when the elan_i2c driver gets unbound, this happens e.g. with the
hot-plugable dock with Elan I2C touchpad for the Asus TF103C 2-in-1.
Fix this by making the regulator_enable() call also be conditional
on device_may_wakeup() returning false.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20220131135436.29638-2-hdegoede@redhat.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
elan_disable_power() is called conditionally on suspend, where as
elan_enable_power() is always called on resume. This leads to
an imbalance in the regulator's enable count.
Move the regulator_[en|dis]able() calls out of elan_[en|dis]able_power()
in preparation of fixing this.
No functional changes intended.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20220131135436.29638-1-hdegoede@redhat.com
[dtor: consolidate elan_[en|dis]able() into elan_set_power()]
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Interrupt() resource
ACPI/x86 devices with a Cherry Trail SoC should have a GpioInt + a regular
GPIO ACPI resource in their ACPI tables.
Some CHT devices have a bug, where the also is bogus interrupt resource
(likely copied from a previous Bay Trail based generation of the device).
The i2c-core-acpi code will assign the bogus, non-working, interrupt
resource to client->irq. Add a workaround to fix this up.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=2043960
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20220228111613.363336-1-hdegoede@redhat.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Use the new soc_intel_is_byt() helper from linux/platform_data/x86/soc.h.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20220131143539.109142-5-hdegoede@redhat.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
| |\ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Merge with mainline to get the Intel ASoC generic helpers header and
other changes.
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Make the samsung-keypad driver explicitly depend on CONFIG_HAS_IOMEM, as it
calls devm_ioremap(). This prevents compile errors in some configs (e.g,
allyesconfig/randconfig under UML):
/usr/bin/ld: drivers/input/keyboard/samsung-keypad.o: in function `samsung_keypad_probe':
samsung-keypad.c:(.text+0xc60): undefined reference to `devm_ioremap'
Signed-off-by: David Gow <davidgow@google.com>
Acked-by: anton ivanov <anton.ivanov@cambridgegreys.com>
Link: https://lore.kernel.org/r/20220225041727.1902850-1-davidgow@google.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|\ \ \ \ \ \ \
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
Merge misc fixes from Andrew Morton:
"8 patches.
Subsystems affected by this patch series: mm (hugetlb, pagemap, and
userfaultfd), memfd, selftests, and kconfig"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
configs/debug: set CONFIG_DEBUG_INFO=y properly
proc: fix documentation and description of pagemap
kselftest/vm: fix tests build with old libc
memfd: fix F_SEAL_WRITE after shmem huge page allocated
mm: fix use-after-free when anon vma name is used after vma is freed
mm: prevent vm_area_struct::anon_name refcount saturation
mm: refactor vm_area_struct::anon_vma_name usage code
selftests/vm: cleanup hugetlb file after mremap test
|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
CONFIG_DEBUG_INFO can't be set by user directly, so set
CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y instead.
Otherwise, we end up with no debuginfo in vmlinux which is a big no-no
for kernel debugging.
Link: https://lkml.kernel.org/r/20220301202920.18488-1-quic_qiancai@quicinc.com
Signed-off-by: Qian Cai <quic_qiancai@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
Since bit 57 was exported for uffd-wp write-protected (commit
fb8e37f35a2f: "mm/pagemap: export uffd-wp protection information"),
fixing it can reduce some unnecessary confusion.
Link: https://lkml.kernel.org/r/20220301044538.3042713-1-yun.zhou@windriver.com
Fixes: fb8e37f35a2fe1 ("mm/pagemap: export uffd-wp protection information")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
Cc: Florian Schmidt <florian.schmidt@nutanix.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Colin Cross <ccross@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
The error message when I build vm tests on debian10 (GLIBC 2.28):
userfaultfd.c: In function `userfaultfd_pagemap_test':
userfaultfd.c:1393:37: error: `MADV_PAGEOUT' undeclared (first use
in this function); did you mean `MADV_RANDOM'?
if (madvise(area_dst, test_pgsize, MADV_PAGEOUT))
^~~~~~~~~~~~
MADV_RANDOM
This patch includes these newer definitions from UAPI linux/mman.h, is
useful to fix tests build on systems without these definitions in glibc
sys/mman.h.
Link: https://lkml.kernel.org/r/20220227055330.43087-2-zhouchengming@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
Wangyong reports: after enabling tmpfs filesystem to support transparent
hugepage with the following command:
echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
the docker program tries to add F_SEAL_WRITE through the following
command, but it fails unexpectedly with errno EBUSY:
fcntl(5, F_ADD_SEALS, F_SEAL_WRITE) = -1.
That is because memfd_tag_pins() and memfd_wait_for_pins() were never
updated for shmem huge pages: checking page_mapcount() against
page_count() is hopeless on THP subpages - they need to check
total_mapcount() against page_count() on THP heads only.
Make memfd_tag_pins() (compared > 1) as strict as memfd_wait_for_pins()
(compared != 1): either can be justified, but given the non-atomic
total_mapcount() calculation, it is better now to be strict. Bear in
mind that total_mapcount() itself scans all of the THP subpages, when
choosing to take an XA_CHECK_SCHED latency break.
Also fix the unlikely xa_is_value() case in memfd_wait_for_pins(): if a
page has been swapped out since memfd_tag_pins(), then its refcount must
have fallen, and so it can safely be untagged.
Link: https://lkml.kernel.org/r/a4f79248-df75-2c8c-3df-ba3317ccb5da@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Reported-by: wangyong <wang.yong12@zte.com.cn>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: CGEL ZTE <cgel.zte@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
When adjacent vmas are being merged it can result in the vma that was
originally passed to madvise_update_vma being destroyed. In the current
implementation, the name parameter passed to madvise_update_vma points
directly to vma->anon_name and it is used after the call to vma_merge.
In the cases when vma_merge merges the original vma and destroys it,
this might result in UAF. For that the original vma would have to hold
the anon_vma_name with the last reference. The following vma would need
to contain a different anon_vma_name object with the same string. Such
scenario is shown below:
madvise_vma_behavior(vma)
madvise_update_vma(vma, ..., anon_name == vma->anon_name)
vma_merge(vma)
__vma_adjust(vma) <-- merges vma with adjacent one
vm_area_free(vma) <-- frees the original vma
replace_vma_anon_name(anon_name) <-- UAF of vma->anon_name
Fix this by raising the name refcount and stabilizing it.
Link: https://lkml.kernel.org/r/20220224231834.1481408-3-surenb@google.com
Link: https://lkml.kernel.org/r/20220223153613.835563-3-surenb@google.com
Fixes: 9a10064f5625 ("mm: add a field to store names for private anonymous memory")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: syzbot+aa7b3d4b35f9dc46a366@syzkaller.appspotmail.com
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Colin Cross <ccross@google.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
A deep process chain with many vmas could grow really high. With
default sysctl_max_map_count (64k) and default pid_max (32k) the max
number of vmas in the system is 2147450880 and the refcounter has
headroom of 1073774592 before it reaches REFCOUNT_SATURATED
(3221225472).
Therefore it's unlikely that an anonymous name refcounter will overflow
with these defaults. Currently the max for pid_max is PID_MAX_LIMIT
(4194304) and for sysctl_max_map_count it's INT_MAX (2147483647). In
this configuration anon_vma_name refcount overflow becomes theoretically
possible (that still require heavy sharing of that anon_vma_name between
processes).
kref refcounting interface used in anon_vma_name structure will detect a
counter overflow when it reaches REFCOUNT_SATURATED value but will only
generate a warning and freeze the ref counter. This would lead to the
refcounted object never being freed. A determined attacker could leak
memory like that but it would be rather expensive and inefficient way to
do so.
To ensure anon_vma_name refcount does not overflow, stop anon_vma_name
sharing when the refcount reaches REFCOUNT_MAX (2147483647), which still
leaves INT_MAX/2 (1073741823) values before the counter reaches
REFCOUNT_SATURATED. This should provide enough headroom for raising the
refcounts temporarily.
Link: https://lkml.kernel.org/r/20220223153613.835563-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Colin Cross <ccross@google.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
Avoid mixing strings and their anon_vma_name referenced pointers by
using struct anon_vma_name whenever possible. This simplifies the code
and allows easier sharing of anon_vma_name structures when they
represent the same name.
[surenb@google.com: fix comment]
Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com
Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Colin Cross <ccross@google.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
The hugepage-mremap test will create a file in a hugetlb filesystem. In
a default 'run_vmtests' run, the file will contain all the hugetlb
pages. After the test, the file remains and there are no free hugetlb
pages for subsequent tests. This causes those hugetlb tests to fail.
Change hugepage-mremap to take the name of the hugetlb file as an
argument. Unlink the file within the test, and just to be sure remove
the file in the run_vmtests script.
Link: https://lkml.kernel.org/r/20220201033459.156944-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \ \ \ \ \ \ \
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Fix HAVE_DYNAMIC_FTRACE_WITH_ARGS implementation by providing correct
switching between ftrace_caller/ftrace_regs_caller and supplying
pt_regs only when ftrace_regs_caller is activated.
- Fix exception table sorting.
- Fix breakage of kdump tooling by preserving metadata it cannot
function without.
* tag 's390-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/extable: fix exception table sorting
s390/ftrace: fix arch_ftrace_get_regs implementation
s390/ftrace: fix ftrace_caller/ftrace_regs_caller generation
s390/setup: preserve memory at OLDMEM_BASE and OLDMEM_SIZE
|
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
s390 has a swap_ex_entry_fixup function, however it is not being used
since common code expects a swap_ex_entry_fixup define. If it is not
defined the default implementation will be used. So fix this by adding
a proper define.
However also the implementation of the function must be fixed, since a
NULL value for handler has a special meaning and must not be adjusted.
Luckily all of this doesn't fix a real bug currently: the main extable
is correctly sorted during build time, and for runtime sorting there
is currently no case where the handler field is not NULL.
Fixes: 05a68e892e89 ("s390/kernel: expand exception table logic to allow new handling options")
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
arch_ftrace_get_regs is supposed to return a struct pt_regs pointer
only if the pt_regs structure contains all register contents, which
means it must have been populated when created via ftrace_regs_caller.
If it was populated via ftrace_caller the contents are not complete
(the psw mask part is missing), and therefore a NULL pointer needs be
returned.
The current code incorrectly always returns a struct pt_regs pointer.
Fix this by adding another pt_regs flag which indicates if the
contents are complete, and fix arch_ftrace_get_regs accordingly.
Fixes: 894979689d3a ("s390/ftrace: provide separate ftrace_caller/ftrace_regs_caller implementations")
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reported-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
ftrace_caller was used for both ftrace_caller and ftrace_regs_caller,
which means that the target address of the hotpatch trampoline was
never updated.
With commit 894979689d3a ("s390/ftrace: provide separate
ftrace_caller/ftrace_regs_caller implementations") a separate
ftrace_regs_caller entry point was implemeted, however it was
forgotten to implement the necessary changes for ftrace_modify_call
and ftrace_make_call, where the branch target has to be modified
accordingly.
Therefore add the missing code now.
Fixes: 894979689d3a ("s390/ftrace: provide separate ftrace_caller/ftrace_regs_caller implementations")
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
We need to preserve the values at OLDMEM_BASE and OLDMEM_SIZE which are
used by zgetdump in case when kdump crashes. In that case zgetdump will
attempt to read OLDMEM_BASE and OLDMEM_SIZE in order to find out where
the memory range [0 - OLDMEM_SIZE] belonging to the production kernel is.
Fixes: f1a546947431 ("s390/setup: don't reserve memory that occupied decompressor's head")
Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|\ \ \ \ \ \ \ \ \
| |_|/ / / / / / /
|/| | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
Pull block fix from Jens Axboe:
"Just a small UAF fix for blktrace"
* tag 'block-5.17-2022-03-04' of git://git.kernel.dk/linux-block:
blktrace: fix use after free for struct blk_trace
|
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
When tracing the whole disk, 'dropped' and 'msg' will be created
under 'q->debugfs_dir' and 'bt->dir' is NULL, thus blk_trace_free()
won't remove those files. What's worse, the following UAF can be
triggered because of accessing stale 'dropped' and 'msg':
==================================================================
BUG: KASAN: use-after-free in blk_dropped_read+0x89/0x100
Read of size 4 at addr ffff88816912f3d8 by task blktrace/1188
CPU: 27 PID: 1188 Comm: blktrace Not tainted 5.17.0-rc4-next-20220217+ #469
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-4
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
print_address_description.constprop.0.cold+0xab/0x381
? blk_dropped_read+0x89/0x100
? blk_dropped_read+0x89/0x100
kasan_report.cold+0x83/0xdf
? blk_dropped_read+0x89/0x100
kasan_check_range+0x140/0x1b0
blk_dropped_read+0x89/0x100
? blk_create_buf_file_callback+0x20/0x20
? kmem_cache_free+0xa1/0x500
? do_sys_openat2+0x258/0x460
full_proxy_read+0x8f/0xc0
vfs_read+0xc6/0x260
ksys_read+0xb9/0x150
? vfs_write+0x3d0/0x3d0
? fpregs_assert_state_consistent+0x55/0x60
? exit_to_user_mode_prepare+0x39/0x1e0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fbc080d92fd
Code: ce 20 00 00 75 10 b8 00 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 1
RSP: 002b:00007fbb95ff9cb0 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 00007fbb95ff9dc0 RCX: 00007fbc080d92fd
RDX: 0000000000000100 RSI: 00007fbb95ff9cc0 RDI: 0000000000000045
RBP: 0000000000000045 R08: 0000000000406299 R09: 00000000fffffffd
R10: 000000000153afa0 R11: 0000000000000293 R12: 00007fbb780008c0
R13: 00007fbb78000938 R14: 0000000000608b30 R15: 00007fbb780029c8
</TASK>
Allocated by task 1050:
kasan_save_stack+0x1e/0x40
__kasan_kmalloc+0x81/0xa0
do_blk_trace_setup+0xcb/0x410
__blk_trace_setup+0xac/0x130
blk_trace_ioctl+0xe9/0x1c0
blkdev_ioctl+0xf1/0x390
__x64_sys_ioctl+0xa5/0xe0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Freed by task 1050:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_set_free_info+0x20/0x30
__kasan_slab_free+0x103/0x180
kfree+0x9a/0x4c0
__blk_trace_remove+0x53/0x70
blk_trace_ioctl+0x199/0x1c0
blkdev_common_ioctl+0x5e9/0xb30
blkdev_ioctl+0x1a5/0x390
__x64_sys_ioctl+0xa5/0xe0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88816912f380
which belongs to the cache kmalloc-96 of size 96
The buggy address is located 88 bytes inside of
96-byte region [ffff88816912f380, ffff88816912f3e0)
The buggy address belongs to the page:
page:000000009a1b4e7c refcount:1 mapcount:0 mapping:0000000000000000 index:0x0f
flags: 0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc0000200 ffffea00044f1100 dead000000000002 ffff88810004c780
raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88816912f280: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff88816912f300: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>ffff88816912f380: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
^
ffff88816912f400: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff88816912f480: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
==================================================================
Fixes: c0ea57608b69 ("blktrace: remove debugfs file dentries from struct blk_trace")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20220228034354.4047385-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|\ \ \ \ \ \ \ \ \
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- Fixes for a handful of KASAN-related crashes.
- A fix to avoid a crash during boot for SPARSEMEM &&
!SPARSEMEM_VMEMMAP configurations.
- A fix to stop reporting some incorrect errors under DEBUG_VIRTUAL.
- A fix for the K210's device tree to properly populate the interrupt
map, so hart1 will get interrupts again.
* tag 'riscv-for-linus-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: dts: k210: fix broken IRQs on hart1
riscv: Fix kasan pud population
riscv: Move high_memory initialization to setup_bootmem
riscv: Fix config KASAN && DEBUG_VIRTUAL
riscv: Fix DEBUG_VIRTUAL false warnings
riscv: Fix config KASAN && SPARSEMEM && !SPARSE_VMEMMAP
riscv: Fix is_linear_mapping with recent move of KASAN region
|