| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
This numbers can be analized by system automation similar to errors_count.
In ideal world it would be nice to have separate counters for different
log-levels, but this makes this patch too intrusive.
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Link: https://lore.kernel.org/r/20200725123313.4467-1-dmtrmonakhov@yandex-team.ru
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
| |
ext4 update feature functions do not exist now, remove these useless
function declarations.
Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200724032954.22097-1-luoshijie1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently there is a problem with mount options that can be both set by
vfs using mount flags or by a string parsing in ext4.
i_version/iversion options gets lost after remount, for example
$ mount -o i_version /dev/pmem0 /mnt
$ grep pmem0 /proc/self/mountinfo | grep i_version
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,seclabel,i_version
$ mount -o remount,ro /mnt
$ grep pmem0 /proc/self/mountinfo | grep i_version
nolazytime gets ignored by ext4 on remount, for example
$ mount -o lazytime /dev/pmem0 /mnt
$ grep pmem0 /proc/self/mountinfo | grep lazytime
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
$ mount -o remount,nolazytime /mnt
$ grep pmem0 /proc/self/mountinfo | grep lazytime
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
Fix it by applying the SB_LAZYTIME and SB_I_VERSION flags from *flags to
s_flags before we parse the option and use the resulting state of the
same flags in *flags at the end of successful remount.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200723150526.19931-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
| |
Remove unnecessary blank.
Signed-off-by: Xianting Tian <xianting_tian@126.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1595077057-8048-1-git-send-email-xianting_tian@126.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For file systems where we can afford to keep the buddy bitmaps cached,
we can speed up initial writes to large file systems by starting to
load the block allocation bitmaps as soon as the file system is
mounted. This won't work well for _super_ large file systems, or
memory constrained systems, so we only enable this when it is
requested via a mount option.
Addresses-Google-Bug: 159488342
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
|
|
|
|
|
|
|
|
| |
Modify the ext4_read_block_bitmap_load tracepoint so that it tells us
whether a block bitmap is being prefetched.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
|
|
|
|
|
|
|
|
|
|
| |
Parameter gfp_mask in jbd2_journal_try_to_free_buffers() is no longer
used after commit <536fc240e7147> ("jbd2: clean up
jbd2_journal_try_to_free_buffers()"), so just remove it.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200620025427.1756360-6-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we free a metadata buffer which has been failed to async write out
in the background, the jbd2 checkpoint procedure will not detect this
failure in jbd2_log_do_checkpoint(), so it may lead to filesystem
inconsistency after cleanup journal tail. This patch abort the journal
if free a buffer has write_io_error flag to prevent potential further
inconsistency.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200620025427.1756360-5-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is a risk of filesystem inconsistency if we failed to async write
back metadata buffer in the background. Because of current buffer's end
io procedure is handled by end_buffer_async_write() in the block layer,
and it only clear the buffer's uptodate flag and mark the write_io_error
flag, so ext4 cannot detect such failure immediately. In most cases of
getting metadata buffer (e.g. ext4_read_inode_bitmap()), although the
buffer's data is actually uptodate, it may still read data from disk
because the buffer's uptodate flag has been cleared. Finally, it may
lead to on-disk filesystem inconsistency if reading old data from the
disk successfully and write them out again.
This patch detect bdev mapping->wb_err when getting journal's write
access and mark the filesystem error if bdev's mapping->wb_err was
increased, this could prevent further writing and potential
inconsistency.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200620025427.1756360-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
cr=0 is supposed to be an optimization to save CPU cycles, but if
buddy data (in memory) is not initialized then all this makes no sense
as we have to do sync IO taking a lot of cycles. Also, at cr=0
mballoc doesn't choose any available chunk. cr=1 also skips groups
using heuristic based on avg. fragment size. It's more useful to skip
such groups and switch to cr=2 where groups will be scanned for
available chunks. However, we always read the first block group in a
flex_bg so metadata blocks will get read into the first flex_bg if
possible.
Using sparse image and dm-slow virtual device of 120TB was
simulated, then the image was formatted and filled using debugfs to
mark ~85% of available space as busy. mount process w/o the patch
couldn't complete in half an hour (according to vmstat it would take
~10-11 hours). With the patch applied mount took ~20 seconds.
Lustre-bug-id: https://jira.whamcloud.com/browse/LU-12988
Signed-off-by: Alex Zhuravlev <azhuravlev@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This should significantly improve bitmap loading, especially for flex
groups as it tries to load all bitmaps within a flex.group instead of
one by one synchronously.
Prefetching is done in 8 * flex_bg groups, so it should be 8
read-ahead reads for a single allocating thread. At the end of
allocation the thread waits for read-ahead completion and initializes
buddy information so that read-aheads are not lost in case of memory
pressure.
At cr=0 the number of prefetching IOs is limited per allocation
context to prevent a situation when mballoc loads thousands of bitmaps
looking for a perfect group and ignoring groups with good chunks.
Together with the patch "ext4: limit scanning of uninitialized groups"
the mount time (which includes few tiny allocations) of a 1PB
filesystem is reduced significantly:
0% full 50%-full unpatched patched
mount time 33s 9279s 563s
[ Restructured by tytso; removed the state flags in the allocation
context, so it can be used to lazily prefetch the allocation bitmaps
immediately after the file system is mounted. Skip prefetching
block groups which are uninitialized. Finally pass in the
REQ_RAHEAD flag to the block layer while prefetching. ]
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ext4 uses blkdev_get_by_dev() to get the block_device for journal device
which does check to see if the read-only block device was opened
read-only.
As a result ext4 will hapily proceed mounting the file system with
external journal on read-only device. This is bad as we would not be
able to use the journal leading to errors later on.
Instead of simply failing to mount file system in this case, treat it in
a similar way we treat internal journal on read-only device. Allow to
mount with -o noload in read-only mode.
This can be reproduced easily like this:
mke2fs -F -O journal_dev $JOURNAL_DEV 100M
mkfs.$FSTYPE -F -J device=$JOURNAL_DEV $FS_DEV
blockdev --setro $JOURNAL_DEV
mount $FS_DEV $MNT
touch $MNT/file
umount $MNT
leading to error like this
[ 1307.318713] ------------[ cut here ]------------
[ 1307.323362] generic_make_request: Trying to write to read-only block-device dm-2 (partno 0)
[ 1307.331741] WARNING: CPU: 36 PID: 3224 at block/blk-core.c:855 generic_make_request_checks+0x2c3/0x580
[ 1307.341041] Modules linked in: ext4 mbcache jbd2 rfkill intel_rapl_msr intel_rapl_common isst_if_commd
[ 1307.419445] CPU: 36 PID: 3224 Comm: jbd2/dm-2 Tainted: G W I 5.8.0-rc5 #2
[ 1307.427359] Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 2.3.10 08/15/2019
[ 1307.434932] RIP: 0010:generic_make_request_checks+0x2c3/0x580
[ 1307.440676] Code: 94 03 00 00 48 89 df 48 8d 74 24 08 c6 05 cf 2b 18 01 01 e8 7f a4 ff ff 48 c7 c7 50e
[ 1307.459420] RSP: 0018:ffffc0d70eb5fb48 EFLAGS: 00010286
[ 1307.464646] RAX: 0000000000000000 RBX: ffff9b33b2978300 RCX: 0000000000000000
[ 1307.471780] RDX: ffff9b33e12a81e0 RSI: ffff9b33e1298000 RDI: ffff9b33e1298000
[ 1307.478913] RBP: ffff9b7b9679e0c0 R08: 0000000000000837 R09: 0000000000000024
[ 1307.486044] R10: 0000000000000000 R11: ffffc0d70eb5f9f0 R12: 0000000000000400
[ 1307.493177] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[ 1307.500308] FS: 0000000000000000(0000) GS:ffff9b33e1280000(0000) knlGS:0000000000000000
[ 1307.508396] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1307.514142] CR2: 000055eaf4109000 CR3: 0000003dee40a006 CR4: 00000000007606e0
[ 1307.521273] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1307.528407] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1307.535538] PKRU: 55555554
[ 1307.538250] Call Trace:
[ 1307.540708] generic_make_request+0x30/0x340
[ 1307.544985] submit_bio+0x43/0x190
[ 1307.548393] ? bio_add_page+0x62/0x90
[ 1307.552068] submit_bh_wbc+0x16a/0x190
[ 1307.555833] jbd2_write_superblock+0xec/0x200 [jbd2]
[ 1307.560803] jbd2_journal_update_sb_log_tail+0x65/0xc0 [jbd2]
[ 1307.566557] jbd2_journal_commit_transaction+0x2ae/0x1860 [jbd2]
[ 1307.572566] ? check_preempt_curr+0x7a/0x90
[ 1307.576756] ? update_curr+0xe1/0x1d0
[ 1307.580421] ? account_entity_dequeue+0x7b/0xb0
[ 1307.584955] ? newidle_balance+0x231/0x3d0
[ 1307.589056] ? __switch_to_asm+0x42/0x70
[ 1307.592986] ? __switch_to_asm+0x36/0x70
[ 1307.596918] ? lock_timer_base+0x67/0x80
[ 1307.600851] kjournald2+0xbd/0x270 [jbd2]
[ 1307.604873] ? finish_wait+0x80/0x80
[ 1307.608460] ? commit_timeout+0x10/0x10 [jbd2]
[ 1307.612915] kthread+0x114/0x130
[ 1307.616152] ? kthread_park+0x80/0x80
[ 1307.619816] ret_from_fork+0x22/0x30
[ 1307.623400] ---[ end trace 27490236265b1630 ]---
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20200717090605.2612-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
| |
Fix spelling typos in ext4_mb_initialize_context.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/883b523c-58ec-7f38-0bb8-cd2ea4393684@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Don't define EXT4_IOC_* aliases to ioctls that already have a generic
FS_IOC_* name. These aliases are unnecessary, and they make it unclear
which ioctls are ext4-specific and which are generic.
Exception: leave EXT4_IOC_GETVERSION_OLD and EXT4_IOC_SETVERSION_OLD
as-is for now, since renaming them to FS_IOC_GETVERSION and
FS_IOC_SETVERSION would probably make them more likely to be confused
with EXT4_IOC_GETVERSION and EXT4_IOC_SETVERSION which also exist.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200714230909.56349-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Define the EXT4_FL_USER_* constants by OR-ing together the appropriate
flags, rather than hard-coding a numeric value. This makes it much
easier to see which flags are listed.
No change in the actual values.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200713031012.192440-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A customer has reported a BUG_ON in ext4_clear_journal_err() hitting
during an LTP testing. Either this has been caused by a test setup
issue where the filesystem was being overwritten while LTP was mounting
it or the journal replay has overwritten the superblock with invalid
data. In either case it is preferable we don't take the machine down
with a BUG_ON. So handle the situation of unexpectedly missing
has_journal feature more gracefully. We issue warning and fail the mount
in the cases where the race window is narrow and the failed check is
most likely a programming error. In cases where fs corruption is more
likely, we do full ext4_error() handling before failing mount / remount.
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200710140759.18031-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since commit 378f32bab371 ("ext4: introduce direct I/O write using iomap
infrastructure") we don't properly bail out of RWF_NOWAIT direct IO
write if underlying blocks are not allocated. Also
ext4_dio_write_checks() does not honor RWF_NOWAIT when re-acquiring
i_rwsem. Fix both issues.
Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
Cc: stable@kernel.org
Reported-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200708153516.9507-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Link: https://lore.kernel.org/r/20200706190339.20709-1-grandmaster@al2klimov.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
| |
If dquot_initialize() return non-zero and trace of ext4_unlink_enter/exit
enabled then the matching-pair of trace_exit will lost in log.
Signed-off-by: Yi Zhuang <zhuangyi1@huawei.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200629122621.129953-1-zhuangyi1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
| |
It should call trace exit in all return path for ext4_truncate.
Signed-off-by: zhengliang <zhengliang6@huawei.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200701083027.45996-1-zhengliang6@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
jbd2_write_superblock()
jbd2_write_superblock() is under the buffer lock of journal superblock
before ending that superblock write, so add a missing unlock_buffer() in
in the error path before submitting buffer.
Fixes: 742b06b5628f ("jbd2: check superblock mapped prior to committing")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20200620061948.2049579-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If for any reason a directory passed to do_split() does not have enough
active entries to exceed half the size of the block, we can end up
iterating over all "count" entries without finding a split point.
In this case, count == move, and split will be zero, and we will
attempt a negative index into map[].
Guard against this by detecting this case, and falling back to
split-to-half-of-count instead; in this case we will still have
plenty of space (> half blocksize) in each split block.
Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/f53e246b-647c-64bb-16ec-135383c70ad7@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Callers of __jbd2_journal_unfile_buffer() and
__jbd2_journal_refile_buffer() assume that the b_transaction is set. In
fact if it's not, we can end up with journal_head refcounting errors
leading to crash much later that might be very hard to track down. Add
asserts to make sure that is the case.
We also make sure that b_next_transaction is NULL in
__jbd2_journal_unfile_buffer() since the callers expect that as well and
we should not get into that stage in this state anyway, leading to
problems later on if we do.
Tested with fstests.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200617092549.6712-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
| |
Fixed a few coding style issues in file.c
Signed-off-by: Dio Putra <dioput12@gmail.com>
Link: https://lore.kernel.org/r/239fcd8f-d33f-8621-9e82-0416dd3f9c94@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The brelse() function tests whether its argument is NULL
and then returns immediately.
Thus remove the tests which are not needed around the shown calls.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/0d713702-072f-a89c-20ec-ca70aa83a432@web.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix spelling issues over the comments in the code.
requsted ==> requested
deterimined ==> determined
insde ==> inside
neet ==> need
somthing ==> something
Signed-off-by: Keyur Patel <iamkeyur96@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200611031947.165079-1-iamkeyur96@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Using a mutex for "print this warning only once" is so overdesigned as
to be actively offensive to my sensitive stomach.
Just use "pr_info_once()" that already does this, although in a
(harmlessly) racy manner that can in theory cause the message to be
printed twice if more than one CPU races on that "is this the first
time" test.
[ If somebody really cares about that harmless data race (which sounds
very unlikely indeed), that person can trivially fix printk_once() by
using a simple atomic access, preferably with an optimistic non-atomic
test first before even bothering to treat the pointless "make sure it
is _really_ just once" case.
A mutex is most definitely never the right primitive to use for
something like this. ]
Yes, this is a small and meaningless detail in a code path that hardly
matters. But let's keep some code quality standards here, and not
accept outrageously bad code.
Link: https://lore.kernel.org/lkml/CAHk-=wgV9toS7GU3KmNpj8hCS9SeF+A0voHS8F275_mgLhL4Lw@mail.gmail.com/
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A series of fixes for x86:
- Reset MXCSR in kernel_fpu_begin() to prevent using a stale user
space value.
- Prevent writing MSR_TEST_CTRL on CPUs which are not explicitly
whitelisted for split lock detection. Some CPUs which do not
support it crash even when the MSR is written to 0 which is the
default value.
- Fix the XEN PV fallout of the entry code rework
- Fix the 32bit fallout of the entry code rework
- Add more selftests to ensure that these entry problems don't come
back.
- Disable 16 bit segments on XEN PV. It's not supported because XEN
PV does not implement ESPFIX64"
* tag 'x86-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ldt: Disable 16-bit segments on Xen PV
x86/entry/32: Fix #MC and #DB wiring on x86_32
x86/entry/xen: Route #DB correctly on Xen PV
x86/entry, selftests: Further improve user entry sanity checks
x86/entry/compat: Clear RAX high bits on Xen PV SYSENTER
selftests/x86: Consolidate and fix get/set_eflags() helpers
selftests/x86/syscall_nt: Clear weird flags after each test
selftests/x86/syscall_nt: Add more flag combinations
x86/entry/64/compat: Fix Xen PV SYSENTER frame setup
x86/entry: Move SYSENTER's regs->sp and regs->flags fixups into C
x86/entry: Assert that syscalls are on the right stack
x86/split_lock: Don't write MSR_TEST_CTRL on CPUs that aren't whitelisted
x86/fpu: Reset MXCSR to default in kernel_fpu_begin()
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Xen PV doesn't implement ESPFIX64, so they don't work right. Disable
them. Also print a warning the first time anyone tries to use a
16-bit segment on a Xen PV guest that would otherwise allow it
to help people diagnose this change in behavior.
This gets us closer to having all x86 selftests pass on Xen PV.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/92b2975459dfe5929ecf34c3896ad920bd9e3f2d.1593795633.git.luto@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
DEFINE_IDTENTRY_MCE and DEFINE_IDTENTRY_DEBUG were wired up as non-RAW
on x86_32, but the code expected them to be RAW.
Get rid of all the macro indirection for them on 32-bit and just use
DECLARE_IDTENTRY_RAW and DEFINE_IDTENTRY_RAW directly.
Also add a warning to make sure that we only hit the _kernel paths
in kernel mode.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/9e90a7ee8e72fd757db6d92e1e5ff16339c1ecf9.1593795633.git.luto@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
On Xen PV, #DB doesn't use IST. It still needs to be correctly routed
depending on whether it came from user or kernel mode.
Get rid of DECLARE/DEFINE_IDTENTRY_XEN -- it was too hard to follow the
logic. Instead, route #DB and NMI through DECLARE/DEFINE_IDTENTRY_RAW on
Xen, and do the right thing for #DB. Also add more warnings to the
exc_debug* handlers to make this type of failure more obvious.
This fixes various forms of corruption that happen when usermode
triggers #DB on Xen PV.
Fixes: 4c0dcd8350a0 ("x86/entry: Implement user mode C entry points for #DB and #MCE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/4163e733cce0b41658e252c6c6b3464f33fdff17.1593795633.git.luto@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Chasing down a Xen bug caused me to realize that the new entry sanity
checks are still fairly weak. Add some more checks.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/881de09e786ab93ce56ee4a2437ba2c308afe7a9.1593795633.git.luto@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Move the clearing of the high bits of RAX after Xen PV joins the SYSENTER
path so that Xen PV doesn't skip it.
Arguably this code should be deleted instead, but that would belong in the
merge window.
Fixes: ffae641f5747 ("x86/entry/64/compat: Fix Xen PV SYSENTER frame setup")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/9d33b3f3216dcab008070f1c28b6091ae7199969.1593795633.git.luto@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There are several copies of get_eflags() and set_eflags() and they all are
buggy. Consolidate them and fix them. The fixes are:
Add memory clobbers. These are probably unnecessary but they make sure
that the compiler doesn't move something past one of these calls when it
shouldn't.
Respect the redzone on x86_64. There has no failure been observed related
to this, but it's definitely a bug.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/982ce58ae8dea2f1e57093ee894760e35267e751.1593191971.git.luto@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Clear the weird flags before logging to improve strace output --
logging results while, say, TF is set does no one any favors.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/907bfa5a42d4475b8245e18b67a04b13ca51ffdb.1593191971.git.luto@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add EFLAGS.AC to the mix.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/12924e2fe2c5826568b7fc9436d85ca7f5eb1743.1593191971.git.luto@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The SYSENTER frame setup was nonsense. It worked by accident because the
normal code into which the Xen asm jumped (entry_SYSENTER_32/compat) threw
away SP without touching the stack. entry_SYSENTER_compat was recently
modified such that it relied on having a valid stack pointer, so now the
Xen asm needs to invoke it with a valid stack.
Fix it up like SYSCALL: use the Xen-provided frame and skip the bare
metal prologue.
Fixes: 1c3e5d3f60e2 ("x86/entry: Make entry_64_compat.S objtool clean")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lkml.kernel.org/r/947880c41ade688ff4836f665d0c9fcaa9bd1201.1593191971.git.luto@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The SYSENTER asm (32-bit and compat) contains fixups for regs->sp and
regs->flags. Move the fixups into C and fix some comments while at it.
This is a valid cleanup all by itself, and it also simplifies the
subsequent patch that will fix Xen PV SYSENTER.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/fe62bef67eda7fac75b8f3dbafccf571dc4ece6b.1593191971.git.luto@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Now that the entry stack is a full page, it's too easy to regress the
system call entry code and end up on the wrong stack without noticing.
Assert that all system calls (SYSCALL64, SYSCALL32, SYSENTER, and INT80)
are on the right stack and have pt_regs in the right place.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/52059e42bb0ab8551153d012d68f7be18d72ff8e.1593191971.git.luto@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Choo! Choo! All aboard the Split Lock Express, with direct service to
Wreckage!
Skip split_lock_verify_msr() if the CPU isn't whitelisted as a possible
SLD-enabled CPU model to avoid writing MSR_TEST_CTRL. MSR_TEST_CTRL
exists, and is writable, on many generations of CPUs. Writing the MSR,
even with '0', can result in bizarre, undocumented behavior.
This fixes a crash on Haswell when resuming from suspend with a live KVM
guest. Because APs use the standard SMP boot flow for resume, they will
go through split_lock_init() and the subsequent RDMSR/WRMSR sequence,
which runs even when sld_state==sld_off to ensure SLD is disabled. On
Haswell (at least, my Haswell), writing MSR_TEST_CTRL with '0' will
succeed and _may_ take the SMT _sibling_ out of VMX root mode.
When KVM has an active guest, KVM performs VMXON as part of CPU onlining
(see kvm_starting_cpu()). Because SMP boot is serialized, the resulting
flow is effectively:
on_each_ap_cpu() {
WRMSR(MSR_TEST_CTRL, 0)
VMXON
}
As a result, the WRMSR can disable VMX on a different CPU that has
already done VMXON. This ultimately results in a #UD on VMPTRLD when
KVM regains control and attempt run its vCPUs.
The above voodoo was confirmed by reworking KVM's VMXON flow to write
MSR_TEST_CTRL prior to VMXON, and to serialize the sequence as above.
Further verification of the insanity was done by redoing VMXON on all
APs after the initial WRMSR->VMXON sequence. The additional VMXON,
which should VM-Fail, occasionally succeeded, and also eliminated the
unexpected #UD on VMPTRLD.
The damage done by writing MSR_TEST_CTRL doesn't appear to be limited
to VMX, e.g. after suspend with an active KVM guest, subsequent reboots
almost always hang (even when fudging VMXON), a #UD on a random Jcc was
observed, suspend/resume stability is qualitatively poor, and so on and
so forth.
kernel BUG at arch/x86/kvm/x86.c:386!
CPU: 1 PID: 2592 Comm: CPU 6/KVM Tainted: G D
Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
RIP: 0010:kvm_spurious_fault+0xf/0x20
Call Trace:
vmx_vcpu_load_vmcs+0x1fb/0x2b0
vmx_vcpu_load+0x3e/0x160
kvm_arch_vcpu_load+0x48/0x260
finish_task_switch+0x140/0x260
__schedule+0x460/0x720
_cond_resched+0x2d/0x40
kvm_arch_vcpu_ioctl_run+0x82e/0x1ca0
kvm_vcpu_ioctl+0x363/0x5c0
ksys_ioctl+0x88/0xa0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x4c/0x170
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: dbaba47085b0c ("x86/split_lock: Rework the initialization flow of split lock detection")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200605192605.7439-1-sean.j.christopherson@intel.com
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Previously, kernel floating point code would run with the MXCSR control
register value last set by userland code by the thread that was active
on the CPU core just before kernel call. This could affect calculation
results if rounding mode was changed, or a crash if a FPU/SIMD exception
was unmasked.
Restore MXCSR to the kernel's default value.
[ bp: Carve out from a bigger patch by Petteri, add feature check, add
FNINIT call too (amluto). ]
Signed-off-by: Petteri Aimonen <jpa@git.mail.kapsi.fi>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=207979
Link: https://lkml.kernel.org/r/20200624114646.28953-2-bp@alien8.de
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A set of interrupt chip driver fixes:
- Ensure the atomicity of affinity updates in the GIC driver
- Don't try to sleep in atomic context when waiting for the GICv4.1
to respond. Use polling instead.
- Typo fixes in Kconfig and warnings"
* tag 'irq-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic: Atomically update affinity
irqchip/riscv-intc: Fix a typo in a pr_warn()
irqchip/gic-v4.1: Use readx_poll_timeout_atomic() to fix sleep in atomic
irqchip/loongson-pci-msi: Fix a typo in Kconfig
|
| |\ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent
Pull irqchip fixes from Marc Zyngier:
- Fix atomicity of affinity update in the GIC driver
- Don't sleep in atomic when waiting for a GICv4.1 RD to respond
- Fix a couple of typos in user-visible messages
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The GIC driver uses a RMW sequence to update the affinity, and
relies on the gic_lock_irqsave/gic_unlock_irqrestore sequences
to update it atomically.
But these sequences only expand into anything meaningful if
the BL_SWITCHER option is selected, which almost never happens.
It also turns out that using a RMW and locks is just as silly,
as the GIC distributor supports byte accesses for the GICD_TARGETRn
registers, which when used make the update atomic by definition.
Drop the terminally broken code and replace it by a byte write.
Fixes: 04c8b0f82c7d ("irqchip/gic: Make locking a BL_SWITCHER only feature")
Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Anup originally re-spun his patch set to include this fix, but it was a bit too
late for my PR so I've split it out.
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200611175302.253540-1-palmer@dabbelt.com
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
readx_poll_timeout() can sleep if @sleep_us is specified by the caller,
and is therefore unsafe to be used inside the atomic context, which is
this case when we use it to poll the GICR_VPENDBASER.Dirty bit in
irq_set_vcpu_affinity() callback.
Let's convert to its atomic version instead which helps to get the v4.1
board back to life!
Fixes: 96806229ca03 ("irqchip/gic-v4.1: Add support for VPENDBASER's Dirty+Valid signaling")
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200605052345.1494-1-yuzenghui@huawei.com
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
PCH MSI driver's menuconfig entry was wrong. Fix it.
Fixes: 632dcc2c75ef ("irqchip: Add Loongson PCH MSI controller")
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200530121113.1797678-2-jiaxun.yang@flygoat.com
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rcu fixlet from Thomas Gleixner:
"A single fix for a printk format warning in RCU"
* tag 'core-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcuperf: Fix printk format warning
|
| |\ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent
Pull RCU fix from Paul E. McKenney.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|