| Commit message (Collapse) | Author | Files | Lines |
|
syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.
The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:
CPU1 CPU2
cgwb_bdi_unregister()
cgwb_kill(*slot);
cgwb_release()
queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
wb = list_first_entry(&bdi->wb_list, ...)
spin_unlock_irq(&cgwb_lock);
wb_shutdown(wb);
...
kfree_rcu(wb, rcu);
wb_shutdown(wb); -> oops use-after-free
We solve these issues by synchronizing writeback structure shutdown from
cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
way we also no longer need synchronization using WB_shutting_down as the
mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
bdi_unregister().
Reported-by: syzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove dependencies on HAS_DMA where a Kconfig symbol depends on another
symbol that implies HAS_DMA, and, optionally, on "|| COMPILE_TEST".
In most cases this other symbol is an architecture or platform specific
symbol, or PCI.
Generic symbols and drivers without platform dependencies keep their
dependencies on HAS_DMA, to prevent compiling subsystems or drivers that
cannot work anyway.
This simplifies the dependencies, and allows to improve compile-testing.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When rewriting swapper using nG mappings, we must performance cache
maintenance around each page table access in order to avoid coherency
problems with the host's cacheable alias under KVM. To ensure correct
ordering of the maintenance with respect to Device memory accesses made
with the Stage-1 MMU disabled, DMBs need to be added between the
maintenance and the corresponding memory access.
This patch adds a missing DMB between writing a new page table entry and
performing a clean+invalidate on the same line.
Fixes: f992b4dfd58b ("arm64: kpti: Add ->enable callback to remap swapper using nG mappings")
Cc: <stable@vger.kernel.org> # 4.16.x-
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
We inspect __kpti_forced early on as part of the cpufeature enable
callback which remaps the swapper page table using non-global entries.
Ensure that __kpti_forced has been updated to reflect the kpti=
command-line option before we start using it.
Fixes: ea1e3de85e94 ("arm64: entry: Add fake CPU feature for unmapping the kernel at EL0")
Cc: <stable@vger.kernel.org> # 4.16.x-
Reported-by: Wei Xu <xuwei5@hisilicon.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Wei Xu <xuwei5@hisilicon.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
For the common cases where 1000 is a multiple of HZ, or HZ is a multiple of
1000, jiffies_to_msecs() never returns zero when passed a non-zero time
period.
However, if HZ > 1000 and not an integer multiple of 1000 (e.g. 1024 or
1200, as used on alpha and DECstation), jiffies_to_msecs() may return zero
for small non-zero time periods. This may break code that relies on
receiving back a non-zero value.
jiffies_to_usecs() does not need such a fix: one jiffy can only be less
than one µs if HZ > 1000000, and such large values of HZ are already
rejected at build time, twice:
- include/linux/jiffies.h does #error if HZ >= 12288,
- kernel/time/time.c has BUILD_BUG_ON(HZ > USEC_PER_SEC).
Broken since forever.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: linux-alpha@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180622143357.7495-1-geert@linux-m68k.org
|
|
KVM_CAP_HYPERV_TLBFLUSH collided with KVM_CAP_S390_PSW-BPB, its paragraph
number should now be 8.18.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
This patch extends the checks done prior to a nested VM entry.
Specifically, it extends the check_vmentry_prereqs function with checks
for fields relevant to the VM-entry event injection information, as
described in the Intel SDM, volume 3.
This patch is motivated by a syzkaller bug, where a bad VM-entry
interruption information field is generated in the VMCS02, which causes
the nested VM launch to fail. Then, KVM fails to resume L1.
While KVM should be improved to correctly resume L1 execution after a
failed nested launch, this change is justified because the existing code
to resume L1 is flaky/ad-hoc and the test coverage for resuming L1 is
sparse.
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Marc Orr <marcorr@google.com>
[Removed comment whose parts were describing previous revisions and the
rest was obvious from function/variable naming. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Function irq_desc_get_msi_desc() is not referenced in the kernel (and does
not seem to have been referenced since e39758e0ea76, 3 years ago), so
delete it.
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <marc.zyngier@arm.com>
Cc: <will.deacon@arm.com>
Cc: <kstewart@linuxfoundation.org>
Cc: <julien.thierry@arm.com>
Cc: <andrew@lunn.ch>
Cc: <trivial@kernel.org>
Link: https://lkml.kernel.org/r/1529667333-92959-1-git-send-email-john.garry@huawei.com
|
|
Enabling LPIs was made a lot stricter recently, by checking that they are
disabled before enabling them. By doing so, the CPU hotplug case was missed
altogether, which leaves LPIs enabled on hotplug off (expecting the CPU to
eventually come back), and won't write a different value anyway on hotplug
on.
So skip that check if that particular case is detected
Fixes: 6eb486b66a30 ("irqchip/gic-v3: Ensure GICR_CTLR.EnableLPI=0 is observed before enabling")
Reported-by: Sumit Garg <sumit.garg@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sumit Garg <sumit.garg@linaro.org>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Yang Yingliang <yangyingliang@huawei.com>
Link: https://lkml.kernel.org/r/20180622095254.5906-8-marc.zyngier@arm.com
|
|
Similarily to the SYNC operation, it must be verified that the VPE
targetted by a VLPI is backed by a valid collection in the GIC driver data
structures.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Yang Yingliang <yangyingliang@huawei.com>
Cc: Sumit Garg <sumit.garg@linaro.org>
Link: https://lkml.kernel.org/r/20180622095254.5906-7-marc.zyngier@arm.com
|
|
It is possible, under obscure circumstances, to convince the ITS driver to
emit a SYNC operation that targets a collection that is not bound to any
redistributor (and the target_address field is zero) because the
corresponding CPU has not been seen yet (the system has been booted with
max_cpus="something small").
If the ITS is using the linear CPU number as the target, this is not a big
deal, as we just end-up issuing a SYNC to CPU0. But if the ITS requires the
physical address of the redistributor (with GITS_TYPER.PTA==1), we end-up
asking the ITS to write to the physical address zero, which is not exactly
a good idea (there has been report of the ITS locking up). This should of
course never happen, but hey, this is SW...
In order to avoid the above disaster, let's track which collections have
been actually initialized, and let's not generate a SYNC if the collection
hasn't been properly bound to a redistributor. Take this opportunity to
spit our a warning, in the hope that someone may report the issue if it
arrises again.
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Sumit Garg <sumit.garg@linaro.org>
Link: https://lkml.kernel.org/r/20180622095254.5906-6-marc.zyngier@arm.com
|
|
On a NUMA system, if an ITS is local to an offline node, the ITS driver may
pick an offline CPU to bind the LPI. In this case, pick an online CPU (and
the first one will do).
But on some systems, binding an LPI to non-local node CPU may cause
deadlock (see Cavium erratum 23144). In this case, just fail the activate
and return an error code.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180622095254.5906-5-marc.zyngier@arm.com
|
|
On failing to allocate the required SPIs, the actual number of interrupts
should be freed and not its log2 value.
Fixes: de337ee30142 ("irqchip/gic-v2m: Add PCI Multi-MSI support")
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Yang Yingliang <yangyingliang@huawei.com>
Cc: Sumit Garg <sumit.garg@linaro.org>
Link: https://lkml.kernel.org/r/20180622095254.5906-4-marc.zyngier@arm.com
|
|
The ls-scfs-msi driver is not dealing with the effective affinity
as it should. Let's fix that, and make it clear that the effective
affinity is restricted to a single CPU. Also prevent the driver from
messing with the internals of the affinity setting infrastructure.
Reported-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Yang Yingliang <yangyingliang@huawei.com>
Cc: Sumit Garg <sumit.garg@linaro.org>
Link: https://lkml.kernel.org/r/20180622095254.5906-3-marc.zyngier@arm.com
|
|
Debug is missing the IRQCHIP_SUPPORTS_LEVEL_MSI debug entry, making debugfs
slightly less useful.
Take this opportunity to also add a missing comment in the definition of
IRQCHIP_SUPPORTS_LEVEL_MSI.
Fixes: 6988e0e0d283 ("genirq/msi: Limit level-triggered MSI to platform devices")
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Yang Yingliang <yangyingliang@huawei.com>
Cc: Sumit Garg <sumit.garg@linaro.org>
Link: https://lkml.kernel.org/r/20180622095254.5906-2-marc.zyngier@arm.com
|
|
Commit 910f8befdf5b ("xen/pirq: fix error path cleanup when binding
MSIs") fixed a couple of errors in error cleanup path of
xen_bind_pirq_msi_to_irq(). This cleanup allowed a call to
__unbind_from_irq() with an unbound irq, which would result in
triggering the BUG_ON there.
Since there is really no reason for the BUG_ON (xen_free_irq() can
operate on unbound irqs) we can remove it.
Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: stable@vger.kernel.org
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
For passing arbitrary data from user land to the Xen hypervisor the
Xen tools today are using mlock()ed buffers. Unfortunately the kernel
might change access rights of such buffers for brief periods of time
e.g. for page migration or compaction, leading to access faults in the
hypervisor, as the hypervisor can't use the locks of the kernel.
In order to solve this problem add a new device node to the Xen privcmd
driver to easily allocate hypercall buffers via mmap(). The memory is
allocated in the kernel and just mapped into user space. Marked as
VM_IO the user mapping will not be subject to page migration et al.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
Non gcc-5 builds with CONFIG_STACK_VALIDATION=y and
SKIP_STACK_VALIDATION=1 fail.
Example output:
/bin/sh: init/.tmp_main.o: Permission denied
commit 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace"),
added a mismatched endif. This causes cmd_objtool to get mistakenly
set.
Relocate endif to balance the newly added -record-mcount check.
Link: http://lkml.kernel.org/r/20180608214746.136554-1-gthelen@google.com
Fixes: 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace")
Acked-by: Andi Kleen <ak@linux.intel.com>
Tested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Fix typos, inconsistencies in using quotes, incorrect section number,
etc. in the trace histogram documentation.
Link: http://lkml.kernel.org/r/20180614224859.55864-1-joel@joelfernandes.org
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Make use of the swap macro and remove unnecessary variable _buf_.
This makes the code easier to read and maintain. Also, reduces the
stack usage.
This code was detected with the help of Coccinelle.
Link: http://lkml.kernel.org/r/20180209175316.GA18720@embeddedgus
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
I'm able to reproduce a lockdep splat with config options:
CONFIG_PROVE_LOCKING=y,
CONFIG_DEBUG_LOCK_ALLOC=y and
CONFIG_PREEMPTIRQ_EVENTS=y
$ echo 1 > /d/tracing/events/preemptirq/preempt_enable/enable
[ 26.112609] DEBUG_LOCKS_WARN_ON(current->softirqs_enabled)
[ 26.112636] WARNING: CPU: 0 PID: 118 at kernel/locking/lockdep.c:3854
[...]
[ 26.144229] Call Trace:
[ 26.144926] <IRQ>
[ 26.145506] lock_acquire+0x55/0x1b0
[ 26.146499] ? __do_softirq+0x46f/0x4d9
[ 26.147571] ? __do_softirq+0x46f/0x4d9
[ 26.148646] trace_preempt_on+0x8f/0x240
[ 26.149744] ? trace_preempt_on+0x4d/0x240
[ 26.150862] ? __do_softirq+0x46f/0x4d9
[ 26.151930] preempt_count_sub+0x18a/0x1a0
[ 26.152985] __do_softirq+0x46f/0x4d9
[ 26.153937] irq_exit+0x68/0xe0
[ 26.154755] smp_apic_timer_interrupt+0x271/0x280
[ 26.156056] apic_timer_interrupt+0xf/0x20
[ 26.157105] </IRQ>
The issue was this:
preempt_count = 1 << SOFTIRQ_SHIFT
__local_bh_enable(cnt = 1 << SOFTIRQ_SHIFT) {
if (softirq_count() == (cnt && SOFTIRQ_MASK)) {
trace_softirqs_on() {
current->softirqs_enabled = 1;
}
}
preempt_count_sub(cnt) {
trace_preempt_on() {
tracepoint() {
rcu_read_lock_sched() {
// jumps into lockdep
Where preempt_count still has softirqs disabled, but
current->softirqs_enabled is true, and we get a splat.
Link: http://lkml.kernel.org/r/20180607201143.247775-1-joel@joelfernandes.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Glexiner <tglx@linutronix.de>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Erick Reyes <erickreyes@google.com>
Cc: Julia Cartwright <julia@ni.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: stable@vger.kernel.org
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fixes: d59158162e032 ("tracing: Add support for preempt and irq enable/disable events")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
The syzkaller detected a out-of-bounds issue with the events filter code,
specifically here:
prog[N].pred = NULL; /* #13 */
prog[N].target = 1; /* TRUE */
prog[N+1].pred = NULL;
prog[N+1].target = 0; /* FALSE */
-> prog[N-1].target = N;
prog[N-1].when_to_branch = false;
As that's the first reference to a "N-1" index, it appears that the code got
here with N = 0, which means the filter parser found no filter to parse
(which shouldn't ever happen, but apparently it did).
Add a new error to the parsing code that will check to make sure that N is
not zero before going into this part of the code. If N = 0, then -EINVAL is
returned, and a error message is added to the filter.
Cc: stable@vger.kernel.org
Fixes: 80765597bc587 ("tracing: Rewrite filter logic to be simpler and faster")
Reported-by: air icy <icytxw@gmail.com>
bugzilla url: https://bugzilla.kernel.org/show_bug.cgi?id=200019
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
nvme requires an sg table allocation for each request. If the request
is large, then the allocation can become quite large. For instance,
with our default software settings of 1280KB IO size, we'll need
10248 bytes of sg table. That turns into a 2nd order allocation,
which we can't always guarantee. If we fail the allocation, blk-mq
will retry it later. But there's no guarantee that we'll EVER be
able to allocate that much contigious memory.
Limit the IO size such that we never need more than a single page
of memory. That's a lot faster and more reliable. Then back that
allocation with a mempool, so that we know we'll always be able
to succeed the allocation at some point.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
There is very little point in trying to support the 32bit KVM/arm API
on arm64, and this was never an anticipated use case.
Let's make it clear by not selecting KVM_COMPAT.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
The current behaviour of the compat ioctls is a bit odd.
We provide a compat_ioctl method when KVM_COMPAT is set, and NULL
otherwise. But NULL means that the normal, non-compat ioctl should
be used directly for compat tasks, and there is no way to actually
prevent a compat task from issueing KVM ioctls.
This patch changes this behaviour, by always registering a compat_ioctl
method, even if KVM_COMPAT is not selected. In that case, the callback
will always return -EINVAL.
Fixes: de8e5d744051568c8aad ("KVM: Disable compat ioctl for s390")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
There is race between nvme_remove and nvme_reset_work that can
lead to io hang.
nvme_remove nvme_reset_work
-> nvme_remove_dead_ctrl
-> nvme_dev_disable
-> quiesce request_queue
-> queue remove_work
-> cancel_work_sync reset_work
-> nvme_remove_namespaces
-> splice ctrl->namespaces
nvme_remove_dead_ctrl_work
-> nvme_kill_queues
-> nvme_ns_remove do nothing
-> blk_cleanup_queue
-> blk_freeze_queue
Finally, the request_queue is quiesced state when wait freeze,
we will get io hang here. To fix it, move the nvme_kill_queues
from nvme_remove_dead_ctrl_work to nvme_remove_dead_ctrl.
Suggested-by: Keith Busch <keith.busch@linux.intel.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
unmap_stage2_range
There is a panic in armv8a server(QDF2400) under memory pressure tests
(start 20 guests and run memhog in the host).
---------------------------------begin--------------------------------
[35380.800950] BUG: Bad page state in process qemu-kvm pfn:dd0b6
[35380.805825] page:ffff7fe003742d80 count:-4871 mapcount:-2126053375
mapping: (null) index:0x0
[35380.815024] flags: 0x1fffc00000000000()
[35380.818845] raw: 1fffc00000000000 0000000000000000 0000000000000000
ffffecf981470000
[35380.826569] raw: dead000000000100 dead000000000200 ffff8017c001c000
0000000000000000
[35380.805825] page:ffff7fe003742d80 count:-4871 mapcount:-2126053375
mapping: (null) index:0x0
[35380.815024] flags: 0x1fffc00000000000()
[35380.818845] raw: 1fffc00000000000 0000000000000000 0000000000000000
ffffecf981470000
[35380.826569] raw: dead000000000100 dead000000000200 ffff8017c001c000
0000000000000000
[35380.834294] page dumped because: nonzero _refcount
[...]
--------------------------------end--------------------------------------
The root cause might be what was fixed at [1]. But from the KVM points of
view, it would be better if the issue was caught earlier.
If the size is not PAGE_SIZE aligned, unmap_stage2_range might unmap the
wrong(more or less) page range. Hence it caused the "BUG: Bad page
state"
Let's WARN in that case, so that the issue is obvious.
[1] https://lkml.org/lkml/2018/5/3/1042
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: jia.he@hxt-semitech.com
[maz: tidied up commit message]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
Currently packed pixel modes in MHL2 can't be displayed. The device
automatically recognizes output format, so setting format other than
RGB causes failure. Fix it by writing proper values to registers.
Tested on MHL1 and MHL2 using various vendors' dongles both in
DVI and HDMI mode.
Signed-off-by: Maciej Purski <m.purski@samsung.com>
Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1516706239-9104-1-git-send-email-m.purski@samsung.com
|
|
Commit e6b673b ("KVM: arm64: Optimise FPSIMD handling to reduce
guest/host thrashing") uses fpsimd_save() to save the FPSIMD state
for a vcpu when scheduling the vcpu out. However, currently
current's value of TIF_SVE is restored before calling fpsimd_save()
which means that fpsimd_save() may erroneously attempt to save SVE
state from the vcpu. This enables current's vector state to be
polluted with guest data. current->thread.sve_state may be
unallocated or not large enough, so this can also trigger a NULL
dereference or buffer overrun.
Instead of this, TIF_SVE should be configured properly for the
guest when calling fpsimd_save() with the vcpu context loaded.
This patch ensures this by delaying restoration of current's
TIF_SVE until after the call to fpsimd_save().
Fixes: e6b673b741ea ("KVM: arm64: Optimise FPSIMD handling to reduce guest/host thrashing")
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
Commit e6b673b ("KVM: arm64: Optimise FPSIMD handling to reduce
guest/host thrashing") attempts to restore the configuration of
userspace SVE trapping via a call to fpsimd_bind_task_to_cpu(), but
the logic for determining when to do this is not correct.
The patch makes the errnoenous assumption that the only task that
may try to enter userspace with the currently loaded FPSIMD/SVE
register content is current. This may not be the case however: if
some other user task T is scheduled on the CPU during the execution
of the KVM run loop, and the vcpu does not try to use the registers
in the meantime, then T's state may be left there intact. If T
happens to be the next task to enter userspace on this CPU then the
hooks for reloading the register state and configuring traps will
be skipped.
(Also, current never has SVE state at this point anyway and should
always have the trap enabled, as a side-effect of the ioctl()
syscall needed to reach the KVM run loop in the first place.)
This patch instead restores the state of the EL0 trap from the
state observed at the most recent vcpu_load(), ensuring that the
trap is set correctly for the loaded context (if any).
Fixes: e6b673b741ea ("KVM: arm64: Optimise FPSIMD handling to reduce guest/host thrashing")
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
Commit e6b673b ("KVM: arm64: Optimise FPSIMD handling to reduce
guest/host thrashing") introduces a specific helper
kvm_arch_vcpu_put_fp() for saving the vcpu FPSIMD state during
vcpu_put().
This function uses local_bh_disable()/_enable() to protect the
FPSIMD context manipulation from interruption by softirqs.
This approach is not correct, because vcpu_put() can be invoked
either from the KVM host vcpu thread (when exiting the vcpu run
loop), or via a preempt notifier. In the former case, only
preemption is disabled. In the latter case, the function is called
from inside __schedule(), which means that IRQs are disabled.
Use of local_bh_disable()/_enable() with IRQs disabled is considerd
an error, resulting in lockdep splats while running VMs if lockdep
is enabled.
This patch disables IRQs instead of attempting to disable softirqs,
avoiding the problem of calling local_bh_enable() with IRQs
disabled in the __schedule() path. This creates an additional
interrupt blackout during vcpu run loop exit, but this is the rare
case and the blackout latency is still less than that of
__schedule().
Fixes: e6b673b741ea ("KVM: arm64: Optimise FPSIMD handling to reduce guest/host thrashing")
Reported-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
Currently we have a couple of helpers to manipulate bits in particular
sysregs:
* config_sctlr_el1(u32 clear, u32 set)
* change_cpacr(u64 val, u64 mask)
The parameters of these differ in naming convention, order, and size,
which is unfortunate. They also differ slightly in behaviour, as
change_cpacr() skips the sysreg write if the bits are unchanged, which
is a useful optimization when sysreg writes are expensive.
Before we gain yet another sysreg manipulation function, let's
unify these with a common helper, providing a consistent order for
clear/set operands, and the write skipping behaviour from
change_cpacr(). Code will be migrated to the new helper in subsequent
patches.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Dave Martin <dave.martin@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
When booting a 64 KB pages kernel on a ACPI GICv3 system that
implements support for v2 emulation, the following warning is
produced
GICV size 0x2000 not a multiple of page size 0x10000
and support for v2 emulation is disabled, preventing GICv2 VMs
from being able to run on such hosts.
The reason is that vgic_v3_probe() performs a sanity check on the
size of the window (it should be a multiple of the page size),
while the ACPI MADT parsing code hardcodes the size of the window
to 8 KB. This makes sense, considering that ACPI does not bother
to describe the size in the first place, under the assumption that
platforms implementing ACPI will follow the architecture and not
put anything else in the same 64 KB window.
So let's just drop the sanity check altogether, and assume that
the window is at least 64 KB in size.
Fixes: 909777324588 ("KVM: arm/arm64: vgic-new: vgic_init: implement kvm_vgic_hyp_init")
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
Rather than leaving io queues quiesced after tearing down an association,
restart them. This allows ios to be replayed, with fastfail ios terminating
and non-fastfail getting into loops of retry.
This follows rdma's lead.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimber.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
If NBD_DISCONNECT_ON_CLOSE is set on a device, then the driver will
issue a disconnect from nbd_release if the device has no remaining
bdev->bd_openers.
Fix ret val so reconfigure with only setting the flag succeeds.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Fix a typo in the intel_pstate admin-guide documentation.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
resp->num is the number of tokens in resp->tok[]. It gets set in
response_parse(). So if n == resp->num then we're reading beyond the
end of the data.
Fixes: 455a7b238cd6 ("block: Add Sed-opal library")
Reviewed-by: Scott Bauer <scott.bauer@intel.com>
Tested-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Document the missing command line tokens in the help() function.
Signed-off-by: Nathan Ciobanu <nathan.d.ciobanu@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Improve the help() output by adding the single character
tokens (e.g -a).
Signed-off-by: Nathan Ciobanu <nathan.d.ciobanu@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Sort the command line arguments output of help() in
alphabetical order in line with other linux tools.
Signed-off-by: Nathan Ciobanu <nathan.d.ciobanu@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Running turbostat on machines that don't expose nodes
in sysfs (no /sys/bus/node) causes a segfault or a -nan
value diesplayed in the log. This is caused by
physical_node_id being reported as -1 and logical_node_id
being calculated as a negative number resulting in the new
GET_THREAD/GET_CORE returning an incorrect address.
Signed-off-by: Nathan Ciobanu <nathan.d.ciobanu@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Add APIC and X2APIC columns to the topology section.
They are disabled-by-default -- enable like so:
--debug
or
--enable APIC,X2APIC
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
eg. the "HT" here:
CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM HT TM
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
The --show and --hide options failed on "Node", which was listed as "Node%".
The --show and --hide options were generally fouled-up do due to come
content merges that scrambled the list of column name indexes.
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
If rq_state == ARRAY_SIZE() then we read one element beyond the end of
the blk_mq_rq_state_name_array[] array.
Fixes: ec6dcf63c55c ("blk-mq-debugfs: Show more request state information")
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Controllers that are not yet enabled should not really enforce keep alive
timeouts, but we still want to track a timeout and cleanup in case a host
died before it enabled the controller. Hence, simply reset the keep
alive timer when the controller is enabled.
Suggested-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
That is user argument, and theoretically controller limits can change
over time (over reconnects/resets). Instead, use the sqsize controller
attribute to check queue depth boundaries and use it to the tagset
allocation.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The race is between completing the request at error recovery work and
rdma completions. If we cancel the request before getting the good
rdma completion we get a NULL deref of the request MR at
nvme_rdma_process_nvme_rsp().
When Canceling the request we return its mr to the mr pool (set mr to
NULL) and also unmap its data. Canceling the requests while the rdma
queues are active is not safe. Because rdma queues are active and we
get good rdma completions that can use the mr pointer which may be NULL.
Completing the request too soon may lead also to performing DMA to/from
user buffers which might have been already unmapped.
The commit fixes the race by draining the QP before starting the abort
commands mechanism.
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
If nvme_rdma_configure_admin_queue fails before we allocated
the async event buffer, we will falsly free it because
nvme_rdma_free_queue is freeing it. Fix it by allocating the buffer right
after nvme_rdma_alloc_queue and free it right before nvme_rdma_queue_free
to maintain orderly reverse cleanup sequence.
Reported-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|