| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- tm: Block signal return from setting invalid MSR state from Michael
Neuling
- tm: Check for already reclaimed tasks from Michael Neuling
* tag 'powerpc-4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/tm: Check for already reclaimed tasks
powerpc/tm: Block signal return setting invalid MSR state
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently we can hit a scenario where we'll tm_reclaim() twice. This
results in a TM bad thing exception because the second reclaim occurs
when not in suspend mode.
The scenario in which this can happen is the following. We attempt to
deliver a signal to userspace. To do this we need obtain the stack
pointer to write the signal context. To get this stack pointer we
must tm_reclaim() in case we need to use the checkpointed stack
pointer (see get_tm_stackpointer()). Normally we'd then return
directly to userspace to deliver the signal without going through
__switch_to().
Unfortunatley, if at this point we get an error (such as a bad
userspace stack pointer), we need to exit the process. The exit will
result in a __switch_to(). __switch_to() will attempt to save the
process state which results in another tm_reclaim(). This
tm_reclaim() now causes a TM Bad Thing exception as this state has
already been saved and the processor is no longer in TM suspend mode.
Whee!
This patch checks the state of the MSR to ensure we are TM suspended
before we attempt the tm_reclaim(). If we've already saved the state
away, we should no longer be in TM suspend mode. This has the
additional advantage of checking for a potential TM Bad Thing
exception.
Found using syscall fuzzer.
Fixes: fb09692e71f1 ("powerpc: Add reclaim and recheckpoint functions for context switching transactional memory processes")
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently we allow both the MSR T and S bits to be set by userspace on
a signal return. Unfortunately this is a reserved configuration and
will cause a TM Bad Thing exception if attempted (via rfid).
This patch checks for this case in both the 32 and 64 bit signals
code. If both T and S are set, we mark the context as invalid.
Found using a syscall fuzzer.
Fixes: 2b0a576d15e0 ("powerpc: Add new transactional memory state to the signal context")
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"A couple of fixes for sendfile lockups caught by Dmitry + a fix for
ancient sysvfs symlink breakage"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: Avoid softlockups with sendfile(2)
vfs: Make sendfile(2) killable even better
fix sysvfs symlinks
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The following test program from Dmitry can cause softlockups or RCU
stalls as it copies 1GB from tmpfs into eventfd and we don't have any
scheduling point at that path in sendfile(2) implementation:
int r1 = eventfd(0, 0);
int r2 = memfd_create("", 0);
unsigned long n = 1<<30;
fallocate(r2, 0, 0, n);
sendfile(r1, r2, 0, n);
Add cond_resched() into __splice_from_pipe() to fix the problem.
CC: Dmitry Vyukov <dvyukov@google.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Commit 296291cdd162 (mm: make sendfile(2) killable) fixed an issue where
sendfile(2) was doing a lot of tiny writes into a filesystem and thus
was unkillable for a long time. However sendfile(2) can be (mis)used to
issue lots of writes into arbitrary file descriptor such as evenfd or
similar special file descriptors which never hit the standard filesystem
write path and thus are still unkillable. E.g. the following example
from Dmitry burns CPU for ~16s on my test system without possibility to
be killed:
int r1 = eventfd(0, 0);
int r2 = memfd_create("", 0);
unsigned long n = 1<<30;
fallocate(r2, 0, 0, n);
sendfile(r1, r2, 0, n);
There are actually quite a few tests for pending signals in sendfile
code however we data to write is always available none of them seems to
trigger. So fix the problem by adding a test for pending signal into
splice_from_pipe_next() also before the loop waiting for pipe buffers to
be available. This should fix all the lockup issues with sendfile of the
do-ton-of-tiny-writes nature.
CC: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The thing got broken back in 2002 - sysvfs does *not* have inline
symlinks; even short ones have bodies stored in the first block
of file. sysv_symlink() handles that correctly; unfortunately,
attempting to look an existing symlink up will end up confusing
them for inline symlinks, and interpret the block number containing
the body as the body itself.
Nobody has noticed until now, which says something about the level
of testing sysvfs gets ;-/
Cc: stable@vger.kernel.org # all of them, not that anyone cared
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull more block layer fixes from Jens Axboe:
"I wasn't going to send off a new pull before next week, but the blk
flush fix from Jan from the other day introduced a regression. It's
rare enough not to have hit during testing, since it requires both a
device that rejects the first flush, and bad timing while it does
that. But since someone did hit it, let's get the revert into 4.4-rc3
so we don't have a released rc with that known issue.
Apart from that revert, three other fixes:
- From Christoph, a fix for a missing unmap in NVMe request
preparation.
- An NVMe fix from Nishanth that fixes data corruption on powerpc.
- Also from Christoph, fix a list_del() attempt on blk-mq that didn't
have a matching list_add() at timer start"
* 'for-linus' of git://git.kernel.dk/linux-block:
Revert "blk-flush: Queue through IO scheduler when flush not required"
block: fix blk_abort_request for blk-mq drivers
nvme: add missing unmaps in nvme_queue_rq
NVMe: default to 4k device page size
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This reverts commit 1b2ff19e6a957b1ef0f365ad331b608af80e932e.
Jan writes:
--
Thanks for report! After some investigation I found out we allocate
elevator specific data in __get_request() only for non-flush requests. And
this is actually required since the flush machinery uses the space in
struct request for something else. Doh. So my patch is just wrong and not
easy to fix since at the time __get_request() is called we are not sure
whether the flush machinery will be used in the end. Jens, please revert
1b2ff19e6a957b1ef0f365ad331b608af80e932e. Thanks!
I'm somewhat surprised that you can reliably hit the race where flushing
gets disabled for the device just while the request is in flight. But I
guess during boot it makes some sense.
--
So let's just revert it, we can fix the queue run manually after the
fact. This race is rare enough that it didn't trigger in testing, it
requires the specific disable-while-in-flight scenario to trigger.
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We only added the request to the request list for the !blk-mq case,
so we should only delete it in that case as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When we fail various metadata related operations in nvme_queue_rq we
need to unmap the data SGL.
Cc: stable@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).
The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).
In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size. There is not currently an
API to obtain the IOMMU's page size across all architectures and in the
interest of a stop-gap fix to this functional issue, default the NVMe
device page size to 4K, with the intent of adding such an API and
implementation across all architectures in the next merge window.
With the functionally equivalent v3 of this patch, our hardware test
exerciser survives when using 32-bit DMA; without the patch, the kernel
will BUG within a few minutes.
Signed-off-by: Nishanth Aravamudan <nacc at linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|\ \ \
| |/ /
|/| |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull KVM fixes from Paolo Bonzini:
"Bug fixes for all architectures. Nothing really stands out"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
KVM: nVMX: remove incorrect vpid check in nested invvpid emulation
arm64: kvm: report original PAR_EL1 upon panic
arm64: kvm: avoid %p in __kvm_hyp_panic
KVM: arm/arm64: vgic: Trust the LR state for HW IRQs
KVM: arm/arm64: arch_timer: Preserve physical dist. active state on LR.active
KVM: arm/arm64: Fix preemptible timer active state crazyness
arm64: KVM: Add workaround for Cortex-A57 erratum 834220
arm64: KVM: Fix AArch32 to AArch64 register mapping
ARM/arm64: KVM: test properly for a PTE's uncachedness
KVM: s390: fix wrong lookup of VCPUs by array index
KVM: s390: avoid memory overwrites on emergency signal injection
KVM: Provide function for VCPU lookup by id
KVM: s390: fix pfmf intercept handler
KVM: s390: enable SIMD only when no VCPUs were created
KVM: x86: request interrupt window when IRQ chip is split
KVM: x86: set KVM_REQ_EVENT on local interrupt request from user space
KVM: x86: split kvm_vcpu_ready_for_interrupt_injection out of dm_request_for_irq_injection
KVM: x86: fix interrupt window handling in split IRQ chip case
MIPS: KVM: Uninit VCPU in vcpu_create error path
MIPS: KVM: Fix CACHE immediate offset sign extension
...
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch removes the vpid check when emulating nested invvpid
instruction of type all-contexts invalidation. The existing code is
incorrect because:
(1) According to Intel SDM Vol 3, Section "INVVPID - Invalidate
Translations Based on VPID", invvpid instruction does not check
vpid in the invvpid descriptor when its type is all-contexts
invalidation.
(2) According to the same document, invvpid of type all-contexts
invalidation does not require there is an active VMCS, so/and
get_vmcs12() in the existing code may result in a NULL-pointer
dereference. In practice, it can crash both KVM itself and L1
hypervisors that use invvpid (e.g. Xen).
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |\ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
KVM/ARM Fixes for v4.4-rc3.
Includes some timer fixes, properly unmapping PTEs, an errata fix, and two
tweaks to the EL2 panic code.
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If we call __kvm_hyp_panic while a guest context is active, we call
__restore_sysregs before acquiring the system register values for the
panic, in the process throwing away the PAR_EL1 value at the point of
the panic.
This patch modifies __kvm_hyp_panic to stash the PAR_EL1 value prior to
restoring host register values, enabling us to report the original
values at the point of the panic.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Currently __kvm_hyp_panic uses %p for values which are not pointers,
such as the ESR value. This can confusingly lead to "(null)" being
printed for the value.
Use %x instead, and only use %p for host pointers.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We were probing the physial distributor state for the active state of a
HW virtual IRQ, because we had seen evidence that the LR state was not
cleared when the guest deactivated a virtual interrupted.
However, this issue turned out to be a software bug in the GIC, which
was solved by: 84aab5e68c2a5e1e18d81ae8308c3ce25d501b29
(KVM: arm/arm64: arch_timer: Preserve physical dist. active
state on LR.active, 2015-11-24)
Therefore, get rid of the complexities and just look at the LR.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We were incorrectly removing the active state from the physical
distributor on the timer interrupt when the timer output level was
deasserted. We shouldn't be doing this without considering the virtual
interrupt's active state, because the architecture requires that when an
LR has the HW bit set and the pending or active bits set, then the
physical interrupt must also have the corresponding bits set.
This addresses an issue where we have been observing an inconsistency
between the LR state and the physical distributor state where the LR
state was active and the physical distributor was not active, which
shouldn't happen.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We were setting the physical active state on the GIC distributor in a
preemptible section, which could cause us to set the active state on
different physical CPU from the one we were actually going to run on,
hacoc ensues.
Since we are no longer descheduling/scheduling soft timers in the
flush/sync timer functions, simply moving the timer flush into a
non-preemptible section.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Cortex-A57 parts up to r1p2 can misreport Stage 2 translation faults
when a Stage 1 permission fault or device alignment fault should
have been reported.
This patch implements the workaround (which is to validate that the
Stage-1 translation actually succeeds) by using code patching.
Cc: stable@vger.kernel.org
Reviewed-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
When running a 32bit guest under a 64bit hypervisor, the ARMv8
architecture defines a mapping of the 32bit registers in the 64bit
space. This includes banked registers that are being demultiplexed
over the 64bit ones.
On exceptions caused by an operation involving a 32bit register, the
HW exposes the register number in the ESR_EL2 register. It was so
far understood that SW had to distinguish between AArch32 and AArch64
accesses (based on the current AArch32 mode and register number).
It turns out that I misinterpreted the ARM ARM, and the clue is in
D1.20.1: "For some exceptions, the exception syndrome given in the
ESR_ELx identifies one or more register numbers from the issued
instruction that generated the exception. Where the exception is
taken from an Exception level using AArch32 these register numbers
give the AArch64 view of the register."
Which means that the HW is already giving us the translated version,
and that we shouldn't try to interpret it at all (for example, doing
an MMIO operation from the IRQ mode using the LR register leads to
very unexpected behaviours).
The fix is thus not to perform a call to vcpu_reg32() at all from
vcpu_reg(), and use whatever register number is supplied directly.
The only case we need to find out about the mapping is when we
actively generate a register access, which only occurs when injecting
a fault in a guest.
Cc: stable@vger.kernel.org
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
|
| | |/
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The open coded tests for checking whether a PTE maps a page as
uncached use a flawed '(pte_val(xxx) & CONST) != CONST' pattern,
which is not guaranteed to work since the type of a mapping is
not a set of mutually exclusive bits
For HYP mappings, the type is an index into the MAIR table (i.e, the
index itself does not contain any information whatsoever about the
type of the mapping), and for stage-2 mappings it is a bit field where
normal memory and device types are defined as follows:
#define MT_S2_NORMAL 0xf
#define MT_S2_DEVICE_nGnRE 0x1
I.e., masking *and* comparing with the latter matches on the former,
and we have been getting lucky merely because the S2 device mappings
also have the PTE_UXN bit set, or we would misidentify memory mappings
as device mappings.
Since the unmap_range() code path (which contains one instance of the
flawed test) is used both for HYP mappings and stage-2 mappings, and
considering the difference between the two, it is non-trivial to fix
this by rewriting the tests in place, as it would involve passing
down the type of mapping through all the functions.
However, since HYP mappings and stage-2 mappings both deal with host
physical addresses, we can simply check whether the mapping is backed
by memory that is managed by the host kernel, and only perform the
D-cache maintenance if this is the case.
Cc: stable@vger.kernel.org
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
|
| |\ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master
KVM: s390: Fixes for 4.4
1. disallow changing the SIMD mode when CPUs have been created.
it allowed userspace to corrupt kernel memory
2. Fix vCPU lookup. Until now the vCPU number equals the vCPU id. Some
kernel code places relied on that. This might
a: cause guest failures
b: allow userspace to corrupt kernel memory
3. Fencing of the PFMF instruction should use the guest facilities
and not the host facilities.
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
For now, VCPUs were always created sequentially with incrementing
VCPU ids. Therefore, the index in the VCPUs array matched the id.
As sequential creation might change with cpu hotplug, let's use
the correct lookup function to find a VCPU by id, not array index.
Let's also use kvm_lookup_vcpu() for validation of the sending VCPU
on external call injection.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: stable@vger.kernel.org # db27a7a KVM: Provide function for VCPU lookup by id
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Commit 383d0b050106 ("KVM: s390: handle pending local interrupts via
bitmap") introduced a possible memory overwrite from user space.
User space could pass an invalid emergency signal code (sending VCPU)
and therefore exceed the bitmap. Let's take care of this case and
check that the id is in the valid range.
Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org # v3.19+ db27a7a KVM: Provide function for VCPU lookup by id
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Let's provide a function to lookup a VCPU by id.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
[split patch from refactoring patch]
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The pfmf intercept handler should check if the EDAT 1 facility
is installed in the guest, not if it is installed in the host.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We should never allow to enable/disable any facilities for the guest
when other VCPUs were already created.
kvm_arch_vcpu_(load|put) relies on SIMD not changing during runtime.
If somebody would create and run VCPUs and then decides to enable
SIMD, undefined behaviour could be possible (e.g. vector save area
not being set up).
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: stable@vger.kernel.org # 4.1+
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Before this patch, we incorrectly enter the guest without requesting an
interrupt window if the IRQ chip is split between user space and the
kernel.
Because lapic_in_kernel no longer implies the PIC is in the kernel, this
patch tests pic_in_kernel to determining whether an interrupt window
should be requested when entering the guest.
If the APIC is in the kernel and we request an interrupt window the
guest will return immediately. If the APIC is masked the guest will not
not make forward progress and unmask it, leading to a loop when KVM
reenters and requests again. This patch adds a check to ensure the APIC
is ready to accept an interrupt before requesting a window.
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Matt Gingell <gingell@google.com>
[Use the other newly introduced functions. - Paolo]
Fixes: 1c1a9ce973a7863dd46767226bce2a5f12d48bc6
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Set KVM_REQ_EVENT when a PIC in user space injects a local interrupt.
Currently a request is only made when neither the PIC nor the APIC is in
the kernel, which is not sufficient in the split IRQ chip case.
This addresses a problem in QEMU where interrupts are delayed until
another path invokes the event loop.
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Matt Gingell <gingell@google.com>
Fixes: 1c1a9ce973a7863dd46767226bce2a5f12d48bc6
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
dm_request_for_irq_injection
This patch breaks out a new function kvm_vcpu_ready_for_interrupt_injection.
This routine encapsulates the logic required to determine whether a vcpu
is ready to accept an interrupt injection, which is now required on
multiple paths.
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Matt Gingell <gingell@google.com>
Fixes: 1c1a9ce973a7863dd46767226bce2a5f12d48bc6
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This patch ensures that dm_request_for_irq_injection and
post_kvm_run_save are in sync, avoiding that an endless ping-pong
between userspace (who correctly notices that IF=0) and
the kernel (who insists that userspace handles its request
for the interrupt window).
To synchronize them, it also adds checks for kvm_arch_interrupt_allowed
and !kvm_event_needs_reinjection. These are always needed, not
just for in-kernel LAPIC.
Signed-off-by: Matt Gingell <gingell@google.com>
[A collage of two patches from Matt. - Paolo]
Fixes: 1c1a9ce973a7863dd46767226bce2a5f12d48bc6
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If either of the memory allocations in kvm_arch_vcpu_create() fail, the
vcpu which has been allocated and kvm_vcpu_init'd doesn't get uninit'd
in the error handling path. Add a call to kvm_vcpu_uninit() to fix this.
Fixes: 669e846e6c4e ("KVM/MIPS32: MIPS arch specific APIs for KVM")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Cc: <stable@vger.kernel.org> # 3.10.x-
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The immediate field of the CACHE instruction is signed, so ensure that
it gets sign extended by casting it to an int16_t rather than just
masking the low 16 bits.
Fixes: e685c689f3a8 ("KVM/MIPS32: Privileged instruction/target branch emulation.")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Cc: <stable@vger.kernel.org> # 3.10.x-
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
ASID restoration on guest resume should determine the guest execution
mode based on the guest Status register rather than bit 30 of the guest
PC.
Fix the two places in locore.S that do this, loading the guest status
from the cop0 area. Note, this assembly is specific to the trap &
emulate implementation of KVM, so it doesn't need to check the
supervisor bit as that mode is not implemented in the guest.
Fixes: b680f70fc111 ("KVM/MIPS32: Entry point for trampolining to...")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Cc: <stable@vger.kernel.org> # 3.10.x-
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"Two fixes for 4.4-rc1's DM ioctl changes that introduced the potential
for infinite recursion on ioctl (with DM multipath).
And four stable fixes:
- A DM thin-provisioning fix to restore 'error_if_no_space' setting
when a thin-pool is made writable again (after having been out of
space).
- A DM thin-provisioning fix to properly advertise discard support
for thin volumes that are stacked on a thin-pool whose underlying
data device doesn't support discards.
- A DM ioctl fix to allow ctrl-c to break out of an ioctl retry loop
when DM multipath is configured to 'queue_if_no_path'.
- A DM crypt fix for a possible hang on dm-crypt device removal"
* tag 'dm-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm thin: fix regression in advertised discard limits
dm crypt: fix a possible hang due to race condition on exit
dm mpath: fix infinite recursion in ioctl when no paths and !queue_if_no_path
dm: do not reuse dm_blk_ioctl block_device input as local variable
dm: fix ioctl retry termination with signal
dm thin: restore requested 'error_if_no_space' setting on OODS to WRITE transition
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
When establishing a thin device's discard limits we cannot rely on the
underlying thin-pool device's discard capabilities (which are inherited
from the thin-pool's underlying data device) given that DM thin devices
must provide discard support even when the thin-pool's underlying data
device doesn't support discards.
Users were exposed to this thin device discard limits regression if
their thin-pool's underlying data device does _not_ support discards.
This regression caused all upper-layers that called the
blkdev_issue_discard() interface to not be able to issue discards to
thin devices (because discard_granularity was 0). This regression
wasn't caught earlier because the device-mapper-test-suite's extensive
'thin-provisioning' discard tests are only ever performed against
thin-pool's with data devices that support discards.
Fix is to have thin_io_hints() test the pool's 'discard_enabled' feature
rather than inferring whether or not a thin device's discard support
should be enabled by looking at the thin-pool's discard_granularity.
Fixes: 216076705 ("dm thin: disable discard support for thin devices if pool's is disabled")
Reported-by: Mike Gerber <mike@sprachgewalt.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE),
__add_wait_queue, spin_unlock_irq and then tests kthread_should_stop().
It is possible that the processor reorders memory accesses so that
kthread_should_stop() is executed before __set_current_state(). If such
reordering happens, there is a possible race on thread termination:
CPU 0:
calls kthread_should_stop()
it tests KTHREAD_SHOULD_STOP bit, returns false
CPU 1:
calls kthread_stop(cc->write_thread)
sets the KTHREAD_SHOULD_STOP bit
calls wake_up_process on the kernel thread, that sets the thread
state to TASK_RUNNING
CPU 0:
sets __set_current_state(TASK_INTERRUPTIBLE)
spin_unlock_irq(&cc->write_thread_wait.lock)
schedule() - and the process is stuck and never terminates, because the
state is TASK_INTERRUPTIBLE and wake_up_process on CPU 1 already
terminated
Fix this race condition by using a new flag DM_CRYPT_EXIT_THREAD to
signal that the kernel thread should exit. The flag is set and tested
while holding cc->write_thread_wait.lock, so there is no possibility of
racy access to the flag.
Also, remove the unnecessary set_task_state(current, TASK_RUNNING)
following the schedule() call. When the process was woken up, its state
was already set to TASK_RUNNING. Other kernel code also doesn't set the
state to TASK_RUNNING following schedule() (for example,
do_wait_for_common in completion.c doesn't do it).
Fixes: dc2676210c42 ("dm crypt: offload writes to thread")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
In multipath_prepare_ioctl(),
- pgpath is a path selected from available paths
- m->queue_io is true if we cannot send a request immediately to
paths, either because:
* there is no available path
* the path group needs activation (pg_init)
- pg_init is not started
- pg_init is still running
- m->queue_if_no_path is true if the device is configured to queue
I/O if there are no available paths
If !pgpath && !m->queue_if_no_path, the handler should return -EIO.
However in the course of refactoring the condition check has broken
and returns success in that case. Since bdev points to the dm device
itself, dm_blk_ioctl() calls __blk_dev_driver_ioctl() for itself and
recurses until crash.
You could reproduce the problem like this:
# dmsetup create mp --table '0 1024 multipath 0 0 0 0'
# sg_inq /dev/mapper/mp
<crash>
[ 172.648615] BUG: unable to handle kernel paging request at fffffffc81b10268
[ 172.662843] PGD 19dd067 PUD 0
[ 172.666269] Thread overran stack, or stack corrupted
[ 172.671808] Oops: 0000 [#1] SMP
...
Fix the condition check with some clarifications.
Fixes: e56f81e0b01e ("dm: refactor ioctl handling")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
(Ab)using the @bdev passed to dm_blk_ioctl() opens the potential for
targets' .prepare_ioctl to fail if they go on to check the bdev for
!NULL.
Fixes: e56f81e0b01e ("dm: refactor ioctl handling")
Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
dm-mpath retries ioctl, when no path is readily available and the device
is configured to queue I/O in such a case. If you want to stop the retry
before multipathd decides to turn off queueing mode, you could send
signal for the process to exit from the loop.
However the check of fatal signal has not carried along when commit
6c182cd88d17 ("dm mpath: fix ioctl deadlock when no paths") moved the
loop from dm-mpath to dm core. As a result, we can't terminate such
a process in the retry loop.
Easy reproducer of the situation is:
# dmsetup create mp --table '0 1024 multipath 0 0 0 0'
# dmsetup message mp 0 'queue_if_no_path'
# sg_inq /dev/mapper/mp
then you should be able to terminate sg_inq by pressing Ctrl+C.
Fixes: 6c182cd88d17 ("dm mpath: fix ioctl deadlock when no paths")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
transition
A thin-pool that is in out-of-data-space (OODS) mode may transition back
to write mode -- without the admin adding more space to the thin-pool --
if/when blocks are released (either by deleting thin devices or
discarding provisioned blocks).
But as part of the thin-pool's earlier transition to out-of-data-space
mode the thin-pool may have set the 'error_if_no_space' flag to true if
the no_space_timeout expires without more space having been made
available. That implementation detail, of changing the pool's
error_if_no_space setting, needs to be reset back to the default that
the user specified when the thin-pool's table was loaded.
Otherwise we'll drop the user requested behaviour on the floor when this
out-of-data-space to write mode transition occurs.
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Fixes: 2c43fd26e4 ("dm thin: fix missing out-of-data-space to write mode transition if blocks are released")
Cc: stable@vger.kernel.org
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
I got a crash during a "perf top" session that was caused by a race in
__task_pid_nr_ns() :
pid_nr_ns() was inlined, but apparently compiler chose to read
task->pids[type].pid twice, and the pid->level dereference crashed
because we got a NULL pointer at the second read :
if (pid && ns->level <= pid->level) { // CRASH
Just use RCU API properly to solve this race, and not worry about "perf
top" crashing hosts :(
get_task_pid() can benefit from same fix.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pull block layer fixes from Jens Axboe:
"A round of fixes/updates for the current series.
This looks a little bigger than it is, but that's mainly because we
pushed the lightnvm enabled null_blk change out of the merge window so
it could be updated a bit. The rest of the volume is also mostly
lightnvm. In particular:
- Lightnvm. Various fixes, additions, updates from Matias and
Javier, as well as from Wenwei Tao.
- NVMe:
- Fix for potential arithmetic overflow from Keith.
- Also from Keith, ensure that we reap pending completions from
a completion queue before deleting it. Fixes kernel crashes
when resetting a device with IO pending.
- Various little lightnvm related tweaks from Matias.
- Fixup flushes to go through the IO scheduler, for the cases where a
flush is not required. Fixes a case in CFQ where we would be
idling and not see this request, hence not break the idling. From
Jan Kara.
- Use list_{first,prev,next} in elevator.c for cleaner code. From
Gelian Tang.
- Fix for a warning trigger on btrfs and raid on single queue blk-mq
devices, where we would flush plug callbacks with preemption
disabled. From me.
- A mac partition validation fix from Kees Cook.
- Two merge fixes from Ming, marked stable. A third part is adding a
new warning so we'll notice this quicker in the future, if we screw
up the accounting.
- Cleanup of thread name/creation in mtip32xx from Rasmus Villemoes"
* 'for-linus' of git://git.kernel.dk/linux-block: (32 commits)
blk-merge: warn if figured out segment number is bigger than nr_phys_segments
blk-merge: fix blk_bio_segment_split
block: fix segment split
blk-mq: fix calling unplug callbacks with preempt disabled
mac: validate mac_partition is within sector
mtip32xx: use formatting capability of kthread_create_on_node
NVMe: reap completion entries when deleting queue
lightnvm: add free and bad lun info to show luns
lightnvm: keep track of block counts
nvme: lightnvm: use admin queues for admin cmds
lightnvm: missing free on init error
lightnvm: wrong return value and redundant free
null_blk: do not del gendisk with lightnvm
null_blk: use device addressing mode
null_blk: use ppa_cache pool
NVMe: Fix possible arithmetic overflow for max segments
blk-flush: Queue through IO scheduler when flush not required
null_blk: register as a LightNVM device
elevator: use list_{first,prev,next}_entry
lightnvm: cleanup queue before target removal
...
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We had seen lots of reports of this kind issue, so add one
warnning in blk-merge, then it can be triggered easily and
avoid to depend on warning/bug from drivers.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Commit bdced438acd83a(block: setup bi_phys_segments after
splitting) introduces function of computing bio->bi_phys_segments
during bio splitting.
Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size
arn't computed, so too many physical segments may be obtained
for one request since both the two are used to check if one segment
across two bios can be possible.
This patch fixes the issue by computing the two variables in
blk_bio_segment_split().
Fixes: bdced438acd83a(block: setup bi_phys_segments after splitting)
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Mark Salter <msalter@redhat.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Mark Salter <msalter@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Inside blk_bio_segment_split(), previous bvec pointer(bvprvp)
always points to the iterator local variable, which is obviously
wrong, so fix it by pointing to the local variable of 'bvprv'.
Fixes: 5014c311baa2b(block: fix bogus compiler warnings in blk-merge.c)
Cc: stable@kernel.org #4.3
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Mark Salter <msalter@redhat.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Mark Salter <msalter@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Liu reported that running certain parts of xfstests threw the
following error:
BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
3 locks held by kworker/u16:0/6:
#0: ("writeback"){++++.+}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
#1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
#2: (&type->s_umount_key#44){+++++.}, at: [<ffffffff811e6805>] trylock_super+0x25/0x60
CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G OE 4.3.0+ #3
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-btrfs-108)
ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
Call Trace:
[<ffffffff8130191b>] dump_stack+0x4f/0x74
[<ffffffff8108ed95>] ___might_sleep+0x185/0x240
[<ffffffff8108eea2>] __might_sleep+0x52/0x90
[<ffffffff811817e8>] __alloc_pages_nodemask+0x268/0x410
[<ffffffff8109a43c>] ? sched_clock_local+0x1c/0x90
[<ffffffff8109a6d1>] ? local_clock+0x21/0x40
[<ffffffff810b9eb0>] ? __lock_release+0x420/0x510
[<ffffffff810b534c>] ? __lock_acquired+0x16c/0x3c0
[<ffffffff811ca265>] alloc_pages_current+0xc5/0x210
[<ffffffffa0577105>] ? rbio_is_full+0x55/0x70 [btrfs]
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffff81666d50>] ? _raw_spin_unlock_irqrestore+0x40/0x60
[<ffffffffa0578c0a>] full_stripe_write+0x5a/0xc0 [btrfs]
[<ffffffffa0578ca9>] __raid56_parity_write+0x39/0x60 [btrfs]
[<ffffffffa0578deb>] run_plug+0x11b/0x140 [btrfs]
[<ffffffffa0578e33>] btrfs_raid_unplug+0x23/0x70 [btrfs]
[<ffffffff812d36c2>] blk_flush_plug_list+0x82/0x1f0
[<ffffffff812e0349>] blk_sq_make_request+0x1f9/0x740
[<ffffffff812ceba2>] ? generic_make_request_checks+0x222/0x7c0
[<ffffffff812cf264>] ? blk_queue_enter+0x124/0x310
[<ffffffff812cf1d2>] ? blk_queue_enter+0x92/0x310
[<ffffffff812d0ae2>] generic_make_request+0x172/0x2c0
[<ffffffff812d0ad4>] ? generic_make_request+0x164/0x2c0
[<ffffffff812d0ca0>] submit_bio+0x70/0x140
[<ffffffffa0577b29>] ? rbio_add_io_page+0x99/0x150 [btrfs]
[<ffffffffa0578a89>] finish_rmw+0x4d9/0x600 [btrfs]
[<ffffffffa0578c4c>] full_stripe_write+0x9c/0xc0 [btrfs]
[<ffffffffa057ab7f>] raid56_parity_write+0xef/0x160 [btrfs]
[<ffffffffa052bd83>] btrfs_map_bio+0xe3/0x2d0 [btrfs]
[<ffffffffa04fbd6d>] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
[<ffffffffa05173c4>] submit_one_bio+0x74/0xb0 [btrfs]
[<ffffffffa0517f55>] submit_extent_page+0xe5/0x1c0 [btrfs]
[<ffffffffa0519b18>] __extent_writepage_io+0x408/0x4c0 [btrfs]
[<ffffffffa05179c0>] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
[<ffffffffa051dc88>] __extent_writepage+0x218/0x3a0 [btrfs]
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffffa051e2c9>] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
[<ffffffffa051e422>] extent_writepages+0x52/0x70 [btrfs]
[<ffffffffa05001f0>] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
[<ffffffffa04fcc17>] btrfs_writepages+0x27/0x30 [btrfs]
[<ffffffff81184df3>] do_writepages+0x23/0x40
[<ffffffff81212229>] __writeback_single_inode+0x89/0x4d0
[<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
[<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
[<ffffffff8121295f>] ? writeback_sb_inodes+0x15f/0x480
[<ffffffff81212ad2>] writeback_sb_inodes+0x2d2/0x480
[<ffffffff810b1397>] ? down_read_trylock+0x57/0x60
[<ffffffff811e6805>] ? trylock_super+0x25/0x60
[<ffffffff810d629f>] ? rcu_read_lock_sched_held+0x4f/0x90
[<ffffffff81212d0c>] __writeback_inodes_wb+0x8c/0xc0
[<ffffffff812130b5>] wb_writeback+0x2b5/0x500
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffff810660a8>] ? __local_bh_enable_ip+0x68/0xc0
[<ffffffff81213362>] ? wb_do_writeback+0x62/0x310
[<ffffffff812133c1>] wb_do_writeback+0xc1/0x310
[<ffffffff8107c3d9>] ? set_worker_desc+0x79/0x90
[<ffffffff81213842>] wb_workfn+0x92/0x330
[<ffffffff8107f133>] process_one_work+0x223/0x730
[<ffffffff8107f083>] ? process_one_work+0x173/0x730
[<ffffffff8108035f>] ? worker_thread+0x18f/0x430
[<ffffffff810802ed>] worker_thread+0x11d/0x430
[<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
[<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
[<ffffffff810858df>] kthread+0xef/0x110
[<ffffffff8108f74e>] ? schedule_tail+0x1e/0xd0
[<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
[<ffffffff816673bf>] ret_from_fork+0x3f/0x70
[<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
The issue is that we've got the software context pinned while
calling blk_flush_plug_list(), which flushes callbacks that
are allowed to sleep. btrfs and raid has such callbacks.
Flip the checks around a bit, so we can enable preempt a bit
earlier and flush plugs without having preempt disabled.
This only affects blk-mq driven devices, and only those that
register a single queue.
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If md->signature == MAC_DRIVER_MAGIC and md->block_size == 1023, a single
512 byte sector would be read (secsize / 512). However the partition
structure would be located past the end of the buffer (secsize % 512).
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|