| Commit message (Collapse) | Author | Files | Lines |
|
Add xs_reset_watches function to shutdown watches from old kernel after
kexec boot. The old kernel does not unregister all watches in the
shutdown path. They are still active, the double registration can not
be detected by the new kernel. When the watches fire, unexpected events
will arrive and the xenwatch thread will crash (jumps to NULL). An
orderly reboot of a hvm guest will destroy the entire guest with all its
resources (including the watches) before it is rebuilt from scratch, so
the missing unregister is not an issue in that case.
With this change the xenstored is instructed to wipe all active watches
for the guest. However, a patch for xenstored is required so that it
accepts the XS_RESET_WATCHES request from a client (see changeset
23839:42a45baf037d in xen-unstable.hg). Without the patch for xenstored
the registration of watches will fail and some features of a PVonHVM
guest are not available. The guest is still able to boot, but repeated
kexec boots will fail.
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
When switching tasks in a Xen PV guest, avoid updating the TLS
descriptors if they haven't changed. This improves the speed of
context switches by almost 10% as much of the time the descriptors are
the same or only one is different.
The descriptors written into the GDT by Xen are modified from the
values passed in the update_descriptor hypercall so we keep shadow
copies of the three TLS descriptors to compare against.
lmbench3 test Before After Improvement
--------------------------------------------
lat_ctx -s 32 24 7.19 6.52 9%
lat_pipe 12.56 11.66 7%
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
[v1: Moving it to the Xen file]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
When constructing the initial page tables, if the MFN for a usable PFN
is missing in the p2m then that frame is initially ballooned out. In
this case, zero the PTE (as in decrease_reservation() in
drivers/xen/balloon.c).
This is obviously safe instead of having an valid PTE with an MFN of
INVALID_P2M_ENTRY (~0).
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
In xen_set_pte() if batching is unavailable (because the caller is in
an interrupt context such as handling a page fault) it would fall back
to using native_set_pte() and trapping and emulating the PTE write.
On 32-bit guests this requires two traps for each PTE write (one for
each dword of the PTE). Instead, do one mmu_update hypercall
directly.
During construction of the initial page tables, continue to use
native_set_pte() because most of the PTEs being set are in writable
and unpinned pages (see phys_pmd_init() in arch/x86/mm/init_64.c) and
using a hypercall for this is very expensive.
This significantly improves page fault performance in 32-bit PV
guests.
lmbench3 test Before After Improvement
----------------------------------------------
lat_pagefault 3.18 us 2.32 us 27%
lat_proc fork 356 us 313.3 us 11%
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
Coverity would complain about this - even thought it looks OK.
CID 401957
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
Coverity points out that we do not free in one case the
pr_backup - and sure enough we forgot.
Found by Coverity (CID 401970)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
If a driver leaves its poll method NULL, the device is assumed to
be both readable and writable without blocking.
This patch add .poll method to xen mcelog device driver, so that
when mcelog use system calls like ppoll or select, it would be
blocked when no data available, and avoid spinning at CPU.
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
copy_to_user might sleep and print a stack trace if it is executed
in an atomic spinlock context. Like this:
(XEN) CMCI: send CMCI to DOM0 through virq
BUG: sleeping function called from invalid context at /home/konradinux/kernel.h:199
in_atomic(): 1, irqs_disabled(): 0, pid: 4581, name: mcelog
Pid: 4581, comm: mcelog Tainted: G O 3.5.0-rc1upstream-00003-g149000b-dirty #1
[<ffffffff8109ad9a>] __might_sleep+0xda/0x100
[<ffffffff81329b0b>] xen_mce_chrdev_read+0xab/0x140
[<ffffffff81148945>] vfs_read+0xc5/0x190
[<ffffffff81148b0c>] sys_read+0x4c/0x90
[<ffffffff815bd039>] system_call_fastpath+0x16
This patch schedule a workqueue for IRQ handler to poll the data,
and use mutex instead of spinlock, so copy_to_user sleep in atomic
context would not occur.
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
This patch provide Xen physical cpus online/offline sys interface.
User can use it for their own purpose, like power saving:
by offlining some cpus when light workload it save power greatly.
Its basic workflow is, user online/offline cpu via sys interface,
then hypercall xen to implement, after done xen inject virq back to dom0,
and then dom0 sync cpu status.
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
When Xen hypervisor inject vMCE to guest, use native mce handler
to handle it
Signed-off-by: Ke, Liping <liping.ke@intel.com>
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
there are 3 funcs which need to be _initcalled in a logic sequence:
1. xen_late_init_mcelog
2. mcheck_init_device
3. threshold_init_device
xen_late_init_mcelog must register xen_mce_chrdev_device before
native mce_chrdev_device registration if running under xen platform;
mcheck_init_device should be inited before threshold_init_device to
initialize mce_device, otherwise a a NULL ptr dereference will cause panic.
so we use following _initcalls
1. device_initcall(xen_late_init_mcelog);
2. device_initcall_sync(mcheck_init_device);
3. late_initcall(threshold_init_device);
when running under xen, the initcall order is 1,2,3;
on baremetal, we skip 1 and we do only 2 and 3.
Acked-and-tested-by: Borislav Petkov <bp@amd64.org>
Suggested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
When MCA error occurs, it would be handled by Xen hypervisor first,
and then the error information would be sent to initial domain for logging.
This patch gets error information from Xen hypervisor and convert
Xen format error into Linux format mcelog. This logic is basically
self-contained, not touching other kernel components.
By using tools like mcelog tool users could read specific error information,
like what they did under native Linux.
To test follow directions outlined in Documentation/acpi/apei/einj.txt
Acked-and-tested-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Ke, Liping <liping.ke@intel.com>
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
|
|
The definition of 32-bit values in the 64-bit tilegx architecture is that
they should be sign-extended regardless of whether they are considered
signed or unsigned by the compiler. Accordingly, we need to use an
"ld4s" rather than "ld4u" to load and sign-extend for get_user().
This fixes glibc bug 14238 (see http://sourceware.org/bugzilla),
introduced during the 3.5 merge window.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
|
|
Minchan Kim reports that when a system has many swap areas, and tmpfs
swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read
back the page cannot locate it, and the read fails with -ENOMEM.
Whoops. Yes, I blindly followed read_swap_header()'s pte_to_swp_entry(
swp_entry_to_pte()) technique for determining maximum usable swap
offset, without stopping to realize that that actually depends upon the
pte swap encoding shifting swap offset to the higher bits and truncating
it there. Whereas our radix_tree swap encoding leaves offset in the
lower bits: it's swap "type" (that is, index of swap area) that was
truncated.
Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the
broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header().
This does not reduce the usable size of a swap area any further, it
leaves it as claimed when making the original commit: no change from 3.0
on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB
per swapfile on i386 with PAE. It's not a change I would have risked
five years ago, but with x86_64 supported for ten years, I believe it's
appropriate now.
Hmm, and what if some architecture implements its swap pte with offset
encoded below type? That would equally break the maximum usable swap
offset check. Happily, they all follow the same tradition of encoding
offset above type, but I'll prepare a check on that for next.
Reported-and-Reviewed-and-Tested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org [3.1, 3.2, 3.3, 3.4]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Avoid warning in 32 bit machines
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
gcc was giving an uninit variable warning here. Strictly
speaking we don't need to init it, but this will make things
much less error prone.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Update to the latest btrfs's maintainer mail and git repo.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
the items of the delayed inodes were forgotten to be freed, this patch
fixes it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Since we have two trees for recording pinned extents, we need to go through
both of them to make sure that we've done everything clean.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
We've forgotten to clear extent states in pinned tree, which will results in
space counter mismatch and memory leak:
WARNING: at fs/btrfs/extent-tree.c:7537 btrfs_free_block_groups+0x1f3/0x2e0 [btrfs]()
...
space_info 2 has 8380416 free, is not full
space_info total=12582912, used=4096, pinned=4096, reserved=0, may_use=0, readonly=4194304
btrfs state leak: start 29364224 end 29376511 state 1 in tree ffff880075f20090 refs 1
...
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Seeding devices are not supposed to change any more.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
When we move a file into a directory with compression flag, we need to
inherite BTRFS_INODE_COMPRESS and clear BTRFS_INODE_NOCOMPRESS as well.
But if we move a file into a directory without compression flag, we need
to clear both of them.
It is the way how our setflags deals with compression flag, so keep
the same behaviour here.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
At present, hard_irq_disable() does nothing on powerpc because of
this code in include/linux/interrupt.h:
#ifndef hard_irq_disable
#define hard_irq_disable() do { } while(0)
#endif
So we need to make our hard_irq_disable be a macro. It was previously
a macro until commit 7230c56441 ("powerpc: Rework lazy-interrupt
handling") changed it to a static inline function.
Cc: stable@vger.kernel.org
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
--
arch/powerpc/include/asm/hw_irq.h | 3 +++
1 file changed, 3 insertions(+)
|
|
It's a bug, but it happens to work, as BTRFS_COMPRESS_LZO == 2, which
has only one bit set.
Signed-off-by: Li Zefan <lizefan@huawei.com>
|
|
If a file has 3 small extents:
| ext1 | ext2 | ext3 |
Running "btrfs fi defrag" will only defrag the last two extents, if those
extent mappings hasn't been read into memory from disk.
This bug was introduced by commit 17ce6ef8d731af5edac8c39e806db4c7e1f6956f
("Btrfs: add a check to decide if we should defrag the range")
The cause is, that commit looked into previous and next extents using
lookup_extent_mapping() only.
While at it, remove the code that checks the previous extent, since
it's sufficient to check the next extent.
Signed-off-by: Li Zefan <lizefan@huawei.com>
|
|
I removed this in an earlier commit and I was wrong. Because compression
can return from filemap_fdatawrite() without having actually set any of it's
pages as writeback() it can make filemap_fdatawait() do essentially nothing,
and then we won't find any ordered extents because they may not have been
created yet. So not only does this make fsync() completely useless, but it
will also screw up if you truncate on a non-page aligned offset since we
zero out the end and then wait on ordered extents and then call drop caches.
We can drop the cache before the io completes and then we try to unpin the
extent we just wrote we won't find it and everything goes sideways. So fix
this by putting it back and put a giant comment there to keep me from trying
to remove it in the future. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
A user reported lots of problems using compression on the new code and it
turns out part of the problem was that igrab() was failing when we added a
new ordered extent. This is because when writing out an inode under
compression we immediately return without actually doing anything to the
pages, and then in another thread at some point down the line actually do
the ordered dance. The problem is between the point that we start writeback
and we actually add the ordered extent we could be trying to reclaim the
inode, which makes igrab() return NULL. So we need to do an igrab() when we
create the async extent and then drop it when we are done with it. This
makes sure we stay pinned in memory until the ordered extent can get a
reference on it and we are good to go. With this patch we no longer panic
in btrfs_finish_ordered_io(). Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
Because btrfs can remove the device that was mounted we need to have a
->show_devname so that in this case we can print out some other device in
the file system to /proc/mount. So if there are multiple devices in a btrfs
file system we will just print the device with the lowest devid that we can
find. This will make everything consistent and deal with device removal
properly. The drawback is if you mount with a device that is higher than
the lowest devicd it won't show up as the mounted device in /proc/mounts,
but this is a small price to pay. This was inspired by Miao Xie's patch.
Thanks,
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
Al pointed out that we can just toss out the old name on a device and add a
new one arbitrarily, so anybody who uses device->name in printk could
possibly use free'd memory. Instead of adding locking around all of this he
suggested doing it with RCU, so I've introduced a struct rcu_string that
does just that and have gone through and protected all accesses to
device->name that aren't under the uuid_mutex with rcu_read_lock(). This
protects us and I will use it for dealing with removing the device that we
used to mount the file system in a later patch. Thanks,
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
I was getting hung on umount when a transaction was aborted because a range
of one of the free space inodes was still locked. This is because the nocow
stuff doesn't unlock anything on error. This fixed the problem and I
verified that is what was happening. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
So we're forcing the eb's to have their ref count set to 1 so invalidatepage
works but this breaks lots of things, for example root nodes, and is just
plain wrong, we don't need to just evict all of this stuff. Also drop the
invalidatepage altogether and add a page_cache_release(). With this patch
we no longer hang when trying to access the root nodes after an aborted
transaction and we no longer leak memory. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
If a transaction commit fails we don't abort it so we don't set an error on
the file system. This patch fixes that by actually calling the abort stuff
and then adding a check for a fs error in the transaction start stuff to
make sure it is caught properly. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
I was getting lots of hung tasks and a NULL pointer dereference because we
are not cleaning up the transaction properly when it aborts. First we need
to reset the running_transaction to NULL so we don't get a bad dereference
for any start_transaction callers after this. Also we cannot rely on
waitqueue_active() since it's just a list_empty(), so just call wake_up()
directly since that will do the barrier for us and such. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
The transaction abort stuff was throwing warnings from the list debugging
code because we do a list_del_init outside of the delayed_refs spin lock.
The delayed refs locking makes baby Jesus cry so it's not hard to get wrong,
but we need to take the ref head mutex to make sure it's not being processed
currently, and so if it is we need to drop the spin lock and then take and
drop the mutex and do the search again. If we can take the mutex then we
can safely remove the head from the list and carry on. Now when the
transaction aborts I don't get the list debugging warnings. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
While doing my enospc work I got a transaction abortion that resulted in a
panic when we tried to unlock_page() an already unlocked page. This is
because we aren't calling extent_clear_unlock_delalloc with the locked page
so it was unlocking all the pages in the range. This is wrong since
__extent_writepage expects to have the page locked still unless we return
*page_started as 1. This should keep us from panicing. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
TEAC's UD-H01 (and probably other devices) have a gap in the interface
number allocation of their descriptors:
Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength 220
bNumInterfaces 3
[...]
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 0
bAlternateSetting 0
[...]
Interface Association:
bLength 8
bDescriptorType 11
bFirstInterface 2
bInterfaceCount 2
bFunctionClass 1 Audio
bFunctionSubClass 0
bFunctionProtocol 32
iFunction 4
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 2
bAlternateSetting 0
[...]
Once a configuration is selected, usb_set_configuration() walks the
known interfaces of a given configuration and calls find_iad() on
each of them to set the interface association pointer the interface
is included in.
The problem here is that the loop variable is taken for the interface
number in the comparison logic that gathers the association. Which is
fine as long as the descriptors are sane.
In the case above, however, the logic gets out of sync and the
interface association fields of all interfaces beyond the interface
number gap are wrong.
Fix this by passing the interface's bInterfaceNumber to find_iad()
instead.
Signed-off-by: Daniel Mack <zonque@gmail.com>
Reported-by: bEN <ml_all@circa.be>
Reported-by: Ivan Perrone <ivanperrone@hotmail.com>
Tested-by: ivan perrone <ivanperrone@hotmail.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
If the platform_data is not set, pdata will be uninitialized value.
Since the driver has the following code, if the condition is true when
the pdata is uninitialized value, the driver may jump to the illegal
phy_init().
if (pdata && pdata->phy_init)
pdata->phy_init();
This patch also fixes the following warning:
CC drivers/usb/host/ehci-hcd.o
drivers/usb/host/ehci-sh.c: In function ‘ehci_hcd_sh_probe’:
drivers/usb/host/ehci-sh.c:104: warning: ‘pdata’ may be used uninitialized in this function
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Currently CDC-ACM devices stay throttled when their TTY is closed while
throttled, stalling further communication attempts after the next open.
Unthrottling during open/activate got lost starting with kernel
3.0.0 and this patch reintroduces it.
Signed-off-by: Otto Meta <otto.patches@sister-shadow.de>
Cc: stable <stable@vger.kernel.org>
Acked-by: Johan Hovold <jhovold@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
When the frontend and the backend reside on the same domain, even if we
add pages to the m2p_override, these pages will never be returned by
mfn_to_pfn because the check "get_phys_to_machine(pfn) != mfn" will
always fail, so the pfn of the frontend will be returned instead
(resulting in a deadlock because the frontend pages are already locked).
INFO: task qemu-system-i38:1085 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
qemu-system-i38 D ffff8800cfc137c0 0 1085 1 0x00000000
ffff8800c47ed898 0000000000000282 ffff8800be4596b0 00000000000137c0
ffff8800c47edfd8 ffff8800c47ec010 00000000000137c0 00000000000137c0
ffff8800c47edfd8 00000000000137c0 ffffffff82213020 ffff8800be4596b0
Call Trace:
[<ffffffff81101ee0>] ? __lock_page+0x70/0x70
[<ffffffff81a0fdd9>] schedule+0x29/0x70
[<ffffffff81a0fe80>] io_schedule+0x60/0x80
[<ffffffff81101eee>] sleep_on_page+0xe/0x20
[<ffffffff81a0e1ca>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81101ed7>] __lock_page+0x67/0x70
[<ffffffff8106f750>] ? autoremove_wake_function+0x40/0x40
[<ffffffff811867e6>] ? bio_add_page+0x36/0x40
[<ffffffff8110b692>] set_page_dirty_lock+0x52/0x60
[<ffffffff81186021>] bio_set_pages_dirty+0x51/0x70
[<ffffffff8118c6b4>] do_blockdev_direct_IO+0xb24/0xeb0
[<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00
[<ffffffff8118ca95>] __blockdev_direct_IO+0x55/0x60
[<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00
[<ffffffff811e91c8>] ext3_direct_IO+0xf8/0x390
[<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00
[<ffffffff81004b60>] ? xen_mc_flush+0xb0/0x1b0
[<ffffffff81104027>] generic_file_aio_read+0x737/0x780
[<ffffffff813bedeb>] ? gnttab_map_refs+0x15b/0x1e0
[<ffffffff811038f0>] ? find_get_pages+0x150/0x150
[<ffffffff8119736c>] aio_rw_vect_retry+0x7c/0x1d0
[<ffffffff811972f0>] ? lookup_ioctx+0x90/0x90
[<ffffffff81198856>] aio_run_iocb+0x66/0x1a0
[<ffffffff811998b8>] do_io_submit+0x708/0xb90
[<ffffffff81199d50>] sys_io_submit+0x10/0x20
[<ffffffff81a18d69>] system_call_fastpath+0x16/0x1b
The explanation is in the comment within the code:
We need to do this because the pages shared by the frontend
(xen-blkfront) can be already locked (lock_page, called by
do_read_cache_page); when the userspace backend tries to use them
with direct_IO, mfn_to_pfn returns the pfn of the frontend, so
do_blockdev_direct_IO is going to try to lock the same pages
again resulting in a deadlock.
A simplified call graph looks like this:
pygrub QEMU
-----------------------------------------------
do_read_cache_page io_submit
| |
lock_page ext3_direct_IO
|
bio_add_page
|
lock_page
Internally the xen-blkback uses m2p_add_override to swizzle (temporarily)
a 'struct page' to have a different MFN (so that it can point to another
guest). It also can easily find out whether another pfn corresponding
to the mfn exists in the m2p, and can set the FOREIGN bit
in the p2m, making sure that mfn_to_pfn returns the pfn of the backend.
This allows the backend to perform direct_IO on these pages, but as a
side effect prevents the frontend from using get_user_pages_fast on
them while they are being shared with the backend.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
Most frequent symptom was a BUG triggering in expire_client, with the
server locking up shortly thereafter.
Introduced by 508dc6e110c6dbdc0bbe84298ccfe22de7538486 "nfsd41:
free_session/free_client must be called under the client_lock".
Cc: stable@kernel.org
Cc: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
In case of destroying mount namespace on child reaper exit, nsproxy is zeroed
to the point already. So, dereferencing of it is invalid.
This patch hard-code "init_net" for all network namespace references for NFS
callback services. This will be fixed with proper NFS callback
containerization.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
When adding to the tree modification log, we grab two locks at different
stages. We must not drop the outer lock until we're done with section
protected by the inner lock. This moves the unlock call for the outer lock
to the appropriate position.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
|
To make sense of the tree mod log, the backref walker not only needs
btrfs_search_old_slot, but it also called btrfs_next_leaf, which in turn was
calling btrfs_search_slot. This obviously didn't give the correct result.
This commit adds btrfs_next_old_leaf, a drop-in replacement for
btrfs_next_leaf with a time_seq parameter. If it is zero, it behaves exactly
like btrfs_next_leaf. If it is non-zero, it will use btrfs_search_old_slot
with this time_seq parameter.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
|
In __tree_mod_log_oldest_root() we must return the found operation even if
it's not a ROOT_REPLACE operation. Otherwise, the caller assumes that there
are no operations to be rewinded and returns immediately.
The code in the caller is modified to improve readability.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
|
get_old_root could race with root node updates because we weren't locking
the node early enough. Use btrfs_read_lock_root_node to grab the root locked
in the very beginning and release the lock as soon as possible (just like
btrfs_search_slot does).
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
|
When resolving indirect refs, we used to call btrfs_next_leaf in case we
didn't find an exact match. While we should find exact matches most of the
time, in case we don't, we must continue searching. Treating those matches
differently depending on the level we're searching doesn't make sense.
Even worse, we might end up searching for a key larger than the largest, in
which case there is no next_leaf and subsequent jobs would fail. This commit
drops the bogous lines.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
|
Commit 0a2b9a6ea93 ("X86: integrate CMA with DMA-mapping subsystem")
broke memory allocation with dma_mask. This patch fixes possible kernel
ops caused by lack of resetting page variable when jumping to 'again' label.
Reported-by: Konrad Rzeszutek Wilk <konrad@darnok.org>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
|
|
A bunch of bugzillas have complained how noisy the nmi_watchdog
is during boot-up especially with its expected failure cases
(like virt and bios resource contention).
This is my attempt to quiet them down and keep it less confusing
for the end user. What I did is print the message for cpu0 and
save it for future comparisons. If future cpus have an
identical message as cpu0, then don't print the redundant info.
However, if a future cpu has a different message, happily print
that loudly.
Before the change, you would see something like:
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz stepping 0a
Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
... version: 2
... bit width: 40
... generic registers: 2
... value mask: 000000ffffffffff
... max period: 000000007fffffff
... fixed-purpose events: 3
... event mask: 0000000700000003
NMI watchdog enabled, takes one hw-pmu counter.
Booting Node 0, Processors #1
NMI watchdog enabled, takes one hw-pmu counter.
#2
NMI watchdog enabled, takes one hw-pmu counter.
#3 Ok.
NMI watchdog enabled, takes one hw-pmu counter.
Brought up 4 CPUs
Total of 4 processors activated (22607.24 BogoMIPS).
After the change, it is simplified to:
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz stepping 0a
Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
... version: 2
... bit width: 40
... generic registers: 2
... value mask: 000000ffffffffff
... max period: 000000007fffffff
... fixed-purpose events: 3
... event mask: 0000000700000003
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
Booting Node 0, Processors #1 #2 #3 Ok.
Brought up 4 CPUs
V2: little changes based on Joe Perches' feedback
V3: printk cleanup based on Ingo's feedback; checkpatch fix
V4: keep printk as one long line
V5: Ingo fix ups
Reported-and-tested-by: Nathan Zimmer <nzimmer@sgi.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: nzimmer@sgi.com
Cc: joe@perches.com
Link: http://lkml.kernel.org/r/1339594548-17227-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|