| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
Add xen_irq_set_priority() to set an event channels priority. This function
will only work with event channel ABIs that support priority (i.e., the
FIFO-based ABI).
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add the hypercall sub-ops and the structures for the shared data used
in the FIFO-based event channel ABI.
The design document for this new ABI is available here:
http://xenbits.xen.org/people/dvrabel/event-channels-H.pdf
In summary, events are reported using a per-domain shared event array
of event words. Each event word has PENDING, LINKED and MASKED bits
and a LINK field for pointing to the next event in the event queue.
There are 16 event queues (with different priorities) per-VCPU.
Key advantages of this new ABI include:
- Support for over 100,000 events (2^17).
- 16 different event priorities.
- Improved fairness in event latency through the use of FIFOs.
The ABI is available in Xen 4.4 and later.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
| |
Remove the check during unbind for NR_EVENT_CHANNELS as this limits
support to less than 4096 ports.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
| |
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Refactor static array evtchn_to_irq array to be dynamically allocated by
implementing get and set functions for accesses to the array.
Two new port ops are added: max_channels (maximum supported number of
event channels) and nr_channels (number of currently usable event
channels). For the 2-level ABI, these numbers are both the same as
the shared data structure is a fixed size. For the FIFO ABI, these
will be different as the event array is expanded dynamically.
This allows more than 65000 event channels so an unsigned short is no
longer sufficient for an event channel port number and unsigned int is
used instead.
Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a hook for port-specific setup and call it from
xen_irq_info_common_setup().
The FIFO-based ABIs may need to perform additional setup (expanding
the event array) before a bound event channel can start to receive
events.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The FIFO-based event ABI requires additional setup of newly bound
events (it may need to expand the event array) and this setup may
fail.
xen_irq_info_common_init() is a useful place to put this setup so
allow this call to fail. This call and the other similar calls are
renamed to be *_setup() to reflect that they may now fail.
This failure can only occur with new event channels not on rebind.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
| |
evtchn_ops contains the low-level operations that access the shared
data structures. This allows alternate ABIs to be supported.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
| |
In preparation for alternative event channel ABIs, move all the
functions accessing the shared data structures into their own file.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
| |
events.c will be split into multiple files so move it into its own
directory.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
| |
In preparation for adding event channel port ops, use set_evtchn()
instead of sync_set_bit().
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
| |
In preparation for adding event channel port ops, add
test_and_set_mask().
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Because the guest-side binding of an event to a VCPU (i.e., setting
the local per-cpu masks) is always explicitly done after an event
channel is bound to a port, there is no need to initialize all
possible events as bound to VCPU 0 at start of day or after a resume.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These two function did the same thing with different parameters, put
the common bits in retrigger_evtchn().
This changes the return value of resend_irq_on_evtchn() but the only
caller (in arch/ia64/xen/irq_xen.c) ignored the return value so this
is fine.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
| |
We can't include <asm/pci_x86.h> if this isn't x86, and we only need
it if CONFIG_PCI_MMCONFIG is enabled.
Fixes: 8deb3eb1461e ('xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved for MCFG areas.')
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
| |
Replace the code logic at balloon_first_page() by calling
list_first_entry_or_null() directly. since here is only
one user of that routine, therefore we can just remove it.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of
existing RAM pages" the balloon driver appears to work fine on ARM as far as I
can tell. Prior to that commit it was broken because on ARM RAM doesn't
typically start at zero, effectively leaving a big MMIO hole at the start.
This would cause the balloon driver to give away all of RAM at start of day,
which is rather inconvenient.
It was already enabled (or rather not excluded) on ARM64. The
c1d15f5c8bc1170dafe16e988e55437245966dfe
"xen/balloon: Seperate the auto-translate logic properly (v2)"
added in the proper plumbing to work with ARM and PVH type guests.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
[v2: Added the bit about PVH]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The user has the option of disabling the platform driver:
00:02.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
which is used to unplug the emulated drivers (IDE, Realtek 8169, etc)
and allow the PV drivers to take over. If the user wishes
to disable that they can set:
xen_platform_pci=0
(in the guest config file)
or
xen_emul_unplug=never
(on the Linux command line)
except it does not work properly. The PV drivers still try to
load and since the Xen platform driver is not run - and it
has not initialized the grant tables, most of the PV drivers
stumble upon:
input: Xen Virtual Keyboard as /devices/virtual/input/input5
input: Xen Virtual Pointer as /devices/virtual/input/input6M
------------[ cut here ]------------
kernel BUG at /home/konrad/ssd/konrad/linux/drivers/xen/grant-table.c:1206!
invalid opcode: 0000 [#1] SMP
Modules linked in: xen_kbdfront(+) xenfs xen_privcmd
CPU: 6 PID: 1389 Comm: modprobe Not tainted 3.13.0-rc1upstream-00021-ga6c892b-dirty #1
Hardware name: Xen HVM domU, BIOS 4.4-unstable 11/26/2013
RIP: 0010:[<ffffffff813ddc40>] [<ffffffff813ddc40>] get_free_entries+0x2e0/0x300
Call Trace:
[<ffffffff8150d9a3>] ? evdev_connect+0x1e3/0x240
[<ffffffff813ddd0e>] gnttab_grant_foreign_access+0x2e/0x70
[<ffffffffa0010081>] xenkbd_connect_backend+0x41/0x290 [xen_kbdfront]
[<ffffffffa0010a12>] xenkbd_probe+0x2f2/0x324 [xen_kbdfront]
[<ffffffff813e5757>] xenbus_dev_probe+0x77/0x130
[<ffffffff813e7217>] xenbus_frontend_dev_probe+0x47/0x50
[<ffffffff8145e9a9>] driver_probe_device+0x89/0x230
[<ffffffff8145ebeb>] __driver_attach+0x9b/0xa0
[<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
[<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
[<ffffffff8145cf1c>] bus_for_each_dev+0x8c/0xb0
[<ffffffff8145e7d9>] driver_attach+0x19/0x20
[<ffffffff8145e260>] bus_add_driver+0x1a0/0x220
[<ffffffff8145f1ff>] driver_register+0x5f/0xf0
[<ffffffff813e55c5>] xenbus_register_driver_common+0x15/0x20
[<ffffffff813e76b3>] xenbus_register_frontend+0x23/0x40
[<ffffffffa0015000>] ? 0xffffffffa0014fff
[<ffffffffa001502b>] xenkbd_init+0x2b/0x1000 [xen_kbdfront]
[<ffffffff81002049>] do_one_initcall+0x49/0x170
.. snip..
which is hardly nice. This patch fixes this by having each
PV driver check for:
- if running in PV, then it is fine to execute (as that is their
native environment).
- if running in HVM, check if user wanted 'xen_emul_unplug=never',
in which case bail out and don't load any PV drivers.
- if running in HVM, and if PCI device 5853:0001 (xen_platform_pci)
does not exist, then bail out and not load PV drivers.
- (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=ide-disks',
then bail out for all PV devices _except_ the block one.
Ditto for the network one ('nics').
- (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=unnecessary'
then load block PV driver, and also setup the legacy IDE paths.
In (v3) make it actually load PV drivers.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it
Reported-by: Anthony PERARD <anthony.perard@citrix.com>
Reported-and-Tested-by: Fabio Fantoni <fabio.fantoni@m2r.biz>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: Add extra logic to handle the myrid ways 'xen_emul_unplug'
can be used per Ian and Stefano suggestion]
[v3: Make the unnecessary case work properly]
[v4: s/disks/ide-disks/ spotted by Fabio]
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> [for PCI parts]
CC: stable@vger.kernel.org
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
"There is a small EFI fix and a big power regression fix in this batch.
My queue also had a fix for downing a CPU when there are insufficient
number of IRQ vectors available, but I'm holding that one for now due
to recent bug reports"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/efi: Don't select EFI from certain special ACPI drivers
x86 idle: Repair large-server 50-watt idle-power regression
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 7ea6c6c1 ("Move cper.c from drivers/acpi/apei to
drivers/firmware/efi") results in CONFIG_EFI being enabled even
when the user doesn't want this. Since ACPI APEI used to build
fine without UEFI (and as far as I know also has no functional
depency on it), at least in that case using a reverse dependency
is wrong (and a straight one isn't needed).
Whether the same is true for ACPI_EXTLOG I don't know - if there
is a functional dependency, it should depend on EFI rather than
selecting it. It certainly has (currently) no build dependency.
Adjust Kconfig and build logic so that the bad dependency gets
avoided.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/52AF1EBC020000780010DBF9@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Linux 3.10 changed the timing of how thread_info->flags is touched:
x86: Use generic idle loop
(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)
This caused Intel NHM-EX and WSM-EX servers to experience a large number
of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
from deep C-states to shallow C-states, which caused these platforms
to experience a significant increase in idle power.
Note that this issue was already present before the commit above,
however, it wasn't seen often enough to be noticed in power measurements.
Here we extend an errata workaround from the Core2 EX "Dunnington"
to extend to NHM-EX and WSM-EX, to prevent these immediate
returns from MWAIT, reducing idle power on these platforms.
While only acpi_idle ran on Dunnington, intel_idle
may also run on these two newer systems.
As of today, there are no other models that are known
to need this tweak.
Link: http://lkml.kernel.org/r/CAJvTdK=%2BaNN66mYpCGgbHGCHhYQAKx-vB0kJSWjVpsNb_hOAtQ@mail.gmail.com
Signed-off-by: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com
Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes and new device IDs from Rafael Wysocki:
- Fix for a cpufreq regression causing stale sysfs files to be left
behind during system resume if cpufreq_add_dev() fails for one or
more CPUs from Viresh Kumar.
- Fix for a bug in cpufreq causing CONFIG_CPU_FREQ_DEFAULT_* to be
ignored when the intel_pstate driver is used from Jason Baron.
- System suspend fix for a memory leak in pm_vt_switch_unregister()
that forgot to release objects after removing them from
pm_vt_switch_list. From Masami Ichikawa.
- Intel Valley View device ID and energy unit encoding update for the
(recently added) Intel RAPL (Running Average Power Limit) driver from
Jacob Pan.
- Intel Bay Trail SoC GPIO and ACPI device IDs for the Low Power
Subsystem (LPSS) ACPI driver from Paul Drews.
* tag 'pm+acpi-3.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
powercap / RAPL: add support for ValleyView Soc
PM / sleep: Fix memory leak in pm_vt_switch_unregister().
cpufreq: Use CONFIG_CPU_FREQ_DEFAULT_* to set initial policy for setpolicy drivers
cpufreq: remove sysfs files for CPUs which failed to come back after resume
ACPI: Add BayTrail SoC GPIO and LPSS ACPI IDs
|
| | \ | |
| | \ | |
| |\ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
* powercap:
powercap / RAPL: add support for ValleyView Soc
* acpi-lpss:
ACPI: Add BayTrail SoC GPIO and LPSS ACPI IDs
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This adds the new ACPI ID (INT33FC) for the BayTrail GPIO
banks as seen on a BayTrail M System-On-Chip platform. This
ACPI ID is used by the BayTrail GPIO (pinctrl) driver to
manage the Low Power Subsystem (LPSS).
Signed-off-by: Paul Drews <paul.drews@intel.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
| | | |/
| | |/|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This patch adds support for RAPL on Intel ValleyView based SoC
platforms, such as Baytrail.
Besides adding CPU ID, special energy unit encoding is handled
for ValleyView.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
| | | | | |
| | \ \ | |
| |\ \| |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
* pm-cpufreq:
cpufreq: Use CONFIG_CPU_FREQ_DEFAULT_* to set initial policy for setpolicy drivers
cpufreq: remove sysfs files for CPUs which failed to come back after resume
* pm-sleep:
PM / sleep: Fix memory leak in pm_vt_switch_unregister().
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
drivers
When configuring a default governor (via CONFIG_CPU_FREQ_DEFAULT_*) with the
intel_pstate driver, the desired default policy is not properly set. For
example, setting 'CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE' ends up with the
'powersave' policy being set.
Fix by configuring the correct default policy, if either 'powersave' or
'performance' are requested. Otherwise, fallback to what the driver originally
set via its 'init' routine.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
| | |/ /
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
There are cases where cpufreq_add_dev() may fail for some CPUs
during system resume. With the current code we will still have
sysfs cpufreq files for those CPUs and struct cpufreq_policy
would be already freed for them. Hence any operation on those
sysfs files would result in kernel warnings.
Example of problems resulting from resume errors (from Bjørn Mork):
WARNING: CPU: 0 PID: 6055 at fs/sysfs/file.c:343 sysfs_open_file+0x77/0x212()
missing sysfs attribute operations for kobject: (null)
Modules linked in: [stripped as irrelevant]
CPU: 0 PID: 6055 Comm: grep Tainted: G D 3.13.0-rc2 #153
Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
0000000000000009 ffff8802327ebb78 ffffffff81380b0e 0000000000000006
ffff8802327ebbc8 ffff8802327ebbb8 ffffffff81038635 0000000000000000
ffffffff811823c7 ffff88021a19e688 ffff88021a19e688 ffff8802302f9310
Call Trace:
[<ffffffff81380b0e>] dump_stack+0x55/0x76
[<ffffffff81038635>] warn_slowpath_common+0x7c/0x96
[<ffffffff811823c7>] ? sysfs_open_file+0x77/0x212
[<ffffffff810386e3>] warn_slowpath_fmt+0x41/0x43
[<ffffffff81182dec>] ? sysfs_get_active+0x6b/0x82
[<ffffffff81182382>] ? sysfs_open_file+0x32/0x212
[<ffffffff811823c7>] sysfs_open_file+0x77/0x212
[<ffffffff81182350>] ? sysfs_schedule_callback+0x1ac/0x1ac
[<ffffffff81122562>] do_dentry_open+0x17c/0x257
[<ffffffff8112267e>] finish_open+0x41/0x4f
[<ffffffff81130225>] do_last+0x80c/0x9ba
[<ffffffff8112dbbd>] ? inode_permission+0x40/0x42
[<ffffffff81130606>] path_openat+0x233/0x4a1
[<ffffffff81130b7e>] do_filp_open+0x35/0x85
[<ffffffff8113b787>] ? __alloc_fd+0x172/0x184
[<ffffffff811232ea>] do_sys_open+0x6b/0xfa
[<ffffffff811233a7>] SyS_openat+0xf/0x11
[<ffffffff8138c812>] system_call_fastpath+0x16/0x1b
To fix this, remove those sysfs files or put the associated kobject
in case of such errors. Also, to make it simple, remove the cpufreq
sysfs links from all the CPUs (except for the policy->cpu) during
suspend, as that operation won't result in a loss of sysfs file
permissions and we can create those links during resume just fine.
Fixes: 5302c3fb2e62 ("cpufreq: Perform light-weight init/teardown during suspend/resume")
Reported-and-tested-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 3.12+ <stable@vger.kernel.org> # 3.12+
[rjw: Changelog]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Pull block fixes from Jens Axboe:
- fix for a memory leak on certain unplug events
- a collection of bcache fixes from Kent and Nicolas
- a few null_blk fixes and updates form Matias
- a marking of static of functions in the stec pci-e driver
* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: support submit_queues on use_per_node_hctx
null_blk: set use_per_node_hctx param to false
null_blk: corrections to documentation
null_blk: warning on ignored submit_queues param
null_blk: refactor init and init errors code paths
null_blk: documentation
null_blk: mem garbage on NUMA systems during init
drivers: block: Mark the functions as static in skd_main.c
bcache: New writeback PD controller
bcache: bugfix for race between moving_gc and bucket_invalidate
bcache: fix for gc and writeback race
bcache: bugfix - moving_gc now moves only correct buckets
bcache: fix for gc crashing when no sectors are used
bcache: Fix heap_peek() macro
bcache: Fix for can_attach_cache()
bcache: Fix dirty_data accounting
bcache: Use uninterruptible sleep in writeback
bcache: kthread don't set writeback task to INTERUPTIBLE
block: fix memory leaks on unplugging block device
bcache: fix sparse non static symbol warning
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
In the case of both the submit_queues param and use_per_node_hctx param
are used. We limit the number af submit_queues to the number of online
nodes.
If the submit_queues is a multiple of nr_online_nodes, its trivial. Simply map
them to the nodes. For example: 8 submit queues are mapped as node0[0,1],
node1[2,3], ...
If uneven, we are left with an uneven number of submit_queues that must be
mapped. These are mapped toward the first node and onward. E.g. 5
submit queues mapped onto 4 nodes are mapped as node0[0,1], node1[2], ...
Signed-off-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The defaults for the module is to instantiate itself with blk-mq and a
submit queue for each CPU node in the system.
To save resources, initialize instead with a single submit queue.
Signed-off-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Let the user know when the number of submission queues are being
ignored.
Signed-off-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Simplify the initialization logic of the three block-layers.
- The queue initialization is split into two parts. This allows reuse of
code when initializing the sq-, bio- and mq-based layers.
- Set submit_queues default value to 0 and always set it at init time.
- Simplify the init error code paths.
Signed-off-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
For NUMA systems, initializing the blk-mq layer and using per node hctx.
We initialize submit queues to 1, while blk-mq nr_hw_queues is
initialized to the number of NUMA nodes.
This makes the null_init_hctx function overwrite memory outside of what
it allocated. In my case it lead to writing garbage into struct
request_queue's mq_map.
Signed-off-by: Matias Bjorling <m@bjorling.me>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Mark functions skd_skmsg_state_to_str() and skd_skreq_state_to_str() as
static in skd_main.c because they are not used outside this file.
This eliminates the following warnings in skd_main.c:
drivers/block/skd_main.c:5272:13: warning: no previous prototype for ‘skd_skmsg_state_to_str’ [-Wmissing-prototypes]
drivers/block/skd_main.c:5284:13: warning: no previous prototype for ‘skd_skreq_state_to_str’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |\ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
into for-linus
Kent writes:
Jens - small pile of bcache fixes. I've been slacking on the writeback
fixes but those definitely need to get into 3.13.
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
The old writeback PD controller could get into states where it had throttled all
the way down and take way too long to recover - it was too complicated to really
understand what it was doing.
This rewrites a good chunk of it to hopefully be simpler and make more sense,
and it also pays more attention to units which should make the behaviour a bit
easier to understand.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
There is a possibility for a bucket to be invalidated by the allocator
while moving_gc was copying it's contents to another bucket, if the
bucket only held cached data. To prevent this moving checks for
a stale ptr (to an invalidated bucket), before and after reads.
It it finds one, it simply ignores moving that data. This only
affects bcache if the moving_gc was turned on, note that it's
off by default.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Garbage collector needs to check keys in the writeback keybuf to
make sure it's not invalidating buckets to which the writeback
keys point to.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Removed gc_move_threshold because picking buckets only by
threshold could lead moving extra buckets (ei. if there are
buckets at the threshold that aren't supposed to be moved
do to space considerations).
This is replaced by a GC_MOVE bit in the gc_mark bitmask.
Now only marked buckets get moved.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Dirty data accounting wasn't quite right - firstly, we were adding the key we're
inserting after it could have merged with another dirty key already in the
btree, and secondly we could sometimes pass the wrong offset to
bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is
important when tracking dirty data by stripe.
NOTE FOR BACKPORTERS: For 3.10 (and 3.11?) there's other accounting fixes
necessary that got squashed in with other patches; the full patch against 3.10
is 408cc2f47eeac93a, available at:
git://evilpiepirate.org/~kent/linux-bcache.git bcache-3.10-writeback-fixes
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2a46036..4a12b2f 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1817,7 +1817,8 @@ static bool fix_overlapping_extents(struct btree *b, struct bkey *insert,
if (KEY_START(k) > KEY_START(insert) + sectors_found)
goto check_failed;
- if (KEY_PTRS(replace_key) != KEY_PTRS(k))
+ if (KEY_PTRS(k) != KEY_PTRS(replace_key) ||
+ KEY_DIRTY(k) != KEY_DIRTY(replace_key))
goto check_failed;
/* skip past gen */
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
We're just waiting on kthread_should_stop(), nothing else, so
interruptible sleep was wrong here.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
at the beginning (schedule_timout_interuptible) and others
do his on their own
This prevents wrong load average calculation (load of 1 per thread)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Fixes the following sparse warning:
drivers/md/bcache/btree.c:2220:5: warning:
symbol 'btree_insert_fn' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|\ \ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata fixes from Tejun Heo:
"There's one interseting commit - "libata, freezer: avoid block device
removal while system is frozen". It's an ugly hack working around a
deadlock condition between driver core resume and block layer device
removal paths through freezer which was made more reproducible by
writeback being converted to workqueue some releases ago. The bug has
nothing to do with libata but it's just an workaround which is easy to
backport. After discussion, Rafael and I seem to agree that we don't
really need kernel freezables - both kthread and workqueue. There are
few specific workqueues which constitute PM operations and require
freezing, which will be converted to use workqueue_set_max_active()
instead. All other kernel freezer uses are planned to be removed,
followed by the removal of kthread and workqueue freezer support,
hopefully.
Others are device-specific fixes. The most notable is the addition of
NO_NCQ_TRIM which is used to disable queued TRIM commands to Micro
M500 SSDs which otherwise suffers data corruption"
* 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
libata, freezer: avoid block device removal while system is frozen
libata: implement ATA_HORKAGE_NO_NCQ_TRIM and apply it to Micro M500 SSDs
libata: disable a disk via libata.force params
ahci: bail out on ICH6 before using AHCI BAR
ahci: imx: Explicitly clear IMX6Q_GPR13_SATA_MPLL_CLK_EN
libata: add ATA_HORKAGE_BROKEN_FPDMA_AA quirk for Seagate Momentus SpinPoint M8
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios. This is the latest occurrence.
During resume, libata rescans all the ports and revalidates all
pre-existing devices. If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks. Unfortunately, this can race with the rest of device
resume. Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.
839a8e8660b6 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too. In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.
I believe the right thing to do is getting rid of freezable kthreads
and workqueues. This is something fundamentally broken. For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen. Kernel engineering at its
finest. :(
v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
as a module.
v3: Comment updated and polling interval changed to 10ms as suggested
by Rafael.
v4: Add #ifdef CONFIG_FREEZER around the hack as pm_freezing is not
defined when FREEZER is not configured thus breaking build.
Reported by kbuild test robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tomaž Šolc <tomaz.solc@tablix.org>
Reviewed-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Cc: kbuild test robot <fengguang.wu@intel.com>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Certain drives cannot handle queued TRIM commands properly, even
though support is indicated in the IDENTIFY DEVICE buffer. This patch
allows for disabling the commands for the affected drives and apply it
to the Micron/Crucial M500 SSDs which exhibit incorrect protocol
behavior when issued queued TRIM commands, which could lead to silent
data corruption.
tj: Merged two unnecessarily split patches and made minor edits
including shortening horkage name.
Signed-off-by: Marc Carino <marc.ceeeee@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/1387246554-7311-1-git-send-email-marc.ceeeee@gmail.com
Cc: stable@vger.kernel.org # 3.12+
|