summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'dm-3.16-changes' of ↵Linus Torvalds2014-06-129-117/+230
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: "This pull request is later than I'd have liked because I was waiting for some performance data to help finally justify sending the long-standing dm-crypt cpu scalability improvements upstream. Unfortunately we came up short, so those dm-crypt changes will continue to wait, but it seems we're not far off. . Add dm_accept_partial_bio interface to DM core to allow DM targets to only process a portion of a bio, the remainder being sent in the next bio. This enables the old dm snapshot-origin target to only split write bios on chunk boundaries, read bios are now sent to the origin device unchanged. . Add DM core support for disabling WRITE SAME if the underlying SCSI layer disables it due to command failure. . Reduce lock contention in DM's bio-prison. . A few small cleanups and fixes to dm-thin and dm-era" * tag 'dm-3.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm thin: update discard_granularity to reflect the thin-pool blocksize dm bio prison: implement per bucket locking in the dm_bio_prison hash table dm: remove symbol export for dm_set_device_limits dm: disable WRITE SAME if it fails dm era: check for a non-NULL metadata object before closing it dm thin: return ENOSPC instead of EIO when error_if_no_space enabled dm thin: cleanup noflush_work to use a proper completion dm snapshot: do not split read bios sent to snapshot-origin target dm snapshot: allocate a per-target structure for snapshot-origin target dm: introduce dm_accept_partial_bio dm: change sector_count member in clone_info from sector_t to unsigned
| * dm thin: update discard_granularity to reflect the thin-pool blocksizeLukas Czerner2014-06-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DM thinp already checks whether the discard_granularity of the data device is a factor of the thin-pool block size. But when using the dm-thin-pool's discard passdown support, DM thinp was not selecting the max of the underlying data device's discard_granularity and the thin-pool's block size. Update set_discard_limits() to set discard_granularity to the max of these values. This enables blkdev_issue_discard() to properly align the discards that are sent to the DM thin device on a full block boundary. As such each discard will now cover an entire DM thin-pool block and the block will be reclaimed. Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
| * dm bio prison: implement per bucket locking in the dm_bio_prison hash tableHeinz Mauelshagen2014-06-111-27/+39
| | | | | | | | | | | | | | | | | | | | Split the single per bio-prison lock by using per bucket locking. Per bucket locking benefits both dm-thin and dm-cache targets by reducing bio-prison lock contention. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm: remove symbol export for dm_set_device_limitsMike Snitzer2014-06-043-11/+3
| | | | | | | | | | | | | | There is no need for code other than DM core to use dm_set_device_limits so remove its EXPORT_SYMBOL_GPL. Also, cleanup a couple whitespace nits. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm: disable WRITE SAME if it failsMike Snitzer2014-06-042-10/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add DM core support for disabling WRITE SAME on first failure to both request-based and bio-based targets. The need to disable WRITE SAME stems from SCSI enabling it by default but then disabling it when it fails. When SCSI does this it returns "permanent target failure, do not retry" using -EREMOTEIO. Update DM core to only disable WRITE SAME on failure if the returned error is -EREMOTEIO. Commit f84cb8a4 ("dm mpath: disable WRITE SAME if it fails") implemented multipath specific disabling of WRITE SAME if it fails. However, as that commit detailed, the multipath-only solution doesn't go far enough if bio-based DM targets are stacked ontop of the request-based dm-multipath target (as is commonly done using dm-linear to support partitions on multipath devices, via kpartx). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Tested-by: Alex Chen <alex.chen@huawei.com>
| * dm era: check for a non-NULL metadata object before closing itJoe Thornber2014-06-031-1/+2
| | | | | | | | | | | | | | | | | | | | era_ctr() may call era_destroy() before era->md is initialized so era_destory() must only close the metadata object if it is not NULL. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Naohiro Aota <naota@elisp.net> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.15+
| * dm thin: return ENOSPC instead of EIO when error_if_no_space enabledMike Snitzer2014-06-033-17/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | Update the DM thin provisioning target's allocation failure error to be consistent with commit a9d6ceb8 ("[SCSI] return ENOSPC on thin provisioning failure"). The DM thin target now returns -ENOSPC rather than -EIO when block allocation fails due to the pool being out of data space (and the 'error_if_no_space' thin-pool feature is enabled). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-By: Joe Thornber <ejt@redhat.com>
| * dm thin: cleanup noflush_work to use a proper completionJoe Thornber2014-06-031-18/+34
| | | | | | | | | | | | | | | | | | | | Factor out a pool_work interface that noflush_work makes use of to wait for and complete work items (in terms of a proper completion struct). Allows discontinuing the use of a custom completion in terms of atomic_t and wait_event. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm snapshot: do not split read bios sent to snapshot-origin targetMikulas Patocka2014-06-031-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | Change the snapshot-origin target so that only write bios are split on chunk boundary. Read bios are passed unchanged to the underlying device, so they don't have to be split. Later, we could change the target so that it accepts a larger write bio if it spans an area that is completely covered by snapshot exceptions. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm snapshot: allocate a per-target structure for snapshot-origin targetMikulas Patocka2014-06-031-18/+35
| | | | | | | | | | | | | | | | | | Allocate a per-target dm_origin structure. This is a prerequisite for the next commit ("dm snapshot: do not split read bios sent to snapshot-origin target") which adds a new member to this structure. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm: introduce dm_accept_partial_bioMikulas Patocka2014-06-032-8/+53
| | | | | | | | | | | | | | | | | | | | The function dm_accept_partial_bio allows the target to specify how many sectors of the current bio it will process. If the target only wants to accept part of the bio, it calls dm_accept_partial_bio and the DM core sends the rest of the data in next bio. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm: change sector_count member in clone_info from sector_t to unsignedMikulas Patocka2014-06-031-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | It is impossible to create bios with 2^23 or more sectors (the size is stored as a 32-bit byte count in the bio). So we convert some sector_t values to unsigned integers. This is needed for the next commit ("dm: introduce dm_accept_partial_bio") that replaces integer value arguments with pointers, so the size of the integer must match. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | Merge tag 'pci-v3.16-changes-2' of ↵Linus Torvalds2014-06-1266-1039/+1065
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull more PCI updates from Bjorn Helgaas: "Here are some more things I'd like to see in v3.16-rc1: - DMA alias iterator, part of some work to fix IOMMU issues - MVEBU, Tegra, DesignWare changes that I forgot to include before - Some whitespace code cleanup Details: IOMMU - Add DMA alias iterator (Alex Williamson) - Add DMA alias quirks for ASMedia, ITE, Tundra bridges (Alex Williamson) - Add DMA alias quirks for Marvell, Ricoh devices (Alex Williamson) - Add DMA alias quirk for HighPoint devices (Jérôme Carretero) MSI - Fix leak in free_msi_irqs() (Alexei Starovoitov) Marvell MVEBU - Remove unnecessary use of 'conf_lock' spinlock (Andrew Murray) - Avoid setting an undefined window size (Jason Gunthorpe) - Allow several windows with the same target/attribute (Thomas Petazzoni) - Split PCIe BARs into multiple MBus windows when needed (Thomas Petazzoni) - Fix off-by-one in the computed size of the mbus windows (Willy Tarreau) NVIDIA Tegra - Use new OF interrupt mapping when possible (Lucas Stach) Synopsys DesignWare - Remove unnecessary use of 'conf_lock' spinlock (Andrew Murray) - Use new OF interrupt mapping when possible (Lucas Stach) - Split Exynos and i.MX bindings (Lucas Stach) - Fix comment for setting number of lanes (Mohit Kumar) - Fix iATU programming for cfg1, io and mem viewport (Mohit Kumar) Miscellaneous - EXPORT_SYMBOL cleanup (Ryan Desfosses) - Whitespace cleanup (Ryan Desfosses) - Merge multi-line quoted strings (Ryan Desfosses)" * tag 'pci-v3.16-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (21 commits) PCI: Add function 1 DMA alias quirk for HighPoint RocketRaid 642L PCI/MSI: Fix memory leak in free_msi_irqs() PCI: Merge multi-line quoted strings PCI: Whitespace cleanup PCI: Move EXPORT_SYMBOL so it immediately follows function/variable PCI: Add bridge DMA alias quirk for ITE bridge PCI: designware: Split Exynos and i.MX bindings PCI: Add bridge DMA alias quirk for ASMedia and Tundra bridges PCI: Add support for PCIe-to-PCI bridge DMA alias quirks PCI: Add function 1 DMA alias quirk for Marvell devices PCI: Add function 0 DMA alias quirk for Ricoh devices PCI: Add support for DMA alias quirks PCI: Convert pci_dev_flags definitions to bit shifts PCI: Add DMA alias iterator PCI: mvebu: Use '%pa' for printing 'phys_addr_t' type PCI: mvebu: Remove unnecessary use of 'conf_lock' spinlock PCI: designware: Remove unnecessary use of 'conf_lock' spinlock PCI: designware: Use new OF interrupt mapping when possible PCI: designware: Fix iATU programming for cfg1, io and mem viewport PCI: designware: Fix comment for setting number of lanes ...
| | \
| | \
| | \
| | \
| *---. \ Merge branches 'pci/msi', 'pci/iommu' and 'pci/cleanup' into nextBjorn Helgaas2014-06-1158-951/+780
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * pci/msi: PCI/MSI: Fix memory leak in free_msi_irqs() * pci/iommu: PCI: Add function 1 DMA alias quirk for HighPoint RocketRaid 642L PCI: Add bridge DMA alias quirk for ITE bridge * pci/cleanup: PCI: Merge multi-line quoted strings PCI: Whitespace cleanup PCI: Move EXPORT_SYMBOL so it immediately follows function/variable
| | | | * | PCI: Merge multi-line quoted stringsRyan Desfosses2014-06-1128-249/+169
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge quoted strings that are broken across lines into a single entity. The compiler merges them anyway, but checkpatch complains about it, and merging them makes it easier to grep for strings. No functional change. [bhelgaas: changelog, do the same for everything under drivers/pci] Signed-off-by: Ryan Desfosses <ryan@desfo.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| | | | * | PCI: Whitespace cleanupRyan Desfosses2014-06-1145-588/+517
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix various whitespace errors. No functional change. [bhelgaas: fix other similar problems] Signed-off-by: Ryan Desfosses <ryan@desfo.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| | | | * | PCI: Move EXPORT_SYMBOL so it immediately follows function/variableRyan Desfosses2014-06-1012-113/+89
| | |_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move EXPORT_SYMBOL so it immediately follows the function or variable. No functional change. [bhelgaas: squash similar changes, fix hotplug, probe, rom, search, too] Signed-off-by: Ryan Desfosses <ryan@desfo.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| | | * | PCI: Add function 1 DMA alias quirk for HighPoint RocketRaid 642LJérôme Carretero2014-06-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This device uses function 1 as the PCIe requester ID. This vendor has similar boards based on the same Marvell 88SE9235 chipset, but this patch was only tested with the 642L. Tested on ASUS Sabertooth 990FX (AMD). Link: https://bugzilla.kernel.org/show_bug.cgi?id=42679 Signed-off-by: Jérôme Carretero <cJ-ko@zougloub.eu> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Alex Williamson <alex.williamson@redhat.com>
| | | * | PCI: Add bridge DMA alias quirk for ITE bridgeAlex Williamson2014-06-091-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ITE 8892 is a PCIe-to-PCI bridge but doesn't have a PCIe capability. Quirk it so we can figure out the DMA alias for devices below the bridge, so they work correctly with an IOMMU. [bhelgaas: add changelog] Link: https://bugzilla.kernel.org/show_bug.cgi?id=73551 Reported-by: Ronald <rwarsow@gmx.de> Tested-by: Ronald <rwarsow@gmx.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| | * | | PCI/MSI: Fix memory leak in free_msi_irqs()Alexei Starovoitov2014-06-111-1/+1
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | free_msi_irqs() is leaking memory, since list_for_each_entry(entry, &dev->msi_list, list) {...} is never executed, because dev->msi_list is made empty by the loop just above this one. Fix it by relying on zero termination of attribute array like populate_msi_sysfs() does. Fixes: 1c51b50c2995 ("PCI/MSI: Export MSI mode using attributes, not kobjects") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: stable@vger.kernel.org # v3.14+
| | | |
| | \ \
| | \ \
| | \ \
| | \ \
| | \ \
| *-----. \ \ Merge branches 'pci/host-designware', 'pci/host-imx6', 'pci/host-mvebu' and ↵Bjorn Helgaas2014-06-0310-115/+222
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'pci/host-tegra' into next * pci/host-designware: PCI: designware: Remove unnecessary use of 'conf_lock' spinlock PCI: designware: Use new OF interrupt mapping when possible PCI: designware: Fix iATU programming for cfg1, io and mem viewport PCI: designware: Fix comment for setting number of lanes * pci/host-imx6: PCI: designware: Split Exynos and i.MX bindings * pci/host-mvebu: PCI: mvebu: Use '%pa' for printing 'phys_addr_t' type PCI: mvebu: Remove unnecessary use of 'conf_lock' spinlock PCI: mvebu: split PCIe BARs into multiple MBus windows when needed bus: mvebu-mbus: allow several windows with the same target/attribute bus: mvebu-mbus: Avoid setting an undefined window size PCI: mvebu: fix off-by-one in the computed size of the mbus windows * pci/host-tegra: PCI: tegra: Use new OF interrupt mapping when possible
| | | | | * | | PCI: tegra: Use new OF interrupt mapping when possibleLucas Stach2014-04-141-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use new OF interrupt mapping (of_irq_parse_and_map_pci()) when possible. This is the recommended method of doing the IRQ mapping. For old devicetrees we fall back to the previous practice. This allows interrupts to be remapped across bridges. Tested-by: Stephen Warren <swarren@nvidia.com> Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
| | | | * | | | PCI: mvebu: Use '%pa' for printing 'phys_addr_t' typeFabio Estevam2014-04-291-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following build warning that happens when building multi_v7_defconfig with CONFIG_ARM_LPAE=y: drivers/pci/host/pci-mvebu.c:334:5: warning: format '%x' expects argument of type 'unsigned int', but argument 3 has type 'phys_addr_t' [-Wformat=] Fix the warning by using '%pa' to printing 'phys_addr_t' type. While at it, also use the more standard notation [mem 0x-0x] for memory region. [bhelgaas: make end address inclusive, remove extra spaces] Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Jason Cooper <jason@lakedaemon.net> Reviewed-by: Jingoo Han <jg1.han@samsung.com>
| | | | * | | | PCI: mvebu: Remove unnecessary use of 'conf_lock' spinlockAndrew Murray2014-04-291-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Serialization of configuration accesses is provided by 'pci_lock' in drivers/pci/access.c thus making the driver's 'conf_lock' superfluous. Signed-off-by: Andrew Murray <amurray@embedded-bits.co.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Jason Cooper <jason@lakedaemon.net>
| | | | * | | | Merge tag 'tags/mvebu-mbus_pci-fixes-3.15' into pci/host-mvebuBjorn Helgaas2014-04-292-22/+92
| | | | |\ \ \ \ | | | | | |/ / / | | | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull mvebu fixes from Jason Cooper: "mvebu drivers (mbus and pci) fixes for v3.15 - pci - fix off-by-one for mbus window size - split BARs into multiple mbus windows when needed - mbus - avoid setting undefined window size - allow several windows with the same target/attr"
| | | * | | | | PCI: designware: Split Exynos and i.MX bindingsLucas Stach2014-06-033-68/+109
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The glue around the core designware IP is significantly different between the Exynos and i.MX implementation, which is reflected in the DT bindings. This changes the i.MX6 binding to reuse as much as possible from the common designware binding and removes old cruft. I removed the optional GPIOs with the following reasoning: - disable-gpio: endpoint specific GPIO, not currently wired up in any code. Should be handled by the PCI device driver, not the host controller driver. - wake-up-gpio: same as above. - power-on-gpio: No user in any upstream DT. This should be handled by a regulator which shouldn't be controlled by the host driver, but rather by the PCI device driver. [bhelgaas: whitespace fixes] Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| | * | | | | | PCI: designware: Remove unnecessary use of 'conf_lock' spinlockAndrew Murray2014-04-164-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Serialization of configuration accesses is provided by 'pci_lock' in drivers/pci/access.c thus making the driver's 'conf_lock' superfluous. Signed-off-by: Andrew Murray <amurray@embedded-bits.co.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Jingoo Han <jg1.han@samsung.com> Acked-by: Richard Zhu <r65037@freescale.com>
| | * | | | | | PCI: designware: Use new OF interrupt mapping when possibleLucas Stach2014-04-161-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use new OF interrupt mapping (of_irq_parse_and_map_pci()) when possible. This is the recommended method of doing the IRQ mapping. For old devicetrees we fall back to the previous practice. This makes INTB, INTC, and INTD work on i.MX. Tested-by: Tim Harvey <tharvey@gateworks.com> Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Marek Vasut <marex@denx.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jingoo Han <jg1.han@samsung.com>
| | * | | | | | PCI: designware: Fix iATU programming for cfg1, io and mem viewportMohit Kumar2014-04-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch corrects iATU programming for cfg1, io and mem viewport. Enable ATU only after configuring it. Signed-off-by: Mohit Kumar <mohit.kumar@st.com> Signed-off-by: Ajay Khandelwal <ajay.khandelwal@st.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Jingoo Han <jg1.han@samsung.com> Cc: stable@vger.kernel.org
| | * | | | | | PCI: designware: Fix comment for setting number of lanesMohit Kumar2014-04-161-1/+1
| | | |/ / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Corrects comment for setting number of lanes. Signed-off-by: Mohit Kumar <mohit.kumar@st.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Jingoo Han <jg1.han@samsung.com>
| * | | | | | Merge branch 'pci/iommu' into nextBjorn Helgaas2014-06-033-4/+175
| |\ \ \ \ \ \ | | | |_|_|/ / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * pci/iommu: PCI: Add bridge DMA alias quirk for ASMedia and Tundra bridges PCI: Add support for PCIe-to-PCI bridge DMA alias quirks PCI: Add function 1 DMA alias quirk for Marvell devices PCI: Add function 0 DMA alias quirk for Ricoh devices PCI: Add support for DMA alias quirks PCI: Convert pci_dev_flags definitions to bit shifts PCI: Add DMA alias iterator
| | * | | | | PCI: Add bridge DMA alias quirk for ASMedia and Tundra bridgesAlex Williamson2014-05-281-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The quirk is intended to be extremely generic, but we only apply it to known offending devices. Link: https://bugzilla.kernel.org/show_bug.cgi?id=44881 Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| | * | | | | PCI: Add support for PCIe-to-PCI bridge DMA alias quirksAlex Williamson2014-05-282-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several PCIe-to-PCI bridges fail to provide a PCIe capability, causing us to handle them as conventional PCI devices when they really use the requester ID of the secondary bus. We need to differentiate these from PCIe-to-PCI bridges that actually use the conventional PCI ID when a PCIe capability is not present, such as those found on the root complex of may Intel chipsets. Add a dev_flag bit to identify devices to be handled as standard PCIe-to-PCI bridges. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| | * | | | | PCI: Add function 1 DMA alias quirk for Marvell devicesAlex Williamson2014-05-281-0/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several Marvell devices and a JMicron device have a similar DMA requester ID problem to Ricoh, except they use function 1 as the PCIe requester ID. Add a quirk for these to populate the DMA alias with the correct devfn. Link: https://bugzilla.kernel.org/show_bug.cgi?id=42679 Tested-by: George Spelvin <linux@horizon.com> Tested-by: Andreas Schrägle <ajs124.ajs124@gmail.com> Tested-by: Tobias N <qemu@suppser.de> Tested-by: <daxcore@online.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| | * | | | | PCI: Add function 0 DMA alias quirk for Ricoh devicesAlex Williamson2014-05-281-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing quirk for these devices (pci_get_dma_source()) doesn't really solve the problem; re-implement it using the DMA alias iterator. We'll come back later and remove the existing quirk and dma_source interface. Note that device ID 0xe822 is typically function 0 and 0xe230 has been tested to not need the quirk and are therefore removed versus the equivalent dma_source quirk. If there exist in other configurations we can re-add them. Link: https://bugzilla.redhat.com/show_bug.cgi?id=605888 Tested-by: Pat Erley <pat-lkml@erley.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| | * | | | | PCI: Add support for DMA alias quirksAlex Williamson2014-05-282-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some devices are broken and use a requester ID other than their physical devfn. Add a byte, using an existing gap in the pci_dev structure, to store an alternate "alias" devfn. A bit in the dev_flags tells us when this is valid. We then add the alias as one more step in the pci_for_each_dma_alias() iterator. Tested-by: George Spelvin <linux@horizon.com> Tested-by: Pat Erley <pat-lkml@erley.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| | * | | | | PCI: Convert pci_dev_flags definitions to bit shiftsAlex Williamson2014-05-281-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert the pci_dev_flags definitions from decimal constants to bit shifts. We're only a few entries away from where using the decimal value becomes cumbersome. No functional change. Tested-by: George Spelvin <linux@horizon.com> Tested-by: Pat Erley <pat-lkml@erley.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| | * | | | | PCI: Add DMA alias iteratorAlex Williamson2014-05-282-0/+74
| | |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a mixed PCI/PCI-X/PCIe topology, bridges can take ownership of transactions, replacing the original requester ID with their own. Sometimes we just want to know the resulting device or resulting alias; other times we want each step in the chain. This iterator allows either usage. When an endpoint is connected via an unbroken chain of PCIe switches and root ports, it has no alias and its requester ID is visible to the root bus. When PCI/X get in the way, we pick up aliases for bridges. The reason why we potentially care about each step in the path is because of PCI-X. PCI-X has the concept of a requester ID, but bridges may or may not take ownership of various types of transactions. We therefore leave it to the consumer of this function to prune out what they don't care about rather than attempt to flatten the alias ourselves. Tested-by: George Spelvin <linux@horizon.com> Tested-by: Pat Erley <pat-lkml@erley.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
* | | | | | Merge tag 'pm+acpi-3.16-rc1-2' of ↵Linus Torvalds2014-06-1224-113/+437
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more ACPI and power management updates from Rafael Wysocki: "These are fixups on top of the previous PM+ACPI pull request, regression fixes (ACPI hotplug, cpufreq ppc-corenet), other bug fixes (ACPI reset, cpufreq), new PM trace points for system suspend profiling and a copyright notice update. Specifics: - I didn't remember correctly that the Hans de Goede's ACPI video patches actually didn't flip the video.use_native_backlight default, although we had discussed that and decided to do that. Since I said we would do that in the previous PM+ACPI pull request, make that change for real now. - ACPI bus check notifications for PCI host bridges don't cause the bus below the host bridge to be checked for changes as they should because of a mistake in the ACPI-based PCI hotplug (ACPIPHP) subsystem that forgets to add hotplug contexts to PCI host bridge ACPI device objects. Create hotplug contexts for PCI host bridges too as appropriate. - Revert recent cpufreq commit related to the big.LITTLE cpufreq driver that breaks arm64 builds. - Fix for a regression in the ppc-corenet cpufreq driver introduced during the 3.15 cycle and causing the driver to use the remainder from do_div instead of the quotient. From Ed Swarthout. - Resets triggered by panic activate a BUG_ON() in vmalloc.c on systems where the ACPI reset register is located in memory address space. Fix from Randy Wright. - Fix for a problem with cpufreq governors that decisions made by them may be suboptimal due to the fact that deferrable timers are used by them for CPU load sampling. From Srivatsa S Bhat. - Fix for a problem with the Tegra cpufreq driver where the CPU frequency is temporarily switched to a "stable" level that is different from both the initial and target frequencies during transitions which causes udelay() to expire earlier than it should sometimes. From Viresh Kumar. - New trace points and rework of some existing trace points for system suspend/resume profiling from Todd Brandt. - Assorted cpufreq fixes and cleanups from Stratos Karafotis and Viresh Kumar. - Copyright notice update for suspend-and-cpuhotplug.txt from Srivatsa S Bhat" * tag 'pm+acpi-3.16-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / hotplug / PCI: Add hotplug contexts to PCI host bridges PM / sleep: trace events for device PM callbacks cpufreq: cpufreq-cpu0: remove dependency on THERMAL and REGULATOR cpufreq: tegra: update comment for clarity cpufreq: intel_pstate: Remove duplicate CPU ID check cpufreq: Mark CPU0 driver with CPUFREQ_NEED_INITIAL_FREQ_CHECK flag PM / Documentation: Update copyright in suspend-and-cpuhotplug.txt cpufreq: governor: remove copy_prev_load from 'struct cpu_dbs_common_info' cpufreq: governor: Be friendly towards latency-sensitive bursty workloads PM / sleep: trace events for suspend/resume cpufreq: ppc-corenet-cpu-freq: do_div use quotient Revert "cpufreq: Enable big.LITTLE cpufreq driver on arm64" cpufreq: Tegra: implement intermediate frequency callbacks cpufreq: add support for intermediate (stable) frequencies ACPI / video: Change the default for video.use_native_backlight to 1 ACPI: Fix bug when ACPI reset register is implemented in system memory
| * \ \ \ \ \ Merge branch 'pm-sleep'Rafael J. Wysocki2014-06-128-32/+115
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * pm-sleep: PM / sleep: trace events for device PM callbacks PM / sleep: trace events for suspend/resume
| | * | | | | | PM / sleep: trace events for device PM callbacksTodd E Brandt2014-06-112-22/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds two trace events which supply the same info that initcall_debug provides, but via ftrace instead of dmesg. The existing initcall_debug calls require the pm_print_times_enabled var to be set (either via sysfs or via the kernel cmd line). The new trace events provide all the same info as the initcall_debug prints but with less overhead, and also with coverage of device prepare and complete device callbacks. These events replace the device_pm_report_time event (which has been removed). device_pm_callback_start is called first and provides the device and callback info. device_pm_callback_end is called after with the device name and error info. The time and pid are gathered from the trace data headers. Signed-off-by: Todd Brandt <todd.e.brandt@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| | * | | | | | PM / sleep: trace events for suspend/resumeTodd E Brandt2014-06-078-19/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds trace events that give finer resolution into suspend/resume. These events are graphed in the timelines generated by the analyze_suspend.py script. They represent large areas of time consumed that are typical to suspend and resume. The event is triggered by calling the function "trace_suspend_resume" with three arguments: a string (the name of the event to be displayed in the timeline), an integer (case specific number, such as the power state or cpu number), and a boolean (where true is used to denote the start of the timeline event, and false to denote the end). The suspend_resume trace event reproduces the data that the machine_suspend trace event did, so the latter has been removed. Signed-off-by: Todd Brandt <todd.e.brandt@intel.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| | * | | | | | Merge branch 'acpi-pm' into pm-sleepRafael J. Wysocki2014-06-07488-3282/+5238
| | |\ \ \ \ \ \
| * | \ \ \ \ \ \ Merge branch 'pm-cpufreq'Rafael J. Wysocki2014-06-1211-61/+256
| |\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * pm-cpufreq: cpufreq: cpufreq-cpu0: remove dependency on THERMAL and REGULATOR cpufreq: tegra: update comment for clarity cpufreq: intel_pstate: Remove duplicate CPU ID check cpufreq: Mark CPU0 driver with CPUFREQ_NEED_INITIAL_FREQ_CHECK flag cpufreq: governor: remove copy_prev_load from 'struct cpu_dbs_common_info' cpufreq: governor: Be friendly towards latency-sensitive bursty workloads cpufreq: ppc-corenet-cpu-freq: do_div use quotient Revert "cpufreq: Enable big.LITTLE cpufreq driver on arm64" cpufreq: Tegra: implement intermediate frequency callbacks cpufreq: add support for intermediate (stable) frequencies
| | * | | | | | | | cpufreq: cpufreq-cpu0: remove dependency on THERMAL and REGULATORViresh Kumar2014-06-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpufreq-cpu0 uses thermal framework to register a cooling device, but doesn't depend on it as there are dummy calls provided by thermal layer when CONFIG_THERMAL=n. And when these calls fail, the driver is still usable. Similar explanation is valid for regulators as well. We do have dummy calls available for regulator APIs and the driver can work even when those calls fail. So, we don't really need to mention thermal and regulators as a dependency for cpufreq-cpu0 in Kconfig as platforms without support for thermal/regulator can also use this driver. Remove this dependency. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| | * | | | | | | | cpufreq: tegra: update comment for clarityViresh Kumar2014-06-101-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tegra's driver got updated a bit (00917dd cpufreq: Tegra: implement intermediate frequency callbacks) and implements new 'intermediate freq' infrastructure of core. Above commit updated comments about when to call clk_prepare_enable(pll_x_clk) and Doug wasn't satisfied with those comments and said this: > The "Though when target-freq is intermediate freq, we don't need to > take this reference." makes me think that this function is actually > called when target-freq is intermediate freq.  I don't think it is, > right? For better clarity just make that comment more explicit about when we call tegra_target_intermediate(). Reviewed-by: Stephen Warren <swarren@nvidia.com> Reported-and-reviewed-by: Doug Anderson <dianders@chromium.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| | * | | | | | | | cpufreq: intel_pstate: Remove duplicate CPU ID checkStratos Karafotis2014-06-101-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We check the CPU ID during driver init. There is no need to do it again per logical CPU initialization. So, remove the duplicate check. Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| | * | | | | | | | cpufreq: Mark CPU0 driver with CPUFREQ_NEED_INITIAL_FREQ_CHECK flagViresh Kumar2014-06-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sometimes boot loaders set CPU frequency to a value outside of frequency table present with cpufreq core. In such cases CPU might be unstable if it has to run on that frequency for long duration of time and so its better to set it to a frequency which is specified in frequency table. Sachin recently found this problem with cpufreq-cpu0 driver when he was testing it for Exynos. Set this flag for cpufreq-cpu0 driver. Reported-and-tested-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| | * | | | | | | | cpufreq: governor: remove copy_prev_load from 'struct cpu_dbs_common_info'Viresh Kumar2014-06-092-9/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'copy_prev_load' was recently added by commit: 18b46ab (cpufreq: governor: Be friendly towards latency-sensitive bursty workloads). It actually is a bit redundant as we also have 'prev_load' which can store any integer value and can be used instead of 'copy_prev_load' by setting it zero. True load can also turn out to be zero during long idle intervals (and hence the actual value of 'prev_load' and the overloaded value can clash). However this is not a problem because, if the true load was really zero in the previous interval, it makes sense to evaluate the load afresh for the current interval rather than copying the previous load. So, drop 'copy_prev_load' and use 'prev_load' instead. Update comments as well to make it more clear. There is another change here which was probably missed by Srivatsa during the last version of updates he made. The unlikely in the 'if' statement was covering only half of the condition and the whole line should actually come under it. Also checkpatch is made more silent as it was reporting this (--strict option): CHECK: Alignment should match open parenthesis + if (unlikely(wall_time > (2 * sampling_rate) && + j_cdbs->prev_load)) { Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| | * | | | | | | | cpufreq: governor: Be friendly towards latency-sensitive bursty workloadsSrivatsa S. Bhat2014-06-072-3/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cpufreq governors like the ondemand governor calculate the load on the CPU periodically by employing deferrable timers. A deferrable timer won't fire if the CPU is completely idle (and there are no other timers to be run), in order to avoid unnecessary wakeups and thus save CPU power. However, the load calculation logic is agnostic to all this, and this can lead to the problem described below. Time (ms) CPU 1 100 Task-A running 110 Governor's timer fires, finds load as 100% in the last 10ms interval and increases the CPU frequency. 110.5 Task-A running 120 Governor's timer fires, finds load as 100% in the last 10ms interval and increases the CPU frequency. 125 Task-A went to sleep. With nothing else to do, CPU 1 went completely idle. 200 Task-A woke up and started running again. 200.5 Governor's deferred timer (which was originally programmed to fire at time 130) fires now. It calculates load for the time period 120 to 200.5, and finds the load is almost zero. Hence it decreases the CPU frequency to the minimum. 210 Governor's timer fires, finds load as 100% in the last 10ms interval and increases the CPU frequency. So, after the workload woke up and started running, the frequency was suddenly dropped to absolute minimum, and after that, there was an unnecessary delay of 10ms (sampling period) to increase the CPU frequency back to a reasonable value. And this pattern repeats for every wake-up-from-cpu-idle for that workload. This can be quite undesirable for latency- or response-time sensitive bursty workloads. So we need to fix the governor's logic to detect such wake-up-from- cpu-idle scenarios and start the workload at a reasonably high CPU frequency. One extreme solution would be to fake a load of 100% in such scenarios. But that might lead to undesirable side-effects such as frequency spikes (which might also need voltage changes) especially if the previous frequency happened to be very low. We just want to avoid the stupidity of dropping down the frequency to a minimum and then enduring a needless (and long) delay before ramping it up back again. So, let us simply carry forward the previous load - that is, let us just pretend that the 'load' for the current time-window is the same as the load for the previous window. That way, the frequency and voltage will continue to be set to whatever values they were set at previously. This means that bursty workloads will get a chance to influence the CPU frequency at which they wake up from cpu-idle, based on their past execution history. Thus, they might be able to avoid suffering from slow wakeups and long response-times. However, we should take care not to over-do this. For example, such a "copy previous load" logic will benefit cases like this: (where # represents busy and . represents idle) ##########.........#########.........###########...........##########........ but it will be detrimental in cases like the one shown below, because it will retain the high frequency (copied from the previous interval) even in a mostly idle system: ##########.........#.................#.....................#............... (i.e., the workload finished and the remaining tasks are such that their busy periods are smaller than the sampling interval, which causes the timer to always get deferred. So, this will make the copy-previous-load logic copy the initial high load to subsequent idle periods over and over again, thus keeping the frequency high unnecessarily). So, we modify this copy-previous-load logic such that it is used only once upon every wakeup-from-idle. Thus if we have 2 consecutive idle periods, the previous load won't get blindly copied over; cpufreq will freshly evaluate the load in the second idle interval, thus ensuring that the system comes back to its normal state. [ The right way to solve this whole problem is to teach the CPU frequency governors to also track load on a per-task basis, not just a per-CPU basis, and then use both the data sources intelligently to set the appropriate frequency on the CPUs. But that involves redesigning the cpufreq subsystem, so this patch should make the situation bearable until then. ] Experimental results: +-------------------+ I ran a modified version of ebizzy (called 'sleeping-ebizzy') that sleeps in between its execution such that its total utilization can be a user-defined value, say 10% or 20% (higher the utilization specified, lesser the amount of sleeps injected). This ebizzy was run with a single-thread, tied to CPU 8. Behavior observed with tracing (sample taken from 40% utilization runs): ------------------------------------------------------------------------ Without patch: ~~~~~~~~~~~~~~ kworker/8:2-12137 416.335742: cpu_frequency: state=2061000 cpu_id=8 kworker/8:2-12137 416.335744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40753 416.345741: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-12137 416.345744: cpu_frequency: state=4123000 cpu_id=8 kworker/8:2-12137 416.345746: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40753 416.355738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 <snip> --------------------------------------------------------------------- <snip> <...>-40753 416.402202: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8 <idle>-0 416.502130: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy <...>-40753 416.505738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-12137 416.505739: cpu_frequency: state=2061000 cpu_id=8 kworker/8:2-12137 416.505741: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40753 416.515739: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-12137 416.515742: cpu_frequency: state=4123000 cpu_id=8 kworker/8:2-12137 416.515744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy Observation: Ebizzy went idle at 416.402202, and started running again at 416.502130. But cpufreq noticed the long idle period, and dropped the frequency at 416.505739, only to increase it back again at 416.515742, realizing that the workload is in-fact CPU bound. Thus ebizzy needlessly ran at the lowest frequency for almost 13 milliseconds (almost 1 full sample period), and this pattern repeats on every sleep-wakeup. This could hurt latency-sensitive workloads quite a lot. With patch: ~~~~~~~~~~~ kworker/8:2-29802 464.832535: cpu_frequency: state=2061000 cpu_id=8 <snip> --------------------------------------------------------------------- <snip> kworker/8:2-29802 464.962538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 464.972533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-29802 464.972536: cpu_frequency: state=4123000 cpu_id=8 kworker/8:2-29802 464.972538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 464.982531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 <snip> --------------------------------------------------------------------- <snip> kworker/8:2-29802 465.022533: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 465.032531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-29802 465.032532: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 465.035797: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8 <idle>-0 465.240178: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy <...>-40738 465.242533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-29802 465.242535: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 465.252531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 Observation: Ebizzy went idle at 465.035797, and started running again at 465.240178. Since ebizzy was the only real workload running on this CPU, cpufreq retained the frequency at 4.1Ghz throughout the run of ebizzy, no matter how many times ebizzy slept and woke-up in-between. Thus, ebizzy got the 10ms worth of 4.1 Ghz benefit during every sleep-wakeup (as compared to the run without the patch) and this boost gave a modest improvement in total throughput, as shown below. Sleeping-ebizzy records-per-second: ----------------------------------- Utilization Without patch With patch Difference (Absolute and % values) 10% 274767 277046 + 2279 (+0.829%) 20% 543429 553484 + 10055 (+1.850%) 40% 1090744 1107959 + 17215 (+1.578%) 60% 1634908 1662018 + 27110 (+1.658%) A rudimentary and somewhat approximately latency-sensitive workload such as sleeping-ebizzy itself showed a consistent, noticeable performance improvement with this patch. Hence, workloads that are truly latency-sensitive will benefit quite a bit from this change. Moreover, this is an overall win-win since this patch does not hurt power-savings at all (because, this patch does not reduce the idle time or idle residency; and the high frequency of the CPU when it goes to cpu-idle does not affect/hurt the power-savings of deep idle states). Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>