summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'pci/enumeration'Bjorn Helgaas2018-06-069-220/+76
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - neaten pci=earlydump output (Andy Shevchenko) - avoid errors when extended config space inaccessible (Gilles Buloz) - prevent sysfs disable of device while driver attached (Christoph Hellwig) - use core interface to report PCIe link properties in bnx2x, bnxt_en, cxgb4, ixgbe (Bjorn Helgaas) - remove unused pcie_get_minimum_link() (Bjorn Helgaas) * pci/enumeration: PCI: Remove unused pcie_get_minimum_link() ixgbe: Report PCIe link properties with pcie_print_link_status() cxgb4: Report PCIe link properties with pcie_print_link_status() bnxt_en: Report PCIe link properties with pcie_print_link_status() bnx2x: Report PCIe link properties with pcie_print_link_status() PCI: Prevent sysfs disable of device while driver is attached PCI: Check whether bridges allow access to extended config space x86/PCI: Make pci=earlydump output neat
| * PCI: Remove unused pcie_get_minimum_link()Bjorn Helgaas2018-05-262-45/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In some cases pcie_get_minimum_link() returned misleading information because it found the slowest link and the narrowest link without considering the total bandwidth of the link. For example, consider a path with these two links: - 16.0 GT/s x1 link (16.0 * 10^9 * 128 / 130) * 1 / 8 = 1969 MB/s - 2.5 GT/s x16 link ( 2.5 * 10^9 * 8 / 10) * 16 / 8 = 4000 MB/s The available bandwidth of the path is limited by the 16 GT/s link to about 1969 MB/s, but pcie_get_minimum_link() returned 2.5 GT/s x1, which corresponds to only 250 MB/s. Callers should use pcie_print_link_status() instead, or pcie_bandwidth_available() if they need more detailed information. Remove pcie_get_minimum_link() since there are no callers left. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * ixgbe: Report PCIe link properties with pcie_print_link_status()Bjorn Helgaas2018-05-261-46/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously the driver used pcie_get_minimum_link() to warn when the NIC is in a slot that can't supply as much bandwidth as the NIC could use. pcie_get_minimum_link() can be misleading because it finds the slowest link and the narrowest link (which may be different links) without considering the total bandwidth of each link. For a path with a 16 GT/s x1 link and a 2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of bandwidth, not the true available bandwidth of about 1969 MB/s for a 16 GT/s x1 link. Use pcie_print_link_status() to report PCIe link speed and possible limitations instead of implementing this in the driver itself. This finds the slowest link in the path to the device by computing the total bandwidth of each link and compares that with the capabilities of the device. The dmesg change is: - PCI Express bandwidth of %dGT/s available - (Speed:%s, Width: x%d, Encoding Loss:%s) + %u.%03u Gb/s available PCIe bandwidth (%s x%d link) or, if the device is capable of better performance than is available in the current slot: - This is not sufficient for optimal performance of this card. - For optimal performance, at least %dGT/s of bandwidth is required. - A slot with more lanes and/or higher speed is suggested. + %u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link) Note that the driver previously used dev_warn() to suggest using a different slot, but pcie_print_link_status() uses dev_info() because if the platform has no faster slot available, the user can't do anything about the warning and may not want to be bothered with it. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * cxgb4: Report PCIe link properties with pcie_print_link_status()Bjorn Helgaas2018-05-261-74/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously the driver used pcie_get_minimum_link() to warn when the NIC is in a slot that can't supply as much bandwidth as the NIC could use. pcie_get_minimum_link() can be misleading because it finds the slowest link and the narrowest link (which may be different links) without considering the total bandwidth of each link. For a path with a 16 GT/s x1 link and a 2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of bandwidth, not the true available bandwidth of about 1969 MB/s for a 16 GT/s x1 link. Use pcie_print_link_status() to report PCIe link speed and possible limitations instead of implementing this in the driver itself. This finds the slowest link in the path to the device by computing the total bandwidth of each link and compares that with the capabilities of the device. The dmesg change is: - PCIe link speed is %s, device supports %s - PCIe link width is x%d, device supports x%d + %u.%03u Gb/s available PCIe bandwidth (%s x%d link) or, if the device is capable of better performance than is available in the current slot: - A slot with more lanes and/or higher speed is suggested for optimal performance. + %u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link) Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * bnxt_en: Report PCIe link properties with pcie_print_link_status()Bjorn Helgaas2018-05-261-18/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously the driver used pcie_get_minimum_link() to warn when the NIC is in a slot that can't supply as much bandwidth as the NIC could use. pcie_get_minimum_link() can be misleading because it finds the slowest link and the narrowest link (which may be different links) without considering the total bandwidth of each link. For a path with a 16 GT/s x1 link and a 2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of bandwidth, not the true available bandwidth of about 1969 MB/s for a 16 GT/s x1 link. Use pcie_print_link_status() to report PCIe link speed and possible limitations instead of implementing this in the driver itself. This finds the slowest link in the path to the device by computing the total bandwidth of each link and compares that with the capabilities of the device. The dmesg change is: - PCIe: Speed %s Width x%d + %u.%03u Gb/s available PCIe bandwidth (%s x%d link) Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * bnx2x: Report PCIe link properties with pcie_print_link_status()Bjorn Helgaas2018-05-261-17/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously the driver used pcie_get_minimum_link() to warn when the NIC is in a slot that can't supply as much bandwidth as the NIC could use. pcie_get_minimum_link() can be misleading because it finds the slowest link and the narrowest link (which may be different links) without considering the total bandwidth of each link. For a path with a 16 GT/s x1 link and a 2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of bandwidth, not the true available bandwidth of about 1969 MB/s for a 16 GT/s x1 link. Use pcie_print_link_status() to report PCIe link speed and possible limitations instead of implementing this in the driver itself. This finds the slowest link in the path to the device by computing the total bandwidth of each link and compares that with the capabilities of the device. The dmesg change is: - %s (%c%d) PCI-E x%d %s found at mem %lx, IRQ %d, node addr %pM + %s (%c%d) PCI-E found at mem %lx, IRQ %d, node addr %pM + %u.%03u Gb/s available PCIe bandwidth (%s x%d link) Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI: Prevent sysfs disable of device while driver is attachedChristoph Hellwig2018-05-261-6/+9
| | | | | | | | | | | | | | | | | | Manipulating the enable_cnt behind the back of the driver will wreak complete havoc with the kernel state, so disallow it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: Keith Busch <keith.busch@intel.com>
| * PCI: Check whether bridges allow access to extended config spaceGilles Buloz2018-05-072-0/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even if a device supports extended config space, i.e., it is a PCI-X Mode 2 or a PCI Express device, the extended space may not be accessible if there's a conventional PCI bus in the path to it. We currently figure that out in pci_cfg_space_size() by reading the first dword of extended config space. On most platforms that returns ~0 data if the space is inaccessible, but it may set error bits in PCI status registers, and on some platforms it causes exceptions that we currently don't recover from. For example, a PCIe-to-conventional PCI bridge treats config transactions with a non-zero Extended Register Address as an Unsupported Request on PCIe and a received Master-Abort on the destination bus (see PCI Express to PCI/PCI-X Bridge spec, r1.0, sec 4.1.3). A sample case is a LS1043A CPU (NXP QorIQ Layerscape) platform with the following bus topology: LS1043 PCIe Root Port -> PEX8112 PCIe-to-PCI bridge (doesn't support ext cfg on PCI side) -> PMC slot connector (for legacy PMC modules) With a PMC module topology as follows: PMC connector -> PCI-to-PCIe bridge -> PCIe switch (4 ports) -> 4 PCIe devices (one on each port) The PCIe devices on the PMC module support extended config space, but we can't reach it because the PEX8112 can't generate accesses to the extended space on its secondary bus. Attempts to access it cause Unsupported Request errors, which result in synchronous aborts on this platform. To avoid these errors, check whether bridges are capable of generating extended config space addresses on their secondary interfaces. If they can't, we restrict devices below the bridge to only the 256-byte PCI-compatible config space. Signed-off-by: Gilles Buloz <gilles.buloz@kontron.com> [bhelgaas: changelog, rework patch so bus_flags testing is all in pci_bridge_child_ext_cfg_accessible()] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * x86/PCI: Make pci=earlydump output neatAndy Shevchenko2018-04-271-14/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the early dump of PCI configuration space looks quite unhelpful, e.g. [ 0.000000] 60: [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] 00 [ 0.000000] which makes really hard to get anything out of this. Convert the function to use print_hex_dump() to make output neat. In the result we will have [ 0.000000] 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 which is much, much better. Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Ingo Molnar <mingo@kernel.org>
* | Merge branch 'pci/dpc'Bjorn Helgaas2018-06-061-1/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | - clear interrupt status in top half to avoid interrupt storm (Oza Pawandeep) * pci/dpc: PCI/DPC: Clear interrupt status in interrupt handler top half
| * | PCI/DPC: Clear interrupt status in interrupt handler top halfOza Pawandeep2018-05-161-1/+4
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The generic IRQ handling code ensures that an interrupt handler runs with its interrupt masked or disabled. If the interrupt is level-triggered, the interrupt handler must tell its device to stop asserting the interrupt before returning. If it doesn't, we will immediately take the interrupt again when the handler returns and the generic code unmasks the interrupt. The driver doesn't know whether its interrupt is edge- or level-triggered, so it must clear its interrupt source directly in its interrupt handler. Previously we cleared the DPC interrupt status in the bottom half, i.e., in deferred work, which can cause an interrupt storm if the DPC interrupt happens to be level-triggered, e.g., if we're using INTx instead of MSI. Clear the DPC interrupt status bit in the interrupt handler, not in the deferred work. Signed-off-by: Oza Pawandeep <poza@codeaurora.org> [bhelgaas: changelog] Signed-off-by: Bjorn Helgaas <helgaas@kernel.org> Reviewed-by: Keith Busch <keith.busch@intel.com>
* | Merge branch 'pci/aspm'Bjorn Helgaas2018-06-065-1/+23
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | - disable ASPM L1.2 substate if we don't have LTR (Bjorn Helgaas) - respect platform ownership of LTR (Bjorn Helgaas) * pci/aspm: PCI/ACPI: Request LTR control from platform before using it PCI/ASPM: Disable ASPM L1.2 Substate if we don't have LTR
| * | PCI/ACPI: Request LTR control from platform before using itBjorn Helgaas2018-04-234-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Per the PCI Firmware spec r3.2, sec 4.5, an ACPI-based OS should use _OSC to request control of Latency Tolerance Reporting (LTR) before using it. Request control of LTR, and if the platform does not grant control, don't use it. N.B. If the hardware supports LTR and the ASPM L1.2 substate but the BIOS doesn't support LTR in _OSC, we previously would enable ASPM L1.2. This patch will prevent us from enabling ASPM L1.2 in that case. It does not prevent us from enabling PCI-PM L1.2, since that doesn't depend on LTR. See PCIe r40, sec 5.5.1, for the L1 PM substate entry conditions. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| * | PCI/ASPM: Disable ASPM L1.2 Substate if we don't have LTRBjorn Helgaas2018-04-181-0/+9
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | When in the ASPM L1.0 state (but not the PCI-PM L1.0 state), the most recent LTR value and the LTR_L1.2_THRESHOLD determines whether the link enters the L1.2 substate. If we don't have LTR enabled, prevent the use of ASPM L1.2. PCI-PM L1.2 may still be used because it doesn't depend on LTR_L1.2_THRESHOLD (see PCIe r4.0, sec 5.5.1). Tested-by: Srinath Mannam <srinath.mannam@broadcom.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | Merge branch 'pci/aer'Bjorn Helgaas2018-06-0618-476/+650
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - unify AER decoding for native and ACPI CPER sources (Alexandru Gagniuc) - add TLP header info to AER tracepoint (Thomas Tai) - add generic pcie_wait_for_link() interface (Oza Pawandeep) - handle AER ERR_FATAL by removing and re-enumerating devices, as Downstream Port Containment does (Oza Pawandeep) - factor out common code between AER and DPC recovery (Oza Pawandeep) - stop triggering DPC for ERR_NONFATAL errors (Oza Pawandeep) - share ERR_FATAL recovery path between AER and DPC (Oza Pawandeep) * pci/aer: PCI/AER: Replace struct pcie_device with pci_dev PCI/AER: Remove unused parameters PCI/AER: Decode Error Source Requester ID PCI/AER: Remove aer_recover_work_func() forward declaration PCI/DPC: Use the generic pcie_do_fatal_recovery() path PCI/AER: Pass service type to pcie_do_fatal_recovery() PCI/DPC: Disable ERR_NONFATAL handling by DPC PCI/portdrv: Add generic pcie_port_find_device() PCI/portdrv: Add generic pcie_port_find_service() PCI/AER: Factor out error reporting to drivers/pci/pcie/err.c PCI/AER: Rename error recovery interfaces to generic PCI naming PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devices PCI: Add generic pcie_wait_for_link() interface PCI/AER: Add TLP header information to tracepoint PCI/AER: Unify error bit printing for native and CPER reporting
| * PCI/AER: Replace struct pcie_device with pci_devKeith Busch2018-06-063-14/+12
| | | | | | | | | | | | | | | | The AER driver only needed the pcie_device to get to the port pci_dev. Save the pci_dev pointer directly in struct aer_rpc and remove the unnecessary indirection. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI/AER: Remove unused parametersKeith Busch2018-06-061-9/+5
| | | | | | | | | | | | | | Remove unused "struct pcie_device *" parameters to handle_error_source() and aer_process_err_devices(). No functional change intended. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI/AER: Decode Error Source Requester IDBjorn Helgaas2018-06-031-7/+11
| | | | | | | | | | | | | | | | Decode the Requester ID from the AER Error Source Register into domain/ bus/device/function format to match other logging. In cases where the ID matches the device used for pci_err(), drop the extra ID completely so we don't print it twice. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI/AER: Remove aer_recover_work_func() forward declarationBorislav Petkov2018-06-031-24/+24
| | | | | | | | | | | | | | | | | | Just move the actual function up so that it is visible to its user aer_recover_queue(). No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI/DPC: Use the generic pcie_do_fatal_recovery() pathOza Pawandeep2018-06-032-20/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our goal is to handle ERR_FATAL errors similarly, whether they are reported via AER or via DPC. A previous commit changed AER so it handles ERR_FATAL by calling driver .remove() methods and resetting the Link. DPC already does that (although the Link reset is done automatically by hardware and happens before we call the driver .remove() methods). Restructure the DPC code so it calls the same pcie_do_fatal_recovery() interface used by AER. This makes it clearer that we want to use the same path. Implement the .reset_link() method used by pcie_do_fatal_recovery(). For DPC, the actual reset is done automatically by hardware, so we really only have to wait for the Link to be inactive, then release the Port from DPC. Signed-off-by: Oza Pawandeep <poza@codeaurora.org> [bhelgaas: changelog, DPC_FATAL is not a bitfield, can be sequential] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI/AER: Pass service type to pcie_do_fatal_recovery()Oza Pawandeep2018-06-033-8/+9
| | | | | | | | | | | | | | | | | | | | Pass the service type to pcie_do_fatal_recovery() instead of assuming AER. We will make DPC also use pcie_do_fatal_recovery(), and it needs to do things a little differently for AER and DPC. Signed-off-by: Oza Pawandeep <poza@codeaurora.org> [bhelgaas: split to separate patch] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI/DPC: Disable ERR_NONFATAL handling by DPCOza Pawandeep2018-06-032-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PCIe ERR_NONFATAL errors mean a particular transaction is unreliable but the Link is otherwise fully functional (PCIe r4.0, sec 6.2.2). The AER driver handles these by logging the error details and calling driver-supplied pci_error_handlers callbacks. It does not reset downstream devices, does not remove them from the PCI subsystem, does not re-enumerate them, and does not call their driver .remove() or .probe() methods. But DPC driver previously enabled DPC on ERR_NONFATAL, so if the hardware supports DPC, these errors caused a Link reset (performed automatically by the hardware), followed by the DPC driver removing affected devices (which calls their .remove() methods), bringing the Link back up, and re-enumerating (which calls driver .probe() methods). Disable ERR_NONFATAL DPC triggering so these errors will only be handled by AER. This means drivers won't have to deal with different usage of their pci_error_handlers callbacks and .probe() and .remove() methods based on whether the platform has DPC support. Signed-off-by: Oza Pawandeep <poza@codeaurora.org> [bhelgaas: changelog] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI/portdrv: Add generic pcie_port_find_device()Oza Pawandeep2018-06-032-0/+24
| | | | | | | | | | | | | | | | Add generic pcie_port_find_device() routine. Signed-off-by: Oza Pawandeep <poza@codeaurora.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Keith Busch <keith.busch@intel.com>
| * PCI/portdrv: Add generic pcie_port_find_service()Oza Pawandeep2018-05-174-30/+49
| | | | | | | | | | | | | | | | Add generic pcie_port_find_service() routine. Signed-off-by: Oza Pawandeep <poza@codeaurora.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Keith Busch <keith.busch@intel.com>
| * PCI/AER: Factor out error reporting to drivers/pci/pcie/err.cOza Pawandeep2018-05-178-366/+398
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the error reporting callbacks from aerdrv_core.c to err.c, where they can be used by DPC in addition to AER. As part of aerdrv_core.c, these callbacks were built under CONFIG_PCIEAER. Moving them to the new err.c means they will now be built under CONFIG_PCIEPORTBUS, so adjust the definition of pci_uevent_ers() to match. Signed-off-by: Oza Pawandeep <poza@codeaurora.org> [bhelgaas: in reset_link(), initialize "driver" even if CONFIG_PCIEAER is unset, update pci_uevent_ers() #ifdef wrapper] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI/AER: Rename error recovery interfaces to generic PCI namingOza Pawandeep2018-05-171-8/+8
| | | | | | | | | | | | | | | | | | | | Rename error recovery interfaces with "pcie_" prefix so they can be made non-static. Signed-off-by: Oza Pawandeep <poza@codeaurora.org> [bhelgaas: move declaration to later patch, leave functions static] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Keith Busch <keith.busch@intel.com>
| * PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devicesOza Pawandeep2018-05-173-33/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PCIe ERR_FATAL errors mean the Link is unreliable. Components on the Link may need to be reset to return to reliable operation (PCIe r4.0, sec 6.2.2). We previously handled these errors much differently depending on whether the platform supports Downstream Port Containment (DPC) (PCIe r4.0, sec 6.2.10) or not. The AER driver has historically logged the error details, called driver-supplied pci_error_handlers callbacks, and reset the Link. This reset downstream devices, but did not remove them from the PCI subsystem, re-enumerate them, or call their driver .remove() or .probe() methods. DPC is different because the hardware automatically disables the Link when it detects ERR_FATAL, which resets downstream devices. There's no opportunity for pci_error_handlers callbacks before resetting the Link. The DPC driver removes affected devices (which calls their driver .remove() methods), brings the Link back up, and re-enumerates (which calls driver .probe() methods). Align AER ERR_FATAL handling with DPC by resetting the Link in software, skipping the driver pci_error_handlers callbacks, removing the devices from the PCI subsystem, and re-enumerating. The idea is that drivers and devices should see the same behavior for ERR_FATAL events, regardless of whether they're handled by AER or DPC. Here are the basic ERR_FATAL recovery steps, showing the previous AER behavior, the AER behavior after this patch, and the DPC behavior: AER AER DPC previous new behavior -------- --- -------- Log error yes yes yes (minimal) drv.error_detected() yes no no Reset Link yes yes yes drv.mmio_enabled() yes no no drv.slot_reset() yes no no drv.resume() yes no no Remove PCI devices no yes yes (calls drv.remove()) Re-enumerate no yes yes (calls drv.probe()) N.B. With DPC, the Link reset happens before the driver .remove() calls, while with AER, the reset happens *after* the .remove() calls. The goal is to eventually do the reset before .remove() for AER as well. Signed-off-by: Oza Pawandeep <poza@codeaurora.org> [bhelgaas: changelog, squash doc patch into this, remove unused "result_data"] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Keith Busch <keith.busch@intel.com>
| * PCI: Add generic pcie_wait_for_link() interfaceOza Pawandeep2018-05-174-28/+34
| | | | | | | | | | | | | | | | | | | | | | | | Clients such as hotplug and Downstream Port Containment (DPC) both need to wait until a link becomes active or inactive. Add a generic pcie_wait_link_active() interface and use it instead of duplicating the code. Signed-off-by: Oza Pawandeep <poza@codeaurora.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Keith Busch <keith.busch@intel.com>
| * PCI/AER: Add TLP header information to tracepointThomas Tai2018-05-102-6/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a PCIe AER error occurs, the TLP header information is printed in the kernel message but it is missing from the tracepoint. A userspace program can use this information in the tracepoint to better analyze problems. To enable the tracepoint: echo 1 > /sys/kernel/debug/tracing/events/ras/aer_event/enable Example tracepoint output: $ cat /sys/kernel/debug/tracing/trace aer_event: 0000:01:00.0 PCIe Bus Error: severity=Uncorrected, non-fatal, Completer Abort TLP Header={0x0,0x1,0x2,0x3} Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * PCI/AER: Unify error bit printing for native and CPER reportingAlexandru Gagniuc2018-05-081-7/+9
|/ | | | | | | | | | | | | | | | AER errors can be reported natively (Linux AER driver fields interrupts and reads error state directly from hardware) or via the ACPI/APEI/GHES/CPER path (platform firmware reads error state from hardware and sends it to Linux via ACPI interfaces). Previously the same error would produce different output depending on whether it was reported natively or via ACPI. The CPER path resulted in hard-to-understand messages, without a prefix. Instead use __aer_print_error() for both native AER and CPER to provide a more consistent log format. Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> [bhelgaas: changelog] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
* Linux 4.17-rc1v4.17-rc1Linus Torvalds2018-04-161-2/+2
|
* Merge tag 'for-4.17-part2-tag' of ↵Linus Torvalds2018-04-1694-1234/+277
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull more btrfs updates from David Sterba: "We have queued a few more fixes (error handling, log replay, softlockup) and the rest is SPDX updates that touche almost all files so the diffstat is long" * tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: Only check first key for committed tree blocks btrfs: add SPDX header to Kconfig btrfs: replace GPL boilerplate by SPDX -- sources btrfs: replace GPL boilerplate by SPDX -- headers Btrfs: fix loss of prealloc extents past i_size after fsync log replay Btrfs: clean up resources during umount after trans is aborted btrfs: Fix possible softlock on single core machines Btrfs: bail out on error during replay_dir_deletes Btrfs: fix NULL pointer dereference in log_dir_items
| * btrfs: Only check first key for committed tree blocksQu Wenruo2018-04-131-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When looping btrfs/074 with many cpus (>= 8), it's possible to trigger kernel warning due to first key verification: [ 4239.523446] WARNING: CPU: 5 PID: 2381 at fs/btrfs/disk-io.c:460 btree_read_extent_buffer_pages+0x1ad/0x210 [ 4239.523830] Modules linked in: [ 4239.524630] RIP: 0010:btree_read_extent_buffer_pages+0x1ad/0x210 [ 4239.527101] Call Trace: [ 4239.527251] read_tree_block+0x42/0x70 [ 4239.527434] read_node_slot+0xd2/0x110 [ 4239.527632] push_leaf_right+0xad/0x1b0 [ 4239.527809] split_leaf+0x4ea/0x700 [ 4239.527988] ? leaf_space_used+0xbc/0xe0 [ 4239.528192] ? btrfs_set_lock_blocking_rw+0x99/0xb0 [ 4239.528416] btrfs_search_slot+0x8cc/0xa40 [ 4239.528605] btrfs_insert_empty_items+0x71/0xc0 [ 4239.528798] __btrfs_run_delayed_refs+0xa98/0x1680 [ 4239.529013] btrfs_run_delayed_refs+0x10b/0x1b0 [ 4239.529205] btrfs_commit_transaction+0x33/0xaf0 [ 4239.529445] ? start_transaction+0xa8/0x4f0 [ 4239.529630] btrfs_alloc_data_chunk_ondemand+0x1b0/0x4e0 [ 4239.529833] btrfs_check_data_free_space+0x54/0xa0 [ 4239.530045] btrfs_delalloc_reserve_space+0x25/0x70 [ 4239.531907] btrfs_direct_IO+0x233/0x3d0 [ 4239.532098] generic_file_direct_write+0xcb/0x170 [ 4239.532296] btrfs_file_write_iter+0x2bb/0x5f4 [ 4239.532491] aio_write+0xe2/0x180 [ 4239.532669] ? lock_acquire+0xac/0x1e0 [ 4239.532839] ? __might_fault+0x3e/0x90 [ 4239.533032] do_io_submit+0x594/0x860 [ 4239.533223] ? do_io_submit+0x594/0x860 [ 4239.533398] SyS_io_submit+0x10/0x20 [ 4239.533560] ? SyS_io_submit+0x10/0x20 [ 4239.533729] do_syscall_64+0x75/0x1d0 [ 4239.533979] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 4239.534182] RIP: 0033:0x7f8519741697 The problem here is, at btree_read_extent_buffer_pages() we don't have acquired read/write lock on that extent buffer, only basic info like level/bytenr is reliable. So race condition leads to such false alert. However in current call site, it's impossible to acquire proper lock without race window. To fix the problem, we only verify first key for committed tree blocks (whose generation is no larger than fs_info->last_trans_committed), so the content of such tree blocks will not change and there is no need to get read/write lock. Reported-by: Nikolay Borisov <nborisov@suse.com> Fixes: 581c1760415c ("btrfs: Validate child tree block's level and first key") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: add SPDX header to KconfigDavid Sterba2018-04-121-0/+2
| | | | | | | | Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: replace GPL boilerplate by SPDX -- sourcesDavid Sterba2018-04-1258-750/+65
| | | | | | | | | | | | | | | | Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: replace GPL boilerplate by SPDX -- headersDavid Sterba2018-04-1235-475/+133
| | | | | | | | | | | | | | | | | | | | Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Unify the include protection macros to match the file names. Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: fix loss of prealloc extents past i_size after fsync log replayFilipe Manana2018-04-121-5/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently if we allocate extents beyond an inode's i_size (through the fallocate system call) and then fsync the file, we log the extents but after a power failure we replay them and then immediately drop them. This behaviour happens since about 2009, commit c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log"), because it marks the inode as an orphan instead of dropping any extents beyond i_size before replaying logged extents, so after the log replay, and while the mount operation is still ongoing, we find the inode marked as an orphan and then perform a truncation (drop extents beyond the inode's i_size). Because the processing of orphan inodes is still done right after replaying the log and before the mount operation finishes, the intention of that commit does not make any sense (at least as of today). However reverting that behaviour is not enough, because we can not simply discard all extents beyond i_size and then replay logged extents, because we risk dropping extents beyond i_size created in past transactions, for example: add prealloc extent beyond i_size fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode transaction commit add another prealloc extent beyond i_size fsync - triggers the fast fsync path power failure In that scenario, we would drop the first extent and then replay the second one. To fix this just make sure that all prealloc extents beyond i_size are logged, and if we find too many (which is far from a common case), fallback to a full transaction commit (like we do when logging regular extents in the fast fsync path). Trivial reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo $ sync $ xfs_io -c "falloc -k 256K 1M" /mnt/foo $ xfs_io -c "fsync" /mnt/foo <power failure> # mount to replay log $ mount /dev/sdb /mnt # at this point the file only has one extent, at offset 0, size 256K A test case for fstests follows soon, covering multiple scenarios that involve adding prealloc extents with previous shrinking truncates and without such truncates. Fixes: c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: clean up resources during umount after trans is abortedLiu Bo2018-04-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently if some fatal errors occur, like all IO get -EIO, resources would be cleaned up when a) transaction is being committed or b) BTRFS_FS_STATE_ERROR is set However, in some rare cases, resources may be left alone after transaction gets aborted and umount may run into some ASSERT(), e.g. ASSERT(list_empty(&block_group->dirty_list)); For case a), in btrfs_commit_transaciton(), there're several places at the beginning where we just call btrfs_end_transaction() without cleaning up resources. For case b), it is possible that the trans handle doesn't have any dirty stuff, then only trans hanlde is marked as aborted while BTRFS_FS_STATE_ERROR is not set, so resources remain in memory. This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that all resources won't stay in memory after umount. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Fix possible softlock on single core machinesNikolay Borisov2018-04-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_chunk_alloc implements a loop checking whether there is a pending chunk allocation and if so causes the caller do loop. Generally this loop is executed only once, however testing with btrfs/072 on a single core vm machines uncovered an extreme case where the system could loop indefinitely. This is due to a missing cond_resched when loop which doesn't give a chance to the previous chunk allocator finish its job. The fix is to simply add the missing cond_resched. Fixes: 6d74119f1a3e ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: bail out on error during replay_dir_deletesLiu Bo2018-04-051-1/+3
| | | | | | | | | | | | | | | | | | | | | | If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs to bail out, otherwise @ret would be forced to be 0 after 'break;' and the caller won't be aware of it. Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: fix NULL pointer dereference in log_dir_itemsLiu Bo2018-04-051-2/+5
| | | | | | | | | | | | | | | | | | | | | | 0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is returned, path->nodes[0] could be NULL, log_dir_items lacks such a check for <0 and we may run into a null pointer dereference panic. Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge tag '4.17-rc1SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2018-04-1613-71/+240
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull cifs fixes from Steve French: "SMB3 fixes, a few for stable, and some important cleanup work from Ronnie of the smb3 transport code" * tag '4.17-rc1SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: change validate_buf to validate_iov cifs: remove rfc1002 hardcoded constants from cifs_discard_remaining_data() cifs: Change SMB2_open to return an iov for the error parameter cifs: add resp_buf_size to the mid_q_entry structure smb3.11: replace a 4 with server->vals->header_preamble_size cifs: replace a 4 with server->vals->header_preamble_size cifs: add pdu_size to the TCP_Server_Info structure SMB311: Improve checking of negotiate security contexts SMB3: Fix length checking of SMB3.11 negotiate request CIFS: add ONCE flag for cifs_dbg type cifs: Use ULL suffix for 64-bit constant SMB3: Log at least once if tree connect fails during reconnect cifs: smb2pdu: Fix potential NULL pointer dereference
| * | cifs: change validate_buf to validate_iovRonnie Sahlberg2018-04-131-18/+21
| | | | | | | | | | | | | | | | | | Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
| * | cifs: remove rfc1002 hardcoded constants from cifs_discard_remaining_data()Ronnie Sahlberg2018-04-131-2/+3
| | | | | | | | | | | | | | | | | | Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
| * | cifs: Change SMB2_open to return an iov for the error parameterRonnie Sahlberg2018-04-133-9/+13
| | | | | | | | | | | | | | | | | | Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
| * | cifs: add resp_buf_size to the mid_q_entry structureRonnie Sahlberg2018-04-134-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | and get rid of some more calls to get_rfc1002_length() Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
| * | smb3.11: replace a 4 with server->vals->header_preamble_sizeSteve French2018-04-132-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | More cleanup of use of hardcoded 4 byte RFC1001 field size Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
| * | cifs: replace a 4 with server->vals->header_preamble_sizeRonnie Sahlberg2018-04-131-1/+1
| | | | | | | | | | | | | | | | | | Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
| * | cifs: add pdu_size to the TCP_Server_Info structureRonnie Sahlberg2018-04-134-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | and get rid of some get_rfc1002_length() in smb2 Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
| * | SMB311: Improve checking of negotiate security contextsSteve French2018-04-123-0/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SMB3.11 crypto and hash contexts were not being checked strictly enough. Add parsing and validity checking for the security contexts in the SMB3.11 negotiate response. Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>