summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'for-linus-20181201' of git://git.kernel.dk/linux-blockLinus Torvalds2018-12-016-7/+14
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer fixes from Jens Axboe: - Single range elevator discard merge fix, that caused crashes (Ming) - Fix for a regression in O_DIRECT, where we could potentially lose the error value (Maximilian Heyne) - NVMe pull request from Christoph, with little fixes all over the map for NVMe. * tag 'for-linus-20181201' of git://git.kernel.dk/linux-block: block: fix single range discard merge nvme-rdma: fix double freeing of async event data nvme: flush namespace scanning work just before removing namespaces nvme: warn when finding multi-port subsystems without multipathing enabled fs: fix lost error code in dio_complete nvme-pci: fix surprise removal nvme-fc: initialize nvme_req(rq)->ctrl after calling __nvme_fc_init_request() nvme: Free ctrl device name on init failure
| * Merge branch 'nvme-4.20' of git://git.infradead.org/nvme into for-linusJens Axboe2018-11-304-4/+11
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe fixes from Christoph: "Various fixlets all over." * 'nvme-4.20' of git://git.infradead.org/nvme: nvme-rdma: fix double freeing of async event data nvme: flush namespace scanning work just before removing namespaces nvme: warn when finding multi-port subsystems without multipathing enabled nvme-pci: fix surprise removal nvme-fc: initialize nvme_req(rq)->ctrl after calling __nvme_fc_init_request() nvme: Free ctrl device name on init failure
| | * nvme-rdma: fix double freeing of async event dataPrabhath Sajeepa2018-11-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some error paths in configuration of admin queue free data buffer associated with async request SQE without resetting the data buffer pointer to NULL, This buffer is also freed up again if the controller is shutdown or reset. Signed-off-by: Prabhath Sajeepa <psajeepa@purestorage.com> Reviewed-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * nvme: flush namespace scanning work just before removing namespacesSagi Grimberg2018-11-301-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nvme_stop_ctrl can be called also for reset flow and there is no need to flush the scan_work as namespaces are not being removed. This can cause deadlock in rdma, fc and loop drivers since nvme_stop_ctrl barriers before controller teardown (and specifically I/O cancellation of the scan_work itself) takes place, but the scan_work will be blocked anyways so there is no need to flush it. Instead, move scan_work flush to nvme_remove_namespaces() where it really needs to flush. Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed by: James Smart <jsmart2021@gmail.com> Tested-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * nvme: warn when finding multi-port subsystems without multipathing enabledChristoph Hellwig2018-11-301-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | Without CONFIG_NVME_MULTIPATH enabled a multi-port subsystem might show up as invididual devices and cause problems, warn about it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
| | * nvme-pci: fix surprise removalIgor Konopko2018-11-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a PCIe NVMe device is not present, nvme_dev_remove_admin() calls blk_cleanup_queue() on the admin queue, which frees the hctx for that queue. Moments later, on the same path nvme_kill_queues() calls blk_mq_unquiesce_queue() on admin queue and tries to access hctx of it, which leads to following OOPS: Oops: 0000 [#1] SMP PTI RIP: 0010:sbitmap_any_bit_set+0xb/0x40 Call Trace: blk_mq_run_hw_queue+0xd5/0x150 blk_mq_run_hw_queues+0x3a/0x50 nvme_kill_queues+0x26/0x50 nvme_remove_namespaces+0xb2/0xc0 nvme_remove+0x60/0x140 pci_device_remove+0x3b/0xb0 Fixes: cb4bfda62afa2 ("nvme-pci: fix hot removal during error handling") Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * nvme-fc: initialize nvme_req(rq)->ctrl after calling __nvme_fc_init_request()Ewan D. Milne2018-11-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __nvme_fc_init_request() invokes memset() on the nvme_fcp_op_w_sgl structure, which NULLed-out the nvme_req(req)->ctrl field previously set by nvme_fc_init_request(). This apparently was not referenced until commit faf4a44fff ("nvme: support traffic based keep-alive") which now results in a crash in nvme_complete_rq(): [ 8386.897130] RIP: 0010:panic+0x220/0x26c [ 8386.901406] Code: 83 3d 6f ee 72 01 00 74 05 e8 e8 54 02 00 48 c7 c6 40 fd 5b b4 48 c7 c7 d8 8d c6 b3 31e [ 8386.922359] RSP: 0018:ffff99650019fc40 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 [ 8386.930804] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000006 [ 8386.938764] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff8e325f8168b0 [ 8386.946725] RBP: ffff99650019fcb0 R08: 0000000000000000 R09: 00000000000004f8 [ 8386.954687] R10: 0000000000000000 R11: ffff99650019f9b8 R12: ffffffffb3c55f3c [ 8386.962648] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [ 8386.970613] oops_end+0xd1/0xe0 [ 8386.974116] no_context+0x1b2/0x3c0 [ 8386.978006] do_page_fault+0x32/0x140 [ 8386.982090] page_fault+0x1e/0x30 [ 8386.985786] RIP: 0010:nvme_complete_rq+0x65/0x1d0 [nvme_core] [ 8386.992195] Code: 41 bc 03 00 00 00 74 16 0f 86 c3 00 00 00 66 3d 83 00 41 bc 06 00 00 00 0f 85 e7 00 000 [ 8387.013147] RSP: 0018:ffff99650019fe18 EFLAGS: 00010246 [ 8387.018973] RAX: 0000000000000000 RBX: ffff8e322ae51280 RCX: 0000000000000001 [ 8387.026935] RDX: 0000000000000400 RSI: 0000000000000001 RDI: ffff8e322ae51280 [ 8387.034897] RBP: ffff8e322ae51280 R08: 0000000000000000 R09: ffffffffb2f0b890 [ 8387.042859] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 8387.050821] R13: 0000000000000100 R14: 0000000000000004 R15: ffff8e2b0446d990 [ 8387.058782] ? swiotlb_unmap_page+0x40/0x40 [ 8387.063448] nvme_fc_complete_rq+0x2d/0x70 [nvme_fc] [ 8387.068986] blk_done_softirq+0xa1/0xd0 [ 8387.073264] __do_softirq+0xd6/0x2a9 [ 8387.077251] run_ksoftirqd+0x26/0x40 [ 8387.081238] smpboot_thread_fn+0x10e/0x160 [ 8387.085807] kthread+0xf8/0x130 [ 8387.089309] ? sort_range+0x20/0x20 [ 8387.093198] ? kthread_stop+0x110/0x110 [ 8387.097475] ret_from_fork+0x35/0x40 [ 8387.101462] ---[ end trace 7106b0adf5e422f8 ]--- Fixes: faf4a44fff ("nvme: support traffic based keep-alive") Signed-off-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * nvme: Free ctrl device name on init failureKeith Busch2018-11-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Free the kobject name that was allocated for the controller device on failure rather than its parent. Fixes: d22524a4782a9 ("nvme: switch controller refcounting to use struct device") Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | block: fix single range discard mergeMing Lei2018-11-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are actually two kinds of discard merge: - one is the normal discard merge, just like normal read/write request, and call it single-range discard - another is the multi-range discard, queue_max_discard_segments(rq->q) > 1 For the former case, queue_max_discard_segments(rq->q) is 1, and we should handle this kind of discard merge like the normal read/write request. This patch fixes the following kernel panic issue[1], which is caused by not removing the single-range discard request from elevator queue. Guangwu has one raid discard test case, in which this issue is a bit easier to trigger, and I verified that this patch can fix the kernel panic issue in Guangwu's test case. [1] kernel panic log from Jens's report BUG: unable to handle kernel NULL pointer dereference at 0000000000000148 PGD 0 P4D 0. Oops: 0000 [#1] SMP PTI CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted \ 4.20.0-rc3-00649-ge64d9a554a91-dirty #14 Hardware name: Wiwynn \ Leopard-Orv2/Leopard-DDR BW, BIOS LBM08 03/03/2017 Workqueue: kblockd \ blk_mq_run_work_fn RIP: \ 0010:blk_mq_get_driver_tag+0x81/0x120 Code: 24 \ 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 \ 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 \ f6 87 b0 00 00 00 02 RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246 \ RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8 RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000 R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300 R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000 FS: 0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: blk_mq_dispatch_rq_list+0xec/0x480 ? elv_rb_del+0x11/0x30 blk_mq_do_dispatch_sched+0x6e/0xf0 blk_mq_sched_dispatch_requests+0xfa/0x170 __blk_mq_run_hw_queue+0x5f/0xe0 process_one_work+0x154/0x350 worker_thread+0x46/0x3c0 kthread+0xf5/0x130 ? process_one_work+0x350/0x350 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x1f/0x30 Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel \ kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii \ cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq \ button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme \ nvme_core fuse sg loop efivarfs autofs4 CR2: 0000000000000148 \ ---[ end trace 340a1fb996df1b9b ]--- RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 \ 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 \ 20 72 37 f6 87 b0 00 00 00 02 Fixes: 445251d0f4d329a ("blk-mq: fix discard merge with scheduler attached") Reported-by: Jens Axboe <axboe@kernel.dk> Cc: Guangwu Zhang <guazhang@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | fs: fix lost error code in dio_completeMaximilian Heyne2018-11-301-2/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e259221763a40403d5bb232209998e8c45804ab8 ("fs: simplify the generic_write_sync prototype") reworked callers of generic_write_sync(), and ended up dropping the error return for the directio path. Prior to that commit, in dio_complete(), an error would be bubbled up the stack, but after that commit, errors passed on to dio_complete were eaten up. This was reported on the list earlier, and a fix was proposed in https://lore.kernel.org/lkml/20160921141539.GA17898@infradead.org/, but never followed up with. We recently hit this bug in our testing where fencing io errors, which were previously erroring out with EIO, were being returned as success operations after this commit. The fix proposed on the list earlier was a little short -- it would have still called generic_write_sync() in case `ret` already contained an error. This fix ensures generic_write_sync() is only called when there's no pending error in the write. Additionally, transferred is replaced with ret to bring this code in line with other callers. Fixes: e259221763a4 ("fs: simplify the generic_write_sync prototype") Reported-by: Ravi Nankani <rnankani@amazon.com> Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Reviewed-by: Christoph Hellwig <hch@lst.de> CC: Torsten Mehlan <tomeh@amazon.de> CC: Uwe Dannowski <uwed@amazon.de> CC: Amit Shah <aams@amazon.de> CC: David Woodhouse <dwmw@amazon.co.uk> CC: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'pci-v4.20-fixes-2' of ↵Linus Torvalds2018-12-014-24/+13
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fixes from Bjorn Helgaas: - Fix a link speed checking interface that broke PCIe gen3 cards in gen1 slots (Mikulas Patocka) - Fix an imx6 link training error (Trent Piepho) - Fix a layerscape outbound window accessor calling error (Hou Zhiqiang) - Fix a DesignWare endpoint MSI-X address calculation error (Gustavo Pimentel) * tag 'pci-v4.20-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI: Fix incorrect value returned from pcie_get_speed_cap() PCI: dwc: Fix MSI-X EP framework address calculation bug PCI: layerscape: Fix wrong invocation of outbound window disable accessor PCI: imx6: Fix link training status detection in link up check
| * \ Merge remote-tracking branch 'lorenzo/pci/controller-fixes' into for-linusBjorn Helgaas2018-12-013-11/+2
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Fix DesignWare endpoint MSI-X address calculation bug (Gustavo Pimentel) - Fix Layerscape outbound window disable usage (Hou Zhiqiang) - Fix imx6 link up detection (Trent Piepho) * lorenzo/pci/controller-fixes: PCI: dwc: Fix MSI-X EP framework address calculation bug PCI: layerscape: Fix wrong invocation of outbound window disable accessor PCI: imx6: Fix link training status detection in link up check
| | * | PCI: dwc: Fix MSI-X EP framework address calculation bugGustavo Pimentel2018-11-271-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix an error caused by 3-bit right rotation on offset address calculation of MSI-X table in dw_pcie_ep_raise_msix_irq(). The initial testing code was setting by default the offset address of MSI-X table to zero, so that even with a 3-bit right rotation the computed result would still be zero and valid, therefore this bug went unnoticed. Fixes: beb4641a787d ("PCI: dwc: Add MSI-X callbacks handler") Signed-off-by: Gustavo Pimentel <gustavo.pimentel@synopsys.com> [lorenzo.pieralisi@arm.com: updated commit log] Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: stable@vger.kernel.org
| | * | PCI: layerscape: Fix wrong invocation of outbound window disable accessorHou Zhiqiang2018-11-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The order of parameters is not correct when invoking the outbound window disable routine. Fix it. Fixes: 4a2745d760fa ("PCI: layerscape: Disable outbound windows configured by bootloader") Signed-off-by: Hou Zhiqiang <Zhiqiang.Hou@nxp.com> [lorenzo.pieralisi@arm.com: commit log] Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: stable@vger.kernel.org
| | * | PCI: imx6: Fix link training status detection in link up checkTrent Piepho2018-11-201-9/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This bug was introduced in the interaction for two commits on either branch of the merge commit 562df5c8521e ("Merge branch 'pci/host-designware' into next"). Commit 4d107d3b5a68 ("PCI: imx6: Move link up check into imx6_pcie_wait_for_link()"), changed imx6_pcie_wait_for_link() to poll the link status register directly, checking for link up and not training, and made imx6_pcie_link_up() only check the link up bit (once, not a polling loop). While commit 886bc5ceb5cc ("PCI: designware: Add generic dw_pcie_wait_for_link()"), replaced the loop in imx6_pcie_wait_for_link() with a call to a new dwc core function, which polled imx6_pcie_link_up(), which still checked both link up and not training in a loop. When these two commits were merged, the version of imx6_pcie_wait_for_link() from 886bc5ceb5cc was kept, which eliminated the link training check placed there by 4d107d3b5a68. However, the version of imx6_pcie_link_up() from 4d107d3b5a68 was kept, which eliminated the link training check that had been there and was moved to imx6_pcie_wait_for_link(). The result was the link training check got lost for the imx6 driver. Eliminate imx6_pcie_link_up() so that the default handler, dw_pcie_link_up(), is used instead. The default handler has the correct code, which checks for link up and also that it still is not training, fixing the regression. Fixes: 562df5c8521e ("Merge branch 'pci/host-designware' into next") Signed-off-by: Trent Piepho <tpiepho@impinj.com> [lorenzo.pieralisi@arm.com: rewrote the commit log] Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Lucas Stach <l.stach@pengutronix.de> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Joao Pinto <Joao.Pinto@synopsys.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Richard Zhu <hongxing.zhu@nxp.com>
| * | | PCI: Fix incorrect value returned from pcie_get_speed_cap()Mikulas Patocka2018-12-011-13/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask the register and compare it against them. This fixes errors like this: amdgpu: [powerplay] failed to send message 261 ret is 0 when a PCIe-v3 card is plugged into a PCIe-v1 slot, because the slot is being incorrectly reported as PCIe-v3 capable. 6cf57be0f78e, which appeared in v4.17, added pcie_get_speed_cap() with the incorrect test of PCI_EXP_LNKCAP_SLS as a bitmask. 5d9a63304032, which appeared in v4.19, changed amdgpu to use pcie_get_speed_cap(), so the amdgpu bug reports below are regressions in v4.19. Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed") Fixes: 5d9a63304032 ("drm/amdgpu: use pcie functions for link width and speed") Link: https://bugs.freedesktop.org/show_bug.cgi?id=108704 Link: https://bugs.freedesktop.org/show_bug.cgi?id=108778 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> [bhelgaas: update comment, remove use of PCI_EXP_LNKCAP_SLS_8_0GB and PCI_EXP_LNKCAP_SLS_16_0GB since those should be covered by PCI_EXP_LNKCAP2, remove test of PCI_EXP_LNKCAP for zero, since that register is required] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # v4.17+
* | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2018-12-0125-177/+319
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc fixes from Andrew Morton: "31 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits) ocfs2: fix potential use after free mm/khugepaged: fix the xas_create_range() error path mm/khugepaged: collapse_shmem() do not crash on Compound mm/khugepaged: collapse_shmem() without freezing new_page mm/khugepaged: minor reorderings in collapse_shmem() mm/khugepaged: collapse_shmem() remember to clear holes mm/khugepaged: fix crashes due to misaccounted holes mm/khugepaged: collapse_shmem() stop if punched or truncated mm/huge_memory: fix lockdep complaint on 32-bit i_size_read() mm/huge_memory: splitting set mapping+index before unfreeze mm/huge_memory: rename freeze_page() to unmap_page() initramfs: clean old path before creating a hardlink kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace psi: make disabling/enabling easier for vendor kernels proc: fixup map_files test on arm debugobjects: avoid recursive calls with kmemleak userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set userfaultfd: shmem: add i_size checks userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem ...
| * | | | ocfs2: fix potential use after freePan Bian2018-11-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ocfs2_get_dentry() calls iput(inode) to drop the reference count of inode, and if the reference count hits 0, inode is freed. However, in this function, it then reads inode->i_generation, which may result in a use after free bug. Move the put operation later. Link: http://lkml.kernel.org/r/1543109237-110227-1-git-send-email-bianpan2016@163.com Fixes: 781f200cb7a("ocfs2: Remove masklog ML_EXPORT.") Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/khugepaged: fix the xas_create_range() error pathHugh Dickins2018-11-301-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | collapse_shmem()'s xas_nomem() is very unlikely to fail, but it is rightly given a failure path, so move the whole xas_create_range() block up before __SetPageLocked(new_page): so that it does not need to remember to unlock_page(new_page). Add the missing mem_cgroup_cancel_charge(), and set (currently unused) result to SCAN_FAIL rather than SCAN_SUCCEED. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261531200.2275@eggly.anvils Fixes: 77da9389b9d5 ("mm: Convert collapse_shmem to XArray") Signed-off-by: Hugh Dickins <hughd@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/khugepaged: collapse_shmem() do not crash on CompoundHugh Dickins2018-11-301-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | collapse_shmem()'s VM_BUG_ON_PAGE(PageTransCompound) was unsafe: before it holds page lock of the first page, racing truncation then extension might conceivably have inserted a hugepage there already. Fail with the SCAN_PAGE_COMPOUND result, instead of crashing (CONFIG_DEBUG_VM=y) or otherwise mishandling the unexpected hugepage - though later we might code up a more constructive way of handling it, with SCAN_SUCCESS. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261529310.2275@eggly.anvils Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/khugepaged: collapse_shmem() without freezing new_pageHugh Dickins2018-11-301-12/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | khugepaged's collapse_shmem() does almost all of its work, to assemble the huge new_page from 512 scattered old pages, with the new_page's refcount frozen to 0 (and refcounts of all old pages so far also frozen to 0). Including shmem_getpage() to read in any which were out on swap, memory reclaim if necessary to allocate their intermediate pages, and copying over all the data from old to new. Imagine the frozen refcount as a spinlock held, but without any lock debugging to highlight the abuse: it's not good, and under serious load heads into lockups - speculative getters of the page are not expecting to spin while khugepaged is rescheduled. One can get a little further under load by hacking around elsewhere; but fortunately, freezing the new_page turns out to have been entirely unnecessary, with no hacks needed elsewhere. The huge new_page lock is already held throughout, and guards all its subpages as they are brought one by one into the page cache tree; and anything reading the data in that page, without the lock, before it has been marked PageUptodate, would already be in the wrong. So simply eliminate the freezing of the new_page. Each of the old pages remains frozen with refcount 0 after it has been replaced by a new_page subpage in the page cache tree, until they are all unfrozen on success or failure: just as before. They could be unfrozen sooner, but cause no problem once no longer visible to find_get_entry(), filemap_map_pages() and other speculative lookups. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/khugepaged: minor reorderings in collapse_shmem()Hugh Dickins2018-11-301-40/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several cleanups in collapse_shmem(): most of which probably do not really matter, beyond doing things in a more familiar and reassuring order. Simplify the failure gotos in the main loop, and on success update stats while interrupts still disabled from the last iteration. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261526400.2275@eggly.anvils Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/khugepaged: collapse_shmem() remember to clear holesHugh Dickins2018-11-301-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Huge tmpfs testing reminds us that there is no __GFP_ZERO in the gfp flags khugepaged uses to allocate a huge page - in all common cases it would just be a waste of effort - so collapse_shmem() must remember to clear out any holes that it instantiates. The obvious place to do so, where they are put into the page cache tree, is not a good choice: because interrupts are disabled there. Leave it until further down, once success is assured, where the other pages are copied (before setting PageUptodate). Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261525080.2275@eggly.anvils Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/khugepaged: fix crashes due to misaccounted holesHugh Dickins2018-11-302-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Huge tmpfs testing on a shortish file mapped into a pmd-rounded extent hit shmem_evict_inode()'s WARN_ON(inode->i_blocks) followed by clear_inode()'s BUG_ON(inode->i_data.nrpages) when the file was later closed and unlinked. khugepaged's collapse_shmem() was forgetting to update mapping->nrpages on the rollback path, after it had added but then needs to undo some holes. There is indeed an irritating asymmetry between shmem_charge(), whose callers want it to increment nrpages after successfully accounting blocks, and shmem_uncharge(), when __delete_from_page_cache() already decremented nrpages itself: oh well, just add a comment on that to them both. And shmem_recalc_inode() is supposed to be called when the accounting is expected to be in balance (so it can deduce from imbalance that reclaim discarded some pages): so change shmem_charge() to update nrpages earlier (though it's rare for the difference to matter at all). Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261523450.2275@eggly.anvils Fixes: 800d8c63b2e98 ("shmem: add huge pages support") Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/khugepaged: collapse_shmem() stop if punched or truncatedHugh Dickins2018-11-301-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Huge tmpfs testing showed that although collapse_shmem() recognizes a concurrently truncated or hole-punched page correctly, its handling of holes was liable to refill an emptied extent. Add check to stop that. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261522040.2275@eggly.anvils Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()Hugh Dickins2018-11-301-6/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Huge tmpfs testing, on 32-bit kernel with lockdep enabled, showed that __split_huge_page() was using i_size_read() while holding the irq-safe lru_lock and page tree lock, but the 32-bit i_size_read() uses an irq-unsafe seqlock which should not be nested inside them. Instead, read the i_size earlier in split_huge_page_to_list(), and pass the end offset down to __split_huge_page(): all while holding head page lock, which is enough to prevent truncation of that extent before the page tree lock has been taken. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261520070.2275@eggly.anvils Fixes: baa355fd33142 ("thp: file pages support for split_huge_page()") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/huge_memory: splitting set mapping+index before unfreezeHugh Dickins2018-11-301-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Huge tmpfs stress testing has occasionally hit shmem_undo_range()'s VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page). Move the setting of mapping and index up before the page_ref_unfreeze() in __split_huge_page_tail() to fix this: so that a page cache lookup cannot get a reference while the tail's mapping and index are unstable. In fact, might as well move them up before the smp_wmb(): I don't see an actual need for that, but if I'm missing something, this way round is safer than the other, and no less efficient. You might argue that VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page) is misplaced, and should be left until after the trylock_page(); but left as is has not crashed since, and gives more stringent assurance. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261516380.2275@eggly.anvils Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()") Requires: 605ca5ede764 ("mm/huge_memory.c: reorder operations in __split_huge_page_tail()") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/huge_memory: rename freeze_page() to unmap_page()Hugh Dickins2018-11-302-16/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The term "freeze" is used in several ways in the kernel, and in mm it has the particular meaning of forcing page refcount temporarily to 0. freeze_page() is just too confusing a name for a function that unmaps a page: rename it unmap_page(), and rename unfreeze_page() remap_page(). Went to change the mention of freeze_page() added later in mm/rmap.c, but found it to be incorrect: ordinary page reclaim reaches there too; but the substance of the comment still seems correct, so edit it down. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261514080.2275@eggly.anvils Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | initramfs: clean old path before creating a hardlinkLi Zhijian2018-11-301-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sys_link() can fail due to the new path already existing. This case ofen occurs when we use a concated initrd, for example: 1) prepare a basic rootfs, it contains a regular files rc.local lizhijian@:~/yocto-tiny-i386-2016-04-22$ cat etc/rc.local #!/bin/sh echo "Running /etc/rc.local..." yocto-tiny-i386-2016-04-22$ find . | sed 's,^\./,,' | cpio -o -H newc | gzip -n -9 >../rootfs.cgz 2) create a extra initrd which also includes a etc/rc.local lizhijian@:~/lkp-x86_64/etc$ echo "append initrd" >rc.local lizhijian@:~/lkp/lkp-x86_64/etc$ cat rc.local append initrd lizhijian@:~/lkp/lkp-x86_64/etc$ ln rc.local rc.local.hardlink append initrd lizhijian@:~/lkp/lkp-x86_64/etc$ stat rc.local rc.local.hardlink File: 'rc.local' Size: 14 Blocks: 8 IO Block: 4096 regular file Device: 801h/2049d Inode: 11296086 Links: 2 Access: (0664/-rw-rw-r--) Uid: ( 1002/lizhijian) Gid: ( 1002/lizhijian) Access: 2018-11-15 16:08:28.654464815 +0800 Modify: 2018-11-15 16:07:57.514903210 +0800 Change: 2018-11-15 16:08:24.180228872 +0800 Birth: - File: 'rc.local.hardlink' Size: 14 Blocks: 8 IO Block: 4096 regular file Device: 801h/2049d Inode: 11296086 Links: 2 Access: (0664/-rw-rw-r--) Uid: ( 1002/lizhijian) Gid: ( 1002/lizhijian) Access: 2018-11-15 16:08:28.654464815 +0800 Modify: 2018-11-15 16:07:57.514903210 +0800 Change: 2018-11-15 16:08:24.180228872 +0800 Birth: - lizhijian@:~/lkp/lkp-x86_64$ find . | sed 's,^\./,,' | cpio -o -H newc | gzip -n -9 >../rc-local.cgz lizhijian@:~/lkp/lkp-x86_64$ gzip -dc ../rc-local.cgz | cpio -t . etc etc/rc.local.hardlink <<< it will be extracted first at this initrd etc/rc.local 3) concate 2 initrds and boot lizhijian@:~/lkp$ cat rootfs.cgz rc-local.cgz >concate-initrd.cgz lizhijian@:~/lkp$ qemu-system-x86_64 -nographic -enable-kvm -cpu host -smp 1 -m 1024 -kernel ~/lkp/linux/arch/x86/boot/bzImage -append "console=ttyS0 earlyprint=ttyS0 ignore_loglevel" -initrd ./concate-initr.cgz -serial stdio -nodefaults In this case, sys_link(2) will fail and return -EEXIST, so we can only get the rc.local at rootfs.cgz instead of rc-local.cgz [akpm@linux-foundation.org: move code to avoid forward declaration] Link: http://lkml.kernel.org/r/1542352368-13299-1-git-send-email-lizhijian@cn.fujitsu.com Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Cc: Philip Li <philip.li@intel.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Li Zhijian <zhijianx.li@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notraceAnders Roxell2018-11-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since __sanitizer_cov_trace_pc() is marked as notrace, function calls in __sanitizer_cov_trace_pc() shouldn't be traced either. ftrace_graph_caller() gets called for each function that isn't marked 'notrace', like canonicalize_ip(). This is the call trace from a run: [ 139.644550] ftrace_graph_caller+0x1c/0x24 [ 139.648352] canonicalize_ip+0x18/0x28 [ 139.652313] __sanitizer_cov_trace_pc+0x14/0x58 [ 139.656184] sched_clock+0x34/0x1e8 [ 139.659759] trace_clock_local+0x40/0x88 [ 139.663722] ftrace_push_return_trace+0x8c/0x1f0 [ 139.667767] prepare_ftrace_return+0xa8/0x100 [ 139.671709] ftrace_graph_caller+0x1c/0x24 Rework so that check_kcov_mode() and canonicalize_ip() that are called from __sanitizer_cov_trace_pc() are also marked as notrace. Link: http://lkml.kernel.org/r/20181128081239.18317-1-anders.roxell@linaro.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signen-off-by: Anders Roxell <anders.roxell@linaro.org> Co-developed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | psi: make disabling/enabling easier for vendor kernelsJohannes Weiner2018-11-305-14/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mel Gorman reports a hackbench regression with psi that would prohibit shipping the suse kernel with it default-enabled, but he'd still like users to be able to opt in at little to no cost to others. With the current combination of CONFIG_PSI and the psi_disabled bool set from the commandline, this is a challenge. Do the following things to make it easier: 1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros to enable CONFIG_PSI in their kernel but leave the feature disabled unless a user requests it at boot-time. To avoid double negatives, rename psi_disabled= to psi=. 2. Make psi_disabled a static branch to eliminate any branch costs when the feature is disabled. In terms of numbers before and after this patch, Mel says: : The following is a comparision using CONFIG_PSI=n as a baseline against : your patch and a vanilla kernel : : 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4 : kconfigdisable-v1r1 vanilla psidisable-v1r1 : Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%) : Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%) : Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%) : Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%) : Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%) : Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%) : Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%) : Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%) : Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%) : : It's showing that the vanilla kernel takes a hit (as the bisection : indicated it would) and that disabling PSI by default is reasonably : close in terms of performance for this particular workload on this : particular machine so; Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | proc: fixup map_files test on armAlexey Dobriyan2018-11-301-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://bugs.linaro.org/show_bug.cgi?id=3782 Turns out arm doesn't permit mapping address 0, so try minimum virtual address instead. Link: http://lkml.kernel.org/r/20181113165446.GA28157@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: Rafael David Tinoco <rafael.tinoco@linaro.org> Tested-by: Rafael David Tinoco <rafael.tinoco@linaro.org> Acked-by: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | debugobjects: avoid recursive calls with kmemleakQian Cai2018-11-301-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CONFIG_DEBUG_OBJECTS_RCU_HEAD does not play well with kmemleak due to recursive calls. fill_pool kmemleak_ignore make_black_object put_object __call_rcu (kernel/rcu/tree.c) debug_rcu_head_queue debug_object_activate debug_object_init fill_pool kmemleak_ignore make_black_object ... So add SLAB_NOLEAKTRACE to kmem_cache_create() to not register newly allocated debug objects at all. Link: http://lkml.kernel.org/r/20181126165343.2339-1-cai@gmx.us Signed-off-by: Qian Cai <cai@gmx.us> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not setAndrea Arcangeli2018-11-301-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Set the page dirty if VM_WRITE is not set because in such case the pte won't be marked dirty and the page would be reclaimed without writepage (i.e. swapout in the shmem case). This was found by source review. Most apps (certainly including QEMU) only use UFFDIO_COPY on PROT_READ|PROT_WRITE mappings or the app can't modify the memory in the first place. This is for correctness and it could help the non cooperative use case to avoid unexpected data loss. Link: http://lkml.kernel.org/r/20181126173452.26955-6-aarcange@redhat.com Reviewed-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | userfaultfd: shmem: add i_size checksAndrea Arcangeli2018-11-302-4/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With MAP_SHARED: recheck the i_size after taking the PT lock, to serialize against truncate with the PT lock. Delete the page from the pagecache if the i_size_read check fails. With MAP_PRIVATE: check the i_size after the PT lock before mapping anonymous memory or zeropages into the MAP_PRIVATE shmem mapping. A mostly irrelevant cleanup: like we do the delete_from_page_cache() pagecache removal after dropping the PT lock, the PT lock is a spinlock so drop it before the sleepable page lock. Link: http://lkml.kernel.org/r/20181126173452.26955-5-aarcange@redhat.com Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Hugh Dickins <hughd@google.com> Reported-by: Jann Horn <jannh@google.com> Cc: <stable@vger.kernel.org> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmasAndrea Arcangeli2018-11-302-9/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the VMA to register the uffd onto is found, check that it has VM_MAYWRITE set before allowing registration. This way we inherit all common code checks before allowing to fill file holes in shmem and hugetlbfs with UFFDIO_COPY. The userfaultfd memory model is not applicable for readonly files unless it's a MAP_PRIVATE. Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com Fixes: ff62a3421044 ("hugetlb: implement memfd sealing") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Hugh Dickins <hughd@google.com> Reported-by: Jann Horn <jannh@google.com> Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Cc: <stable@vger.kernel.org> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmemAndrea Arcangeli2018-11-301-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Userfaultfd did not create private memory when UFFDIO_COPY was invoked on a MAP_PRIVATE shmem mapping. Instead it wrote to the shmem file, even when that had not been opened for writing. Though, fortunately, that could only happen where there was a hole in the file. Fix the shmem-backed implementation of UFFDIO_COPY to create private memory for MAP_PRIVATE mappings. The hugetlbfs-backed implementation was already correct. This change is visible to userland, if userfaultfd has been used in unintended ways: so it introduces a small risk of incompatibility, but is necessary in order to respect file permissions. An app that uses UFFDIO_COPY for anything like postcopy live migration won't notice the difference, and in fact it'll run faster because there will be no copy-on-write and memory waste in the tmpfs pagecache anymore. Userfaults on MAP_PRIVATE shmem keep triggering only on file holes like before. The real zeropage can also be built on a MAP_PRIVATE shmem mapping through UFFDIO_ZEROPAGE and that's safe because the zeropage pte is never dirty, in turn even an mprotect upgrading the vma permission from PROT_READ to PROT_READ|PROT_WRITE won't make the zeropage pte writable. Link: http://lkml.kernel.org/r/20181126173452.26955-3-aarcange@redhat.com Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | userfaultfd: use ENOENT instead of EFAULT if the atomic copy user failsAndrea Arcangeli2018-11-303-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "userfaultfd shmem updates". Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the lack of the VM_MAYWRITE check and the lack of i_size checks. Then looking into the above we also fixed the MAP_PRIVATE case. Hugh by source review also found a data loss source if UFFDIO_COPY is used on shmem MAP_SHARED PROT_READ mappings (the production usages incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't happen in those production usages like with QEMU). The whole patchset is marked for stable. We verified QEMU postcopy live migration with guest running on shmem MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE. Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU unconditionally invokes a punch hole if the guest mapping is filebacked and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and for the anon backend). This patch (of 5): We internally used EFAULT to communicate with the caller, switch to ENOENT, so EFAULT can be used as a non internal retval. Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Hugh Dickins <hughd@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: <stable@vger.kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | lib/test_kmod.c: fix rmmod double freeLuis Chamberlain2018-11-301-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We free the misc device string twice on rmmod; fix this. Without this we cannot remove the module without crashing. Link: http://lkml.kernel.org/r/20181124050500.5257-1-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reported-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> [4.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | hfsplus: do not free node before usingPan Bian2018-11-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | hfs_bmap_free() frees node via hfs_bnode_put(node). However it then reads node->this when dumping error message on an error path, which may result in a use-after-free bug. This patch frees node only when it is never used. Link: http://lkml.kernel.org/r/1543053441-66942-1-git-send-email-bianpan2016@163.com Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | hfs: do not free node before usingPan Bian2018-11-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | hfs_bmap_free() frees the node via hfs_bnode_put(node). However, it then reads node->this when dumping error message on an error path, which may result in a use-after-free bug. This patch frees the node only when it is never again used. Link: http://lkml.kernel.org/r/1542963889-128825-1-git-send-email-bianpan2016@163.com Fixes: a1185ffa2fc ("HFS rewrite") Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joe Perches <joe@perches.com> Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | proc: update MAINTAINERS with proc.txtAlexey Dobriyan2018-11-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Turns out that /proc has official documentation and people even trying to keep it uptodate. Link: http://lkml.kernel.org/r/20181116134630.GA8004@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/page_alloc.c: fix calculation of pgdat->nr_zonesWei Yang2018-11-301-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | init_currently_empty_zone() will adjust pgdat->nr_zones and set it to 'zone_idx(zone) + 1' unconditionally. This is correct in the normal case, while not exact in hot-plug situation. This function is used in two places: * free_area_init_core() * move_pfn_range_to_zone() In the first case, we are sure zone index increase monotonically. While in the second one, this is under users control. One way to reproduce this is: ---------------------------- 1. create a virtual machine with empty node1 -m 4G,slots=32,maxmem=32G \ -smp 4,maxcpus=8 \ -numa node,nodeid=0,mem=4G,cpus=0-3 \ -numa node,nodeid=1,mem=0G,cpus=4-7 2. hot-add cpu 3-7 cpu-add [3-7] 2. hot-add memory to nod1 object_add memory-backend-ram,id=ram0,size=1G device_add pc-dimm,id=dimm0,memdev=ram0,node=1 3. online memory with following order echo online_movable > memory47/state echo online > memory40/state After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1) instead of (ZONE_MOVABLE + 1). Michal said: "Having an incorrect nr_zones might result in all sorts of problems which would be quite hard to debug (e.g. reclaim not considering the movable zone). I do not expect many users would suffer from this it but still this is trivial and obviously right thing to do so backporting to the stable tree shouldn't be harmful (last famous words)" Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm: use swp_offset as key in shmem_replace_page()Yu Zhao2018-11-301-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We changed the key of swap cache tree from swp_entry_t.val to swp_offset. We need to do so in shmem_replace_page() as well. Hugh said: "shmem_replace_page() has been wrong since the day I wrote it: good enough to work on swap "type" 0, which is all most people ever use (especially those few who need shmem_replace_page() at all), but broken once there are any non-0 swp_type bits set in the higher order bits" Link: http://lkml.kernel.org/r/20181121215442.138545-1-yuzhao@google.com Fixes: f6ab1f7f6b2d ("mm, swap: use offset of swap entry as key of swap cache") Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [4.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm: cleancache: fix corruption on missed inode invalidationPavel Tikhomirov2018-11-301-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If all pages are deleted from the mapping by memory reclaim and also moved to the cleancache: __delete_from_page_cache (no shadow case) unaccount_page_cache_page cleancache_put_page page_cache_delete mapping->nrpages -= nr (nrpages becomes 0) We don't clean the cleancache for an inode after final file truncation (removal). truncate_inode_pages_final check (nrpages || nrexceptional) is false no truncate_inode_pages no cleancache_invalidate_inode(mapping) These way when reading the new file created with same inode we may get these trash leftover pages from cleancache and see wrong data instead of the contents of the new file. Fix it by always doing truncate_inode_pages which is already ready for nrpages == 0 && nrexceptional == 0 case and just invalidates inode. [akpm@linux-foundation.org: add comment, per Jan] Link: http://lkml.kernel.org/r/20181112095734.17979-1-ptikhomirov@virtuozzo.com Fixes: commit 91b0abe36a7b ("mm + fs: store shadow entries in page cache") Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | ocfs2: fix deadlock caused by ocfs2_defrag_extent()Larry Chen2018-11-301-21/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ocfs2_defrag_extent may fall into deadlock. ocfs2_ioctl_move_extents ocfs2_ioctl_move_extents ocfs2_move_extents ocfs2_defrag_extent ocfs2_lock_allocators_move_extents ocfs2_reserve_clusters inode_lock GLOBAL_BITMAP_SYSTEM_INODE __ocfs2_flush_truncate_log inode_lock GLOBAL_BITMAP_SYSTEM_INODE As backtrace shows above, ocfs2_reserve_clusters() will call inode_lock against the global bitmap if local allocator has not sufficient cluters. Once global bitmap could meet the demand, ocfs2_reserve_cluster will return success with global bitmap locked. After ocfs2_reserve_cluster(), if truncate log is full, __ocfs2_flush_truncate_log() will definitely fall into deadlock because it needs to inode_lock global bitmap, which has already been locked. To fix this bug, we could remove from ocfs2_lock_allocators_move_extents() the code which intends to lock global allocator, and put the removed code after __ocfs2_flush_truncate_log(). ocfs2_lock_allocators_move_extents() is referred by 2 places, one is here, the other does not need the data allocator context, which means this patch does not affect the caller so far. Link: http://lkml.kernel.org/r/20181101071422.14470-1-lchen@suse.com Signed-off-by: Larry Chen <lchen@suse.com> Reviewed-by: Changwei Ge <ge.changwei@h3c.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/gup: finish consolidating error handlingJohn Hubbard2018-11-301-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit df06b37ffe5a ("mm/gup: cache dev_pagemap while pinning pages") attempted to operate on each page that get_user_pages had retrieved. In order to do that, it created a common exit point from the routine. However, one case was missed, which this patch fixes up. Also, there was still an unnecessary shadow declaration (with a different type) of the "ret" variable, which this patch removes. Keith's description of the situation is: This also fixes a potentially leaked dev_pagemap reference count if a failure occurs when an iteration crosses a vma boundary. I don't think it's normal to have different vma's on a users mapped zone device memory, but good to fix anyway. I actually thought that this code: /* first iteration or cross vma bound */ if (!vma || start >= vma->vm_end) { vma = find_extend_vma(mm, start); if (!vma && in_gate_area(mm, start)) { ret = get_gate_page(mm, start & PAGE_MASK, gup_flags, &vma, pages ? &pages[i] : NULL); if (ret) goto out; dealt with the "you're trying to pin the gate page, as part of this call", rather than the generic case of crossing a vma boundary. (I think there's a fine point that I must be overlooking.) But it's still a valid case, either way. Link: http://lkml.kernel.org/r/20181121081402.29641-2-jhubbard@nvidia.com Fixes: df06b37ffe5a4 ("mm/gup: cache dev_pagemap while pinning pages") Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | MAINTAINERS: name change for LuisLuis Chamberlain2018-11-301-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | My name has changed, works better than Global Entry I tell ya. Link: http://lkml.kernel.org/r/20181122003138.7752-1-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | Merge tag 'mips_fixes_4.20_4' of ↵Linus Torvalds2018-12-013-25/+25
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull few more MIPS fixes from Paul Burton: - Fix mips_get_syscall_arg() to operate on the task specified when detecting o32 tasks running on MIPS64 kernels. - Fix some incorrect GPIO pin muxing for the MT7620 SoC. - Update the linux-mips mailing list address. * tag 'mips_fixes_4.20_4' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: MAINTAINERS: Update linux-mips mailing list address MIPS: ralink: Fix mt7620 nd_sd pinmux mips: fix mips_get_syscall_arg o32 check
| * | | | | MAINTAINERS: Update linux-mips mailing list addressPaul Burton2018-12-011-23/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The linux-mips.org infrastructure has been unreliable recently & nobody with sufficient access to fix it is around to do so. As a result we're moving away from it, and part of this is migrating our mailing list to kernel.org. Replace all instances of linux-mips@linux-mips.org in MAINTAINERS with the shiny new linux-mips@vger.kernel.org address. The new list is now being archived on kernel.org at https://lore.kernel.org/linux-mips/ which also holds the history of the old linux-mips.org list. Signed-off-by: Paul Burton <paul.burton@mips.com> Cc: linux-mips@vger.kernel.org Cc: linux-mips@linux-mips.org