summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* libata: add refcounting to ata_hostTaras Kondratiuk2018-03-134-8/+43
| | | | | | | | | | | | | | | | | | | | | | After commit 9a6d6a2ddabb ("ata: make ata port as parent device of scsi host") manual driver unbind/remove causes use-after-free. Unbind unconditionally invokes devres_release_all() which calls ata_host_release() and frees ata_host/ata_port memory while it is still being referenced as a parent of SCSI host. When SCSI host is finally released scsi_host_dev_release() calls put_device(parent) and accesses freed ata_port memory. Add reference counting to make sure that ata_host lives long enough. Bug report: https://lkml.org/lkml/2017/11/1/945 Fixes: 9a6d6a2ddabb ("ata: make ata port as parent device of scsi host") Cc: Tejun Heo <tj@kernel.org> Cc: Lin Ming <minggr@gmail.com> Cc: linux-ide@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Taras Kondratiuk <takondra@cisco.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* pata_bk3710: clarify license version and use SPDX headerBartlomiej Zolnierkiewicz2018-03-011-5/+3
| | | | | | | | | - clarify license version (it should be GPL 2.0) - use SPDX header Acked-by: Sekhar Nori <nsekhar@ti.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* pata_falcon: clarify license version and use SPDX headerBartlomiej Zolnierkiewicz2018-03-011-5/+3
| | | | | | | | | | - clarify license version (it should be GPL 2.0) - use SPDX header Cc: Michael Schmitz <schmitzmic@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* pata_it821x: Delete an error message for a failed memory allocation in ↵Markus Elfring2018-02-181-3/+3
| | | | | | | | | | | it821x_firmware_command() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Tejun Heo <tj@kernel.org>
* pata_macio: Delete an error message for a failed memory allocation in two ↵Markus Elfring2018-02-181-8/+4
| | | | | | | | | | | functions Omit an extra message for a memory allocation failure in these functions. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Tejun Heo <tj@kernel.org>
* pata_mpc52xx: Delete an error message for a failed memory allocation in ↵Markus Elfring2018-02-181-1/+0
| | | | | | | | | | | mpc52xx_ata_probe() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Tejun Heo <tj@kernel.org>
* sata_dwc_460ex: Delete an error message for a failed memory allocation in ↵Markus Elfring2018-02-181-1/+0
| | | | | | | | | | | | sata_dwc_port_start() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* pata_samsung_cf: Delete an error message for a failed memory allocation in ↵Markus Elfring2018-02-181-3/+1
| | | | | | | | | | | pata_s3c_probe() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Tejun Heo <tj@kernel.org>
* pata_arasan_cf: Delete an unnecessary variable initialisation in ↵Markus Elfring2018-02-181-1/+1
| | | | | | | | | | arasan_cf_probe() The local variable "ret" will eventually be set to an appropriate value a bit later. Thus omit the explicit initialisation at the beginning. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Tejun Heo <tj@kernel.org>
* pata_arasan_cf: Delete an error message for a failed memory allocation in ↵Markus Elfring2018-02-181-3/+1
| | | | | | | | | | | arasan_cf_probe() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Tejun Heo <tj@kernel.org>
* libata: update documentation for sysfs interfacesAishwarya Pant2018-02-132-0/+147
| | | | | | | | | | | | | | | | Dcoumentation has been added by parsing through git commit history and reading code. This might be useful for scripting and tracking changes in the ABI. I do not have complete descriptions for the following 3 attributes; they have been annotated with the comment [to be documented] - /sys/class/scsi_host/hostX/ahci_port_cmd /sys/class/scsi_host/hostX/ahci_host_caps /sys/class/scsi_host/hostX/ahci_host_cap2 Signed-off-by: Aishwarya Pant <aishpant@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* ata: sata_rcar: Remove unused variable in sata_rcar_init_controller()Geert Uytterhoeven2018-02-131-1/+0
| | | | | | | | | drivers/ata/sata_rcar.c: In function 'sata_rcar_init_controller': drivers/ata/sata_rcar.c:821:8: warning: unused variable 'base' [-Wunused-variable] Fixes: da77d76b95a0e894 ("sata_rcar: Reset SATA PHY when Salvator-X board resumes") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Tejun Heo <tj@kernel.org>
* libata: transport: cleanup documentation of sysfs interfaceAishwarya Pant2018-02-131-71/+100
| | | | | | | | | | | | | Clean-up the documentation of sysfs interfaces to be in the same format as described in Documentation/ABI/README. This will be useful for tracking changes in the ABI. Attributes are grouped by function (device, link or port) and then by date added. This patch also adds documentation for one attribute - /sys/class/ata_port/ataX/port_no Signed-off-by: Aishwarya Pant <aishpant@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* sata_rcar: Reset SATA PHY when Salvator-X board resumesKhiem Nguyen2018-02-121-23/+40
| | | | | | | | | | | | | | Because power of Salvator-X board is cut off in suspend, it needs to reset SATA PHY state in resume. Otherwise, SATA partition could not be accessed anymore. Signed-off-by: Khiem Nguyen <khiem.nguyen.xt@rvc.renesas.com> Signed-off-by: Hien Dang <hien.dang.eb@rvc.renesas.com> [reinit phy in sata_rcar_resume() function on R-Car Gen3 only] [factor out SATA module init sequence] [fixed the prefix for the subject] Signed-off-by: Yoshihiro Kaneko <ykaneko0929@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* libata: don't try to pass through NCQ commands to non-NCQ devicesEric Biggers2018-02-121-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzkaller hit a WARN() in ata_bmdma_qc_issue() when writing to /dev/sg0. This happened because it issued an ATA pass-through command (ATA_16) where the protocol field indicated that NCQ should be used -- but the device did not support NCQ. We could just remove the WARN() from libata-sff.c, but the real problem seems to be that the SCSI -> ATA translation code passes through NCQ commands without verifying that the device actually supports NCQ. Fix this by adding the appropriate check to ata_scsi_pass_thru(). Here's reproducer that works in QEMU when /dev/sg0 refers to a disk of the default type ("82371SB PIIX3 IDE"): #include <fcntl.h> #include <unistd.h> int main() { char buf[53] = { 0 }; buf[36] = 0x85; /* ATA_16 */ buf[37] = (12 << 1); /* FPDMA */ buf[38] = 0x1; /* Has data */ buf[51] = 0xC8; /* ATA_CMD_READ */ write(open("/dev/sg0", O_RDWR), buf, sizeof(buf)); } Fixes: ee7fb331c3ac ("libata: add support for NCQ commands for SG interface") Reported-by: syzbot+2f69ca28df61bdfc77cd36af2e789850355a221e@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> # v4.4+ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* libata: remove WARN() for DMA or PIO command without dataEric Biggers2018-02-121-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | syzkaller hit a WARN() in ata_qc_issue() when writing to /dev/sg0. This happened because it issued a READ_6 command with no data buffer. Just remove the WARN(), as it doesn't appear indicate a kernel bug. The expected behavior is to fail the command, which the code does. Here's a reproducer that works in QEMU when /dev/sg0 refers to a disk of the default type ("82371SB PIIX3 IDE"): #include <fcntl.h> #include <unistd.h> int main() { char buf[42] = { [36] = 0x8 /* READ_6 */ }; write(open("/dev/sg0", O_RDWR), buf, sizeof(buf)); } Fixes: f92a26365a72 ("libata: change ATA_QCFLAG_DMAMAP semantics") Reported-by: syzbot+f7b556d1766502a69d85071d2ff08bd87be53d0f@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> # v2.6.25+ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* libata: fix length validation of ATAPI-relayed SCSI commandsEric Biggers2018-02-121-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzkaller reported a crash in ata_bmdma_fill_sg() when writing to /dev/sg1. The immediate cause was that the ATA command's scatterlist was not DMA-mapped, which causes 'pi - 1' to underflow, resulting in a write to 'qc->ap->bmdma_prd[0xffffffff]'. Strangely though, the flag ATA_QCFLAG_DMAMAP was set in qc->flags. The root cause is that when __ata_scsi_queuecmd() is preparing to relay a SCSI command to an ATAPI device, it doesn't correctly validate the CDB length before copying it into the 16-byte buffer 'cdb' in 'struct ata_queued_cmd'. Namely, it validates the fixed CDB length expected based on the SCSI opcode but not the actual CDB length, which can be larger due to the use of the SG_NEXT_CMD_LEN ioctl. Since 'flags' is the next member in ata_queued_cmd, a buffer overflow corrupts it. Fix it by requiring that the actual CDB length be <= 16 (ATAPI_CDB_LEN). [Really it seems the length should be required to be <= dev->cdb_len, but the current behavior seems to have been intentionally introduced by commit 607126c2a21c ("libata-scsi: be tolerant of 12-byte ATAPI commands in 16-byte CDBs") to work around a userspace bug in mplayer. Probably the workaround is no longer needed (mplayer was fixed in 2007), but continuing to allow lengths to up 16 appears harmless for now.] Here's a reproducer that works in QEMU when /dev/sg1 refers to the CD-ROM drive that qemu-system-x86_64 creates by default: #include <fcntl.h> #include <sys/ioctl.h> #include <unistd.h> #define SG_NEXT_CMD_LEN 0x2283 int main() { char buf[53] = { [36] = 0x7e, [52] = 0x02 }; int fd = open("/dev/sg1", O_RDWR); ioctl(fd, SG_NEXT_CMD_LEN, &(int){ 17 }); write(fd, buf, sizeof(buf)); } The crash was: BUG: unable to handle kernel paging request at ffff8cb97db37ffc IP: ata_bmdma_fill_sg drivers/ata/libata-sff.c:2623 [inline] IP: ata_bmdma_qc_prep+0xa4/0xc0 drivers/ata/libata-sff.c:2727 PGD fb6c067 P4D fb6c067 PUD 0 Oops: 0002 [#1] SMP CPU: 1 PID: 150 Comm: syz_ata_bmdma_q Not tainted 4.15.0-next-20180202 #99 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014 [...] Call Trace: ata_qc_issue+0x100/0x1d0 drivers/ata/libata-core.c:5421 ata_scsi_translate+0xc9/0x1a0 drivers/ata/libata-scsi.c:2024 __ata_scsi_queuecmd drivers/ata/libata-scsi.c:4326 [inline] ata_scsi_queuecmd+0x8c/0x210 drivers/ata/libata-scsi.c:4375 scsi_dispatch_cmd+0xa2/0xe0 drivers/scsi/scsi_lib.c:1727 scsi_request_fn+0x24c/0x530 drivers/scsi/scsi_lib.c:1865 __blk_run_queue_uncond block/blk-core.c:412 [inline] __blk_run_queue+0x3a/0x60 block/blk-core.c:432 blk_execute_rq_nowait+0x93/0xc0 block/blk-exec.c:78 sg_common_write.isra.7+0x272/0x5a0 drivers/scsi/sg.c:806 sg_write+0x1ef/0x340 drivers/scsi/sg.c:677 __vfs_write+0x31/0x160 fs/read_write.c:480 vfs_write+0xa7/0x160 fs/read_write.c:544 SYSC_write fs/read_write.c:589 [inline] SyS_write+0x4d/0xc0 fs/read_write.c:581 do_syscall_64+0x5e/0x110 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x21/0x86 Fixes: 607126c2a21c ("libata-scsi: be tolerant of 12-byte ATAPI commands in 16-byte CDBs") Reported-by: syzbot+1ff6f9fcc3c35f1c72a95e26528c8e7e3276e4da@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> # v2.6.24+ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* ata: libahci: fix comment indentationBaruch Siach2018-02-121-1/+1
| | | | | | | | Indent the numbered item with one space like all other items in the same list. Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: Tejun Heo <tj@kernel.org>
* ahci: Add check for device presence (PCIe hot unplug) in ahci_stop_engine()Stefan Roese2018-02-121-0/+10
| | | | | | | | | | Exit directly with ENODEV, if the AHCI controller is not available anymore. Otherwise a delay of 500ms for each port is added to the remove function while trying to issue a command on the non-existent controller. Signed-off-by: Stefan Roese <sr@denx.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
* libata: Fix compile warning with ATA_DEBUG enabledDong Bo2018-02-121-1/+1
| | | | | | | | | | | | | | | This fixs the following comile warnings with ATA_DEBUG enabled, which detected by Linaro GCC 5.2-2015.11: drivers/ata/libata-scsi.c: In function 'ata_scsi_dump_cdb': ./include/linux/kern_levels.h:5:18: warning: format '%d' expects argument of type 'int', but argument 6 has type 'u64 {aka long long unsigned int}' [-Wformat=] tj: Patch hand-applied and description trimmed. Signed-off-by: Dong Bo <dongbo4@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* Linux 4.16-rc1v4.16-rc1Linus Torvalds2018-02-121-2/+2
|
* unify {de,}mangle_poll(), get rid of kernel-side POLL...Al Viro2018-02-118-142/+47
| | | | | | | | | | | | | | | | | | | | | | except, again, POLLFREE and POLL_BUSY_LOOP. With this, we finally get to the promised end result: - POLL{IN,OUT,...} are plain integers and *not* in __poll_t, so any stray instances of ->poll() still using those will be caught by sparse. - eventpoll.c and select.c warning-free wrt __poll_t - no more kernel-side definitions of POLL... - userland ones are visible through the entire kernel (and used pretty much only for mangle/demangle) - same behavior as after the first series (i.e. sparc et.al. epoll(2) working correctly). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfs: do bulk POLL* -> EPOLL* replacementLinus Torvalds2018-02-11297-913/+913
| | | | | | | | | | | | | | | | | | | | | | | This is the mindless scripted replacement of kernel use of POLL* variables as described by Al, done by this script: for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'` for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done done with de-mangling cleanups yet to come. NOTE! On almost all architectures, the EPOLL* constants have the same values as the POLL* constants do. But they keyword here is "almost". For various bad reasons they aren't the same, and epoll() doesn't actually work quite correctly in some cases due to this on Sparc et al. The next patch from Al will sort out the final differences, and we should be all done. Scripted-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'work.poll2' of ↵Linus Torvalds2018-02-118-40/+47
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more poll annotation updates from Al Viro: "This is preparation to solving the problems you've mentioned in the original poll series. After this series, the kernel is ready for running for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'` for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done done as a for bulk search-and-replace. After that, the kernel is ready to apply the patch to unify {de,}mangle_poll(), and then get rid of kernel-side POLL... uses entirely, and we should be all done with that stuff. Basically, that's what you suggested wrt KPOLL..., except that we can use EPOLL... instead - they already are arch-independent (and equal to what is currently kernel-side POLL...). After the preparations (in this series) switch to returning EPOLL... from ->poll() instances is completely mechanical and kernel-side POLL... can go away. The last step (killing kernel-side POLL... and unifying {de,}mangle_poll() has to be done after the search-and-replace job, since we need userland-side POLL... for unified {de,}mangle_poll(), thus the cherry-pick at the last step. After that we will have: - POLL{IN,OUT,...} *not* in __poll_t, so any stray instances of ->poll() still using those will be caught by sparse. - eventpoll.c and select.c warning-free wrt __poll_t - no more kernel-side definitions of POLL... - userland ones are visible through the entire kernel (and used pretty much only for mangle/demangle) - same behavior as after the first series (i.e. sparc et.al. epoll(2) working correctly)" * 'work.poll2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: annotate ep_scan_ready_list() ep_send_events_proc(): return result via esed->res preparation to switching ->poll() to returning EPOLL... add EPOLLNVAL, annotate EPOLL... and event_poll->event use linux/poll.h instead of asm/poll.h xen: fix poll misannotation smc: missing poll annotations
| * annotate ep_scan_ready_list()Al Viro2018-02-011-11/+13
| | | | | | | | | | | | make it always return __poll_t and have its callbacks do the same Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ep_send_events_proc(): return result via esed->resAl Viro2018-02-011-7/+10
| | | | | | | | | | | | preparations for not mixing __poll_t and int in ep_scan_ready_list() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * preparation to switching ->poll() to returning EPOLL...Al Viro2018-02-011-1/+2
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * add EPOLLNVAL, annotate EPOLL... and event_poll->eventAl Viro2018-02-011-16/+17
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * use linux/poll.h instead of asm/poll.hAl Viro2018-02-013-3/+3
| | | | | | | | | | | | | | | | | | The only place that has any business including asm/poll.h is linux/poll.h. Fortunately, asm/poll.h had only been included in 3 places beyond that one, and all of them are trivial to switch to using linux/poll.h. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * xen: fix poll misannotationAl Viro2018-02-011-1/+1
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * smc: missing poll annotationsAl Viro2018-02-011-1/+1
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'xtensa-20180211' of git://github.com/jcmvbkbc/linux-xtensaLinus Torvalds2018-02-111-0/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | Pull xtense fix from Max Filippov: "Build fix for xtensa architecture with KASAN enabled" * tag 'xtensa-20180211' of git://github.com/jcmvbkbc/linux-xtensa: xtensa: fix build with KASAN
| * | xtensa: fix build with KASANMax Filippov2018-02-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit 917538e212a2 ("kasan: clean up KASAN_SHADOW_SCALE_SHIFT usage") removed KASAN_SHADOW_SCALE_SHIFT definition from include/linux/kasan.h and added it to architecture-specific headers, except for xtensa. This broke the xtensa build with KASAN enabled. Define KASAN_SHADOW_SCALE_SHIFT in arch/xtensa/include/asm/kasan.h Reported by: kbuild test robot <fengguang.wu@intel.com> Fixes: 917538e212a2 ("kasan: clean up KASAN_SHADOW_SCALE_SHIFT usage") Acked-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
* | | Merge tag 'nios2-v4.16-rc1' of ↵Linus Torvalds2018-02-113-10/+8
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2 Pull nios2 update from Ley Foon Tan: - clean up old Kconfig options from defconfig - remove leading 0x and 0s from bindings notation in dts files * tag 'nios2-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2: nios2: defconfig: Cleanup from old Kconfig options nios2: dts: Remove leading 0x and 0s from bindings notation
| * | nios2: defconfig: Cleanup from old Kconfig optionsKrzysztof Kozlowski2018-02-112-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | Remove old, dead Kconfig option INET_LRO. It is gone since commit 7bbf3cae65b6 ("ipv4: Remove inet_lro library"). Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Ley Foon Tan <ley.foon.tan@intel.com>
| * | nios2: dts: Remove leading 0x and 0s from bindings notationMathieu Malaterre2018-02-111-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improve the DTS files by removing all the leading "0x" and zeros to fix the following dtc warnings: Warning (unit_address_format): Node /XXX unit name should not have leading "0x" and Warning (unit_address_format): Node /XXX unit name should not have leading 0s Converted using the following command: find . -type f \( -iname *.dts -o -iname *.dtsi \) -exec sed -E -i -e "s/@0x([0-9a-fA-F\.]+)\s?\{/@\L\1 \{/g" -e "s/@0+([0-9a-fA-F\.]+)\s?\{/@\L\1 \{/g" {} + For simplicity, two sed expressions were used to solve each warnings separately. To make the regex expression more robust a few other issues were resolved, namely setting unit-address to lower case, and adding a whitespace before the the opening curly brace: https://elinux.org/Device_Tree_Linux#Linux_conventions This is a follow up to commit 4c9847b7375a ("dt-bindings: Remove leading 0x from bindings notation") Reported-by: David Daney <ddaney@caviumnetworks.com> Suggested-by: Rob Herring <robh@kernel.org> Signed-off-by: Mathieu Malaterre <malat@debian.org> Acked-by: Ley Foon Tan <ley.foon.tan@intel.com>
* | | Merge tag 'pci-v4.16-fixes-1' of ↵Linus Torvalds2018-02-101-2/+3
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fix from Bjorn Helgaas: "Fix a POWER9/powernv INTx regression from the merge window (Alexey Kardashevskiy)" * tag 'pci-v4.16-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: powerpc/pci: Fix broken INTx configuration via OF
| * | | powerpc/pci: Fix broken INTx configuration via OFAlexey Kardashevskiy2018-02-101-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 59f47eff03a0 ("powerpc/pci: Use of_irq_parse_and_map_pci() helper") replaced of_irq_parse_pci() + irq_create_of_mapping() with of_irq_parse_and_map_pci(), but neglected to capture the virq returned by irq_create_of_mapping(), so virq remained zero, which caused INTx configuration to fail. Save the virq value returned by of_irq_parse_and_map_pci() and correct the virq declaration to match the of_irq_parse_and_map_pci() signature. Fixes: 59f47eff03a0 "powerpc/pci: Use of_irq_parse_and_map_pci() helper" Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [bhelgaas: changelog] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
* | | | Merge tag 'for-linus-20180210' of git://git.kernel.dk/linux-blockLinus Torvalds2018-02-1012-63/+212
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A few fixes to round off the merge window on the block side: - a set of bcache fixes by way of Michael Lyle, from the usual bcache suspects. - add a simple-to-hook-into function for bpf EIO error injection. - fix blk-wbt that mischarectized flushes as reads. Improve the logic so that flushes and writes are accounted as writes, and only reads as reads. From me. - fix requeue crash in BFQ, from Paolo" * tag 'for-linus-20180210' of git://git.kernel.dk/linux-block: block, bfq: add requeue-request hook bcache: fix for data collapse after re-attaching an attached device bcache: return attach error when no cache set exist bcache: set writeback_rate_update_seconds in range [1, 60] seconds bcache: fix for allocator and register thread race bcache: set error_limit correctly bcache: properly set task state in bch_writeback_thread() bcache: fix high CPU occupancy during journal bcache: add journal statistic block: Add should_fail_bio() for bpf error injection blk-wbt: account flush requests correctly
| * \ \ \ Merge branch 'for-linus' into testJens Axboe2018-02-0712-63/+212
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * for-linus: block, bfq: add requeue-request hook bcache: fix for data collapse after re-attaching an attached device bcache: return attach error when no cache set exist bcache: set writeback_rate_update_seconds in range [1, 60] seconds bcache: fix for allocator and register thread race bcache: set error_limit correctly bcache: properly set task state in bch_writeback_thread() bcache: fix high CPU occupancy during journal bcache: add journal statistic block: Add should_fail_bio() for bpf error injection blk-wbt: account flush requests correctly
| | * | | | block, bfq: add requeue-request hookPaolo Valente2018-02-071-25/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device be re-inserted into the active I/O scheduler for that device. As a consequence, I/O schedulers may get the same request inserted again, even several times, without a finish_request invoked on that request before each re-insertion. This fact is the cause of the failure reported in [1]. For an I/O scheduler, every re-insertion of the same re-prepared request is equivalent to the insertion of a new request. For schedulers like mq-deadline or kyber, this fact causes no harm. In contrast, it confuses a stateful scheduler like BFQ, which keeps state for an I/O request, until the finish_request hook is invoked on the request. In particular, BFQ may get stuck, waiting forever for the number of request dispatches, of the same request, to be balanced by an equal number of request completions (while there will be one completion for that request). In this state, BFQ may refuse to serve I/O requests from other bfq_queues. The hang reported in [1] then follows. However, the above re-prepared requests undergo a requeue, thus the requeue_request hook of the active elevator is invoked for these requests, if set. This commit then addresses the above issue by properly implementing the hook requeue_request in BFQ. [1] https://marc.info/?l=linux-block&m=151211117608676 Reported-by: Ivan Kozik <ivan@ludios.org> Reported-by: Alban Browaeys <alban.browaeys@gmail.com> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Serena Ziviani <ziviani.serena@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * | | | bcache: fix for data collapse after re-attaching an attached deviceTang Junhui2018-02-073-7/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | back-end device sdm has already attached a cache_set with ID f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with another cache set, and it returns with an error: [root]# cd /sys/block/sdm/bcache [root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach -bash: echo: write error: Invalid argument After that, execute a command to modify the label of bcache device: [root]# echo data_disk1 > label Then we reboot the system, when the system power on, the back-end device can not attach to cache_set, a messages show in the log: Feb 5 12:05:52 ceph152 kernel: [922385.508498] bcache: bch_cached_dev_attach() couldn't find uuid for sdm in set In sysfs_attach(), dc->sb.set_uuid was assigned to the value which input through sysfs, no matter whether it is success or not in bch_cached_dev_attach(). For example, If the back-end device has already attached to an cache set, bch_cached_dev_attach() would fail, but dc->sb.set_uuid was changed. Then modify the label of bcache device, it will call bch_write_bdev_super(), which would write the dc->sb.set_uuid to the super block, so we record a wrong cache set ID in the super block, after the system reboot, the cache set couldn't find the uuid of the back-end device, so the bcache device couldn't exist and use any more. In this patch, we don't assigned cache set ID to dc->sb.set_uuid in sysfs_attach() directly, but input it into bch_cached_dev_attach(), and assigned dc->sb.set_uuid to the cache set ID after the back-end device attached to the cache set successful. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * | | | bcache: return attach error when no cache set existTang Junhui2018-02-071-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I attach a back-end device to a cache set, and the cache set is not registered yet, this back-end device did not attach successfully, and no error returned: [root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach [root]# In sysfs_attach(), the return value "v" is initialized to "size" in the beginning, and if no cache set exist in bch_cache_sets, the "v" value would not change any more, and return to sysfs, sysfs regard it as success since the "size" is a positive number. This patch fixes this issue by assigning "v" with "-ENOENT" in the initialization. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * | | | bcache: set writeback_rate_update_seconds in range [1, 60] secondsColy Li2018-02-073-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dc->writeback_rate_update_seconds can be set via sysfs and its value can be set to [1, ULONG_MAX]. It does not make sense to set such a large value, 60 seconds is long enough value considering the default 5 seconds works well for long time. Because dc->writeback_rate_update is a special delayed work, it re-arms itself inside the delayed work routine update_writeback_rate(). When stopping it by cancel_delayed_work_sync(), there should be a timeout to wait and make sure the re-armed delayed work is stopped too. A small max value of dc->writeback_rate_update_seconds is also helpful to decide a reasonable small timeout. This patch limits sysfs interface to set dc->writeback_rate_update_seconds in range of [1, 60] seconds, and replaces the hand-coded number by macros. Changelog: v2: fix a rebase typo in v4, which is pointed out by Michael Lyle. v1: initial version. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * | | | bcache: fix for allocator and register thread raceTang Junhui2018-02-072-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After long time running of random small IO writing, I reboot the machine, and after the machine power on, I found bcache got stuck, the stack is: [root@ceph153 ~]# cat /proc/2510/task/*/stack [<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache] [<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache] [<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache] [<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache] [<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache] [<ffffffff810a631f>] kthread+0xcf/0xe0 [<ffffffff8164c318>] ret_from_fork+0x58/0x90 [<ffffffffffffffff>] 0xffffffffffffffff [root@ceph153 ~]# cat /proc/2038/task/*/stack [<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache] [<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache] [<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache] [<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache] [<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache] [<ffffffff812f702f>] kobj_attr_store+0xf/0x20 [<ffffffff8125b216>] sysfs_write_file+0xc6/0x140 [<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0 [<ffffffff811e069f>] SyS_write+0x7f/0xe0 [<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1 The stack shows the register thread and allocator thread were getting stuck when registering cache device. I reboot the machine several times, the issue always exsit in this machine. I debug the code, and found the call trace as bellow: register_bcache() ==>run_cache_set() ==>bch_journal_replay() ==>bch_btree_insert() ==>__bch_btree_map_nodes() ==>btree_insert_fn() ==>btree_split() //node need split ==>btree_check_reserve() In btree_check_reserve(), It will check if there is enough buckets of RESERVE_BTREE type, since allocator thread did not work yet, so no buckets of RESERVE_BTREE type allocated, so the register thread waits on c->btree_cache_wait, and goes to sleep. Then the allocator thread initialized, the call trace is bellow: bch_allocator_thread() ==>bch_prio_write() ==>bch_journal_meta() ==>bch_journal() ==>journal_wait_for_write() In journal_wait_for_write(), It will check if journal is full by journal_full(), but the long time random small IO writing causes the exhaustion of journal buckets(journal.blocks_free=0), In order to release the journal buckets, the allocator calls btree_flush_write() to flush keys to btree nodes, and waits on c->journal.wait until btree nodes writing over or there has already some journal buckets space, then the allocator thread goes to sleep. but in btree_flush_write(), since bch_journal_replay() is not finished, so no btree nodes have journal (condition "if (btree_current_write(b)->journal)" never satisfied), so we got no btree node to flush, no journal bucket released, and allocator sleep all the times. Through the above analysis, we can see that: 1) Register thread wait for allocator thread to allocate buckets of RESERVE_BTREE type; 2) Alloctor thread wait for register thread to replay journal, so it can flush btree nodes and get journal bucket. then they are all got stuck by waiting for each other. Hua Rui provided a patch for me, by allocating some buckets of RESERVE_BTREE type in advance, so the register thread can get bucket when btree node splitting and no need to waiting for the allocator thread. I tested it, it has effect, and register thread run a step forward, but finally are still got stuck, the reason is only 8 bucket of RESERVE_BTREE type were allocated, and in bch_journal_replay(), after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left, then btree_check_reserve() is not satisfied anymore, so it goes to sleep again, and in the same time, alloctor thread did not flush enough btree nodes to release a journal bucket, so they all got stuck again. So we need to allocate more buckets of RESERVE_BTREE type in advance, but how much is enough? By experience and test, I think it should be as much as journal buckets. Then I modify the code as this patch, and test in the machine, and it works. This patch modified base on Hua Rui’s patch, and allocate more buckets of RESERVE_BTREE type in advance to avoid register thread and allocate thread going to wait for each other. [patch v2] ca->sb.njournal_buckets would be 0 in the first time after cache creation, and no journal exists, so just 8 btree buckets is OK. Signed-off-by: Hua Rui <huarui.dev@gmail.com> Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * | | | bcache: set error_limit correctlyColy Li2018-02-073-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Struct cache uses io_errors for two purposes, - Error decay: when cache set error_decay is set, io_errors is used to generate a small piece of delay when I/O error happens. - I/O errors counter: in order to generate big enough value for error decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a IO_ERROR_SHIFT). In function bch_count_io_errors(), if I/O errors counter reaches cache set error limit, bch_cache_set_error() will be called to retire the whold cache set. But current code is problematic when checking the error limit, see the following code piece from bch_count_io_errors(), 90 if (error) { 91 char buf[BDEVNAME_SIZE]; 92 unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT, 93 &ca->io_errors); 94 errors >>= IO_ERROR_SHIFT; 95 96 if (errors < ca->set->error_limit) 97 pr_err("%s: IO error on %s, recovering", 98 bdevname(ca->bdev, buf), m); 99 else 100 bch_cache_set_error(ca->set, 101 "%s: too many IO errors %s", 102 bdevname(ca->bdev, buf), m); 103 } At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real errors counter to compare at line 96. But ca->set->error_limit is initia- lized with an amplified value in bch_cache_set_alloc(), 1545 c->error_limit = 8 << IO_ERROR_SHIFT; It means by default, in bch_count_io_errors(), before 8<<20 errors happened bch_cache_set_error() won't be called to retire the problematic cache device. If the average request size is 64KB, it means bcache won't handle failed device until 512GB data is requested. This is too large to be an I/O threashold. So I believe the correct error limit should be much less. This patch sets default cache set error limit to 8, then in bch_count_io_errors() when errors counter reaches 8 (if it is default value), function bch_cache_set_error() will be called to retire the whole cache set. This patch also removes bits shifting when store or show io_error_limit value via sysfs interface. Nowadays most of SSDs handle internal flash failure automatically by LBA address re-indirect mapping. If an I/O error can be observed by upper layer code, it will be a notable error because that SSD can not re-indirect map the problematic LBA address to an available flash block. This situation indicates the whole SSD will be failed very soon. Therefore setting 8 as the default io error limit value makes sense, it is enough for most of cache devices. Changelog: v2: add reviewed-by from Hannes. v1: initial version for review. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * | | | bcache: properly set task state in bch_writeback_thread()Coly Li2018-02-072-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel thread routine bch_writeback_thread() has the following code block, 447 down_write(&dc->writeback_lock); 448~450 if (check conditions) { 451 up_write(&dc->writeback_lock); 452 set_current_state(TASK_INTERRUPTIBLE); 453 454 if (kthread_should_stop()) 455 return 0; 456 457 schedule(); 458 continue; 459 } If condition check is true, its task state is set to TASK_INTERRUPTIBLE and call schedule() to wait for others to wake up it. There are 2 issues in current code, 1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if another process changes the condition and call wake_up_process(dc-> writeback_thread), then at line 452 task state is set back to TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be waken up. 2, At line 454 if kthread_should_stop() is true, writeback kernel thread will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and call do_exit(). It is not good to enter do_exit() with task state TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a warning message is reported by __might_sleep(): "WARNING: do not call blocking ops when !TASK_RUNNING; state=1 set at [xxxx]". For the first issue, task state should be set before condition checks. Ineed because dc->writeback_lock is required when modifying all the conditions, calling set_current_state() inside code block where dc-> writeback_lock is hold is safe. But this is quite implicit, so I still move set_current_state() before all the condition checks. For the second issue, frankley speaking it does not hurt when kernel thread exits with TASK_INTERRUPTIBLE state, but this warning message scares users, makes them feel there might be something risky with bcache and hurt their data. Setting task state to TASK_RUNNING before returning fixes this problem. In alloc.c:allocator_wait(), there is also a similar issue, and is also fixed in this patch. Changelog: v3: merge two similar fixes into one patch v2: fix the race issue in v1 patch. v1: initial buggy fix. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * | | | bcache: fix high CPU occupancy during journalTang Junhui2018-02-073-15/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After long time small writing I/O running, we found the occupancy of CPU is very high and I/O performance has been reduced by about half: [root@ceph151 internal]# top top - 15:51:05 up 1 day,2:43, 4 users, load average: 16.89, 15.15, 16.53 Tasks: 2063 total, 4 running, 2059 sleeping, 0 stopped, 0 zombie %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa, 0.0 hi, 0.5 si, 0.0 st KiB Mem : 65450044 total, 24586420 free, 38909008 used, 1954616 buff/cache KiB Swap: 65667068 total, 65667068 free, 0 used. 25136812 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2023 root 20 0 0 0 0 S 55.1 0.0 0:04.42 kworker/11:191 14126 root 20 0 0 0 0 S 42.9 0.0 0:08.72 kworker/10:3 9292 root 20 0 0 0 0 S 30.4 0.0 1:10.99 kworker/6:1 8553 ceph 20 0 4242492 1.805g 18804 S 30.0 2.9 410:07.04 ceph-osd 12287 root 20 0 0 0 0 S 26.7 0.0 0:28.13 kworker/7:85 31019 root 20 0 0 0 0 S 26.1 0.0 1:30.79 kworker/22:1 1787 root 20 0 0 0 0 R 25.7 0.0 5:18.45 kworker/8:7 32169 root 20 0 0 0 0 S 14.5 0.0 1:01.92 kworker/23:1 21476 root 20 0 0 0 0 S 13.9 0.0 0:05.09 kworker/1:54 2204 root 20 0 0 0 0 S 12.5 0.0 1:25.17 kworker/9:10 16994 root 20 0 0 0 0 S 12.2 0.0 0:06.27 kworker/5:106 15714 root 20 0 0 0 0 R 10.9 0.0 0:01.85 kworker/19:2 9661 ceph 20 0 4246876 1.731g 18800 S 10.6 2.8 403:00.80 ceph-osd 11460 ceph 20 0 4164692 2.206g 18876 S 10.6 3.5 360:27.19 ceph-osd 9960 root 20 0 0 0 0 S 10.2 0.0 0:02.75 kworker/2:139 11699 ceph 20 0 4169244 1.920g 18920 S 10.2 3.1 355:23.67 ceph-osd 6843 ceph 20 0 4197632 1.810g 18900 S 9.6 2.9 380:08.30 ceph-osd The kernel work consumed a lot of CPU, and I found they are running journal work, The journal is reclaiming source and flush btree node with surprising frequency. Through further analysis, we found that in btree_flush_write(), we try to get a btree node with the smallest fifo idex to flush by traverse all the btree nodein c->bucket_hash, after we getting it, since no locker protects it, this btree node may have been written to cache device by other works, and if this occurred, we retry to traverse in c->bucket_hash and get another btree node. When the problem occurrd, the retry times is very high, and we consume a lot of CPU in looking for a appropriate btree node. In this patch, we try to record 128 btree nodes with the smallest fifo idex in heap, and pop one by one when we need to flush btree node. It greatly reduces the time for the loop to find the appropriate BTREE node, and also reduce the occupancy of CPU. [note by mpl: this triggers a checkpatch error because of adjacent, pre-existing style violations] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * | | | bcache: add journal statisticTang Junhui2018-02-073-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sometimes, Journal takes up a lot of CPU, we need statistics to know what's the journal is doing. So this patch provide some journal statistics: 1) reclaim: how many times the journal try to reclaim resource, usually the journal bucket or/and the pin are exhausted. 2) flush_write: how many times the journal try to flush btree node to cache device, usually the journal bucket are exhausted. 3) retry_flush_write: how many times the journal retry to flush the next btree node, usually the previous tree node have been flushed by other thread. we show these statistic by sysfs interface. Through these statistics We can totally see the status of journal module when the CPU is too high. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * | | | block: Add should_fail_bio() for bpf error injectionHoward McLauchlan2018-02-061-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The classic error injection mechanism, should_fail_request() does not support use cases where more information is required (from the entire struct bio, for example). To that end, this patch introduces should_fail_bio(), which calls should_fail_request() under the hood but provides a convenient place for kprobes to hook into if they require the entire struct bio. This patch also replaces some existing calls to should_fail_request() with should_fail_bio() with no degradation in performance. Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>