summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* null_blk: fix wrong capacity when bs is not 512 bytesMatias Bjørling2015-09-031-2/+1
| | | | | | | | | | | set_capacity() sets device's capacity using 512 bytes sectors. null_blk calculates the number of sectors by size / bs, which set_capacity is called with. This led to null_blk exposing the wrong number of sectors when bs is not 512 bytes. Signed-off-by: Matias Bjørling <m@bjorling.me> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* null_blk: fix memory leak on cleanupMatias Bjørling2015-09-031-16/+17
| | | | | | | | Driver was not freeing the memory allocated for internal nullb queues. This patch frees the memory during driver unload. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: fix bogus compiler warnings in blk-merge.cJens Axboe2015-09-031-8/+7
| | | | | | | | | The compiler can't figure out that bvprv is initialized whenever 'prev' is set to 1 as well. Use a pointer to bvprv instead, setting it to NULL initially, and get rid of the 'prev' tracking. This dumbs it down enough that gcc is happy. Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-4.3/sg' of git://git.kernel.dk/linux-blockLinus Torvalds2015-09-0221-81/+234
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull SG updates from Jens Axboe: "This contains a set of scatter-gather related changes/fixes for 4.3: - Add support for limited chaining of sg tables even for architectures that do not set ARCH_HAS_SG_CHAIN. From Christoph. - Add sg chain support to target_rd. From Christoph. - Fixup open coded sg->page_link in crypto/omap-sham. From Christoph. - Fixup open coded crypto ->page_link manipulation. From Dan. - Also from Dan, automated fixup of manual sg_unmark_end() manipulations. - Also from Dan, automated fixup of open coded sg_phys() implementations. - From Robert Jarzmik, addition of an sg table splitting helper that drivers can use" * 'for-4.3/sg' of git://git.kernel.dk/linux-block: lib: scatterlist: add sg splitting function scatterlist: use sg_phys() crypto/omap-sham: remove an open coded access to ->page_link scatterlist: remove open coded sg_unmark_end instances crypto: replace scatterwalk_sg_chain with sg_chain target/rd: always chain S/G list scatterlist: allow limited chaining without ARCH_HAS_SG_CHAIN
| * lib: scatterlist: add sg splitting functionRobert Jarzmik2015-08-244-0/+215
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sometimes a scatter-gather has to be split into several chunks, or sub scatter lists. This happens for example if a scatter list will be handled by multiple DMA channels, each one filling a part of it. A concrete example comes with the media V4L2 API, where the scatter list is allocated from userspace to hold an image, regardless of the knowledge of how many DMAs will fill it : - in a simple RGB565 case, one DMA will pump data from the camera ISP to memory - in the trickier YUV422 case, 3 DMAs will pump data from the camera ISP pipes, one for pipe Y, one for pipe U and one for pipe V For these cases, it is necessary to split the original scatter list into multiple scatter lists, which is the purpose of this patch. The guarantees that are required for this patch are : - the intersection of spans of any couple of resulting scatter lists is empty. - the union of spans of all resulting scatter lists is a subrange of the span of the original scatter list. - streaming DMA API operations (mapping, unmapping) should not happen both on both the resulting and the original scatter list. It's either the first or the later ones. - the caller is reponsible to call kfree() on the resulting scatterlists. Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: Jens Axboe <axboe@fb.com>
| * scatterlist: use sg_phys()Dan Williams2015-08-175-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Coccinelle cleanup to replace open coded sg to physical address translations. This is in preparation for introducing scatterlists that reference __pfn_t. // sg_phys.cocci: convert usage page_to_phys(sg_page(sg)) to sg_phys(sg) // usage: make coccicheck COCCI=sg_phys.cocci MODE=patch virtual patch @@ struct scatterlist *sg; @@ - page_to_phys(sg_page(sg)) + sg->offset + sg_phys(sg) @@ struct scatterlist *sg; @@ - page_to_phys(sg_page(sg)) + sg_phys(sg) & PAGE_MASK Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * crypto/omap-sham: remove an open coded access to ->page_linkChristoph Hellwig2015-08-171-1/+1
| | | | | | | | | | | | | | | | Signed-off-by: Dan Williams <dan.j.williams@intel.com> [hch: split from a larger patch by Dan] Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jens Axboe <axboe@fb.com>
| * scatterlist: remove open coded sg_unmark_end instancesDan Williams2015-08-172-3/+3
| | | | | | | | | | | | | | Signed-off-by: Dan Williams <dan.j.williams@intel.com> [hch: split from a larger patch by Dan] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * crypto: replace scatterwalk_sg_chain with sg_chainDan Williams2015-08-178-21/+12
| | | | | | | | | | | | | | | | Signed-off-by: Dan Williams <dan.j.williams@intel.com> [hch: split from a larger patch by Dan] Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jens Axboe <axboe@fb.com>
| * target/rd: always chain S/G listChristoph Hellwig2015-08-171-44/+0
| | | | | | | | | | | | | | | | The rd sg lists are never passed to hardware, so use S/G chaining unonditionally. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * scatterlist: allow limited chaining without ARCH_HAS_SG_CHAINChristoph Hellwig2015-08-172-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There are a couple of uses of struct scatterlist that never go to the dma_map_sg() helper and thus don't care about ARCH_HAS_SG_CHAIN which indicates that we can map chained S/G list. The most important one is the crypto code, which currently has to open code a few helpers to always allow chaining. This patch removes a few #ifdef ARCH_HAS_SG_CHAIN statements so that we can switch the crypto code to these common helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge branch 'for-4.3/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds2015-09-024-146/+506
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: "On top of the 4.3 core block IO changes, here are the driver related changes for 4.3. Basically just NVMe and nbd this time around: - NVMe: - PRACT PI improvement from Alok Pandey. - Cleanups and improvements on submission queue doorbell and writing, using CMB if available. From Jon Derrick. - From Keith, support for setting queue maximum segments, and reset support. - Also from Jon, fixup of u64 division issue on 32-bit archs and wiring up of the reset support through and ioctl. - Two small cleanups from Matias and Sunad - Various code cleanups and fixes from Markus Pargmann" * 'for-4.3/drivers' of git://git.kernel.dk/linux-block: NVMe: Using PRACT bit to generate and verify PI by controller NVMe:Remove unreachable code in nvme_abort_req NVMe: Add nvme subsystem reset IOCTL NVMe: Add nvme subsystem reset support NVMe: removed unused nn var from nvme_dev_add NVMe: Set queue max segments nbd: flags is a u32 variable nbd: Rename functions for clearness of recv/send path nbd: Change 'disconnect' to be boolean nbd: Add debugfs entries nbd: Remove variable 'pid' nbd: Move clear queue debug message nbd: Remove 'harderror' and propagate error properly nbd: restructure sock_shutdown nbd: sock_shutdown, remove conditional lock nbd: Fix timeout detection nvme: Fixes u64 division which breaks i386 builds NVMe: Use CMB for the IO SQes if available NVMe: Unify SQ entry writing and doorbell ringing
| * | NVMe: Using PRACT bit to generate and verify PI by controllerAlok Pandey2015-08-261-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch enables the PRCHK and reftag support when PRACT bit is set, and block layer integrity is disabled. Signed-off-by: Alok Pandey <pandey.alok@samsung.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | NVMe:Remove unreachable code in nvme_abort_reqSunad Bhandary2015-08-191-15/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | Removing unreachable code from nvme_abort_req as nvme_submit_cmd has no failure status to return. Signed-off-by: Sunad Bhandary <sunad.s@samsung.com> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | NVMe: Add nvme subsystem reset IOCTLJon Derrick2015-08-183-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Controllers can perform optional subsystem resets as introduced in NVMe 1.1. This patch adds an IOCTL to trigger the subsystem reset by writing "NVMe" to the NSSR register. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | NVMe: Add nvme subsystem reset supportKeith Busch2015-08-182-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Controllers part of an NVMe subsystem may be reset by any other controller in the subsystem. If the device is capable of subsystem resets, this patch adds detection for such events and performs appropriate controller initialization upon subsystem reset detection. The register bit is a RW1C type, so the driver needs to write a 1 to the status bit to clear the subsystem reset occured bit during initialization. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | NVMe: removed unused nn var from nvme_dev_addMatias Bjørling2015-08-181-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The logic in nvme_dev_add to enumerate namespaces was moved to nvme_dev_scan. When moved, the nn variable is no longer used. This patch removes it. Fixes: a5768aai ("NVMe: Automatic namespace rescan") Signed-off-by: Matias Bjørling <m@bjorling.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | NVMe: Set queue max segmentsKeith Busch2015-08-171-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This sets the queue's max segment size to match the device's capabilities. The default of 128 is usable until a device's transfer capability exceeds 512k, assuming a device page size of 4k. Many nvme devices exceed that transfer limit, so this lets the block layer know what kind of commands it to allow to form rather than unnecessarily split them. One additional segment is added to account for a transfer that may start in the middle of a page. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nbd: flags is a u32 variableMarkus Pargmann2015-08-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The flags variable is used as u32 variable. This patch changes the type to be u32. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nbd: Rename functions for clearness of recv/send pathMarkus Pargmann2015-08-171-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | This patch renames functions so that it is clear what the function does. Otherwise it is not directly understandable what for example 'do_it' means. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nbd: Change 'disconnect' to be booleanMarkus Pargmann2015-08-171-4/+4
| | | | | | | | | | | | | | | | | | Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nbd: Add debugfs entriesMarkus Pargmann2015-08-171-1/+177
| | | | | | | | | | | | | | | | | | | | | | | | Add some debugfs files that help to understand the internal state of NBD. This exports the different sizes, flags, tasks and so on. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nbd: Remove variable 'pid'Markus Pargmann2015-08-171-10/+9
| | | | | | | | | | | | | | | | | | | | | | | | This patch uses nbd->task_recv to determine the value of the previously used variable 'pid' for sysfs. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nbd: Move clear queue debug messageMarkus Pargmann2015-08-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | This message was a warning without a reason. This patch moves it into nbd_clear_que and transforms it to a debug message. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nbd: Remove 'harderror' and propagate error properlyMarkus Pargmann2015-08-171-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of a variable 'harderror' we can simply try to correctly propagate errors to the userspace. This patch removes the harderror variable and passes errors through error pointers and nbd_do_it back to the userspace. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nbd: restructure sock_shutdownMarkus Pargmann2015-08-171-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch restructures sock_shutdown to avoid having the main code path in an if block. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nbd: sock_shutdown, remove conditional lockMarkus Pargmann2015-08-171-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | Move the conditional lock from sock_shutdown into the surrounding code. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nbd: Fix timeout detectionMarkus Pargmann2015-08-171-28/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment the nbd timeout just detects hanging tcp operations. This is not enough to detect a hanging or bad connection as expected of a timeout. This patch redesigns the timeout detection to include some more cases. The timeout is now in relation to replies from the server. If the server does not send replies within the timeout the connection will be shut down. The patch adds a continous timer 'timeout_timer' that is setup in one of two cases: - The request list is empty and we are sending the first request out to the server. We want to have a reply within the given timeout, otherwise we consider the connection to be dead. - A server response was received. This means the server is still communicating with us. The timer is reset to the timeout value. The timer is not stopped if the list becomes empty. It will just trigger a timeout which will directly leave the handling routine again as the request list is empty. The whole patch does not use any additional explicit locking. The list_empty() calls are safe to be used concurrently. The timer is locked internally as we just use mod_timer and del_timer_sync(). The patch is based on the idea of Michal Belczyk with a previous different implementation. Cc: Michal Belczyk <belczyk@bsd.krakow.pl> Cc: Hermann Lauer <Hermann.Lauer@iwr.uni-heidelberg.de> Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Tested-by: Hermann Lauer <Hermann.Lauer@iwr.uni-heidelberg.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nvme: Fixes u64 division which breaks i386 buildsJon Derrick2015-07-211-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | Uses div_u64 for u64 division and round_down, a bitwise operation, instead of rounddown, which uses a modulus. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | NVMe: Use CMB for the IO SQes if availableJon Derrick2015-07-212-5/+134
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some controllers have a controller-side memory buffer available for use for submissions, completions, lists, or data. If a CMB is available, the entire CMB will be ioremapped and it will attempt to map the IO SQes onto the CMB. The queues will be shrunk as needed. The CMB will not be used if the queue depth is shrunk below some threshold where it may have reduced performance over a larger queue in system memory. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | NVMe: Unify SQ entry writing and doorbell ringingJon Derrick2015-07-211-45/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes sq_cmd writers to instead create their command on the stack. __nvme_submit_cmd copies the sq entry to the queue and writes the doorbell. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds2015-09-02132-2130/+1199
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...
| * | | blk: Fix bio_io_vec index when checking bvec gapsKeith Busch2015-09-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Corrects a coding error from earlier patch. Reported by: Sagi Grimberg <sagig@dev.mellanox.co.il> Signed-off-by: Keith Busch <keith.busch@intel.com> Fixes: 03100aada96f ("block: Replace SG_GAPS with new queue limits mask") Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: Replace SG_GAPS with new queue limits maskKeith Busch2015-08-197-42/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SG_GAPS queue flag caused checks for bio vector alignment against PAGE_SIZE, but the device may have different constraints. This patch adds a queue limits so a driver with such constraints can set to allow requests that would have been unnecessarily split. The new gaps check takes the request_queue as a parameter to simplify the logic around invoking this function. This new limit makes the queue flag redundant, so removing it and all usage. Device-mappers will inherit the correct settings through blk_stack_limits(). Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: bump BLK_DEF_MAX_SECTORS to 2560Jeff Moyer2015-08-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A value of 2560 (1280k) will accommodate a 10-data-disk stripe write with chunk size 128k. In the testing I've done using iozone, fio, and aio-stress across a number of different storage devices, a value of 1280 does not show a big performance difference from 512, but will hopefully help software RAID setups using SATA disks, as reported by Christoph. NOTE: drivers/block/aoe/aoeblk.c sets its own max_hw_sectors_kb to BLK_DEF_MAX_SECTORS. So, this patch essentially changes aeoblk to Use a larger maximum sector size, and I did not test this. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | Revert "block: remove artifical max_hw_sectors cap"Jeff Moyer2015-08-183-2/+5
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 34b48db66e08ca1c1bc07cf305d672ac940268dc. That commit caused performance regressions for streaming I/O workloads on a number of different storage devices, from SATA disks to external RAID arrays. It also managed to trip up some buggy firmware in at least one drive, causing data corruption. The next patch will bump the default max_sectors_kb value to 1280, which will accommodate a 10-data-disk stripe write with chunk size 128k. In the testing I've done using iozone, fio, and aio-stress, a value of 1280 does not show a big performance difference from 512. This will hopefully still help the software RAID setup that Christoph saw the original performance gains with while still not regressing other storage configurations. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: fix race between timeout and freeing requestMing Lei2015-08-155-18/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inside timeout handler, blk_mq_tag_to_rq() is called to retrieve the request from one tag. This way is obviously wrong because the request can be freed any time and some fiedds of the request can't be trusted, then kernel oops might be triggered[1]. Currently wrt. blk_mq_tag_to_rq(), the only special case is that the flush request can share same tag with the request cloned from, and the two requests can't be active at the same time, so this patch fixes the above issue by updating tags->rqs[tag] with the active request(either flush rq or the request cloned from) of the tag. Also blk_mq_tag_to_rq() gets much simplified with this patch. Given blk_mq_tag_to_rq() is mainly for drivers and the caller must make sure the request can't be freed, so in bt_for_each() this helper is replaced with tags->rqs[tag]. [1] kernel oops log [ 439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M [ 439.697162] IP: [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M [ 439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M [ 439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M [ 439.700653] Dumping ftrace buffer:^M [ 439.700653] (ftrace buffer empty)^M [ 439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M [ 439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M [ 439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M [ 439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M [ 439.730500] RIP: 0010:[<ffffffff812d89ba>] [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M [ 439.730500] RSP: 0018:ffff880819203da0 EFLAGS: 00010283^M [ 439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M [ 439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M [ 439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M [ 439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M [ 439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M [ 439.730500] FS: 00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M [ 439.730500] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M [ 439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M [ 439.730500] Stack:^M [ 439.730500] 0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M [ 439.755663] ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M [ 439.755663] Call Trace:^M [ 439.755663] <IRQ> ^M [ 439.755663] [<ffffffff812dc104>] bt_for_each+0x6e/0xc8^M [ 439.755663] [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M [ 439.755663] [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M [ 439.755663] [<ffffffff812dc1b3>] blk_mq_tag_busy_iter+0x55/0x5e^M [ 439.755663] [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M [ 439.755663] [<ffffffff812d8911>] blk_mq_rq_timer+0x5d/0xd4^M [ 439.755663] [<ffffffff810a3e10>] call_timer_fn+0xf7/0x284^M [ 439.755663] [<ffffffff810a3d1e>] ? call_timer_fn+0x5/0x284^M [ 439.755663] [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M [ 439.755663] [<ffffffff810a46d6>] run_timer_softirq+0x1ce/0x1f8^M [ 439.755663] [<ffffffff8104c367>] __do_softirq+0x181/0x3a4^M [ 439.755663] [<ffffffff8104c76e>] irq_exit+0x40/0x94^M [ 439.755663] [<ffffffff81031482>] smp_apic_timer_interrupt+0x33/0x3e^M [ 439.755663] [<ffffffff815559a4>] apic_timer_interrupt+0x84/0x90^M [ 439.755663] <EOI> ^M [ 439.755663] [<ffffffff81554350>] ? _raw_spin_unlock_irq+0x32/0x4a^M [ 439.755663] [<ffffffff8106a98b>] finish_task_switch+0xe0/0x163^M [ 439.755663] [<ffffffff8106a94d>] ? finish_task_switch+0xa2/0x163^M [ 439.755663] [<ffffffff81550066>] __schedule+0x469/0x6cd^M [ 439.755663] [<ffffffff8155039b>] schedule+0x82/0x9a^M [ 439.789267] [<ffffffff8119b28b>] signalfd_read+0x186/0x49a^M [ 439.790911] [<ffffffff8106d86a>] ? wake_up_q+0x47/0x47^M [ 439.790911] [<ffffffff811618c2>] __vfs_read+0x28/0x9f^M [ 439.790911] [<ffffffff8117a289>] ? __fget_light+0x4d/0x74^M [ 439.790911] [<ffffffff811620a7>] vfs_read+0x7a/0xc6^M [ 439.790911] [<ffffffff8116292b>] SyS_read+0x49/0x7f^M [ 439.790911] [<ffffffff81554c17>] entry_SYSCALL_64_fastpath+0x12/0x6f^M [ 439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89 f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b 53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10 ^M [ 439.790911] RIP [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M [ 439.790911] RSP <ffff880819203da0>^M [ 439.790911] CR2: 0000000000000158^M [ 439.790911] ---[ end trace d40af58949325661 ]---^M Cc: <stable@vger.kernel.org> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: fix buffer overflow when reading sysfs file of 'pending'Ming Lei2015-08-151-7/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There may be lots of pending requests so that the buffer of PAGE_SIZE can't hold them at all. One typical example is scsi-mq, the queue depth(.can_queue) of scsi_host and blk-mq is quite big but scsi_device's queue_depth is a bit small(.cmd_per_lun), then it is quite easy to have lots of pending requests in hw queue. This patch fixes the following warning and the related memory destruction. [ 359.025101] fill_read_buffer: blk_mq_hw_sysfs_show+0x0/0x7d returned bad count^M [ 359.055595] irq event stamp: 15537^M [ 359.055606] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M [ 359.055614] Dumping ftrace buffer:^M [ 359.055660] (ftrace buffer empty)^M [ 359.055672] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M [ 359.055678] CPU: 4 PID: 21631 Comm: stress-ng-sysfs Not tainted 4.2.0-rc5-next-20150805 #434^M [ 359.055679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M [ 359.055682] task: ffff8802161cc000 ti: ffff88021b4a8000 task.ti: ffff88021b4a8000^M [ 359.055693] RIP: 0010:[<ffffffff811541c5>] [<ffffffff811541c5>] __kmalloc+0xe8/0x152^M Cc: <stable@vger.kernel.org> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | Documentation: update notes in biovecs about arbitrarily sized biosDongsu Park2015-08-131-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update block/biovecs.txt so that it includes a note on what kind of effects arbitrarily sized bios would bring to the block layer. Also fix a trivial typo, bio_iter_iovec. Cc: Christoph Hellwig <hch@infradead.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: remove bio_get_nr_vecs()Kent Overstreet2015-08-1316-74/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can always fill up the bio now, no need to estimate the possible size based on queue parameters. Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [hch: rebased and wrote a changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | fs: use helper bio_add_page() instead of open coding on bi_io_vecKent Overstreet2015-08-133-20/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Call pre-defined helper bio_add_page() instead of open coding for iterating through bi_io_vec[]. Doing that, it's possible to make some parts in filesystems and mm/page_io.c simpler than before. Acked-by: Dave Kleikamp <shaggy@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: kill merge_bvec_fn() completelyKent Overstreet2015-08-1332-859/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | md/raid5: get rid of bio_fits_rdev()Kent Overstreet2015-08-131-22/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove bio_fits_rdev() as sufficient merge_bvec_fn() handling is now performed by blk_queue_split() in md_make_request(). Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | md/raid5: split bio for chunk_aligned_readMing Lin2015-08-131-5/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a read request fits entirely in a chunk, it will be passed directly to the underlying device (providing it hasn't failed of course). If it doesn't fit, the slightly less efficient path that uses the stripe_cache is used. Requests that get to the stripe cache are always completely split up as necessary. So with RAID5, ripping out the merge_bvec_fn doesn't cause it to stop work, but could cause it to take the less efficient path more often. All that is needed to manage this is for 'chunk_aligned_read' do some bio splitting, much like the RAID0 code does. Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: remove split code in blkdev_issue_{discard,write_same}Ming Lin2015-08-131-36/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The split code in blkdev_issue_{discard,write_same} can go away now that any driver that cares does the split. We have to make sure bio size doesn't overflow. For discard, we set max discard sectors to (1<<31)>>9 to ensure it doesn't overflow bi_size and hopefully it is of the proper granularity as long as the granularity is a power of two. Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | btrfs: remove bio splitting and merge_bvec_fn() callsKent Overstreet2015-08-131-72/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs has been doing bio splitting from btrfs_map_bio(), by checking device limits as well as calling ->merge_bvec_fn() etc. That is not necessary any more, because generic_make_request() is now able to handle arbitrarily sized bios. So clean up unnecessary code paths. Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | bcache: remove driver private bio splitting codeKent Overstreet2015-08-137-162/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bcache driver has always accepted arbitrarily large bios and split them internally. Now that every driver must accept arbitrarily large bios this code isn't nessecary anymore. Cc: linux-bcache@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: simplify bio_add_page()Kent Overstreet2015-08-131-80/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since generic_make_request() can now handle arbitrary size bios, all we have to do is make sure the bvec array doesn't overflow. __bio_add_page() doesn't need to call ->merge_bvec_fn(), where we can get rid of unnecessary code paths. Removing the call to ->merge_bvec_fn() is also fine, as no driver that implements support for BLOCK_PC commands even has a ->merge_bvec_fn() method. Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: rebase and resolve merge conflicts, change a couple of comments, make bio_add_page() warn once upon a cloned bio.] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: make generic_make_request handle arbitrarily sized biosKent Overstreet2015-08-1316-22/+192
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The way the block layer is currently written, it goes to great lengths to avoid having to split bios; upper layer code (such as bio_add_page()) checks what the underlying device can handle and tries to always create bios that don't need to be split. But this approach becomes unwieldy and eventually breaks down with stacked devices and devices with dynamic limits, and it adds a lot of complexity. If the block layer could split bios as needed, we could eliminate a lot of complexity elsewhere - particularly in stacked drivers. Code that creates bios can then create whatever size bios are convenient, and more importantly stacked drivers don't have to deal with both their own bio size limitations and the limitations of the (potentially multiple) devices underneath them. In the future this will let us delete merge_bvec_fn and a bunch of other code. We do this by adding calls to blk_queue_split() to the various make_request functions that need it - a few can already handle arbitrary size bios. Note that we add the call _after_ any call to blk_queue_bounce(); this means that blk_queue_split() and blk_recalc_rq_segments() don't need to be concerned with bouncing affecting segment merging. Some make_request_fn() callbacks were simple enough to audit and verify they don't need blk_queue_split() calls. The skipped ones are: * nfhd_make_request (arch/m68k/emu/nfblock.c) * axon_ram_make_request (arch/powerpc/sysdev/axonram.c) * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c) * brd_make_request (ramdisk - drivers/block/brd.c) * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c) * loop_make_request * null_queue_bio * bcache's make_request fns Some others are almost certainly safe to remove now, but will be left for future patches. Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Geoff Levand <geoff@infradead.org> Cc: Jim Paris <jim@jtan.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: skip more mq-based drivers, resolve merge conflicts, etc.] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)Viresh Kumar2015-08-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there is no need to do that again from its callers. Drop it. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com>