summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* kyber: fix hang on domain token wait queueOmar Sandoval2017-10-181-1/+9
| | | | | | | | | | | | | | | | | | When we're getting a domain token, if we fail to get a token on our first attempt, we put the current hardware queue on a wait queue and then try again just in case a token was freed after our initial attempt but before we got on the wait queue. If this second attempt succeeds, we currently leave the hardware queue on the wait queue. Usually this is okay; we'll just run the hardware queue one extra time when another token is freed. However, if the hardware queue doesn't have any other requests waiting, then when it it gets the extra wakeup, it won't have anything to free and therefore won't wake up any other hardware queues. If tokens are limited, then we won't make forward progress and the device will hang. Reported-by: Bin Zha <zhabin.zb@alibaba-inc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nullb: fix error return code in null_init()Wei Yongjun2017-10-171-1/+3
| | | | | | | | | Fix to return error code -ENOMEM from the null_alloc_dev() error handling case instead of 0, as done elsewhere in this function. Fixes: 2984c8684f96 ("nullb: factor disk parameters") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: fix Sphinx kernel-doc warningRandy Dunlap2017-10-161-1/+1
| | | | | | | | | | | Sphinx treats symbols that end with '_' as a kind of special documentation indicator, so fix that by adding an ending '*' to it. ../block/bio.c:404: ERROR: Unknown target name: "gfp". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: writeback rate clamping: make 32 bit safeMichael Lyle2017-10-161-2/+2
| | | | | | | | | | | Sorry this got through to linux-block, was detected by the kbuilds test robot. NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a signed long constant. Fixes: e41166c5c44e ("bcache: writeback rate shouldn't artifically clamp") Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: MAINTAINERS: set bcache to MAINTAINEDMichael Lyle2017-10-161-1/+2
| | | | | | | Also add URL for IRC channel. Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: Add Michael Lyle to MAINTAINERSKent Overstreet2017-10-161-0/+1
| | | | | Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: safeguard a dangerous addressing in closure_queueLiang Chen2017-10-161-0/+6
| | | | | | | | | | | | | | The use of the union reduces the size of closure struct by taking advantage of the current size of its members. The offset of func in work_struct equals the size of the first three members, so that work.work_func will just reference the forth member - fn. This is smart but dangerous. It can be broken if work_struct or the other structs get changed, and can be a bit difficult to debug. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: rearrange writeback main thread ratelimitMichael Lyle2017-10-161-1/+4
| | | | | | | | | | | | | | The time spent searching for things to write back "counts" for the actual rate achieved, so don't flush the accumulated rate with each chunk. This will maintain better fidelity to user-commanded rates, but it may slightly increase the burstiness of writeback. The writeback lock needs improvement to help mitigate this. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: writeback rate shouldn't artifically clampMichael Lyle2017-10-163-6/+7
| | | | | | | | | | | | | | | | | | | The previous code artificially limited writeback rate to 1000000 blocks/second (NSEC_PER_MSEC), which is a rate that can be met on fast hardware. The rate limiting code works fine (though with decreased precision) up to 3 orders of magnitude faster, so use NSEC_PER_SEC. Additionally, ensure that uint32_t is used as a type for rate throughout the rate management so that type checking/clamp_t can work properly. bch_next_delay should be rewritten for increased precision and better handling of high rates and long sleep periods, but this is adequate for now. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reported-by: Coly Li <colyli@suse.de> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: smooth writeback rate controlMichael Lyle2017-10-162-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | This works in conjunction with the new PI controller. Currently, in real-world workloads, the rate controller attempts to write back 1 sector per second. In practice, these minimum-rate writebacks are between 4k and 60k in test scenarios, since bcache aggregates and attempts to do contiguous writes and because filesystems on top of bcachefs typically write 4k or more. Previously, bcache used to guarantee to write at least once per second. This means that the actual writeback rate would exceed the configured amount by a factor of 8-120 or more. This patch adjusts to be willing to sleep up to 2.5 seconds, and to target writing 4k/second. On the smallest writes, it will sleep 1 second like before, but many times it will sleep longer and load the backing device less. This keeps the loading on the cache and backing device related to writeback more consistent when writing back at low rates. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: implement PI controller for writeback rateMichael Lyle2017-10-163-52/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bcache uses a control system to attempt to keep the amount of dirty data in cache at a user-configured level, while not responding excessively to transients and variations in write rate. Previously, the system was a PD controller; but the output from it was integrated, turning the Proportional term into an Integral term, and turning the Derivative term into a crude Proportional term. Performance of the controller has been uneven in production, and it has tended to respond slowly, oscillate, and overshoot. This patch set replaces the current control system with an explicit PI controller and tuning that should be correct for most hardware. By default, it attempts to write at a rate that would retire 1/40th of the current excess blocks per second. An integral term in turn works to remove steady state errors. IMO, this yields benefits in simplicity (removing weighted average filtering, etc) and system performance. Another small change is a tunable parameter is introduced to allow the user to specify a minimum rate at which dirty blocks are retired. There is a slight difference from earlier versions of the patch in integral handling to prevent excessive negative integral windup. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: don't write back data if reading it failedMichael Lyle2017-10-161-6/+14
| | | | | | | | | If an IO operation fails, and we didn't successfully read data from the cache, don't writeback invalid/partial data to the backing disk. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: remove unused parameterYijing Wang2017-10-161-6/+6
| | | | | | | | Parameter bio is no longer used, clean it. Signed-off-by: Yijing Wang <wangyijing@huawei.com> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: update bio->bi_opf bypass/writeback REQ_ flag hintsEric Wheeler2017-10-162-1/+11
| | | | | | | | | | | | | | | | Flag for bypass if the IO is for read-ahead or background, unless the read-ahead request is for metadata (eg, from gfs2). Bypass if: bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND) && !(bio->bi_opf & REQ_META)) Writeback if: op_is_sync(bio->bi_opf) || bio->bi_opf & (REQ_META|REQ_PRIO) Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: Remove redundant set_capacityYijing Wang2017-10-161-3/+0
| | | | | | | | | | set_capacity() has been called in bcache_device_init(), remove the redundant one. Signed-off-by: Yijing Wang <wangyijing@huawei.com> Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: rewrite multiple partitions supportColy Li2017-10-161-12/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current partition support of bcache is confusing and buggy. It tries to trace non-continuous device minor numbers by an ida bit string, and mistakenly mixed bcache device index with minor numbers. This design generates several negative results, - Index of bcache device name is not consecutive under /dev/. If there are 3 bcache devices, they name will be, /dev/bcache0, /dev/bcache16, /dev/bcache32 Only bcache code indexes bcache device name is such an interesting way. - First minor number of each bcache device is traced by ida bit string. One bcache device will occupy 16 bits, this is not a good idea. Indeed only one bit is enough. - Because minor number and bcache device index are mixed, a device index is allocated by ida_simple_get(), but an first minor number is sent into ida_simple_remove() to release the device. It confused original author too. Root cause of the above errors is, bcache code should not handle device minor numbers at all! A standard process to support multiple partitions in Linux kernel is, - Device driver provides major device number, and indexes multiple device instances. - Device driver does not allocat nor trace device minor number, only provides a first minor number of a given device instance, and sets how many minor numbers (paritions) the device instance may have. All rested stuffs are handled by block layer code, most of the details can be found from block/{genhd, partition-generic}.c files. This patch re-writes multiple partitions support for bcache. It makes whole things to be more clear, and uses ida bit string in a more efficeint way. - Ida bit string only traces bcache device index, not minor number. For a bcache device with 128 partitions, only one bit in ida bit string is enough. - Device minor number and device index are separated in concept. Device index is used for /dev node naming, and ida bit string trace. Minor number is calculated from device index and only used to initialize first_minor of a bcache device. - It does not follow any standard for 16 partitions on a bcache device. This patch sets 128 partitions on single bcache device at max, this is the limitation from GPT (GUID Partition Table) and supported by fdisk. Considering a typical device minor number is 20 bits width, each bcache device may have 128 partitions (7 bits), there can be 8192 bcache devices existing on system. For most common deployment for a single server in now days, it should be enough. [minor spelling fixes in commit message by Michael Lyle] Signed-off-by: Coly Li <colyli@suse.de> Cc: Eric Wheeler <bcache@lists.ewheeler.net> Cc: Junhui Tang <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: fix a comments typo in bch_alloc_sectors()Coly Li2017-10-161-1/+1
| | | | | | | | | | | | Code comments in alloc.c:bch_alloc_sectors() mentions a function name find_data_bucket(), the correct function name should be pick_data_bucket() indeed. bch_alloc_sectors() is a quite important function in bcache allocation code, fixing the typo may help other people to have less confusion. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: check ca->alloc_thread initialized before wake up itColy Li2017-10-161-1/+2
| | | | | | | | | | | | | | | | | | | | | | | In bcache code, sysfs entries are created before all resources get allocated, e.g. allocation thread of a cache set. There is posibility for NULL pointer deference if a resource is accessed but which is not initialized yet. Indeed Jorg Bornschein catches one on cache set allocation thread and gets a kernel oops. The reason for this bug is, when bch_bucket_alloc() is called during cache set registration and attaching, ca->alloc_thread is not properly allocated and initialized yet, call wake_up_process() on ca->alloc_thread triggers NULL pointer deference failure. A simple and fast fix is, before waking up ca->alloc_thread, checking whether it is allocated, and only wake up ca->alloc_thread when it is not NULL. Signed-off-by: Coly Li <colyli@suse.de> Reported-by: Jorg Bornschein <jb@capsec.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: Avoid nested function definitionPeter Foley2017-10-161-4/+6
| | | | | | | | | | | | | | | | | | | Fixes below error with clang: ../drivers/md/bcache/sysfs.c:759:3: error: function definition is not allowed here { return *((uint16_t *) r) - *((uint16_t *) l); } ^ ../drivers/md/bcache/sysfs.c:789:32: error: use of undeclared identifier 'cmp' sort(p, n, sizeof(uint16_t), cmp, NULL); ^ 2 errors generated. v2: rename function to __bch_cache_cmp Signed-off-by: Peter Foley <pefoley2@pefoley.com> Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* mm/page-writeback.c: make changes of dirty_writeback_centisecs take effect ↵Yafang Shao2017-10-141-1/+10
| | | | | | | | | | | | | | | | | | | | immediately This patch is the followup of the prvious patch: [writeback: schedule periodic writeback with sysctl]. There's another issue to fix. For example, - When the tunable was set to one hour and is reset to one second, the new setting will not take effect for up to one hour. Kicking the flusher threads immediately fixes it. Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* null_blk: add usage hints for no_schedweiping zhang2017-10-131-0/+4
| | | | | | | | This parameter provide an option to disable io scheduler when nullb* in multi-queue mode. Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* null_blk: update usage hints for submit_queuesweiping zhang2017-10-131-3/+3
| | | | | | | update the range of submits_queues, and correct usage hints. Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Revert "lightnvm: prevent bd removal if busy"Jens Axboe2017-10-131-14/+0
| | | | | | | | | | Christoph correctly points out that this issue is no different for other block devices, and poking at cross layer internals is not the right way to solve it. This reverts commit bb6aa6f08268bbce4e0185b18cab9e04505d6695. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: implement generic path for sync I/OJavier González2017-10-137-123/+143
| | | | | | | | | | Implement a generic path for sending sync I/O on LightNVM. This allows to reuse the standard synchronous path trough blk_execute_rq(), instead of implementing a wait_for_completion on the target side (e.g., pblk). Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: fail fast on passthrough commandsJavier González2017-10-131-2/+0
| | | | | | | | | | Make LightNVM passhtrough commands fail fast. User space will then take care of re-submitting. Fixes: 84d4add793c6 ('lightnvm: add ioctls for vector I/Os') Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: avoid being reported as hung on rated GCJavier González2017-10-131-1/+7
| | | | | | | | | | | | | | | | The amount of GC I/O on the write buffer is managed by the rate-limiter, which is calculated as a function of the number of available free blocks. When reaching the stable point, we risk having scheduled more I/Os for GC than are allowed on the write buffer. This would result on the GC semaphore balancing the outstanding read GC I/Os to be reported as "hung", though the behavior is normal. Solve this by allowing to schedule when we detect that the read GC path is not moving forward. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: cleanup unused and static functionsJavier González2017-10-134-111/+86
| | | | | | | | Cleanup up unused and static functions across the whole codebase. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: remove spinlock when freeing line metadataHans Holmberg2017-10-131-2/+0
| | | | | | | | | | | | | | Lockdep complains about being in atomic context while freeing line metadata - and rightly so as we take a spinlock and end up calling vfree that might sleep(in pblk_mfree). There is no need for holding the line manager free_lock while freeing line metadata as the pipeline as stopped, so remove the lock. Fixes: 588726d3ec68 ("lightnvm: pblk: fail gracefully on irrec. error") Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: correct valid lba count calculationHans Holmberg2017-10-131-3/+4
| | | | | | | | | | During garbage collect, lbas being written can end up being invalidated. Make sure that this is reflected in the valid lba count. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: gc all lines in the pipeline before exitHans Holmberg2017-10-134-9/+32
| | | | | | | | | | | | | | | | | Finish garbage collect of the lines that are in the gc pipeline before exiting. Ensure that all lines already in in the pipeline goes through, from read to write. Do this by keeping track of how many lines are in the pipeline and waiting for that number to reach zero before exiting the gc reader task. Since we're adding a new gc line counter, change the name of inflight_gc to read_inflight_gc to make the distinction clear. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: add l2p crc debug printoutsHans Holmberg2017-10-131-5/+34
| | | | | | | | | | | | | | Print the CRC of the logical-to-physical mapping during exit and after recovering the L2P table to facilitate detection of meta data corruption/recovery issues. The CRC printed after recovery should match the CRC printed during the previous exit - if it doesn't this indicates that either the meta data written to the disk is corrupt or recovery failed. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: shut down gc gracefully during exitHans Holmberg2017-10-131-8/+12
| | | | | | | | Shut down the GC workqueues and tasks in the right order. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: consider bad sectors in emeta during recoveryHans Holmberg2017-10-131-14/+30
| | | | | | | | | | | | | | | | | | | | | | When recovering lines we need to consider that bad blocks in a line affect the emeta area size. Previously it was assumed that the emeta area would grow by the number of sectors per page * number of bad blocks in the line. This assumption is not correct - the number of "extra" pages that are consumed could be both smaller (depending on emeta size) and bigger (depending on the placement of the bad blocks). Fix this by calculating the emeta start by iterating backwards through the line, skipping ppas that map to bad blocks. Also fix the data types used for ppa indices/counts in pblk_recov_l2p_from_emeta - we should use u64. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: start gc if needed during initHans Holmberg2017-10-134-1/+12
| | | | | | | | | | | | Start GC if needed, directly after init, as we might need to garbage collect in order to make room for user writes. Create a helper function that allows to kick GC without exposing the internals of the GC/rate-limiter interaction. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: free full lines during recoveryHans Holmberg2017-10-133-12/+24
| | | | | | | | | | | | | | When rebuilding the L2P table, any full lines (lines without any valid sectors) will be identified. If these lines are not freed, we risk not being able to allocate the first data line. This patch refactors the part of GC that frees empty lines into a separate function and adds a call to this after the L2P table has been rebuilt. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: recover partially written lines correctlyHans Holmberg2017-10-131-4/+4
| | | | | | | | | | | | | When recovering partially written lines, the valid sector count must be decreased by the number of padded sectors in the line. Update line recovery to take all ADDR_EMPTY(padded) sectors into account. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: prevent gc kicks when gc is not operationalHans Holmberg2017-10-132-4/+6
| | | | | | | | | | | | | GC can be kicked after it has been shut down when closing the last line during exit, resulting in accesses to freed structures. Make sure that GC is not triggered while it is not operational. Also make sure that GC won't be re-activated during exit when running on another processor by using timer_del_sync. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: fix releases of kmem cache in error pathRakesh Pandit2017-10-131-5/+12
| | | | | | | | If pblk_core_init fails lets destroy all global caches. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: reduce arguments in __pblk_rb_update_l2pRakesh Pandit2017-10-131-6/+5
| | | | | | | | We already pass the structure pointer so no need to pass the member. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: remove stale extern and unused exported symbolsRakesh Pandit2017-10-132-72/+64
| | | | | | | | | Not all exported symbols are being used outside core and there were some stale entries in lightnvm.h Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: remove unused argument from nvm_set_tgt_bb_tblRakesh Pandit2017-10-132-18/+13
| | | | | | | | | | vblk isn't being used anyway and if we ever have a usecase we can introduce this again. This makes the logic easier and removes unnecessary checks. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: remove useless lineRakesh Pandit2017-10-131-1/+0
| | | | | | Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: fix changing GC group list for a lineRakesh Pandit2017-10-131-2/+2
| | | | | | | | | | | | | pblk_line_gc_list seems to had a bug since the introduction of pblk in getting GC list for a line. In b20ba1bc7 while redesigning the GC algorithm, the naming for the GC thresholds was altered, but the values for high_thrs and mid_thrs were not. The result is that when moving to the GC lists, the mid threshold is never evaluated. Fixes: a4bd217b4("lightnvm: physical block device (pblk) target") Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: ensure right bad block calculationJavier González2017-10-131-1/+1
| | | | | | | | | | | Make sure that the variable controlling block threshold for allocating extra metadata sectors in case of a line with bad blocks does not get a negative value. Otherwise, the line will be marked as corrupted and wasted. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: enable 1 LUN configurationJavier González2017-10-134-19/+29
| | | | | | | | | | | | | | | | | | | | | | | | | Metadata I/Os are scheduled to minimize their impact on user data I/Os. When there are enough LUNs instantiated (i.e., enough bandwidth), it is easy to interleave metadata and data one after the other so that metadata I/Os are the ones being blocked and not vice-versa. We do this by calculating the distance between the I/Os in terms of the LUNs that are not in used, and selecting a free LUN that satisfies a the simple heuristic that metadata is scheduled behind. The per-LUN semaphores guarantee consistency. This works fine on >1 LUN configuration. However, when a single LUN is instantiated, this design leads to a deadlock, where metadata waits to be scheduled on a free LUN. This patch implements the 1 LUN case by simply scheduling the metadada I/O after the data I/O. In the process, we refactor the way a line is replaced to ensure that metadata writes are submitted after data writes in order to guarantee block sequentiality. Note that, since there is only one LUN, both I/Os will block each other by design. However, such configuration only pursues tight read latencies, not write bandwidth. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: remove I/O dependency on write pathJavier González2017-10-131-80/+65
| | | | | | | | | | | | | | pblk schedules user I/O, metadata I/O and erases on the write path in order to minimize collisions at the media level. Until now, there has been a dependency between user and metadata I/Os that could lead to a deadlock as both take the per-LUN semaphore to schedule submission. This path removes this dependency and guarantees forward progress at a per I/O granurality. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: remove redundant check on read pathJavier González2017-10-131-2/+2
| | | | | | | | | | | A partial read I/O in pblk is an I/O where some sectors reside in the write buffer in main memory and some are persisted on the device. Such an I/O must at least contain 2 lbas, therefore checking for the case where a single lba is mapped is not necessary. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: guarantee line integrity on readsJavier González2017-10-134-25/+118
| | | | | | | | | | | | | | | When a line is recycled during garbage collection, reads can still be issued to the line. If the line is freed in the middle of this process, data corruption might occur. This patch guarantees that lines are not freed in the middle of reads that target them (lines). Specifically, we use the existing line reference to decide when a line is eligible for being freed after the recycle process. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: check lba sanity on read pathJavier González2017-10-132-4/+51
| | | | | | | | | | | | | | | | | As part of pblk's recovery scheme, we store the lba mapped to each physical sector on the device's out-of-bound (OOB) area. On the read path, we can use this information to validate that the data being delivered to the upper layers corresponds to the lba being requested. The cost of this check is an extra copy on the DMA region on the device and an extra comparison in the host, given that (i) the OOB area is being read together with the data in the media, and (ii) the DMA region allocated for the ppa list can be reused for the metadata stored on the OOB area. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: use rqd->end_io for completionJavier González2017-10-132-10/+2
| | | | | | | | | For consistency with the rest of pblk, use rqd->end_io to point to the function taking care of ending the request on the completion path. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>