| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull lightnvm support from Jens Axboe:
"This adds support for lightnvm, and adds support to NVMe as well.
This is pretty exciting, in that it enables new and interesting use
cases for compatible flash devices. There's a LWN writeup about an
earlier posting here:
https://lwn.net/Articles/641247/
This has been underway for a while, and should be ready for merging at
this point"
* 'for-4.4/lightnvm' of git://git.kernel.dk/linux-block:
nvme: lightnvm: clean up a data type
lightnvm: refactor phys addrs type to u64
nvme: LightNVM support
rrpc: Round-robin sector target with cost-based gc
gennvm: Generic NVM manager
lightnvm: Support for Open-Channel SSDs
|
| |
| |
| |
| |
| |
| |
| |
| | |
"nlb_pr_rq" can't be more than u32 because "len" is a u32. Later we
truncate it to u32 anyway when we calculate min_t().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
For cases where CONFIG_LBDAF is not set. The struct ppa_addr exceeds its
type on 32 bit architectures. ppa_addr requires a 64bit integer to hold
the generic ppa format. We therefore refactor it to u64 and
replaces the sector_t usages with u64 for physical addresses.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The first generation of Open-Channel SSDs is based on NVMe. The NVMe
driver is extended with support for the LightNVM command set.
Detection is made through PCI IDs. Current supported devices are the
qemu nvme simulator and CNEX Labs Westlake SSD. The qemu nvme enables
support through vendor specific bits in the namespace identification and
the CNEX Labs Westlake SSD implements a LightNVM compatible firmware and
is detected using the same method as qemu.
After detection, vendor specific codes are used to identify the device
and enumerate supported features.
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Javier González <jg@lightnvm.io>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This target allows an Open-Channel SSD to be exposed asas a block
device.
It implements a round-robin approach for sector allocation,
together with a greedy cost-based garbage collector.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The implementation for Open-Channel SSDs is divided into media
management and targets. This patch implements a generic media manager
for open-channel SSDs. After a media manager has been initialized,
single or multiple targets can be instantiated with the media managed as
the backend.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Open-channel SSDs are devices that share responsibilities with the host
in order to implement and maintain features that typical SSDs keep
strictly in firmware. These include (i) the Flash Translation Layer
(FTL), (ii) bad block management, and (iii) hardware units such as the
flash controller, the interface controller, and large amounts of flash
chips. In this way, Open-channels SSDs exposes direct access to their
physical flash storage, while keeping a subset of the internal features
of SSDs.
LightNVM is a specification that gives support to Open-channel SSDs
LightNVM allows the host to manage data placement, garbage collection,
and parallelism. Device specific responsibilities such as bad block
management, FTL extensions to support atomic IOs, or metadata
persistence are still handled by the device.
The implementation of LightNVM consists of two parts: core and
(multiple) targets. The core implements functionality shared across
targets. This is initialization, teardown and statistics. The targets
implement the interface that exposes physical flash to user-space
applications. Examples of such targets include key-value store,
object-store, as well as traditional block devices, which can be
application-specific.
Contributions in this patch from:
Javier Gonzalez <jg@lightnvm.io>
Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Jesper Madsen <jmad@itu.dk>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|\|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull block driver updates from Jens Axboe:
"Here are the block driver changes for 4.4. This pull request
contains:
- NVMe:
- Refactor and moving of code to prepare for proper target
support. From Christoph and Jay.
- 32-bit nvme warning fix from Arnd.
- Error initialization fix from me.
- Proper namespace removal and reference counting support from
Keith.
- Device resume fix on IO failure, also from Keith.
- Dependency fix from Keith, now that nvme isn't under the
umbrella of the block anymore.
- Target location and maintainers update from Jay.
- From Ming Lei, the long awaited DIO/AIO support for loop.
- Enable BD-RE writeable opens, from Georgios"
* 'for-4.4/drivers' of git://git.kernel.dk/linux-block: (24 commits)
Update target repo for nvme patch contributions
NVMe: initialize error to '0'
nvme: use an integer value to Linux errno values
nvme: fix 32-bit build warning
NVMe: Add explicit block config dependency
nvme: include <linux/types.ĥ> in <linux/nvme.h>
nvme: move to a new drivers/nvme/host directory
nvme.h: add missing nvme_id_ctrl endianess annotations
nvme: move hardware structures out of the uapi version of nvme.h
nvme: add a local nvme.h header
nvme: properly handle partially initialized queues in nvme_create_io_queues
nvme: merge nvme_dev_start, nvme_dev_resume and nvme_async_probe
nvme: factor reset code into a common helper
nvme: merge nvme_dev_reset into nvme_reset_failed_dev
nvme: delete dev from dev_list in nvme_reset
NVMe: Simplify device resume on io queue failure
NVMe: Namespace removal simplifications
NVMe: Reference count open namespaces
cdrom: Random writing support for BD-RE media
block: loop: support DIO & AIO
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Per http://www.nvmexpress.org/resources/linux-driver-information/, the
old nvme git repo is stale. Updating MAINTAINERS to the Supported
target currently used by the community.
Signed-off-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Updated by me to add Keith as the maintainer, me as the co-maintainer.
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| | |
Reported-by: Keith Busch <keith.busch@intel.com>
Fixes: 1951feae88c5 ("nvme: use an integer value to Linux errno values")
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Use a separate integer variable to hold the signed Linux errno
values we pass back to the block layer. Note that for pass through
commands those might still be NVMe values, but those fit into the
int as well.
Fixes: f4829a9b7a61: ("blk-mq: fix racy updates of rq->errors")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Compiling the nvme driver on 32-bit warns about a cast from a __u64
variable to a pointer:
drivers/block/nvme-core.c: In function 'nvme_submit_io':
drivers/block/nvme-core.c:1847:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
(void __user *)io.addr, length, NULL, 0);
The cast here is intentional and safe, so we can shut up the
gcc warning by adding an intermediate cast to 'uintptr_t'.
I had previously submitted a patch to fix this problem in the
nvme driver, but it was accepted on the same day that two new
warnings got added.
For clarification, I also change the third instance of this cast
to use uintptr_t instead of unsigned long now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: d29ec8241c10e ("nvme: submit internal commands through the block layer")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The nvme driver was moved from drivers/block, losing our implicit
dependency on CONFIG_BLOCK. This makes it an explicit driver dependency.
Reported-by: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The buildbot complains about this even if it doesn't generate
a a build warning. But it's an easy fix, so here we go:
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch moves the NVMe driver from drivers/block/ to its own new
drivers/nvme/host/ directory. This is in preparation of splitting the
current monolithic driver up and add support for the upcoming NVMe
over Fabrics standard. The drivers/nvme/host/ is chose to leave space
for a NVMe target implementation in addition to this host side driver.
Signed-off-by: Jay Sternberg <jay.e.sternberg@intel.com>
[hch: rebased, renamed core.c to pci.c, slight tweaks]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| | |
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently all NVMe command and completion structures are exposed to userspace
through the uapi version of nvme.h. They are not an ABI between the kernel
and userspace, and will change in C-incompatible way for future versions of
the spec. Move them to the kernel version of the file and rename the uapi
header to nvme_ioctl.h so that userspace can easily detect the presence of
the new clean header. Nvme-cli already carries a local copy of the header,
so it won't be affected by this move.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add a new drivers/block/nvme.h which contains all the driver internal
interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
This avoids having to clean up later in a seemingly unrelated place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
And give the resulting function a sensible name. This keeps all the
error handling in a single place and will allow for further improvements
to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| | |
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
And give the resulting function a more descriptive name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Device resets need to delete the device from the device list before
kicking of the reset an re-probe, otherwise we get the device added
to the list twice. nvme_reset is the only side missing this deletion
at the moment, and this patch adds it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Releasing IO queues and disks was done in a work queue outside the
controller resume context to delete namespaces if the controller failed
after a resume from suspend. This is unnecessary since we can resume
a device asynchronously.
This patch makes resume use probe_work so it can directly remove
namespaces if the device is manageable but not IO capable. Since the
deleting disks was the only reason we had the convoluted "reset_workfn",
this patch removes that unnecessary indirection.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This liberates namespace removal from the device, allowing gendisk
references to be closed independent of the nvme controller reference
count.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Dynamic namespace attachment means the namespace may be removed at any
time, so the namespace reference count can not be tied to the device
reference count. This fixes a NULL dereference if an opened namespace
is detached from a controller.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |\
| | |
| | |
| | | |
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Recently, i bought a blu-ray writer and noticed that while cdrecord
worked perfectly, random writing didn't work on rewritable bd-re media.
For example, dd if=/dev/zero of=/dev/sr0 bs=32768 count=2 gave the usual
"read-only file system" message.
After checking if the problem lies with my burner or firmware, i grep-ed
the kernel source for EROFS. One of the results was in the cdrom driver.
I tried to follow the function chain and ended in the cdrom_is_dvd_rw
function where writing is permitted only for DVD-RAM and DVD+RW media.
I added a new case label for 0x43 which is the profile name of BD-RE
and now it works correctly for BD-RE too.
Maybe there is a better way of implementing this, like a new function
checking for blu-ray support and called from cdrom_open_write like
it happens for mrw and dvdram media, but adding the case label worked.
Thank you for your time.
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
There are at least 3 advantages to use direct I/O and AIO on
read/write loop's backing file:
1) double cache can be avoided, then memory usage gets
decreased a lot
2) not like user space direct I/O, there isn't cost of
pinning pages
3) avoid context switch for obtaining good throughput
- in buffered file read, random I/O top throughput is often obtained
only if they are submitted concurrently from lots of tasks; but for
sequential I/O, most of times they can be hit from page cache, so
concurrent submissions often introduce unnecessary context switch
and can't improve throughput much. There was such discussion[1]
to use non-blocking I/O to improve the problem for application.
- with direct I/O and AIO, concurrent submissions can be
avoided and random read throughput can't be affected meantime
xfstests(-g auto, ext4) is basically passed when running with
direct I/O(aio), one exception is generic/232, but it failed in
loop buffered I/O(4.2-rc6-next-20150814) too.
Follows the fio test result for performance purpose:
4 jobs fio test inside ext4 file system over loop block
1) How to run
- KVM: 4 VCPUs, 2G RAM
- linux kernel: 4.2-rc6-next-20150814(base) with the patchset
- the loop block is over one image on SSD.
- linux psync, 4 jobs, size 1500M, ext4 over loop block
- test result: IOPS from fio output
2) Throughput(IOPS) becomes a bit better with direct I/O(aio)
-------------------------------------------------------------
test cases |randread |read |randwrite |write |
-------------------------------------------------------------
base |8015 |113811 |67442 |106978
-------------------------------------------------------------
base+loop aio |8136 |125040 |67811 |111376
-------------------------------------------------------------
- somehow, it should be caused by more page cache avaiable for
application or one extra page copy is avoided in case of direct I/O
3) context switch
- context switch decreased by ~50% with loop direct I/O(aio)
compared with loop buffered I/O(4.2-rc6-next-20150814)
4) memory usage from /proc/meminfo
-------------------------------------------------------------
| Buffers | Cached
-------------------------------------------------------------
base | > 760MB | ~950MB
-------------------------------------------------------------
base+loop direct I/O(aio) | < 5MB | ~1.6GB
-------------------------------------------------------------
- so there are much more page caches available for application with
direct I/O
[1] https://lwn.net/Articles/612483/
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If loop block is mounted via 'mount -o loop', it isn't easy
to pass file descriptor opened as O_DIRECT, so this patch
introduces a new command to support direct IO for this case.
Cc: linux-api@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patches provides one interface for enabling direct IO
from user space:
- userspace(such as losetup) can pass 'file' which is
opened/fcntl as O_DIRECT
Also __loop_update_dio() is introduced to check if direct I/O
can be used on current loop setting.
The last big change is to introduce LO_FLAGS_DIRECT_IO flag
for userspace to know if direct IO is used to access backing
file.
Cc: linux-api@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The following patch will use dio/aio to submit IO to backing file,
then it needn't to schedule IO concurrently from work, so
use kthread_work for decreasing context switch cost a lot.
For non-AIO case, single thread has been used for long long time,
and it was just converted to work in v4.0, which has caused performance
regression for fedora live booting already. In discussion[1], even
though submitting I/O via work concurrently can improve random read IO
throughput, meantime it might hurt sequential read IO performance, so
better to restore to single thread behaviour.
For the following AIO support, it is better to use multi hw-queue
with per-hwq kthread than current work approach suppose there is so
high performance requirement for loop.
[1] http://marc.info/?t=143082678400002&r=1&w=2
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
It doesn't make sense to enable merge because the I/O
submitted to backing file is handled page by page.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pull core block updates from Jens Axboe:
"This is the core block pull request for 4.4. I've got a few more
topic branches this time around, some of them will layer on top of the
core+drivers changes and will come in a separate round. So not a huge
chunk of changes in this round.
This pull request contains:
- Enable blk-mq page allocation tracking with kmemleak, from Catalin.
- Unused prototype removal in blk-mq from Christoph.
- Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
xchg()'s, from Davidlohr.
- A plug flush fix from Jeff.
- Also from Jeff, a fix that means we don't have to update shared tag
sets at init time unless we do a state change. This cuts down boot
times on thousands of devices a lot with scsi/blk-mq.
- blk-mq waitqueue barrier fix from Kosuke.
- Various fixes from Ming:
- Fixes for segment merging and splitting, and checks, for
the old core and blk-mq.
- Potential blk-mq speedup by marking ctx pending at the end
of a plug insertion batch in blk-mq.
- direct-io no page dirty on kernel direct reads.
- A WRITE_SYNC fix for mpage from Roman"
* 'for-4.4/core' of git://git.kernel.dk/linux-block:
blk-mq: avoid excessive boot delays with large lun counts
blktrace: re-write setting q->blk_trace
blk-mq: mark ctx as pending at batch in flush plug path
blk-mq: fix for trace_block_plug()
block: check bio_mergeable() early before merging
blk-mq: check bio_mergeable() early before merging
block: avoid to merge splitted bio
block: setup bi_phys_segments after splitting
block: fix plug list flushing for nomerge queues
blk-mq: remove unused blk_mq_clone_flush_request prototype
blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
block: kmemleak: Track the page allocations for struct request
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Hi,
Zhangqing Luo reported long boot times on a system with thousands of
LUNs when scsi-mq was enabled. He narrowed the problem down to
blk_mq_add_queue_tag_set, where every queue is frozen in order to set
the BLK_MQ_F_TAG_SHARED flag. Each added device will freeze all queues
added before it in sequence, which involves waiting for an RCU grace
period for each one. We don't need to do this. After the second queue
is added, only new queues need to be initialized with the shared tag.
We can do that by percolating the flag up to the blk_mq_tag_set, and
updating the newly added queue's hctxs if the flag is set.
This problem was introduced by commit 0d2602ca30e41 (blk-mq: improve
support for shared tags maps).
Reported-and-tested-by: Jason Luo <zhangqing.luo@oracle.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This is really about simplifying the double xchg patterns into
a single cmpxchg, with the same logic. Other than the immediate
cleanup, there are some subtleties this change deals with:
(i) While the load of the old bt is fully ordered wrt everything,
ie:
old_bt = xchg(&q->blk_trace, bt); [barrier]
if (old_bt)
(void) xchg(&q->blk_trace, old_bt); [barrier]
blk_trace could still be changed between the xchg and the old_bt
load. Note that this description is merely theoretical and afaict
very small, but doing everything in a single context with cmpxchg
closes this potential race.
(ii) Ordering guarantees are obviously kept with cmpxchg.
(iii) Gets rid of the hacky-by-nature (void)xchg pattern.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
eviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Most of times, flush plug should be the hottest I/O path,
so mark ctx as pending after all requests in the list are
inserted.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The trace point is for tracing plug event of each request
queue instead of each task, so we should check the request
count in the plug list from current queue instead of
current task.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
After bio splitting is introduced, one bio can be splitted
and it is marked as NOMERGE because it is too fat to be merged,
so check bio_mergeable() earlier to avoid to try to merge it
unnecessarily.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
It isn't necessary to try to merge the bio which is marked
as NOMERGE.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The splitted bio has been already too fat to merge, so mark it
as NOMERGE.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The number of bio->bi_phys_segments is always obtained
during bio splitting, so it is natural to setup it
just after bio splitting, then we can avoid to compute
nr_segment again during merge.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Request queues with merging disabled will not flush the plug list after
BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies
on blk_attempt_plug_merge to compute the request_count. Fix this by
computing the number of queued requests even for nomerge queues.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | | |
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | |/
| |/|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
blk_mq_tag_update_depth() seems to be missing a memory barrier which
might cause the waker to not notice the waiter and fail to send a
wake_up as in the following figure.
blk_mq_tag_update_depth bt_get
------------------------------------------------------------------------
if (waitqueue_active(&bs->wait))
/* The CPU might reorder the test for
the waitqueue up here, before
prior writes complete */
prepare_to_wait(&bs->wait, &wait,
TASK_UNINTERRUPTIBLE);
tag = __bt_get(hctx, bt, last_tag,
tags);
/* Value set in bt_update_count not
visible yet */
bt_update_count(&tags->bitmap_tags, tdepth);
/* blk_mq_tag_wakeup_all(tags, false); */
bt = &tags->bitmap_tags;
wake_index = atomic_read(&bt->wake_index);
...
io_schedule();
------------------------------------------------------------------------
This patch adds the missing memory barrier.
I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).
Signed-off-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |\ \
| | |/
| |/|
| | |
| | |
| | |
| | | |
Linux 4.3-rc4
Pulling in v4.3-rc4 to avoid conflicts with NVMe fixes that have gone
in since for-4.4/core was based.
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When direct read IO is submitted from kernel, it is often
unnecessary to dirty pages, for example of loop, dirtying pages
have been considered in the upper filesystem(over loop) side
already, and they don't need to be dirtied again.
So this patch doesn't dirtying pages for ITER_BVEC/ITER_KVEC
direct read, and loop should be the 1st case to use ITER_BVEC/ITER_KVEC
for direct read I/O.
The patch is based on previous Dave's patch.
Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity
write, thus mark request as WRITE_SYNC.
akpm: afaict this change will cause the data integrity write bios to be
placed onto the second queue in cfq_io_cq.cfqq[], which presumably results
in special treatment. The documentation for REQ_SYNC is horrid.
Signed-off-by: Roman Pen <r.peniaev@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The pages allocated for struct request contain pointers to other slab
allocations (via ops->init_request). Since kmemleak does not track/scan
page allocations, the slab objects will be reported as leaks (false
positives). This patch adds kmemleak callbacks to allow tracking of such
pages.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Bart Van Assche<bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI updates from Rafael Wysocki:
"Quite a new features are included this time.
First off, the Collaborative Processor Performance Control interface
(version 2) defined by ACPI will now be supported on ARM64 along with
a cpufreq frontend for CPU performance scaling.
Second, ACPI gets a new infrastructure for the early probing of IRQ
chips and clock sources (along the lines of the existing similar
mechanism for DT).
Next, the ACPI core and the generic device properties API will now
support a recently introduced hierarchical properties extension of the
_DSD (Device Specific Data) ACPI device configuration object. If the
ACPI platform firmware uses that extension to organize device
properties in a hierarchical way, the kernel will automatically handle
it and make those properties available to device drivers via the
generic device properties API.
It also will be possible to build the ACPICA's AML interpreter
debugger into the kernel now and use that to diagnose AML-related
problems more efficiently. In the future, this should make it
possible to single-step AML execution and do similar things.
Interesting stuff, although somewhat experimental at this point.
Finally, the PM core gets a new mechanism that can be used by device
drivers to distinguish between suspend-to-RAM (based on platform
firmware support) and suspend-to-idle (or other variants of system
suspend the platform firmware is not involved in) and possibly
optimize their device suspend/resume handling accordingly.
In addition to that, some existing features are re-organized quite
substantially.
First, the ACPI-based handling of PCI host bridges on x86 and ia64 is
unified and the common code goes into the ACPI core (so as to reduce
code duplication and eliminate non-essential differences between the
two architectures in that area).
Second, the Operating Performance Points (OPP) framework is
reorganized to make the code easier to find and follow.
Next, the cpufreq core's sysfs interface is reorganized to get rid of
the "primary CPU" concept for configurations in which the same
performance scaling settings are shared between multiple CPUs.
Finally, some interfaces that aren't necessary any more are dropped
from the generic power domains framework.
On top of the above we have some minor extensions, cleanups and bug
fixes in multiple places, as usual.
Specifics:
- ACPICA update to upstream revision 20150930 (Bob Moore, Lv Zheng).
The most significant change is to allow the AML debugger to be
built into the kernel. On top of that there is an update related
to the NFIT table (the ACPI persistent memory interface) and a few
fixes and cleanups.
- ACPI CPPC2 (Collaborative Processor Performance Control v2) support
along with a cpufreq frontend (Ashwin Chaugule).
This can only be enabled on ARM64 at this point.
- New ACPI infrastructure for the early probing of IRQ chips and
clock sources (Marc Zyngier).
- Support for a new hierarchical properties extension of the ACPI
_DSD (Device Specific Data) device configuration object allowing
the kernel to handle hierarchical properties (provided by the
platform firmware this way) automatically and make them available
to device drivers via the generic device properties interface
(Rafael Wysocki).
- Generic device properties API extension to obtain an index of
certain string value in an array of strings, along the lines of
of_property_match_string(), but working for all of the supported
firmware node types, and support for the "dma-names" device
property based on it (Mika Westerberg).
- ACPI core fix to parse the MADT (Multiple APIC Description Table)
entries in the order expected by platform firmware (and mandated by
the specification) to avoid confusion on systems with more than 255
logical CPUs (Lukasz Anaczkowski).
- Consolidation of the ACPI-based handling of PCI host bridges on x86
and ia64 (Jiang Liu).
- ACPI core fixes to ensure that the correct IRQ number is used to
represent the SCI (System Control Interrupt) in the cases when it
has been re-mapped (Chen Yu).
- New ACPI backlight quirk for Lenovo IdeaPad S405 (Hans de Goede).
- ACPI EC driver fixes (Lv Zheng).
- Assorted ACPI fixes and cleanups (Dan Carpenter, Insu Yun, Jiri
Kosina, Rami Rosen, Rasmus Villemoes).
- New mechanism in the PM core allowing drivers to check if the
platform firmware is going to be involved in the upcoming system
suspend or if it has been involved in the suspend the system is
resuming from at the moment (Rafael Wysocki).
This should allow drivers to optimize their suspend/resume handling
in some cases and the changes include a couple of users of it (the
i8042 input driver, PCI PM).
- PCI PM fix to prevent runtime-suspended devices with PME enabled
from being resumed during system suspend even if they aren't
configured to wake up the system from sleep (Rafael Wysocki).
- New mechanism to report the number of a wakeup IRQ that woke up the
system from sleep last time (Alexandra Yates).
- Removal of unused interfaces from the generic power domains
framework and fixes related to latency measurements in that code
(Ulf Hansson, Daniel Lezcano).
- cpufreq core sysfs interface rework to make it handle CPUs that
share performance scaling settings (represented by a common cpufreq
policy object) more symmetrically (Viresh Kumar).
This should help to simplify the CPU offline/online handling among
other things.
- cpufreq core fixes and cleanups (Viresh Kumar).
- intel_pstate fixes related to the Turbo Activation Ratio (TAR)
mechanism on client platforms which causes the turbo P-states range
to vary depending on platform firmware settings (Srinivas
Pandruvada).
- intel_pstate sysfs interface fix (Prarit Bhargava).
- Assorted cpufreq driver (imx, tegra20, powernv, integrator) fixes
and cleanups (Bai Ping, Bartlomiej Zolnierkiewicz, Shilpasri G
Bhat, Luis de Bethencourt).
- cpuidle mvebu driver cleanups (Russell King).
- OPP (Operating Performance Points) framework code reorganization to
make it more maintainable (Viresh Kumar).
- Intel Broxton support for the RAPL (Running Average Power Limits)
power capping driver (Amy Wiles).
- Assorted power management code fixes and cleanups (Dan Carpenter,
Geert Uytterhoeven, Geliang Tang, Luis de Bethencourt, Rasmus
Villemoes)"
* tag 'pm+acpi-4.4-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (108 commits)
cpufreq: postfix policy directory with the first CPU in related_cpus
cpufreq: create cpu/cpufreq/policyX directories
cpufreq: remove cpufreq_sysfs_{create|remove}_file()
cpufreq: create cpu/cpufreq at boot time
cpufreq: Use cpumask_copy instead of cpumask_or to copy a mask
cpufreq: ondemand: Drop unnecessary locks from update_sampling_rate()
PM / Domains: Merge measurements for PM QoS device latencies
PM / Domains: Don't measure ->start|stop() latency in system PM callbacks
PM / clk: Fix broken build due to non-matching code and header #ifdefs
ACPI / Documentation: add copy_dsdt to ACPI format options
ACPI / sysfs: correctly check failing memory allocation
ACPI / video: Add a quirk to force native backlight on Lenovo IdeaPad S405
ACPI / CPPC: Fix potential memory leak
ACPI / CPPC: signedness bug in register_pcc_channel()
ACPI / PAD: power_saving_thread() is not freezable
ACPI / PM: Fix incorrect wakeup IRQ setting during suspend-to-idle
ACPI: Using correct irq when waiting for events
ACPI: Use correct IRQ when uninstalling ACPI interrupt handler
cpuidle: mvebu: disable the bind/unbind attributes and use builtin_platform_driver
cpuidle: mvebu: clean up multiple platform drivers
...
|