linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	md/raid1: prevent merging too large request	Shaohua Li	2012-07-31	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For SSD, if request size exceeds specific value (optimal io size), request size isn't important for bandwidth. In such condition, if making request size bigger will cause some disks idle, the total throughput will actually drop. A good example is doing a readahead in a two-disk raid1 setup. So when should we split big requests? We absolutly don't want to split big request to very small requests. Even in SSD, big request transfer is more efficient. This patch only considers request with size above optimal io size. If all disks are busy, is it worth doing a split? Say optimal io size is 16k, two requests 32k and two disks. We can let each disk run one 32k request, or split the requests to 4 16k requests and each disk runs two. It's hard to say which case is better, depending on hardware. So only consider case where there are idle disks. For readahead, split is always better in this case. And in my test, below patch can improve > 30% thoughput. Hmm, not 100%, because disk isn't 100% busy. Such case can happen not just in readahead, for example, in directio. But I suppose directio usually will have bigger IO depth and make all disks busy, so I ignored it. Note: if the raid uses any hard disk, we don't prevent merging. That will make performace worse. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	md/raid1: make sequential read detection per disk based	Shaohua Li	2012-07-31	1	-6/+5
\| \| \| \| \| \| \| \| \| \|	Currently the sequential read detection is global wide. It's natural to make it per disk based, which can improve the detection for concurrent multiple sequential reads. And next patch will make SSD read balance not use distance based algorithm, where this change help detect truly sequential read for SSD. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	MD: Move macros from raid1.h to raid1.c	Jonathan Brassow	2012-07-31	1	-14/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MD RAID1/RAID10: Move some macros from .h file to .c file There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined in both raid1.h and raid10.h. They are only used in there respective .c files. However, if we wish to make RAID10 accessible to the device-mapper RAID target (dm-raid.c), then we need to move these macros into the .c files where they are used so that they do not conflict with each other. The macros from the two files are identical and could be moved into md.h, but I chose to leave the duplication and have them remain in the personality files. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	MD RAID1: rename mirror_info structure	Jonathan Brassow	2012-07-31	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	MD RAID1: Rename the structure 'mirror_info' to 'raid1_info' The same structure name ('mirror_info') is used by raid10. Each of these structures are defined in there respective header files. If dm-raid is to support both RAID1 and RAID10, the header files will be included and the structure names must not collide. While only one of these structure names needs to change, this patch adds consistency to the naming of the structure. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	md/raid1: Allocate spare to store replacement devices and their bios.	NeilBrown	2011-12-23	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In RAID1, a replacement is much like a normal device, so we just double the size of the relevant arrays and look at all possible devices for reads and writes. This means that the array looks like it is now double the size in some way - we need to be careful about that. In particular, we checking if the array is still degraded while creating a recovery request we need to only consider the first 'half' - i.e. the real (non-replacement) devices. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: add proper write-congestion reporting to RAID1 and RAID10.	NeilBrown	2011-10-11	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	RAID1 and RAID10 handle write requests by queuing them for handling by a separate thread. This is because when a write-intent-bitmap is active we might need to update the bitmap first, so it is good to queue a lot of writes, then do one big bitmap update for them all. However writeback request devices to appear to be congested after a while so it can make some guesstimate of throughput. The infinite queue defeats that (note that RAID5 has already has a finite queue so it doesn't suffer from this problem). So impose a limit on the number of pending write requests. By default it is 1024 which seems to be generally suitable. Make it configurable via module option just in case someone finds a regression. Signed-off-by: NeilBrown <neilb@suse.de>
*	md/raid1: typedef removal: conf_t -> struct r1conf	NeilBrown	2011-10-11	1	-3/+1
\| \| \| \|	Signed-off-by: NeilBrown <neilb@suse.de>
*	md: remove typedefs: mirror_info_t -> struct mirror_info	NeilBrown	2011-10-11	1	-3/+1
\| \| \| \|	Signed-off-by: NeilBrown <neilb@suse.de>
*	md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bio	NeilBrown	2011-10-11	1	-9/+6
\| \| \| \|	Signed-off-by: NeilBrown <neilb@suse.de>
*	md: remove typedefs: mdk_thread_t -> struct md_thread	NeilBrown	2011-10-11	1	-1/+1
\| \| \| \|	Signed-off-by: NeilBrown <neilb@suse.de>
*	md: remove typedefs: mddev_t -> struct mddev	NeilBrown	2011-10-11	1	-4/+4
\| \| \| \| \| \|	Having mddev_t and 'struct mddev_s' is ugly and not preferred Signed-off-by: NeilBrown <neilb@suse.de>
*	md: removing typedefs: mdk_rdev_t -> struct md_rdev	NeilBrown	2011-10-11	1	-1/+1
\| \| \| \| \| \| \|	The typedefs are just annoying. 'mdk' probably refers to 'md_k.h' which used to be an include file that defined this thing. Signed-off-by: NeilBrown <neilb@suse.de>
*	md/raid1: add documentation to r1_private_data_s data structure.	NeilBrown	2011-10-07	1	-17/+42
\| \| \| \| \| \| \| \|	There wasn't much and it is inconsistent. Also rearrange fields to keep related fields together. Reported-by: Aapo Laine <aapo.laine@shiftmail.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md/raid1: Handle write errors by updating badblock log.	NeilBrown	2011-07-28	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	When we get a write error (in the data area, not in metadata), update the badblock log rather than failing the whole device. As the write may well be many blocks, we trying writing each block individually and only log the ones which fail. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
*	md/raid1: store behind-write pages in bi_vecs.	NeilBrown	2011-07-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	When performing write-behind we allocate pages to store the data during write. Previously we just keep a list of pages. Now we keep a list of bi_vec which includes offset and size. This means that the r1bio has complete information to create a new bio which will be needed for retrying after write errors. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
*	md/raid1: clear bad-block record when write succeeds.	NeilBrown	2011-07-28	1	-1/+12
\| \| \| \| \| \| \| \| \| \| \|	If we succeed in writing to a block that was recorded as being bad, we clear the bad-block record. This requires some delayed handling as the bad-block-list update has to happen in process-context. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
*	md/raid1: avoid reading from known bad blocks.	NeilBrown	2011-07-28	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that we have a bad block list, we should not read from those blocks. There are several main parts to this: 1/ read_balance needs to check for bad blocks, and return not only the chosen device, but also how many good blocks are available there. 2/ fix_read_error needs to avoid trying to read from bad blocks. 3/ read submission must be ready to issue multiple reads to different devices as different bad blocks on different devices could mean that a single large read cannot be served by any one device, but can still be served by the array. This requires keeping count of the number of outstanding requests per bio. This count is stored in 'bi_phys_segments' 4/ retrying a read needs to also be ready to submit a smaller read and queue another request for the rest. This does not yet handle bad blocks when reading to perform resync, recovery, or check. 'md_trim_bio' will also be used for RAID10, so put it in md.c and export it. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: change managed of recovery_disabled.	NeilBrown	2011-07-27	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we hit a read error while recovering a mirror, we want to abort the recovery without necessarily failing the disk - as having a disk this a read error is better than not having an array at all. Currently this is managed with a per-array flag "recovery_disabled" and is only implemented for RAID1. For RAID10 we will need finer grained control as we might want to disable recovery for individual devices separately. So push more of the decision making into the personality. 'recovery_disabled' is now a 'cookie' which is copied when the personality want to disable recovery and is changed when a device is added to the array as this is used as a trigger to 'try recovery again'. This will allow RAID10 to get the control that it needs. Signed-off-by: NeilBrown <neilb@suse.de>
*	MD: raid1 changes to allow use by device mapper	Jonathan Brassow	2011-06-08	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	MD RAID1: Changes to allow RAID1 to be used by device-mapper (dm-raid.c) Added the necessary congestion function and conditionalize calls requiring an array 'queue' or 'gendisk'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	md/raid1: improve handling of pages allocated for write-behind.	NeilBrown	2011-05-11	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \|	The current handling and freeing of these pages is a bit fragile. We only keep the list of allocated pages in each bio, so we need to still have a valid bio when freeing the pages, which is a bit clumsy. So simply store the allocated page list in the r1_bio so it can easily be found and freed when we are finished with the r1_bio. Signed-off-by: NeilBrown <neilb@suse.de>
*	md/raid1: discard unused variable.	NeilBrown	2010-10-29	1	-2/+0
\| \| \| \| \| \|	This structure field (flushing_bio_list) is never used, so remove it. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: implment REQ_FLUSH/FUA support	Tejun Heo	2010-09-10	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch converts md to support REQ_FLUSH/FUA instead of now deprecated REQ_HARDBARRIER. In the core part (md.c), the following changes are notable. * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with processing of other requests and thus there is no reason to mark the queue congested while FLUSH/FUA is in progress. * REQ_FLUSH/FUA failures are final and its users don't need retry logic. Retry logic is removed. * Preflush needs to be issued to all member devices but FUA writes can be handled the same way as other writes - their processing can be deferred to request_queue of member devices. md_barrier_request() is renamed to md_flush_request() and simplified accordingly. For linear, raid0 and multipath, the core changes are enough. raid1, 5 and 10 need the following conversions. * raid1: Handling of FLUSH/FUA bio's can simply be deferred to request_queues of member devices. Barrier related logic removed. * raid5: Queue draining logic dropped. FUA bit is propagated through biodrain and stripe resconstruction such that all the updated parts of the stripe are written out with FUA writes if any of the dirtying writes was FUA. preread_active_stripes handling in make_request() is updated as suggested by Neil Brown. * raid10: FUA bit needs to be propagated to write clones. linear, raid0, 1, 5 and 10 tested. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
*	md/raid1: add takeover support for raid5->raid1	NeilBrown	2009-12-14	1	-0/+5
\| \| \| \| \| \|	A 2-device raid5 array can now be converted to raid1. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: remove mddev_to_conf "helper" macro	NeilBrown	2009-06-16	1	-6/+0
\| \| \| \| \| \| \| \| \| \|	Having a macro just to cast a void* isn't really helpful. I would must rather see that we are simply de-referencing ->private, than have to know what the macro does. So open code the macro everywhere and remove the pointless cast. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: move lots of #include lines out of .h files and into .c	NeilBrown	2009-03-31	1	-2/+0
\| \| \| \| \| \| \| \| \| \|	This makes the includes more explicit, and is preparation for moving md_k.h to drivers/md/md.h Remove include/raid/md.h as its only remaining use was to #include other files. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: move headers out of include/linux/raid/	Christoph Hellwig	2009-03-31	1	-0/+134
	Move the headers with the local structures for the disciplines and bitmap.h into drivers/md/ so that they are more easily grepable for hacking and not far away. md.h is left where it is for now as there are some uses from the outside. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>