linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	md/bitmap: It is OK to clear bits during recovery.	NeilBrown	2011-12-22	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit d0a4bb492772ce5c4bdfba3744a99ed6f6fb238f introduced a regression which is annoying but fairly harmless. When writing to an array that is undergoing recovery (a spare in being integrated into the array), writing to the array will set bits in the bitmap, but they will not be cleared when the write completes. For bits covering areas that have not been recovered yet this is not a problem as the recovery will clear the bits. However bits set in already-recovered region will stay set and never be cleared. This doesn't risk data integrity. The only negatives are: - next time there is a crash, more resyncing than necessary will be done. - the bitmap doesn't look clean, which is confusing. While an array is recovering we don't want to update the 'events_cleared' setting in the bitmap but we do still want to clear bits that have very recently been set - providing they were written to the recovering device. So split those two needs - which previously both depended on 'success' and always clear the bit of the write went to all devices. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: don't give up looking for spares on first failure-to-add	NeilBrown	2011-12-22	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before performing a recovery we try to remove any spares that might not be working, then add any that might have become relevant. Currently we abort on the first spare that cannot be added. This is a false optimisation. It is conceivable that - depending on rules in the personality - a subsequent spare might be accepted. Also the loop does other things like count the available spares and reset the 'recovery_offset' value. If we abort early these might not happen properly. So remove the early abort. In particular if you have an array what is undergoing recovery and which has extra spares, then the recovery may not restart after as reboot as the could of 'spares' might end up as zero. Reported-by: Anssi Hannula <anssi.hannula@iki.fi> Signed-off-by: NeilBrown <neilb@suse.de>
*	md/raid5: ensure correct assessment of drives during degraded reshape.	NeilBrown	2011-12-22	1	-4/+10
\| \| \| \| \| \| \| \| \| \| \|	While reshaping a degraded array (as when reshaping a RAID0 by first converting it to a degraded RAID4) we currently get confused about which devices are in_sync. In most cases we get it right, but in the region that is being reshaped we need to treat non-failed devices as in-sync when we have the data but haven't actually written it out yet. Reported-by: Adam Kwolek <adam.kwolek@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	md/linear: fix hot-add of devices to linear arrays.	NeilBrown	2011-12-22	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit d70ed2e4fafdbef0800e73942482bb075c21578b broke hot-add to a linear array. After that commit, metadata if not written to devices until they have been fully integrated into the array as determined by saved_raid_disk. That patch arranged to clear that field after a recovery completed. However for linear arrays, there is no recovery - the integration is instantaneous. So we need to explicitly clear the saved_raid_disk field. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: raid5 crash during degradation	Adam Kwolek	2011-12-09	1	-2/+2
\| \| \| \| \| \| \|	NULL pointer access causes crash in raid5 module. Signed-off-by: Adam Kwolek <adam.kwolek@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	md/raid5: never wait for bad-block acks on failed device.	NeilBrown	2011-12-08	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	Once a device is failed we really want to completely ignore it. It should go away soon anyway. In particular the presence of bad blocks on it should not cause us to block as we won't be trying to write there anyway. So as soon as we can check if a device is Faulty, do so and pretend that it is already gone if it is Faulty. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: ensure new badblocks are handled promptly.	NeilBrown	2011-12-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	When we mark blocks as bad we need them to be acknowledged by the metadata handler promptly. For an in-kernel metadata handler that was already being done. But for an external metadata handler we need to alert it of the change by sending a notification through the sysfs file. This adds that notification. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: bad blocks shouldn't cause a Blocked status on a Faulty device.	NeilBrown	2011-12-08	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Once a device is marked Faulty the badblocks - whether acknowledged or not - become irrelevant. So they shouldn't cause the device to be marked as Blocked. Without this patch, a process might write "-blocked" to clear the Blocked status, but while that will correctly fail the device, it won't remove the apparent 'blocked' status. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: take a reference to mddev during sysfs access.	NeilBrown	2011-12-08	1	-1/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we are accessing an mddev via sysfs we know that the mddev cannot disappear because it has an embedded kobj which is refcounted by sysfs. And we also take the mddev_lock. However this is not enough. The final mddev_put could have been called and the mddev_delayed_delete is waiting for sysfs to let go so it can destroy the kobj and mddev. In this state there are a lot of changes that should not be attempted. To to guard against this we: - initialise mddev->all_mddevs in on last put so the state can be easily detected. - in md_attr_show and md_attr_store, check ->all_mddevs under all_mddevs_lock and mddev_get the mddev if it still appears to be active. This means that if we get to sysfs as the mddev is being deleted we will get -EBUSY. rdev_attr_store and rdev_attr_show are similar but already have sufficient protection. They check that rdev->mddev still points to mddev after taking mddev_lock. As this is cleared before delayed removal which can only be requested under the mddev_lock, this ensure the rdev and mddev are still alive. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: refine interpretation of "hold_active == UNTIL_IOCTL".	NeilBrown	2011-12-08	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We like md devices to disappear when they really are not needed. However it is not possible to tell from the current state whether it is needed or not. We can only tell from recent history of changes. In particular immediately after we create an md device it looks very similar to immediately after we have finished with it. So we always preserve a newly created md device until something significant happens. This state is stored in 'hold_active'. The normal case is to keep it until an ioctl happens, as that will normally either activate it, or explicitly de-activate it. If it doesn't then it was probably created by mistake and it is now time to get rid of it. We can also modify an array via sysfs (instead of via ioctl) and we currently treat any change via sysfs like an ioctl as a sign that if it now isn't more active, it should be destroyed. However this is not appropriate as changes made via sysfs are more gradual so we should look for a more definitive change. So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when the array_state is changed via sysfs. Other changes via sysfs are ignored. Signed-off-by: NeilBrown <neilb@suse.de>
*	md/lock: ensure updates to page_attrs are properly locked.	NeilBrown	2011-11-23	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Page attributes are set using __set_bit rather than set_bit as it normally called under a spinlock so the extra atomicity is not needed. However there are two places where we might set or clear page attributes without holding the spinlock. So add the spinlock in those cases. This might be the cause of occasional reports that bits a aren't getting clear properly - theory is that BITMAP_PAGE_PENDING gets lost when BITMAP_PAGE_NEEDWRITE is set or cleared. This is an inconvenience, not a threat to data safety. Signed-off-by: NeilBrown <neilb@suse.de>
*	md/raid5: STRIPE_ACTIVE has lock semantics, add barriers	Dan Williams	2011-11-08	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	All updates that occur under STRIPE_ACTIVE should be globally visible when STRIPE_ACTIVE clears. test_and_set_bit() implies a barrier, but clear_bit() does not. This is suitable for 3.1-stable. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
*	md/raid5: abort any pending parity operations when array fails.	NeilBrown	2011-11-08	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the number of failed devices exceeds the allowed number we must abort any active parity operations (checks or updates) as they are no longer meaningful, and can lead to a BUG_ON in handle_parity_checks6. This bug was introduce by commit 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8 in 2.6.29. Reported-by: Manish Katiyar <mkatiyar@gmail.com> Tested-by: Manish Katiyar <mkatiyar@gmail.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
*	device-mapper: using EXPORT_SYBOL in dm-space-map-checker.c needs export.h	Stephen Rothwell	2011-11-07	1	-0/+1
\| \| \| \| \| \| \|	Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	device-mapper: dm-bufio.c needs to include module.h	Stephen Rothwell	2011-11-07	1	-0/+1
\| \| \| \| \| \| \| \| \|	since it uses the module facilities. Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	drivers/md: change module.h -> export.h in persistent-data/dm-*	Paul Gortmaker	2011-11-07	4	-4/+4
\| \| \| \| \| \| \| \| \| \|	For the files which are not themselves modular, we can change them to include only the smaller export.h since all they are doing is looking for EXPORT_SYMBOL. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Merge branch 'modsplit-Oct31_2011' of ↵	Linus Torvalds	2011-11-07	17	-0/+17
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
\| *	md: Add in export.h for files using EXPORT_SYMBOL	Paul Gortmaker	2011-11-01	3	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These files were getting the defines for EXPORT_SYMBOL because device.h was including module.h. But we are going to put an end to that. So add the proper export.h include now. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
\| *	md: Add module.h to all files using it implicitly	Paul Gortmaker	2011-11-01	14	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A pending cleanup will mean that module.h won't be implicitly everywhere anymore. Make sure the modular drivers in md dir are actually calling out for <module.h> explicitly in advance. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* \|	Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block	Linus Torvalds	2011-11-05	10	-83/+53
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits) block: don't call blk_drain_queue() if elevator is not up blk-throttle: use queue_is_locked() instead of lockdep_is_held() blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blk-throttle: Free up policy node associated with deleted rule block: warn if tag is greater than real_max_depth. block: make gendisk hold a reference to its queue blk-flush: move the queue kick into blk-flush: fix invalid BUG_ON in blk_insert_flush block: Remove the control of complete cpu from bio. block: fix a typo in the blk-cgroup.h file block: initialize the bounce pool if high memory may be added later block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown block: drop @tsk from attempt_plug_merge() and explain sync rules block: make get_request[_wait]() fail if queue is dead block: reorganize throtl_get_tg() and blk_throtl_bio() block: reorganize queue draining block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() block: pass around REQ_* flags instead of broken down booleans during request alloc/free block: move blk_throtl prototypes to block/blk.h block: fix genhd refcounting in blkio_policy_parse_and_set() ... Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion and making the request functions be of type "void" instead of "int" in - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c} - drivers/staging/zram/zram_drv.c
\| * \|	block: Remove the control of complete cpu from bio.	Tao Ma	2011-10-24	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bio originally has the functionality to set the complete cpu, but it is broken. Chirstoph said that "This code is unused, and from the all the discussions lately pretty obviously broken. The only thing keeping it serves is creating more confusion and possibly more bugs." And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine with leaving cpu control to the request based drivers, they are the only ones that can toggle the setting anyway". So this patch tries to remove all the work of controling complete cpu from a bio. Cc: Shaohua Li <shaohua.li@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
\| * \|	Merge branch 'v3.1-rc10' into for-3.2/core	Jens Axboe	2011-10-19	11	-58/+116
\| \|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: block/blk-core.c include/linux/blkdev.h Signed-off-by: Jens Axboe <axboe@kernel.dk>
\| * \| \|	block: remove support for bio remapping from ->make_request	Christoph Hellwig	2011-09-12	10	-71/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
\| * \| \|	block: rename __make_request() to blk_queue_bio()	Jens Axboe	2011-09-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that it's exported, lets put it in a more sane namespace. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
\| * \| \|	block: export __make_request	Christoph Hellwig	2011-09-12	1	-12/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Avoid the hacks need for request based device mappers currently by simply exporting the symbol instead of trying to get it through the back door. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* \| \| \|	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/linux-dm	Linus Torvalds	2011-11-03	35	-21/+11653
\|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* git://git.kernel.org/pub/scm/linux/kernel/git/steve/linux-dm: dm: raid fix device status indicator when array initializing dm log userspace: add log device dependency dm log userspace: fix comment hyphens dm: add thin provisioning target dm: add persistent data library dm: add bufio dm: export dm get md dm table: add immutable feature dm table: add always writeable feature dm table: add singleton feature dm kcopyd: add dm_kcopyd_zero to zero an area dm: remove superfluous smp_mb dm: use local printk ratelimit dm table: propagate non rotational flag
\| * \| \| \|	dm: raid fix device status indicator when array initializing	Jonathan E Brassow	2011-10-31	1	-11/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When devices in a RAID array are not in-sync, they are supposed to be reported as such in the status output as an 'a' character, which means "alive, but not in-sync". But when the entire array is rebuilt 'A' is being used, which is incorrect. This patch corrects this to 'a'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
\| * \| \| \|	dm log userspace: add log device dependency	Jonathan E Brassow	2011-10-31	1	-3/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allow userspace dm log implementations to register their log device so it is no longer missing from the list of device dependencies. When device mapper targets use a device they normally call dm_get_device which includes it in the device list returned to userspace applications such as LVM through the DM_TABLE_DEPS ioctl. Userspace log devices don't use dm_get_device as userspace opens them so they are missing from the list of dependencies. This patch extends the DM_ULOG_CTR operation to allow userspace to respond with the name of the log device (if appropriate) to be registered via 'dm_get_device'. DM_ULOG_REQUEST_VERSION is incremented. This is backwards compatible. If the kernel and userspace log server have both been updated, the new information will be passed down to the kernel and the device will be registered. If the kernel is new, but the log server is old, the log server will not pass down any device information and the kernel will simply bypass the device registration as before. If the kernel is old but the log server is new, the log server will see the old version number and not pass the device info. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
\| * \| \| \|	dm log userspace: fix comment hyphens	Jonathan Brassow	2011-10-31	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix comments: clustered-disk needs a hyphen not an underscore. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
\| * \| \| \|	dm: add thin provisioning target	Joe Thornber	2011-10-31	5	-0/+4006
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Initial EXPERIMENTAL implementation of device-mapper thin provisioning with snapshot support. The 'thin' target is used to create instances of the virtual devices that are hosted in the 'thin-pool' target. The thin-pool target provides data sharing among devices. This sharing is made possible using the persistent-data library in the previous patch. The main highlight of this implementation, compared to the previous implementation of snapshots, is that it allows many virtual devices to be stored on the same data volume, simplifying administration and allowing sharing of data between volumes (thus reducing disk usage). Another big feature is support for arbitrary depth of recursive snapshots (snapshots of snapshots of snapshots ...). The previous implementation of snapshots did this by chaining together lookup tables, and so performance was O(depth). This new implementation uses a single data structure so we don't get this degradation with depth. For further information and examples of how to use this, please read Documentation/device-mapper/thin-provisioning.txt Signed-off-by: Joe Thornber <thornber@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
\| * \| \| \|	dm: add persistent data library	Joe Thornber	2011-10-31	21	-0/+5625
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The persistent-data library offers a re-usable framework for the storage and management of on-disk metadata in device-mapper targets. It's used by the thin-provisioning target in the next patch and in an upcoming hierarchical storage target. For further information, please read Documentation/device-mapper/persistent-data.txt Signed-off-by: Joe Thornber <thornber@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
\| * \| \| \|	dm: add bufio	Mikulas Patocka	2011-10-31	4	-0/+1820
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The dm-bufio interface allows you to do cached I/O on devices, holding recently-read blocks in memory and performing delayed writes. We don't use buffer cache or page cache already present in the kernel, because: * we need to handle block sizes larger than a page * we can't allocate memory to perform reads or we'd have deadlocks Currently, when a cache is required, we limit its size to a fraction of available memory. Usage can be viewed and changed in /sys/module/dm_bufio/parameters/ . The first user is thin provisioning, but more dm users are planned. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
\| * \| \| \|	dm: export dm get md	Alasdair G Kergon	2011-10-31	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Export dm_get_md() for the new thin provisioning target to use. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
\| * \| \| \|	dm table: add immutable feature	Alasdair G Kergon	2011-10-31	4	-0/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce DM_TARGET_IMMUTABLE to indicate that the target type cannot be mixed with any other target type, and once loaded into a device, it cannot be replaced with a table containing a different type. The thin provisioning pool device will use this. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
\| * \| \| \|	dm table: add always writeable feature	Alasdair G Kergon	2011-10-31	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a target feature flag DM_TARGET_ALWAYS_WRITEABLE to indicate that a target does not support read-only mode. The initial implementation of the thin provisioning target uses this. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
\| * \| \| \|	dm table: add singleton feature	Alasdair G Kergon	2011-10-31	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce the concept of a singleton table which contains exactly one target. If a target type sets the DM_TARGET_SINGLETON feature bit device-mapper will ensure that any table that includes that target contains no others. The thin provisioning pool target uses this. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
\| * \| \| \|	dm kcopyd: add dm_kcopyd_zero to zero an area	Mikulas Patocka	2011-10-31	1	-5/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch introduces dm_kcopyd_zero() to make it easy to use kcopyd to write zeros into the requested areas instead instead of copying. It is implemented by passing a NULL copying source to dm_kcopyd_copy(). The forthcoming thin provisioning target uses this. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
\| * \| \| \|	dm: remove superfluous smp_mb	Namhyung Kim	2011-10-31	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since set_current_state() contains a memory barrier in it, an additional barrier isn't needed. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
\| * \| \| \|	dm: use local printk ratelimit	Namhyung Kim	2011-10-31	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	printk_ratelimit() shares global ratelimiting state with all other subsystems, so its usage is discouraged. Instead, define and use dm's local state. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
\| * \| \| \|	dm table: propagate non rotational flag	Mandeep Singh Baines	2011-10-31	1	-0/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allow QUEUE_FLAG_NONROT to propagate up the device stack if all underlying devices are non-rotational. Tools like ureadahead will schedule IOs differently based on the rotational flag. With this patch, I see boot time go from 7.75 s to 7.46 s on my device. Suggested-by: J. Richard Barnette <jrbarnette@chromium.org> Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: Jens Axboe <jaxboe@fusionio.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: dm-devel@redhat.com Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* \| \| \| \|	Merge branch 'for-linus' of git://neil.brown.name/md	Linus Torvalds	2011-10-31	1	-1/+1
\|\ \ \ \ \ \| \|_\|_\|_\|/ \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \|	* 'for-linus' of git://neil.brown.name/md: md/raid10: Fix bug when activating a hot-spare.
\| * \| \| \|	md/raid10: Fix bug when activating a hot-spare.	NeilBrown	2011-10-31	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a fairly serious bug in RAID10. When a RAID10 array is degraded and a hot-spare is activated, the spare does not take up the empty slot, but rather replaces the first working device. This is likely to make the array non-functional. It would normally be possible to recover the data, but that would need care and is not guaranteed. This bug was introduced in commit 2bb77736ae5dca0a189829fbb7379d43364a9dac which first appeared in 3.1. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
* \| \| \| \|	Merge branch 'for-linus' of git://neil.brown.name/md	Linus Torvalds	2011-10-26	18	-1272/+1216
\|\\| \| \| \| \| \|/ / / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* 'for-linus' of git://neil.brown.name/md: (34 commits) md: Fix some bugs in recovery_disabled handling. md/raid5: fix bug that could result in reads from a failed device. lib/raid6: Fix filename emitted in generated code md.c: trivial comment fix MD: Allow restarting an interrupted incremental recovery. md: clear In_sync bit on devices added to an active array. md: add proper write-congestion reporting to RAID1 and RAID10. md: rename "mdk_personality" to "md_personality" md/bitmap remove fault injection options. md/raid5: typedef removal: raid5_conf_t -> struct r5conf md/raid1: typedef removal: conf_t -> struct r1conf md/raid10: typedef removal: conf_t -> struct r10conf md/raid0: typedef removal: raid0_conf_t -> struct r0conf md/multipath: typedef removal: multipath_conf_t -> struct mpconf md/linear: typedef removal: linear_conf_t -> struct linear_conf md/faulty: remove typedef: conf_t -> struct faulty_conf md/linear: remove typedefs: dev_info_t -> struct dev_info md: remove typedefs: mirror_info_t -> struct mirror_info md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bio md: remove typedefs: mdk_thread_t -> struct md_thread ...
\| * \| \|	md: Fix some bugs in recovery_disabled handling.	NeilBrown	2011-10-26	3	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In 3.0 we changed the way recovery_disabled was handle so that instead of testing against zero, we test an mddev-> value against a conf-> value. Two problems: 1/ one place in raid1 was missed and still sets to '1'. 2/ We didn't explicitly set the conf-> value at array creation time. It defaulted to '0' just like the mddev value does so they could appear equal and thus disable recovery. This did not affect normal 'md' as it calls bind_rdev_to_array which changes the mddev value. However the dmraid interface doesn't call this and so doesn't change ->recovery_disabled; so at array start all recovery is incorrectly disabled. So initialise the 'conf' value to one less that the mddev value, so the will only be the same when explicitly set that way. Reported-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| * \| \|	md/raid5: fix bug that could result in reads from a failed device.	NeilBrown	2011-10-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This bug was introduced in 415e72d034c50520ddb7ff79e7d1792c1306f0c9 which was in 2.6.36. There is a small window of time between when a device fails and when it is removed from the array. During this time we might still read from it, but we won't write to it - so it is possible that we could read stale data. We didn't need the test of 'Faulty' before because the test on In_sync is sufficient. Since we started allowing reads from the early part of non-In_sync devices we need a test on Faulty too. This is suitable for any kernel from 2.6.36 onwards, though the patch might need a bit of tweaking in 3.0 and earlier. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
\| * \| \|	md.c: trivial comment fix	Chris Dunlop	2011-10-19	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Trivial comment fix Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: NeilBrown <neilb@suse.de>
\| * \| \|	MD: Allow restarting an interrupted incremental recovery.	Andrei Warkentin	2011-10-18	1	-7/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If an incremental recovery was interrupted, a subsequent re-add will result in a full recovery, even though an incremental should be possible (seen with raid1). Solve this problem by not updating the superblock on the recovering device until array is not degraded any longer. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrei Warkentin <andreiw@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| * \| \|	md: clear In_sync bit on devices added to an active array.	NeilBrown	2011-10-18	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we add a device to an active array it can be meaningful to set the 'insync' flag. This indicates that the device is in-sync with the array except for locations recorded in the bitmap. A bitmap-based recovery can then bring it completely in-sync. Internally we move that flag to 'saved_raid_disk' but forgot to clear In_sync like we do in add_new_disk. So clear In_sync after moving its value to saved_raid_disk. Reported-by: Andrei Warkentin <andreiw@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| * \| \|	md: add proper write-congestion reporting to RAID1 and RAID10.	NeilBrown	2011-10-11	4	-1/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	RAID1 and RAID10 handle write requests by queuing them for handling by a separate thread. This is because when a write-intent-bitmap is active we might need to update the bitmap first, so it is good to queue a lot of writes, then do one big bitmap update for them all. However writeback request devices to appear to be congested after a while so it can make some guesstimate of throughput. The infinite queue defeats that (note that RAID5 has already has a finite queue so it doesn't suffer from this problem). So impose a limit on the number of pending write requests. By default it is 1024 which seems to be generally suitable. Make it configurable via module option just in case someone finds a regression. Signed-off-by: NeilBrown <neilb@suse.de>
\| * \| \|	md: rename "mdk_personality" to "md_personality"	NeilBrown	2011-10-11	9	-22/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"mdk" doesn't mean anything any more. Signed-off-by: NeilBrown <neilb@suse.de>