summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* dm raid: support metadata devicesJonathan Brassow2011-08-023-24/+412
| | | | | | | | | | | | | | | | | | | | | | Add the ability to parse and use metadata devices to dm-raid. Although not strictly required, without the metadata devices, many features of RAID are unavailable. They are used to store a superblock and bitmap. The role, or position in the array, of each device must be recorded in its superblock. This is to help with fault handling, array reshaping, and sanity checks. RAID 4/5/6 devices must be loaded in a specific order: in this way, the 'array_position' field helps validate the correctness of the mapping when it is loaded. It can be used during reshaping to identify which devices are added/removed. Fault handling is impossible without this field. For example, when a device fails it is recorded in the superblock. If this is a RAID1 device and the offending device is removed from the array, there must be a way during subsequent array assembly to determine that the failed device was the one removed. This is done by correlating the 'array_position' field and the bit-field variable 'failed_devices'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm raid: add write_mostly parameterJonathan Brassow2011-08-022-2/+28
| | | | | | | | | | Add the write_mostly parameter to RAID1 dm-raid tables. This allows the user to set the WriteMostly flag on a RAID1 device that should normally be avoided for read I/O. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm raid: add region_size parameterJonathan Brassow2011-08-022-3/+83
| | | | | | | | | | Allow the user to specify the region_size. Ensures that the supplied value meets md's constraints, viz. the number of regions does not exceed 2^21. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm raid: improve table parameters documentationJonathan Brassow2011-08-021-46/+78
| | | | | | | | Add more information about some dm-raid table parameters and clarify how parameters are printed when 'dmsetup table' is issued. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm ioctl: forbid multiple device specifiersMikulas Patocka2011-08-021-0/+6
| | | | | | | | | | Exactly one of name, uuid or device must be specified when referencing an existing device. This removes the ambiguity (risking the wrong device being updated) if two conflicting parameters were specified. Previously one parameter got used and any others were ignored silently. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm ioctl: introduce __get_dev_cellMikulas Patocka2011-08-021-17/+24
| | | | | | | | | Move logic to find device based on major/minor number to a separate function __get_dev_cell (similar to __get_uuid_cell and __get_name_cell). This makes the function __find_device_hash_cell more straightforward. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm ioctl: fill in device parameters in more ioctlsMikulas Patocka2011-08-022-29/+38
| | | | | | | | | | | | Move parameter filling from find_device to __find_device_hash_cell. This patch causes ioctls using __find_device_hash_cell (DM_DEV_REMOVE_CMD, DM_DEV_SUSPEND_CMD - resume, DM_TABLE_CLEAR_CMD) to return device parameters, bringing them into line with the other ioctls. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm flakey: add corrupt_bio_byte featureMike Snitzer2011-08-022-15/+159
| | | | | | | | | Add corrupt_bio_byte feature to simulate corruption by overwriting a byte at a specified position with a specified value during intervals when the device is "down". Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm flakey: add drop_writesMike Snitzer2011-08-022-14/+78
| | | | | | | | Add 'drop_writes' option to drop writes silently while the device is 'down'. Reads are not touched. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm flakey: support feature argsMike Snitzer2011-08-021-19/+64
| | | | | | | | | | | | Add the ability to specify arbitrary feature flags when creating a flakey target. This code uses the same target argument helpers that the multipath target does. Also remove the superfluous 'dm-flakey' prefixes from the error messages, as they already contain the prefix 'flakey'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm flakey: use dm_target_offset and support discardsMike Snitzer2011-08-021-1/+2
| | | | | | | Use dm_target_offset() and support discards. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: share target argument parsing functionsMike Snitzer2011-08-024-112/+147
| | | | | | | | Move multipath target argument parsing code into dm-table so other targets can share it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm snapshot: skip reading origin when overwriting complete chunkMikulas Patocka2011-08-023-3/+103
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we write a full chunk in the snapshot, skip reading the origin device because the whole chunk will be overwritten anyway. This patch changes the snapshot write logic when a full chunk is written. In this case: 1. allocate the exception 2. dispatch the bio (but don't report the bio completion to device mapper) 3. write the exception record 4. report bio completed Callbacks must be done through the kcopyd thread, because callbacks must not race with each other. So we create two new functions: dm_kcopyd_prepare_callback: allocate a job structure and prepare the callback. (This function must not be called from interrupt context.) dm_kcopyd_do_callback: submit callback. (This function may be called from interrupt context.) Performance test (on snapshots with 4k chunk size): without the patch: non-direct-io sequential write (dd): 17.7MB/s direct-io sequential write (dd): 20.9MB/s non-direct-io random write (mkfs.ext2): 0.44s with the patch: non-direct-io sequential write (dd): 26.5MB/s direct-io sequential write (dd): 33.2MB/s non-direct-io random write (mkfs.ext2): 0.27s Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: ignore merge_bvec for snapshots when safeMikulas Patocka2011-08-023-2/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new flag DMF_MERGE_IS_OPTIONAL to struct mapped_device to indicate whether the device can accept bios larger than the size its merge function returns. When set, use this to send large bios to snapshots which can split them if necessary. Snapshot I/O may be significantly fragmented and this approach seems to improve peformance. Before the patch, dm_set_device_limits restricted bio size to page size if the underlying device had a merge function and the target didn't provide a merge function. After the patch, dm_set_device_limits restricts bio size to page size if the underlying device has a merge function, doesn't have DMF_MERGE_IS_OPTIONAL flag and the target doesn't provide a merge function. The snapshot target can't provide a merge function because when the merge function is called, it is impossible to determine where the bio will be remapped. Previously this led us to impose a 4k limit, which we can now remove if the snapshot store is located on a device without a merge function. Together with another patch for optimizing full chunk writes, it improves performance from 29MB/s to 40MB/s when writing to the filesystem on snapshot store. If the snapshot store is placed on a non-dm device with a merge function (such as md-raid), device mapper still limits all bios to page size. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: clean dm_get_device and move exportsMike Snitzer2011-08-021-20/+13
| | | | | | | | There is no need for __table_get_device to be factored out. Also move the exports to the end of their respective functions. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm raid: tidy includesAlasdair G Kergon2011-08-021-1/+2
| | | | | | A dm target only needs to use include/linux dm headers. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm ioctl: prevent empty messageAlasdair G Kergon2011-08-021-0/+5
| | | | | | | Detect invalid empty messages in core dm instead of requiring every target to check this. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm raid: cleanup parameter handlingJonathan Brassow2011-08-021-19/+23
| | | | | | | | | | | | | | | | Re-order the parameters so they are handled consistently in the same order where defined, parsed and output. Only include rebuild parameters in the STATUSTYPE_TABLE output if they were supplied in the original table line. Correct the parameter count when outputting rebuild: there are two words, not one. Use case-independent checks for keywords (as in other device-mapper targets). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm snapshot: style cleanupsJonathan Brassow2011-08-022-10/+8
| | | | | | | Coding style cleanups. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
* dm snapshot: remove unused definitionsMikulas Patocka2011-08-021-10/+0
| | | | | | | Remove a couple of unused #defines. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm kcopyd: remove nr_pages field from job structureMikulas Patocka2011-08-021-4/+2
| | | | | | | | | The nr_pages field in struct kcopyd_job is only used temporarily in run_pages_job() to count the number of required pages. We can use a local variable instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm kcopyd: remove offset field from job structureMikulas Patocka2011-08-021-5/+2
| | | | | | | The offset field in struct kcopyd_job is always zero so remove it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: use vzallocJoe Perches2011-08-023-7/+3
| | | | | | | Use vzalloc() instead of vmalloc()+memset(). Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm log: userspace use list_moveKirill A. Shutemov2011-08-021-2/+1
| | | | | | | Replace list_del() followed by list_add() with list_move(). Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm log: clean up bit little endian bitopsAkinobu Mita2011-08-021-5/+4
| | | | | | | | | | Using __test_and_{set,clear}_bit_le() with ignoring its return value can be replaced with __{set,clear}_bit_le(). This also removes unnecessary casts. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: fix discard supportMike Snitzer2011-08-022-9/+9
| | | | | | | | | | | | | | | | | | | | Remove 'discards_supported' from the dm_table structure. The same information can be easily discovered from the table's target(s) in dm_table_supports_discards(). Before this fix dm_table_supports_discards() would skip checking the individual targets' 'discards_supported' flag if any one target in the table didn't set num_discard_requests > 0. Now the per-target 'discards_supported' flag is effective at insuring the final DM device advertises discard support. But, to be clear, targets that don't support discards (!num_discard_requests) will not receive discard requests. Also DMWARN if a target sets 'discards_supported' override but forgets to set 'num_discard_requests'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: suppress endian warningsAlasdair G Kergon2011-08-023-43/+54
| | | | | | | Suppress sparse warnings about cpu_to_le32() by using __le32 types for on-disk data etc. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: fix idr leak on module removalAlasdair G Kergon2011-08-021-2/+8
| | | | | | | Destroy _minor_idr when unloading the core dm module. (Found by kmemleak.) Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm io: flush cpu cache with vmapped ioMikulas Patocka2011-08-021-2/+27
| | | | | | | | | | | | | | | | | | | For normal kernel pages, CPU cache is synchronized by the dma layer. However, this is not done for pages allocated with vmalloc. If we do I/O to/from vmallocated pages, we must synchronize CPU cache explicitly. Prior to doing I/O on vmallocated page we must call flush_kernel_vmap_range to flush dirty cache on the virtual address. After finished read we must call invalidate_kernel_vmap_range to invalidate cache on the virtual address, so that accesses to the virtual address return newly read data and not stale data from CPU cache. This patch fixes metadata corruption on dm-snapshots on PA-RISC and possibly other architectures with caches indexed by virtual address. Cc: stable <stable@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm mpath: fix potential NULL pointer in feature arg processingMike Snitzer2011-08-021-0/+5
| | | | | | | | | Avoid dereferencing a NULL pointer if the number of feature arguments supplied is fewer than indicated. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
* dm snapshot: flush disk cache when mergingMikulas Patocka2011-08-021-1/+1
| | | | | | | | | | | | | This patch makes dm-snapshot flush disk cache when writing metadata for merging snapshot. Without cache flushing the disk may reorder metadata write and other data writes and there is a possibility of data corruption in case of power fault. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds2011-07-2812-1379/+3093
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://neil.brown.name/md: (75 commits) md/raid10: handle further errors during fix_read_error better. md/raid10: Handle read errors during recovery better. md/raid10: simplify read error handling during recovery. md/raid10: record bad blocks due to write errors during resync/recovery. md/raid10: attempt to fix read errors during resync/check md/raid10: Handle write errors by updating badblock log. md/raid10: clear bad-block record when write succeeds. md/raid10: avoid writing to known bad blocks on known bad drives. md/raid10 record bad blocks as needed during recovery. md/raid10: avoid reading known bad blocks during resync/recovery. md/raid10 - avoid reading from known bad blocks - part 3 md/raid10: avoid reading from known bad blocks - part 2 md/raid10: avoid reading from known bad blocks - part 1 md/raid10: Split handle_read_error out from raid10d. md/raid10: simplify/reindent some loops. md/raid5: Clear bad blocks on successful write. md/raid5. Don't write to known bad block on doubtful devices. md/raid5: write errors should be recorded as bad blocks if possible. md/raid5: use bad-block log to improve handling of uncorrectable read errors. md/raid5: avoid reading from known bad blocks. ...
| * md/raid10: handle further errors during fix_read_error better.NeilBrown2011-07-281-15/+44
| | | | | | | | | | | | | | If we find more read/write errors we should record a bad block before failing the device. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10: Handle read errors during recovery better.NeilBrown2011-07-281-33/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when we get a read error during recovery, we simply abort the recovery. Instead, repeat the read in page-sized blocks. On successful reads, write to the target. On read errors, record a bad block on the destination, and only if that fails do we abort the recovery. As we now retry reads we need to know where we read from. This was in bi_sector but that can be changed during a read attempt. So store the correct from_addr and to_addr in the r10_bio for later access. Signed-off-by: NeilBrown<neilb@suse.de>
| * md/raid10: simplify read error handling during recovery.NeilBrown2011-07-281-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | If a read error is detected during recovery the code currently fails the read device. This isn't really necessary. recovery_request_write will signal a write error to end_sync_write and it will record a write error on the destination device which will record a bad block there or kick it from the array. So just remove this call to do md_error. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10: record bad blocks due to write errors during resync/recovery.NeilBrown2011-07-281-10/+23
| | | | | | | | | | | | | | | | If we get a write error during resync/recovery don't fail the device but instead record a bad block. If that fails we can then fail the device. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10: attempt to fix read errors during resync/checkNeilBrown2011-07-281-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already attempt to fix read errors found during normal IO and a 'repair' process. It is best to try to repair them at any time they are found, so move a test so that during sync and check a read error will be corrected by over-writing with good data. If both (all) devices have known bad blocks in the sync section we won't try to fix even though the bad blocks might not overlap. That should be considered later. Also if we hit a read error during recovery we don't try to fix it. It would only be possible to fix if there were at least three copies of data, which is not very common with RAID10. But it should still be considered later. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10: Handle write errors by updating badblock log.NeilBrown2011-07-282-17/+117
| | | | | | | | | | | | | | | | | | | | When we get a write error (in the data area, not in metadata), update the badblock log rather than failing the whole device. As the write may well be many blocks, we trying writing each block individually and only log the ones which fail. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10: clear bad-block record when write succeeds.NeilBrown2011-07-282-12/+100
| | | | | | | | | | | | | | | | | | | | If we succeed in writing to a block that was recorded as being bad, we clear the bad-block record. This requires some delayed handling as the bad-block-list update has to happen in process-context. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10: avoid writing to known bad blocks on known bad drives.NeilBrown2011-07-281-12/+93
| | | | | | | | | | | | | | Writing to known bad blocks on drives that have seen a write error is asking for trouble. So try to avoid these blocks. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10 record bad blocks as needed during recovery.NeilBrown2011-07-282-13/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | When recovering one or more devices, if all the good devices have bad blocks we should record a bad block on the device being rebuilt. If this fails, we need to abort the recovery. To ensure we don't think that we aborted later than we actually did, we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync, in particular before mddev->curr_resync is updated. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10: avoid reading known bad blocks during resync/recovery.NeilBrown2011-07-281-9/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | During resync/recovery limit the size of the request to avoid reading into a bad block that does not start at-or-before the current read address. Similarly if there is a bad block at this address, don't allow the current request to extend beyond the end of that bad block. Now that we don't ever read from known bad blocks, it is safe to allow devices with those blocks into the array. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10 - avoid reading from known bad blocks - part 3NeilBrown2011-07-281-1/+6
| | | | | | | | | | | | | | | | | | | | | | When attempting to repair a read error, don't read from devices with a known bad block. As we are only reading PAGE_SIZE blocks, we don't try to narrow down to smaller regions in the hope that only part of this page is bad - it isn't worth the effort. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10: avoid reading from known bad blocks - part 2NeilBrown2011-07-281-5/+40
| | | | | | | | | | | | | | | | | | When redirecting a read error to a different device, we must again avoid bad blocks and possibly split the request. Spin_lock typo fixed thanks to Dan Carpenter <error27@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10: avoid reading from known bad blocks - part 1NeilBrown2011-07-282-16/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch just covers the basic read path: 1/ read_balance needs to check for badblocks, and return not only the chosen slot, but also how many good blocks are available there. 2/ read submission must be ready to issue multiple reads to different devices as different bad blocks on different devices could mean that a single large read cannot be served by any one device, but can still be served by the array. This requires keeping count of the number of outstanding requests per bio. This count is stored in 'bi_phys_segments' On read error we currently just fail the request if another target cannot handle the whole request. Next patch refines that a bit. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10: Split handle_read_error out from raid10d.NeilBrown2011-07-281-57/+66
| | | | | | | | | | | | | | raid10d() is too big and is about to get bigger, so split handle_read_error() out as a separate function. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10: simplify/reindent some loops.NeilBrown2011-07-281-62/+65
| | | | | | | | | | | | | | | | When a loop ends with a large if, it can be neater to change the if to invert the condition and just 'continue'. Then the body of the if can be indented to a lower level. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid5: Clear bad blocks on successful write.NeilBrown2011-07-282-1/+19
| | | | | | | | | | | | | | On a successful write to a known bad block, flag the sh so that raid5d can remove the known bad block from the list. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid5. Don't write to known bad block on doubtful devices.NeilBrown2011-07-281-1/+30
| | | | | | | | | | | | | | If a device has seen write errors, don't write to any known bad blocks on that device. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid5: write errors should be recorded as bad blocks if possible.NeilBrown2011-07-282-10/+41
| | | | | | | | | | | | | | | | | | | | When a write error is detected, don't mark the device as failed immediately but rather record the fact for handle_stripe to deal with. Handle_stripe then attempts to record a bad block. Only if that fails does the device get marked as faulty. Signed-off-by: NeilBrown <neilb@suse.de>