diff options
author | Boaz Harrosh <bharrosh@panasas.com> | 2010-02-01 12:35:51 +0100 |
---|---|---|
committer | Boaz Harrosh <bharrosh@panasas.com> | 2010-02-28 12:43:08 +0100 |
commit | 5d952b8391692553c31e620a92d6e09262a9a307 (patch) | |
tree | b3a1a0490fc98b6304685d64bb4774235ec94a2d /fs/exofs/super.c | |
parent | exofs: Define on-disk per-inode optional layout attribute (diff) | |
download | linux-5d952b8391692553c31e620a92d6e09262a9a307.tar.xz linux-5d952b8391692553c31e620a92d6e09262a9a307.zip |
exofs: RAID0 support
We now support striping over mirror devices. Including variable sized
stripe_unit.
Some limits:
* stripe_unit must be a multiple of PAGE_SIZE
* stripe_unit * stripe_count is maximum upto 32-bit (4Gb)
Tested RAID0 over mirrors, RAID0 only, mirrors only. All check.
Design notes:
* I'm not using a vectored raid-engine mechanism yet. Following the
pnfs-objects-layout data-map structure, "Mirror" is just a private
case of "group_width" == 1, and RAID0 is a private case of
"Mirrors" == 1. The performance lose of the general case over the
particular special case optimization is totally negligible, also
considering the extra code size.
* In general I added a prepare_stripes() stage that divides the
to-be-io pages to the participating devices, the previous
exofs_ios_write/read, now becomes _write/read_mirrors and a new
write/read upper layer loops on all devices calling
_write/read_mirrors. Effectively the prepare_stripes stage is the all
secret.
Also truncate need fixing to accommodate for striping.
* In a RAID0 arrangement, in a regular usage scenario, if all inode
layouts will start at the same device, the small files fill up the
first device and the later devices stay empty, the farther the device
the emptier it is.
To fix that, each inode will start at a different stripe_unit,
according to it's obj_id modulus number-of-stripe-units. And
will then span all stripe-units in the same incrementing order
wrapping back to the beginning of the device table. We call it
a stripe-units moving window.
Special consideration was taken to keep all devices in a mirror
arrangement identical. So a broken osd-device could just be cloned
from one of the mirrors and no FS scrubbing is needed. (We do that
by rotating stripe-unit at a time and not a single device at a time.)
TODO:
We no longer verify object_length == inode->i_size in exofs_iget.
(since i_size is stripped on multiple objects now).
I should introduce a multiple-device attribute reading, and use
it in exofs_iget.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Diffstat (limited to 'fs/exofs/super.c')
-rw-r--r-- | fs/exofs/super.c | 52 |
1 files changed, 45 insertions, 7 deletions
diff --git a/fs/exofs/super.c b/fs/exofs/super.c index fc8875186ae8..8f4e4b37a578 100644 --- a/fs/exofs/super.c +++ b/fs/exofs/super.c @@ -308,6 +308,8 @@ static void exofs_put_super(struct super_block *sb) static int _read_and_match_data_map(struct exofs_sb_info *sbi, unsigned numdevs, struct exofs_device_table *dt) { + u64 stripe_length; + sbi->data_map.odm_num_comps = le32_to_cpu(dt->dt_data_map.cb_num_comps); sbi->data_map.odm_stripe_unit = @@ -321,14 +323,47 @@ static int _read_and_match_data_map(struct exofs_sb_info *sbi, unsigned numdevs, sbi->data_map.odm_raid_algorithm = le32_to_cpu(dt->dt_data_map.cb_raid_algorithm); -/* FIXME: Hard coded mirror only for now. if not so do not mount */ - if ((sbi->data_map.odm_num_comps != numdevs) || - (sbi->data_map.odm_stripe_unit != EXOFS_BLKSIZE) || - (sbi->data_map.odm_raid_algorithm != PNFS_OSD_RAID_0) || - (sbi->data_map.odm_mirror_cnt != (numdevs - 1))) +/* FIXME: Only raid0 !group_width/depth for now. if not so, do not mount */ + if (sbi->data_map.odm_group_width || sbi->data_map.odm_group_depth) { + EXOFS_ERR("Group width/depth not supported\n"); return -EINVAL; - else - return 0; + } + if (sbi->data_map.odm_num_comps != numdevs) { + EXOFS_ERR("odm_num_comps(%u) != numdevs(%u)\n", + sbi->data_map.odm_num_comps, numdevs); + return -EINVAL; + } + if (sbi->data_map.odm_raid_algorithm != PNFS_OSD_RAID_0) { + EXOFS_ERR("Only RAID_0 for now\n"); + return -EINVAL; + } + if (0 != (numdevs % (sbi->data_map.odm_mirror_cnt + 1))) { + EXOFS_ERR("Data Map wrong, numdevs=%d mirrors=%d\n", + numdevs, sbi->data_map.odm_mirror_cnt); + return -EINVAL; + } + + stripe_length = sbi->data_map.odm_stripe_unit * + (numdevs / (sbi->data_map.odm_mirror_cnt + 1)); + if (stripe_length >= (1ULL << 32)) { + EXOFS_ERR("Total Stripe length(0x%llx)" + " >= 32bit is not supported\n", _LLU(stripe_length)); + return -EINVAL; + } + + if (0 != (sbi->data_map.odm_stripe_unit & ~PAGE_MASK)) { + EXOFS_ERR("Stripe Unit(0x%llx)" + " must be Multples of PAGE_SIZE(0x%lx)\n", + _LLU(sbi->data_map.odm_stripe_unit), PAGE_SIZE); + return -EINVAL; + } + + sbi->layout.stripe_unit = sbi->data_map.odm_stripe_unit; + sbi->layout.mirrors_p1 = sbi->data_map.odm_mirror_cnt + 1; + sbi->layout.group_width = sbi->data_map.odm_num_comps / + sbi->layout.mirrors_p1; + + return 0; } /* @odi is valid only as long as @fscb_dev is valid */ @@ -502,6 +537,9 @@ static int exofs_fill_super(struct super_block *sb, void *data, int silent) } /* Default layout in case we do not have a device-table */ + sbi->layout.stripe_unit = PAGE_SIZE; + sbi->layout.mirrors_p1 = 1; + sbi->layout.group_width = 1; sbi->layout.s_ods[0] = od; sbi->layout.s_numdevs = 1; sbi->layout.s_pid = opts->pid; |