| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Starting from [1], kernel requires suspend lock on member drive remove
path. It causes deadlock with external management because monitor
thread may be locked on suspend and is unable to switch array to active,
for example if badblock is reported in this time.
It is blocking action now, so it must be delegated to managemon thread
but we must ensure that monitor does metadata update first, just after
detecting faulty.
This patch adds appropriative support. Monitor thread detects "faulty",
and updates the metadata. After that, it is asking manager thread to
remove the device. Manager must be careful because closing descriptors
used by select() may lead to abort with D_FORTIFY_SOURCE=2. First, it
must ensure that device descriptors are not used by monitor.
There is unlimited numer of remove retries and recovery is blocked
until all failed drives are removed. It is safe because "faulty"
device is not longer used by MD.
Issue will be also mitigated by optimalization on badlbock recording path
in kernel. It will check if device is not failed before badblock is
recorded but relying on this is not ideologically correct. Userspace
must keep compatibility with kernel and since it is blocking action,
we must tract is as blocking action.
[1] kernel commit cfa078c8b80d ("md: use new apis to suspend array
for adding/removing rdev from state_store()")
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
|
|
|
|
|
|
|
|
|
|
| |
Move memory declaration helpers outside mdadm.h. They seems to be
useful so keep them but include separatelly. Rework them to not reffer
to Name[] declared internally in mdadm/mdmon.
This is first step to start decomplexing mdadm.h.
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
|
|
|
|
|
|
|
|
|
|
| |
Proposed function sysfs_wrte_descriptor() unifies error handling for
write() done to sysfs files. Main purpose is to use it with MD sysfs
file but it can be used elsewhere.
No functional changes.
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixing the following coding errors the coverity tools found:
* Event check_return: Calling "fcntl(fd, 4, fl)" without checking
return value. This library function may fail and return an error code.
* Event check_after_deref: Null-checking "new" suggests that it may
be null, but it has already been dereferenced on all paths leading
to the check.
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
sysfs_get_str() usages have inconsistant buffer size.
This results in wild buffer declarations and redundant memory usage.
Define maximum buffer size for sysfs strings.
Replace wild sysfs string buffer sizes for globaly defined value.
Signed-off-by: Mateusz Kusiak <mateusz.kusiak@intel.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
|
|
|
|
|
|
|
|
|
| |
According to POSIX.1-2001, usleep is considered obsolete.
Replace it with a wrapper that uses nanosleep, as recommended in man.
Add handy macros for conversions between msec, usec and nsec.
Signed-off-by: Mateusz Grzonka <mateusz.grzonka@intel.com>
Signed-off-by: Jes Sorensen <jsorensen@fb.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Up to this date signal() was used which implementation could vary [1].
Sigaction() call is preferred. This commit introduces replacement
from signal() to sigaction() by the use of signal_s() wrapper.
Also remove redundant signal.h header includes.
[1] https://man7.org/linux/man-pages/man2/signal.2.html
Signed-off-by: Lukasz Florczak <lukasz.florczak@linux.intel.com>
Signed-off-by: Jes Sorensen <jsorensen@fb.com>
|
|
|
|
|
|
|
|
| |
Use parse_num instead of atoi to parse optarg. Replace atoi by strtol.
Move inst to int conversion into manage_new. Add better error handling.
Signed-off-by: Mateusz Grzonka <mateusz.grzonka@intel.com>
Signed-off-by: Jes Sorensen <jsorensen@fb.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a member drive disappears and is set faulty by the kernel during
mdmon startup, after ss->load_container() but before manage_new(), mdmon
will try to readd the faulty drive to the array and start rebuilding.
Metadata on the active drive is updated, but the faulty drive is not
removed from the array and is left in a "blocked" state and any write
request to the array will block. If the faulty drive reappears in the
system e.g. after a reboot, the array will not assemble because metadata
on the drives will be incompatible (at least on imsm).
Fix this by adding a new option for sysfs_read(): "GET_DEVS_ALL". This
is an extension for the "GET_DEVS" option and causes all member devices
to be returned, even if the associated block device has been removed.
Use this option in manage_new() to include the faulty device on the
active_array's devices list. Mdmon will then properly remove the faulty
device from the array and update the metadata to reflect the degraded
state.
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Jes Sorensen <jsorensen@fb.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When mdmon gets a SIGTERM, it stops managing arrays that are clean. If
there is more that one array in the container and one of them is dirty
and the clean one is still present in mdstat, mdmon will treat it as a
new array and start managing it again. This leads to a cycle of
remove_old() / manage_new() calls for the clean array, until the other
one also becomes clean.
Prevent this by not calling manage_new() if sigterm is set. Also, remove
a check for sigterm in manage_new() because the condition will never be
true.
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Jes Sorensen <jsorensen@fb.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If disk has disappeared from the system and appears again, it is added to the
corresponding container as long the metadata matches and disk number is set.
This code had no effect on imsm until commit 20dc76d15b40 ("imsm: Set disk slot
number"). Now the disk is added to container but not to the array - it is
correct as the disk is out-of-sync. Rebuild should start for the disk but it
doesn't. There is the same behaviour for both imsm and ddf metadata.
There is no point to handle out-of-sync disk as "good member of array" so
remove that part of code. There are no scenarios when monitor is already
running and disk can be safely added to the array. Just write initial metadata
to the disk so it's taken for rebuild.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Jes Sorensen <jsorensen@fb.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After switch root new mdmon is started. It sends initrd mdmon a signal
to terminate. initrd mdmon receives it and switches the safe mode delay
to 1 ms in order to get array to clean state and flush last version of
metadata. The problem is sysfs filesystem is not available to initrd mdmon
after switch root so the original safe mode delay is unchanged. The delay
is set to few seconds - if there is a lot of traffic on the filesystem,
initrd mdmon doesn't terminate for a long time (no clean state). There
are 2 instances of mdmon. initrd mdmon flushes metadata when array goes
to clean state but this metadata might be already outdated.
Use file descriptor obtained on mdmon start to change safe mode delay.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Jes Sorensen <jsorensen@fb.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change the behavior of assemble and create for consistency-policy=ppl
for external metadata arrays. If the kernel does not support ppl, don't
abort but print a warning and start the array without ppl
(consistency-policy=resync). No change for native md arrays because the
kernel will not allow starting the array if it finds an unsupported
feature bit in the superblock.
In sysfs_add_disk() check consistency_policy in the mdinfo structure
that represents the array, not the disk and read the current consistency
policy from sysfs in mdmon's manage_member(). This is necessary to make
sysfs_add_disk() honor the actual consistency policy and not what is in
the metadata. Also remove all the places where consistency_policy is set
for a disk's mdinfo - it is a property of the array, not the disk.
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Jes Sorensen <jsorensen@fb.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Recent commit has changed the way failed disks are counted. It breaks
recovery for external metadata arrays as failed disks are not part of
the array and have no corresponding entries is sysfs (they are only
reported for containers) so degraded arrays show no failed disks.
Recent commit overwrites GET_DEGRADED result prior to GET_STATE and it
is not set again if GET_STATE has not been requested. As GET_STATE
provides the same information as GET_DEGRADED, the latter is not needed
anymore. Remove GET_DEGRADED option and replace it with GET_STATE
option.
Don't count number of failed disks looking at sysfs entries but
calculate it at the end. Do it only for arrays as containers report
no disks, just spares.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Jes Sorensen <jsorensen@fb.com>
|
|
|
|
| |
Signed-off-by: Jes Sorensen <jsorensen@fb.com>
|
|
|
|
|
|
| |
Logical oprators never belong at the beginning of a line.
Signed-off-by: Jes Sorensen <jsorensen@fb.com>
|
|
|
|
|
|
|
|
|
| |
Open 'badblocks' and 'unacknowledged_bad_blocks' sysfs files for each
disk in the array. Add them to the list of files observed by monitor.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
|
|
|
|
|
|
| |
New gcc sometimes complains about this.
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
make dprintf() print program name and __func__, so that
this messaging is consistent.
Also remove all __func__ messages from pr_err(). We shouldn't
leak that internal data in error message.
If we really want function name there, we new pr_XXX might
be wanted.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
| |
seq_file in the kernel will allocate a read buffer on
first read. We want this to happen under the managemon thread,
not the 'monitor' thread, as the latter is not allow to allocate
memory (might deadlock).
So do a first read after opening.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
| |
If 'prepare_update' fails for some reason there is little
point continuing on to 'process_update'.
For now only malloc failures are caught, but other failures
will be considered in future.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
| |
There is not guarantee that 'inst' is a number, and even if there
were there is no point converting it str->int and then int->str again.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
|
| |
Without this, array may not go clean and mdmon will then
not exit.
A safe_mode of '0' (which is the only one that is handled differently
by this patch) means "never switch to 'active_idle'". We don't want
that when mdmon is stopping.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is possible for mdmon to see (in /proc/mdstat) and array
in 'inactive' state, "mdadm -S" has written "inactive" to
"array_state".
In this state values such as "raid_disk" are not meaningful
and so should be ignored by manage_member().
Reported-by: "Dorau, Lukasz" <lukasz.dorau@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
| |
This clearly should be 'st2'.
As it is the 'raid_disk' value being tested is completely
meaningless in the context of the new device.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
| |
commit 71d68ff62 uses the array layout. It needs to be initialized.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
| |
When we receive a signal, set the safemode delay to v.small
so that we can ge clean arrays and exit quickly
Signed-off-by: NeilBrown <neilb@suse.de>o
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In order to track kernel state changes, the monitor needs to
notice changes in sysfs. If the changes are transient, and the
monitor is busy writing meta data, it can happen that the changes
are missed. This will cause the meta data to be inconsistent with
the real state of the array.
I can reproduce this in a test scenario with a DDF container and
two subarrays, where I set a disk to "failed" and then add a global
hot-spare. On a typical MD test setup with loop devices, I can
reliably reproduce a failure where the metadata show degraded members
although the kernel finished the recovery successfully.
This patch fixes this problem by applying two changes. First, when
a metadata update is queued, wait until it is certain that the monitor
actually applied these meta data (the for loop is actually needed to
avoid failures completely in my test case). Second, after triggering the
recovery, set prev_state of the changed array to "recover", in case
the monitor misses the transient "recover" state.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
| |
Add debug messages to watch the manager's steps.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
| |
Now that I am using white-space mode in Emacs I can see all of this,
and I don't like it :-)
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
| |
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Mdmon does not update component_size now. It is wrong because in case
of size's expansion component_size is changed by mdadm but mdmon does not
reread its new value and uses a wrong, old one. As a result the metadata
is incorrect during size's expansion. It contains no information that
resync is in progress (there is no checkpoint too). The metadata is
as if resync has already been finished but it has not.
Component_size will be set to match information in sysfs. This value
will be updated by manager thread in manage_member() function.
Now mdmon uses the correct, current value of component_size and the
correct metadata (containing information about resync and checkpoint)
is written.
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We widely use a "devnum" which is 0 or +ve for md%d devices
and -ve for md_d%d devices.
But I want to be able to use md_%s device names.
So get rid of devnum (a number) and use devnm (a 32char string).
eg.
md0
md_d2
md_home
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
| |
mdadm --create /dev/md0 .... /dev/sda1:1024 /dev/sdb1:2048 ...
The size is in K unless a suffix: K M G is given.
The suffix 's' means sectors.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
malloc should never fail, and if it does it is unlikely
that anything else useful can be done. Best approach is to
abort and let some super-daemon restart.
So define xmalloc, xcalloc, xrealloc, xstrdup which don't
fail but just print a message and exit. Then use those
removing all the tests for failure.
Also replace all "malloc;memset" sequences with 'xcalloc'.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem was found during reshaping 2 volumes /raid0 and raid5/ in container.
Sometimes mdmon throws core dump due to NULL pointer exception.
Problem occurs in scenario:
- managemon: is about spare activation (degraded raid4 volume == raid0 under takeover)
- managemon: detect level change and signals monitor (manage_member() calls replace_array())
- monitor: detects transition raid4/5->raid0 and sets a->container to NULL
to indicate array deactivation
- managemon : continues his work and tries to activate spare (a->check_degraded is set).
NULL pointer is passed to metadata handler activate_spare()
Core dump is generated.
To resolve this situation managemon (after monitor kick) checks again
a->container pointer to learn if current array is not to be deactivated.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
| |
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
| |
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Description of the bug:
Sometimes mdmon crashes after changing RAID level from 1 to 0 (takeover).
Cause of the bug:
The managemon marks an active_array for removal from monitoring
by assigning a->container to NULL value (in the "manage_member" function).
Sometimes (during stress test) it happens right when the monitor
is in the "read_and_act" function and a->container pointer is in use.
This causes the monitor crashes.
Solution:
The active array has to be marked for removal in another way
than setting NULL pointer when it can be in use.
A new field "to_remove" was added to the "active_array" structure.
It is used in the managemon to mark a container to remove
(instead of the old assigment: a->container = NULL)
and monitor checks it to determine if the array should be removed.
The field "to_remove" should be checked in some other places
to avoid managing of the array which is going to be removed.
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following test fails when the md_check_recovery() event triggered by
the ro->rw transition causes remove_and_add_spares() to run while mdmon
is attempting spare activation.
Result is that the kernel races to set the slot immediately after
sysfs_add_disk() writes new_dev. mdmon thinks the spare activation
failed and declines to send the monitor a new acitve_array. We show
degraded after the wait because the monitor cannot notify the metadata
that all disks are in_sync.
#!/bin/bash
i=0
false
while [ $? == 1 ]
do
i=$((i+1))
mdadm -Ss
mdadm -CR /dev/md0 /dev/loop[0-2] -n 3 -e imsm
mdadm -CR /dev/md1 /dev/loop[01] missing -n 3 -l 5
mdadm --wait /dev/md1
mdadm -E /dev/loop2 | grep -i degraded
done
echo "failed: $i"
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When raid0 expansion occurs, takeover operation is used.
After backward takeover monitor remains in memory.
This happens due to remaining just removed active array in mdmon structures.
If there is no other monitored arrays, mdmon has to finish his work.
Problem was introduced in patch (2011.03.22):
mdmon: Stop keeping track of RAID0 (and LINEAR) arrays.
Prior to this patch mdmon kicking occurs via replace_array() where
wakeup_monitor() was called.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Tracking RAID0 arrays doesn't really work. There is no need,
and there are some sysfs files which won't exist when the array
appears and then won't be opened when the level is changed.
So simply ignore RAID0 and LINEAR arrays - don't add them when they
appear and if an array we are monitoring turns into one of these,
discard it promptly.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
| |
As monitor() can set ->container to NULL, we need to be careful
about dereferencing it.
So take a copy in manage_member, return if it is NULL, and only
use the copy.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|\
| |
| |
| |
| |
| |
| |
| | |
Conflicts:
Manage.c
managemon.c
super-ddf.c
super-intel.c
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This is needed to remove devices from mdmon's knowledge when the
device is removed from the md container.
Now that ddf have a remove_from_super we don't need the code
that allows some personalities not to implement this.
Signed-off-by: NeilBrown <neilb@suse.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Manager thread shall pass the information to monitor thread (mdmon)
that some devices are removed from container. Otherwise, monitor
(mdmon) might use such devices (spares) to rebuild the array that has
gone degraded.
This problem happens for imsm containers, since a list of the
container disks is maintained in intel_super structure. When array
goes degraded, the list is searched to find a spare disks to start
rebuild. Without this fix the rebuild could be stared on the spare
device that was a member of the container, but has been removed from
it.
New super type function handler has been introduced to prepare
metadata format specific information about removed devices.
int (*remove_from_super)(struct supertype *st, mdu_disk_info_t *dinfo)
The message prepared in remove_from_super is later processed by
process_update handler in monitor thread.
Signed-off-by: Marcin Labun <marcin.labun@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| |
| |
| |
| |
| |
| |
| |
| | |
Spare assignment requires full knowledge of array state. A pending
update might modify that state (such as a pending spare assignment)
so don't try while there are updates pending.
Signed-off-by: NeilBrown <neilb@suse.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
As chunk_size in mdstat_ent is never set, we shouldn't copy
it into a->info.array.
In fact, it is safest to get rid of the field altogether.
Reported-by: "Kwolek, Adam" <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This is needed to remove devices from mdmon's knowledge when the
device is removed from the md container.
Now that ddf have a remove_from_super we don't need the code
that allows some personalities not to implement this.
Signed-off-by: NeilBrown <neilb@suse.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
last_checkpoint is variable that tracks sync_complete sysfs entry.
sync_complete is per disk counter, so initializing during starting from checkpoint
has to have this in mind and convert reshape position properly.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|