| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It has come to my attention (thanks Martin) that 'discard_zeroes_data'
is only a hint. Some devices in some cases don't do what it
says on the label.
The use of DISCARD in RAID5 depends on reads from discarded regions
being predictably zero. If a write to a previously discarded region
performs a read-modify-write cycle it assumes that the parity block
was consistent with the data blocks. If all were zero, this would
be the case. If some are and some aren't this would not be the case.
This could lead to data corruption after a device failure when
data needs to be reconstructed from the parity.
As we cannot trust 'discard_zeroes_data', ignore it by default
and so disallow DISCARD on all raid4/5/6 arrays.
As many devices are trustworthy, and as there are benefits to using
DISCARD, add a module parameter to over-ride this caution and cause
DISCARD to work if discard_zeroes_data is set.
If a site want to enable DISCARD on some arrays but not on others they
should select DISCARD support at the filesystem level, and set the
raid456 module parameter.
raid456.devices_handle_discard_safely=Y
As this is a data-safety issue, I believe this patch is suitable for
-stable.
DISCARD support for RAID456 was added in 3.7
Cc: Shaohua Li <shli@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org (3.7+)
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 620125f2bf8ff0c4969b79653b54d7bcc9d40637
Signed-off-by: NeilBrown <neilb@suse.de>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull bugfixes for md/raid1 from Neil Brown:
"It is amazing how much easier it is to find bugs when you know one is
there. Two bug reports resulted in finding 7 bugs!
All are tagged for -stable. Those that can't cause (rare) data
corruption, cause lockups.
Particularly, but not only, fixing new "resync" code"
* tag 'md/3.17-more-fixes' of git://git.neil.brown.name/md:
md/raid1: fix_read_error should act on all non-faulty devices.
md/raid1: count resync requests in nr_pending.
md/raid1: update next_resync under resync_lock.
md/raid1: Don't use next_resync to determine how far resync has progressed
md/raid1: make sure resync waits for conflicting writes to complete.
md/raid1: clean up request counts properly in close_sync()
md/raid1: be more cautious where we read-balance during resync.
md/raid1: intialise start_next_window for READ case to avoid hang
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If a devices is being recovered it is not InSync and is not Faulty.
If a read error is experienced on that device, fix_read_error()
will be called, but it ignores non-InSync devices. So it will
neither fix the error nor fail the device.
It is incorrect that fix_read_error() ignores non-InSync devices.
It should only ignore Faulty devices. So fix it.
This became a bug when we allowed reading from a device that was being
recovered. It is suitable for any subsequent -stable kernel.
Fixes: da8840a747c0dbf49506ec906757a6b87b9741e9
Cc: stable@vger.kernel.org (v3.5+)
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Both normal IO and resync IO can be retried with reschedule_retry()
and so be counted into ->nr_queued, but only normal IO gets counted in
->nr_pending.
Before the recent improvement to RAID1 resync there could only
possibly have been one or the other on the queue. When handling a
read failure it could only be normal IO. So when handle_read_error()
called freeze_array() the fact that freeze_array only compares
->nr_queued against ->nr_pending was safe.
But now that these two types can interleave, we can have both normal
and resync IO requests queued, so we need to count them both in
nr_pending.
This error can lead to freeze_array() hanging if there is a read
error, so it is suitable for -stable.
Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
cc: stable@vger.kernel.org (v3.13+)
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
raise_barrier() uses next_resync as part of its calculations, so it
really should be updated first, instead of afterwards.
next_resync is always used under resync_lock so update it under
resync lock to, just before it is used. That is safest.
This could cause normal IO and resync IO to interact badly so
it suitable for -stable.
Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NeilBrown <neilb@suse.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
next_resync is (approximately) the location for the next resync request.
However it does *not* reliably determine the earliest location
at which resync might be happening.
This is because resync requests can complete out of order, and
we only limit the number of current requests, not the distance
from the earliest pending request to the latest.
mddev->curr_resync_completed is a reliable indicator of the earliest
position at which resync could be happening. It is updated less
frequently, but is actually reliable which is more important.
So use it to determine if a write request is before the region
being resynced and so safe from conflict.
This error can allow resync IO to interfere with normal IO which
could lead to data corruption. Hence: stable.
Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NeilBrown <neilb@suse.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The resync/recovery process for raid1 was recently changed
so that writes could happen in parallel with resync providing
they were in different regions of the device.
There is a problem though: While a write request will always
wait for conflicting resync to complete, a resync request
will *not* always wait for conflicting writes to complete.
Two changes are needed to fix this:
1/ raise_barrier (which waits until it is safe to do resync)
must wait until current_window_requests is zero
2/ wait_battier (which waits at the start of a new write request)
must update current_window_requests if the request could
possible conflict with a concurrent resync.
As concurrent writes and resync can lead to data loss,
this patch is suitable for -stable.
Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
Cc: stable@vger.kernel.org (v3.13+)
Cc: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If there are outstanding writes when close_sync is called,
the change to ->start_next_window might cause them to
decrement the wrong counter when they complete. Fix this
by merging the two counters into the one that will be decremented.
Having an incorrect value in a counter can cause raise_barrier()
to hangs, so this is suitable for -stable.
Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NeilBrown <neilb@suse.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 made
it possible for reads to happen concurrently with resync.
This means that we need to be more careful where read_balancing
is allowed during resync - we can no longer be sure that any
resync that has already started will definitely finish.
So keep read_balancing to before recovery_cp, which is conservative
but safe.
This bug makes it possible to read from a device that doesn't
have up-to-date data, so it can cause data corruption.
So it is suitable for any kernel since 3.11.
Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NeilBrown <neilb@suse.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
r1_bio->start_next_window is not initialised in the READ
case, so allow_barrier may incorrectly decrement
conf->current_window_requests
which can cause raise_barrier() to block forever.
Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
cc: stable@vger.kernel.org (v3.13+)
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a writeback or a promotion of a block is completed, the cell of
that block is removed from the prison, the block is marked as clean, and
the clear_dirty() callback of the cache policy is called.
Unfortunately, performing those actions in this order allows an incoming
new write bio for that block to come in before clearing the dirty status
is completed and therefore possibly causing one of these two scenarios:
Scenario A:
Thread 1 Thread 2
cell_defer() .
- cell removed from prison .
- detained bios queued .
. incoming write bio
. remapped to cache
. set_dirty() called,
. but block already dirty
. => it does nothing
clear_dirty() .
- block marked clean .
- policy clear_dirty() called .
Result: Block is marked clean even though it is actually dirty. No
writeback will occur.
Scenario B:
Thread 1 Thread 2
cell_defer() .
- cell removed from prison .
- detained bios queued .
clear_dirty() .
- block marked clean .
. incoming write bio
. remapped to cache
. set_dirty() called
. - block marked dirty
. - policy set_dirty() called
- policy clear_dirty() called .
Result: Block is properly marked as dirty, but policy thinks it is clean
and therefore never asks us to writeback it.
This case is visible in "dmsetup status" dirty block count (which
normally decreases to 0 on a quiet device).
Fix these issues by calling clear_dirty() before calling cell_defer().
Incoming bios for that block will then be detained in the cell and
released only after clear_dirty() has completed, so the race will not
occur.
Found by inspecting the code after noticing spurious dirty counts
(scenario B).
Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The DM crypt target accesses memory beyond allocated space resulting in
a crash on 32 bit x86 systems.
This bug is very old (it dates back to 2.6.25 commit 3a7f6c990ad04 "dm
crypt: use async crypto"). However, this bug was masked by the fact
that kmalloc rounds the size up to the next power of two. This bug
wasn't exposed until 3.17-rc1 commit 298a9fa08a ("dm crypt: use per-bio
data"). By switching to using per-bio data there was no longer any
padding beyond the end of a dm-crypt allocated memory block.
To minimize allocation overhead dm-crypt puts several structures into one
block allocated with kmalloc. The block holds struct ablkcipher_request,
cipher-specific scratch pad (crypto_ablkcipher_reqsize(any_tfm(cc))),
struct dm_crypt_request and an initialization vector.
The variable dmreq_start is set to offset of struct dm_crypt_request
within this memory block. dm-crypt allocates the block with this size:
cc->dmreq_start + sizeof(struct dm_crypt_request) + cc->iv_size.
When accessing the initialization vector, dm-crypt uses the function
iv_of_dmreq, which performs this calculation: ALIGN((unsigned long)(dmreq
+ 1), crypto_ablkcipher_alignmask(any_tfm(cc)) + 1).
dm-crypt allocated "cc->iv_size" bytes beyond the end of dm_crypt_request
structure. However, when dm-crypt accesses the initialization vector, it
takes a pointer to the end of dm_crypt_request, aligns it, and then uses
it as the initialization vector. If the end of dm_crypt_request is not
aligned on a crypto_ablkcipher_alignmask(any_tfm(cc)) boundary the
alignment causes the initialization vector to point beyond the allocated
space.
Fix this bug by calculating the variable iv_size_padding and adding it
to the allocated size.
Also correct the alignment of dm_crypt_request. struct dm_crypt_request
is specific to dm-crypt (it isn't used by the crypto subsystem at all),
so it is aligned on __alignof__(struct dm_crypt_request).
Also align per_bio_data_size on ARCH_KMALLOC_MINALIGN, so that it is
aligned as if the block was allocated with kmalloc.
Reported-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most places which allocate an r10_bio zero the ->state, some don't.
As the r10_bio comes from a mempool, and the allocation function uses
kzalloc it is often zero anyway. But sometimes it isn't and it is
best to be safe.
I only noticed this because of the bug fixed by an earlier patch
where the r10_bios allocated for a reshape were left around to
be used by a subsequent resync. In that case the R10BIO_IsReshape
flag caused problems.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
| |
If raid10 reshape fails to find somewhere to read a block
from, it returns without freeing memory...
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a raid10 commences a resync/recovery/reshape it allocates
some buffer space.
When a resync/recovery completes the buffer space is freed. But not
when the reshape completes.
This can result in a small memory leak.
There is a subtle side-effect of this bug. When a RAID10 is reshaped
to a larger array (more devices), the reshape is immediately followed
by a "resync" of the new space. This "resync" will use the buffer
space which was allocated for "reshape". This can cause problems
including a "BUG" in the SCSI layer. So this is suitable for -stable.
Cc: stable@vger.kernel.org (v3.5+)
Fixes: 3ea7daa5d7fde47cd41f4d56c2deb949114da9d6
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
raid10 reshape clears unwanted bits from a bio->bi_flags using
a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC
was added.
Since then it clears that bit but shouldn't. This results in a
memory leak.
So change to used the approved method of clearing unwanted bits.
As this causes a memory leak which can consume all of memory
the fix is suitable for -stable.
Fixes: a38352e0ac02dbbd4fa464dc22d1352b5fbd06fd
Cc: stable@vger.kernel.org (v3.10+)
Reported-by: mdraid.pkoch@dfgh.net (Peter Koch)
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During recovery of a double-degraded RAID6 it is possible for
some blocks not to be recovered properly, leading to corruption.
If a write happens to one block in a stripe that would be written to a
missing device, and at the same time that stripe is recovering data
to the other missing device, then that recovered data may not be written.
This patch skips, in the double-degraded case, an optimisation that is
only safe for single-degraded arrays.
Bug was introduced in 2.6.32 and fix is suitable for any kernel since
then. In an older kernel with separate handle_stripe5() and
handle_stripe6() functions the patch must change handle_stripe6().
Cc: stable@vger.kernel.org (2.6.32+)
Fixes: 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8
Cc: Yuri Tikhonov <yur@emcraft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Reported-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Tested-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a stripe in a raid6 array received a write to each data block while
the array is degraded, and if any of these writes to a missing device
are not page-aligned, then a live-lock happens.
In this case the P and Q blocks need to be read so that the part of
the missing block which is *not* being updated by the write can be
constructed. Due to a logic error, these blocks are not loaded, so
the update cannot proceed and the stripe is 'handled' repeatedly in an
infinite loop.
This bug is unlikely as most writes are page aligned. However as it
can lead to a livelock it is suitable for -stable. It was introduced
in 3.16.
Cc: stable@vger.kernel.org (v3.16)
Fixed: 67f455486d2ea20b2d94d6adf5b9b783d079e321
Signed-off-by: NeilBrown <neilb@suse.de>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper changes from Mike Snitzer:
- Allow the thin target to paired with any size external origin; also
allow thin snapshots to be larger than the external origin.
- Add support for quickly loading a repetitive pattern into the
dm-switch target.
- Use per-bio data in the dm-crypt target instead of always using a
mempool for each allocation. Required switching to kmalloc alignment
for the bio slab.
- Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag
- Fix the dm-cache and dm-thin targets' export of the minimum_io_size
to match the data block size -- this fixes an issue where mkfs.xfs
would improperly infer raid striping was in place on the underlying
storage.
- Small cleanups in dm-io, dm-mpath and dm-cache
* tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm table: propagate QUEUE_FLAG_NO_SG_MERGE
dm switch: efficiently support repetitive patterns
dm switch: factor out switch_region_table_read
dm cache: set minimum_io_size to cache's data block size
dm thin: set minimum_io_size to pool's data block size
dm crypt: use per-bio data
block: use kmalloc alignment for bio slab
dm table: make dm_table_supports_discards static
dm cache metadata: use dm-space-map-metadata.h defined size limits
dm cache: fail migrations in the do_worker error path
dm cache: simplify deferred set reference count increments
dm thin: relax external origin size constraints
dm thin: switch to an atomic_t for tracking pending new block preparations
dm mpath: eliminate pg_ready() wrapper
dm io: simplify dec_count and sync_io
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 05f1dd5 ("block: add queue flag for disabling SG merging")
introduced a new queue flag: QUEUE_FLAG_NO_SG_MERGE. This gets set by
default in blk_mq_init_queue for mq-enabled devices. The effect of
the flag is to bypass the SG segment merging. Instead, the
bio->bi_vcnt is used as the number of hardware segments.
With a device mapper target on top of a device with
QUEUE_FLAG_NO_SG_MERGE set, we can end up sending down more segments
than a driver is prepared to handle. I ran into this when backporting
the virtio_blk mq support. It triggerred this BUG_ON, in
virtio_queue_rq:
BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
The queue's max is set here:
blk_queue_max_segments(q, vblk->sg_elems-2);
Basically, what happens is that a bio is built up for the dm device
(which does not have the QUEUE_FLAG_NO_SG_MERGE flag set) using
bio_add_page. That path will call into __blk_recalc_rq_segments, so
what you end up with is bi_phys_segments being much smaller than bi_vcnt
(and bi_vcnt grows beyond the maximum sg elements). Then, when the bio
is submitted, it gets cloned. When the cloned bio is submitted, it will
end up in blk_recount_segments, here:
if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags))
bio->bi_phys_segments = bio->bi_vcnt;
and now we've set bio->bi_phys_segments to a number that is beyond what
was registered as queue_max_segments by the driver.
The right way to fix this is to propagate the queue flag up the stack.
The rules for propagating the flag are simple:
- if the flag is set for any underlying device, it must be set for the
upper device
- consequently, if the flag is not set for any underlying device, it
should not be set for the upper device.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.16+
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add support for quickly loading a repetitive pattern into the
dm-switch target.
In the "set_regions_mappings" message, the user may now use "Rn,m" as
one of the arguments. "n" and "m" are hexadecimal numbers. The "Rn,m"
argument repeats the last "n" arguments in the following "m" slots.
For example:
dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
is equivalent to
dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
:1 :2 :1 :2 :1 :2 :1 :2 :1 :2
Requested-by: Jay Wang <jwang@nimblestorage.com>
Tested-by: Jay Wang <jwang@nimblestorage.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Move code that reads the table to a switch_region_table_read.
It will be needed for the next commit. No functional change.
Tested-by: Jay Wang <jwang@nimblestorage.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Before, if the block layer's limit stacking didn't establish an
optimal_io_size that was compatible with the cache's data block size
we'd set optimal_io_size to the data block size and minimum_io_size to 0
(which the block layer adjusts to be physical_block_size).
Update cache_io_hints() to set both minimum_io_size and optimal_io_size
to the cache's data block size. This fixes an issue where mkfs.xfs
would create more XFS Allocation Groups on cache volumes than on a
normal linear LV of comparable size.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Before, if the block layer's limit stacking didn't establish an
optimal_io_size that was compatible with the thin-pool's data block size
we'd set optimal_io_size to the data block size and minimum_io_size to 0
(which the block layer adjusts to be physical_block_size).
Update pool_io_hints() to set both minimum_io_size and optimal_io_size
to the thin-pool's data block size. This fixes an issue reported where
mkfs.xfs would create more XFS Allocation Groups on thinp volumes than
on a normal linear LV of comparable size, see:
https://bugzilla.redhat.com/show_bug.cgi?id=1003227
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Change dm-crypt so that it uses auxiliary data allocated with the bio.
Dm-crypt requires two allocations per request - struct dm_crypt_io and
struct ablkcipher_request (with other data appended to it). It
previously only used mempool allocations.
Some requests may require more dm_crypt_ios and ablkcipher_requests,
however most requests need just one of each of these two structures to
complete.
This patch changes it so that the first dm_crypt_io and ablkcipher_request
are allocated with the bio (using target per_bio_data_size option). If
the request needs additional values, they are allocated from the mempool.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The function dm_table_supports_discards is only called from
dm-table.c:dm_table_set_restrictions(). So move it above
dm_table_set_restrictions and make it static.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 7d48935e cleaned up the persistent-data's space-map-metadata
limits by elevating them to dm-space-map-metadata.h. Update
dm-cache-metadata to use these same limits.
The calculation for DM_CACHE_METADATA_MAX_SECTORS didn't account for the
sizeof the disk_bitmap_header. So the supported maximum metadata size
is a bit smaller (reduced from 33423360 to 33292800 sectors).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
| |
| |
| |
| |
| | |
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Factor out inc_and_issue and inc_ds helpers to simplify deferred set
reference count increments. Also cleanup cache_map to consistently call
cell_defer and inc_ds when the bio is DM_MAPIO_REMAPPED.
No functional change.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Track the size of any external origin. Previously the external origin's
size had to be a multiple of the thin-pool's block size, that is no
longer a requirement. In addition, snapshots that are larger than the
external origin are now supported.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Previously we used separate boolean values to track quiescing and
copying actions. By switching to an atomic_t we can support blocks that
need a partial copy and partial zero.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
pg_ready() is not comprehensive in its logic and only serves to
obfuscate code. Replace pg_ready() with the appropriate logic in
multipath_map().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Remove the io struct off the stack in sync_io() and allocate it from
the mempool like is done in async_io().
dec_count() now always calls a callback function and always frees the io
struct back to the mempool (so sync_io and async_io share this pattern).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull block driver changes from Jens Axboe:
"Nothing out of the ordinary here, this pull request contains:
- A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
and Surbhi Palande. No new features, just a lot of fixes.
- The usual round of drbd updates from Andreas Gruenbacher, Lars
Ellenberg, and Philipp Reisner.
- virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
has taken it one step further and added support for actually using
more than one queue.
- Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
compliment the the default behavior of adding to the tail of the
queue. From Douglas Gilbert"
* 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
bcache: Drop unneeded blk_sync_queue() calls
bcache: add mutex lock for bch_is_open
bcache: Correct printing of btree_gc_max_duration_ms
bcache: try to set b->parent properly
bcache: fix memory corruption in init error path
bcache: fix crash with incomplete cache set
bcache: Fix more early shutdown bugs
bcache: fix use-after-free in btree_gc_coalesce()
bcache: Fix an infinite loop in journal replay
bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
bcache: bcache_write tracepoint was crashing
bcache: fix typo in bch_bkey_equal_header
bcache: Allocate bounce buffers with GFP_NOWAIT
bcache: Make sure to pass GFP_WAIT to mempool_alloc()
bcache: fix uninterruptible sleep in writeback thread
bcache: wait for buckets when allocating new btree root
bcache: fix crash on shutdown in passthrough mode
bcache: fix lockdep warnings on shutdown
bcache allocator: send discards with correct size
bcache: Fix to remove the rcu_sched stalls.
...
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
this is needed for the queue/block device we created (it's done by
blk_cleanup_queue() which we do call) - but calling it for the block devices we
only opened is pointless.
Change-Id: I53dfded14ed15b9581d10ca8399d5e1b3abbf9f2
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Since bch_is_open will iterate linked list bch_cache_sets and
uncached_devices, it needs bch_register_lock.
Signed-off-by: Jianjian Huo <samuel.huo@gmail.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
time_stats::btree_gc_max_duration_mc is not bit shifted by 8
Fixes BUG #138
Change-Id: I44fc6e1d0579674016acc533f1a546b080e5371a
Signed-off-by: Surbhi Palande <sap@daterainc.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
bcache_flash_dev.ktest would reliably crash with 8k and 16k bucket size
before; now it passes.
Change-Id: Ib542232235e39298c3a7548fe52b645cabb823d1
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If register_cache_set() failed, we would touch ca->set after
it had already been freed. Also, fix an assertion to catch
this.
Change-Id: I748e5f5b223e2d9b2602075dec2f997cced2394d
|
| | |
| | |
| | |
| | | |
Change-Id: I6abde52afe917633480caaf4e2518f42a816d886
|
| | |
| | |
| | |
| | | |
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If we goto out_nocoalesce after we free new_nodes[0], we end up freeing
new_nodes[0] again. This was generating a lockdep warning. The fix is
to set new_nodes[0] to NULL, since the out_nocoalesce path safely
ignores NULL entries in the new_nodes array.
This regression was introduced in 2d7f9531.
Change-Id: I76564d7257800583214376b4bacf236cda90c89c
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When running with multiple cache devices, if one of the devices has a completely
empty journal but we'd already found some journal entries on a previosu device
we'd go into an infinite loop.
Change-Id: I1dcdc0d738192746de28f40e8b08825b0dea5e2b
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | |
| | |
| | |
| | |
| | |
| | | |
'b' was NULL.
Change-Id: Icac0fd04afa2d23f213d96d51afd53374e6dd0c0
|
| | |
| | |
| | |
| | | |
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
There's no point in blocking on these allocations, since our fallback paths will
probably go faster than blocking.
Change-Id: I733ca202c25cb36bde02607a0a60552229a4241c
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
this was very wrong - mempool_alloc() only guarantees success with GFP_WAIT.
bcache uses GFP_NOWAIT in various other places where we have a fallback,
circuits must've gotten crossed when writing this code or something.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
There were two issues here:
- writeback thread did not start until the device first became dirty
- writeback thread used uninterruptible sleep once running
Without this patch I see kernel warnings printed and a load average of
1.52 after booting my test VM. With this patch the warnings are gone and
the load average is near 0.00 as expected.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Tested:
- sometimes bcache_tier test would hang on startup with a failure
to allocate the btree root -- no longer seeing this
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|