| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
| |
commit 5f9d1fde7d54a5(raid5: fix memory leak of bio integrity data)
moves bio_reset to bio_endio. But it introduces a small race condition.
It does bio_reset after raid5_release_stripe, which could make the
stripe reusable and hence reuse the bio just before bio_reset. Moving
bio_reset before raid5_release_stripe is called should fix the race.
Reported-and-tested-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Signed-off-by: Shaohua Li <shli@fb.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The md-cluster is compiled as module by default,
if it is compiled by built-in way, then we can't
make md-cluster works.
[64782.630008] md/raid1:md127: active with 2 out of 2 mirrors
[64782.630528] md-cluster module not found.
[64782.630530] md127: Could not setup cluster service (-2)
Fixes: edb39c9 ("Introduce md_cluster_operations to handle cluster functions")
Cc: stable@vger.kernel.org (v4.1+)
Reported-by: Marc Smith <marc.smith@mcc.edu>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
|
|
|
|
|
|
|
|
|
| |
If there aren't enough stripes, reshape will hang. We have a check for
this in new reshape, but miss it for reshape resume, hence we could see
hang in reshape resume. This patch forces enough stripes existed if
reshape resumes.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is a potential deadlock in superblock write. Discard could zero data, so
before discard we must make sure superblock is updated to new log tail.
Updating superblock (either directly call md_update_sb() or depend on md
thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called
with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all
IO finish, hence waitting for reclaim thread, while reclaim thread is calling
this function and waitting for reconfig mutex. So there is a deadlock. We
workaround this issue with a trylock. The downside of the solution is we could
miss discard if we can't take reconfig mutex. But this should happen rarely
(mainly in raid array stop), so miss discard shouldn't be a big problem.
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull MD fixes from Shaohua Li:
"This includes several bug fixes:
- Alexey Obitotskiy fixed a hang for faulty raid5 array with external
management
- Song Liu fixed two raid5 journal related bugs
- Tomasz Majchrzak fixed a bad block recording issue and an
accounting issue for raid10
- ZhengYuan Liu fixed an accounting issue for raid5
- I fixed a potential race condition and memory leak with DIF/DIX
enabled
- other trival fixes"
* tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
raid5: avoid unnecessary bio data set
raid5: fix memory leak of bio integrity data
raid10: record correct address of bad block
md-cluster: fix error return code in join()
r5cache: set MD_JOURNAL_CLEAN correctly
md: don't print the same repeated messages about delayed sync operation
md: remove obsolete ret in md_start_sync
md: do not count journal as spare in GET_ARRAY_INFO
md: Prevent IO hold during accessing to faulty raid5 array
MD: hold mddev lock to change bitmap location
raid5: fix incorrectly counter of conf->empty_inactive_list_nr
raid10: increment write counter after bio is split
|
| |
| |
| |
| |
| |
| |
| |
| | |
bio_reset doesn't change bi_io_vec and bi_max_vecs, so we don't need to
set them every time. bi_private will be set before the bio is
dispatched.
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5
doesn't alloc/free bio, instead it reuses bios. There are two issues in
current code:
1. the code calls bio_init (from
init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io).
The bio is reused, so likely there is integrity data attached. bio_init
will clear a pointer to integrity data and makes bio_reset can't release
the data
2. bio_reset is called before dispatching bio. After bio is finished,
it's possible we don't free bio's integrity data (eg, we don't call
bio_reset again)
Both issues will cause memory leak. The patch moves bio_init to stripe
creation and bio_reset to bio end io. This will fix the two issues.
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
For failed write request record block address on a device, not block
address in an array.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
Fix to return error code -ENOMEM from the lockres_init() error
handling case instead of 0, as done elsewhere in this function.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, the code sets MD_JOURNAL_CLEAN when the array has
MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
will be MD_JOURNAL_CLEAN even if the journal device is missing.
With this patch, the MD_JOURNAL_CLEAN is only set when the journal
device presents.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This fixes a long-standing bug that caused a flood of messages like:
"md: delaying data-check of md1 until md2 has finished (they share one
or more physical units)"
It can be reproduced like this:
1. Create at least 3 raid1 arrays on a pair of disks, each on different
partitions.
2. Request a sync operation like 'check' or 'repair' on 2 arrays by
writing to their md/sync_action attribute files. One operation should
start and one should be delayed and a message like the above will be
printed.
3. Issue a write to the third array. Each write will cause 2 copies of
the message to be printed.
This happens when wake_up(&resync_wait) is called, usually by
md_check_recovery(). Then the delayed sync thread again prints the
message and is put to sleep. This patch adds a check in md_do_sync() to
prevent printing this message more than once for the same pair of
devices.
Reported-by: Sven Koehler <sven.koehler@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=151801
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The ret is not needed anymore since we have already
move resync_start into md_do_sync in commit 41a9a0d.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
GET_ARRAY_INFO counts journal as spare (spare_disks), which is not
accurate. This patch fixes this.
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
After array enters in faulty state (e.g. number of failed drives
becomes more then accepted for raid5 level) it sets error flags
(one of this flags is MD_CHANGE_PENDING). For internal metadata
arrays MD_CHANGE_PENDING cleared into md_update_sb, but not for
external metadata arrays. MD_CHANGE_PENDING flag set prevents to
finish all new or non-finished IOs to array and hold them in
pending state. In some cases this can leads to deadlock situation.
For example, we have faulty array (2 of 4 drives failed) and
udev handle array state changes and blkid started (or other
userspace application that used array to read/write) but unable
to finish reads due to IO hold. At the same time we unable to get
exclusive access to array (to stop array in our case) because
another external application still use this array.
Fix makes possible to return IO with errors immediately.
So external application can finish working with array and
give exclusive access to other applications to perform
required management actions with array.
Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
Changing the location changes a lot of things. Holding the lock to avoid race.
This makes the .quiesce called with mddev lock hold too.
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The counter conf->empty_inactive_list_nr is only used for determine if the
raid5 is congested which is deal with in function raid5_congested().
It was increased in get_free_stripe() when conf->inactive_list got to be
empty and decreased in release_inactive_stripe_list() when splice
temp_inactive_list to conf->inactive_list. However, this may have a
problem when raid5_get_active_stripe or stripe_add_to_batch_list was called,
because these two functions may call list_del_init(&sh->lru) to delete sh from
"conf->inactive_list + hash" which may cause "conf->inactive_list + hash" to
be empty when atomic_inc_not_zero(&sh->count) got false. So a check should be
done at these two point and increase empty_inactive_list_nr accordingly.
Otherwise the counter may get to be negative number which would influence
async readahead from VFS.
Signed-off-by: ZhengYuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
md pending write counter must be incremented after bio is split,
otherwise it gets decremented too many times in end bio callback and
becomes negative.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable patches:
- Fix a refcount leak in nfs_callback_up_net
- Fix an Oopsable condition when the flexfile pNFS driver connection
to the DS fails
- Fix an Oopsable condition in NFSv4.1 server callback races
- Ensure pNFS clients stop doing I/O to the DS if their lease has
expired, as required by the NFSv4.1 protocol
Bugfixes:
- Fix potential looping in the NFSv4.x migration code
- Patch series to close callback races for OPEN, LAYOUTGET and
LAYOUTRETURN
- Silence WARN_ON when NFSv4.1 over RDMA is in use
- Fix a LAYOUTCOMMIT race in the pNFS/blocks client
- Fix pNFS timeout issues when the DS fails"
* tag 'nfs-for-4.8-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.x: Fix a refcount leak in nfs_callback_up_net
NFS4: Avoid migration loops
pNFS/flexfiles: Fix an Oopsable condition when connection to the DS fails
NFSv4.1: Remove obsolete and incorrrect assignment in nfs4_callback_sequence
NFSv4.1: Close callback races for OPEN, LAYOUTGET and LAYOUTRETURN
NFSv4.1: Defer bumping the slot sequence number until we free the slot
NFSv4.1: Delay callback processing when there are referring triples
NFSv4.1: Fix Oopsable condition in server callback races
SUNRPC: Silence WARN_ON when NFSv4.1 over RDMA is in use
pnfs/blocklayout: update last_write_offset atomically with extents
pNFS: The client must not do I/O to the DS if it's lease has expired
pNFS: Handle NFS4ERR_OLD_STATEID correctly in LAYOUTSTAT calls
pNFS/flexfiles: Set reasonable default retrans values for the data channel
NFS: Allow the mount option retrans=0
pNFS/flexfiles: Fix layoutstat periodic reporting
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
On error, the callers expect us to return without bumping
nn->cb_users[].
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v3.7+
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If a server returns itself as a location while migrating, the client may
end up getting stuck attempting to migrate twice to the same server. Catch
this by checking if the nfs_client found is the same as the existing
client. For the other two callers to nfs4_set_client, the nfs_client will
always be ERR_PTR(-EINVAL).
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If the attempt to connect to a DS fails inside ff_layout_pg_init_read or
ff_layout_pg_init_write, then we currently end up clearing the layout
segment carried by the struct nfs_pageio_descriptor, causing an Oops
when we later call into ff_layout_read_pagelist/ff_layout_write_pagelist.
The fix is to ensure we return the layout and then retry.
Fixes: 446ca2195303 ("pNFS/flexfiles: When initing reads or writes, we...")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Defer freeing the slot until after we have processed the results from
OPEN and LAYOUTGET. This means that the server can rely on the
mechanism in RFC5661 Section 2.10.6.3 to ensure that replies to an
OPEN or LAYOUTGET/RETURN RPC call don't race with the callbacks that
apply to them.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
For operations like OPEN or LAYOUTGET, which return recallable state
(i.e. delegations and layouts) we want to enable the mechanism for
resolving recall races in RFC5661 Section 2.10.6.3.
To do so, we will want to defer bumping the slot's sequence number until
we have finished processing the RPC results.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If CB_SEQUENCE tells us that the processing of this request depends on
the completion of one or more referring triples (see RFC 5661 Section
2.10.6.3), delay the callback processing until after the RPC requests
being referred to have completed.
If we end up delaying for more than 1/2 second, then fall back to
returning NFS4ERR_DELAY in reply to the callback.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The slot table hasn't been an array since v3.7. Ensure that we
use nfs4_lookup_slot() to access the slot correctly.
Fixes: 87dda67e7386 ("NFSv4.1: Allow SEQUENCE to resize the slot table...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v3.8+
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Using NFSv4.1 on RDMA should be safe, so broaden the new checks in
rpc_create().
WARN_ON_ONCE is used, matching most other WARN call sites in clnt.c.
Fixes: 39a9beab5acb ("rpc: share one xps between all backchannels")
Fixes: d50039ea5ee6 ("nfsd4/rpc: move backchannel create logic...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Block/SCSI layout write completion may add committable extents to the
extent tree before updating the layout's last-written byte under the inode
lock. If a sync happens before this value is updated, then
prepare_layoutcommit may find and encode these extents which would produce
a LAYOUTCOMMIT request whose encoded extents are larger than the request's
loca_length.
Fix this by using a last-written byte value that is updated atomically with
the extent tree so that commitable extents always match.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Ensure that the client conforms to the normative behaviour described in
RFC5661 Section 12.7.2: "If a client believes its lease has expired,
it MUST NOT send I/O to the storage device until it has validated its
lease."
So ensure that we wait for the lease to be validated before using
the layout.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v3.20+
|
| | |
| | |
| | |
| | |
| | |
| | | |
We normally want to update the stateid and then retry,
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Prior to this patch, the retrans value was set at 5, meaning that we
could see a maximum retransmission timeout value of more than 6 minutes.
That's a tad high for NFSv3 where the protocol does allow the server to
drop requests at any time.
Since this is a data channel, let's just set retrans to 0, and the default
timeout to 60s. The user can continue to adjust these defaults using the
dataserver_retrans and dataserver_timeo module parameters.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We should allow retrans=0 as just meaning that every timeout is a major
timeout, and that there is no increment in the timeout value.
For instance, this means that we would allow TCP users to specify a
flat timeout value of 60s, by specifying "timeo=600,retrans=0" in their
mount option string.
Siged-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Putting the periodicity timer in the mirror instances is causing
non-scalable reporting behaviour and missed reporting intervals.
When you recall layouts and/or implement client side mirroring, it
leads to consecutive reports with only a few ms between RPC calls.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: d0379a5d066a9 ("pNFS/flexfiles: Support server-supplied...")
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
There are three usercopy warnings which are currently being silenced for
gcc 4.6 and newer:
1) "copy_from_user() buffer size is too small" compile warning/error
This is a static warning which happens when object size and copy size
are both const, and copy size > object size. I didn't see any false
positives for this one. So the function warning attribute seems to
be working fine here.
Note this scenario is always a bug and so I think it should be
changed to *always* be an error, regardless of
CONFIG_DEBUG_STRICT_USER_COPY_CHECKS.
2) "copy_from_user() buffer size is not provably correct" compile warning
This is another static warning which happens when I enable
__compiletime_object_size() for new compilers (and
CONFIG_DEBUG_STRICT_USER_COPY_CHECKS). It happens when object size
is const, but copy size is *not*. In this case there's no way to
compare the two at build time, so it gives the warning. (Note the
warning is a byproduct of the fact that gcc has no way of knowing
whether the overflow function will be called, so the call isn't dead
code and the warning attribute is activated.)
So this warning seems to only indicate "this is an unusual pattern,
maybe you should check it out" rather than "this is a bug".
I get 102(!) of these warnings with allyesconfig and the
__compiletime_object_size() gcc check removed. I don't know if there
are any real bugs hiding in there, but from looking at a small
sample, I didn't see any. According to Kees, it does sometimes find
real bugs. But the false positive rate seems high.
3) "Buffer overflow detected" runtime warning
This is a runtime warning where object size is const, and copy size >
object size.
All three warnings (both static and runtime) were completely disabled
for gcc 4.6 with the following commit:
2fb0815c9ee6 ("gcc4: disable __compiletime_object_size for GCC 4.6+")
That commit mistakenly assumed that the false positives were caused by a
gcc bug in __compiletime_object_size(). But in fact,
__compiletime_object_size() seems to be working fine. The false
positives were instead triggered by #2 above. (Though I don't have an
explanation for why the warnings supposedly only started showing up in
gcc 4.6.)
So remove warning #2 to get rid of all the false positives, and re-enable
warnings #1 and #3 by reverting the above commit.
Furthermore, since #1 is a real bug which is detected at compile time,
upgrade it to always be an error.
Having done all that, CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is no longer
needed.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata fixes from Tejun Heo:
"Two libata driver specific fixes for v4.8-rc4. Nothing too scary"
* 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
pata_ninja32: Avoid corrupting status flags
ahci: disable correct irq for dummy ports
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Ninja32 needs to set some flags to indicate it does 32bit IO. However it currently assigns this which
loses the initializing flag and causes a warning spew. Fix it to use a logical or as is intended.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Tested-by: Ellmar Stelnberger <estellnb@elstel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
irq already contains the interrupt number for the port, don't add the
port index to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: d684a90d38e2 ("ahci: per-port msix support")
Cc: stable@vger.kernel.org v4.5+
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"Two fixes for cgroup.
- There still was a hole in enforcing cpuset rules, fixed by Li.
- The recent switch to global percpu_rwseom for threadgroup locking
revealed a couple issues in how percpu_rwsem is implemented and
used by cgroup. Balbir found that the read locking section was too
wide unnecessarily including operations which can often depend on
IOs. With percpu_rwsem updates (coming through a different tree)
and reduction of read locking section, all the reported locking
latency issues, including the android one, are resolved.
It looks like we can keep global percpu_rwsem locking for now. If
there actually are cases which can't be resolved, we can go back to
more complex per-signal_struct locking"
* 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork
cpuset: make sure new tasks conform to the current config of the cpuset
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
cgroup_threadgroup_rwsem is acquired in read mode during process exit
and fork. It is also grabbed in write mode during
__cgroups_proc_write(). I've recently run into a scenario with lots
of memory pressure and OOM and I am beginning to see
systemd
__switch_to+0x1f8/0x350
__schedule+0x30c/0x990
schedule+0x48/0xc0
percpu_down_write+0x114/0x170
__cgroup_procs_write.isra.12+0xb8/0x3c0
cgroup_file_write+0x74/0x1a0
kernfs_fop_write+0x188/0x200
__vfs_write+0x6c/0xe0
vfs_write+0xc0/0x230
SyS_write+0x6c/0x110
system_call+0x38/0xb4
This thread is waiting on the reader of cgroup_threadgroup_rwsem to
exit. The reader itself is under memory pressure and has gone into
reclaim after fork. There are times the reader also ends up waiting on
oom_lock as well.
__switch_to+0x1f8/0x350
__schedule+0x30c/0x990
schedule+0x48/0xc0
jbd2_log_wait_commit+0xd4/0x180
ext4_evict_inode+0x88/0x5c0
evict+0xf8/0x2a0
dispose_list+0x50/0x80
prune_icache_sb+0x6c/0x90
super_cache_scan+0x190/0x210
shrink_slab.part.15+0x22c/0x4c0
shrink_zone+0x288/0x3c0
do_try_to_free_pages+0x1dc/0x590
try_to_free_pages+0xdc/0x260
__alloc_pages_nodemask+0x72c/0xc90
alloc_pages_current+0xb4/0x1a0
page_table_alloc+0xc0/0x170
__pte_alloc+0x58/0x1f0
copy_page_range+0x4ec/0x950
copy_process.isra.5+0x15a0/0x1870
_do_fork+0xa8/0x4b0
ppc_clone+0x8/0xc
In the meanwhile, all processes exiting/forking are blocked almost
stalling the system.
This patch moves the threadgroup_change_begin from before
cgroup_fork() to just before cgroup_canfork(). There is no nee to
worry about threadgroup changes till the task is actually added to the
threadgroup. This avoids having to call reclaim with
cgroup_threadgroup_rwsem held.
tj: Subject and description edits.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| |/ / /
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
A new task inherits cpus_allowed and mems_allowed masks from its parent,
but if someone changes cpuset's config by writing to cpuset.cpus/cpuset.mems
before this new task is inserted into the cgroup's task list, the new task
won't be updated accordingly.
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fix from Guenter Roeck:
"Add missing sysfs attribute group terminator to it87 driver"
* tag 'hwmon-for-linus-v4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (it87) Add missing sysfs attribute group terminator
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Attribute array it87_attributes_in lacks its NULL terminator,
causing random behavior when operating on the attribute group.
Fixes: 52929715634a ("hwmon: (it87) Use is_visible for voltage sensors")
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix bugs that could cause kernel deadlocks or file system corruption
while moving xattrs to expand the extended inode.
Also add some sanity checks to the block group descriptors to make
sure we don't end up overwriting the superblock"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: avoid deadlock when expanding inode size
ext4: properly align shifted xattrs when expanding inodes
ext4: fix xattr shifting when expanding inodes part 2
ext4: fix xattr shifting when expanding inodes
ext4: validate that metadata blocks do not overlap superblock
ext4: reserve xattr index for the Hurd
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
When we need to move xattrs into external xattr block, we call
ext4_xattr_block_set() from ext4_expand_extra_isize_ea(). That may end
up calling ext4_mark_inode_dirty() again which will recurse back into
the inode expansion code leading to deadlocks.
Protect from recursion using EXT4_STATE_NO_EXPAND inode flag and move
its management into ext4_expand_extra_isize_ea() since its manipulation
is safe there (due to xattr_sem) from possible races with
ext4_xattr_set_handle() which plays with it as well.
CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
We did not count with the padding of xattr value when computing desired
shift of xattrs in the inode when expanding i_extra_isize. As a result
we could create unaligned start of inline xattrs. Account for alignment
properly.
CC: stable@vger.kernel.org # 4.4.x-
Signed-off-by: Jan Kara <jack@suse.cz>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
When multiple xattrs need to be moved out of inode, we did not properly
recompute total size of xattr headers in the inode and the new header
position. Thus when moving the second and further xattr we asked
ext4_xattr_shift_entries() to move too much and from the wrong place,
resulting in possible xattr value corruption or general memory
corruption.
CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
The code in ext4_expand_extra_isize_ea() treated new_extra_isize
argument sometimes as the desired target i_extra_isize and sometimes as
the amount by which we need to grow current i_extra_isize. These happen
to coincide when i_extra_isize is 0 which used to be the common case and
so nobody noticed this until recently when we added i_projid to the
inode and so i_extra_isize now needs to grow from 28 to 32 bytes.
The result of these bugs was that we sometimes unnecessarily decided to
move xattrs out of inode even if there was enough space and we often
ended up corrupting in-inode xattrs because arguments to
ext4_xattr_shift_entries() were just wrong. This could demonstrate
itself as BUG_ON in ext4_xattr_shift_entries() triggering.
Fix the problem by introducing new isize_diff variable and use it where
appropriate.
CC: stable@vger.kernel.org # 4.4.x
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
A number of fuzzing failures seem to be caused by allocation bitmaps
or other metadata blocks being pointed at the superblock.
This can cause kernel BUG or WARNings once the superblock is
overwritten, so validate the group descriptor blocks to make sure this
doesn't happen.
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
The Hurd is using inode fields which restricts it from using more
advanced ext4 file system features, due to design choices made over a
decade ago. By giving the Hurd an extended attribute index field we
allow it to move the translator and author fields out of the core
inode fields, and hopefully we can get rid of ugly hacks such as
EXT4_OS_HURD and EXT4_MOUNT2_HURD_COMPAT somday.
For more information please see:
https://summerofcode.withgoogle.com/projects/#5869799859027968
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|\ \ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Pull networking fixes from David Miller:
1) Segregate namespaces properly in conntrack dumps, from Liping Zhang.
2) tcp listener refcount fix in netfilter tproxy, from Eric Dumazet.
3) Fix timeouts in qed driver due to xmit_more, from Yuval Mintz.
4) Fix use-after-free in tcp_xmit_retransmit_queue().
5) Userspace header fixups (use of __u32, missing includes, etc.) from
Mikko Rapeli.
6) Further refinements to fragmentation wrt gso and tunnels, from
Shmulik Ladkani.
7) Trigger poll correctly for zero length UDP packets, from Eric
Dumazet.
8) TCP window scaling fix, also from Eric Dumazet.
9) SLAB_DESTROY_BY_RCU is not relevant any more for UDP sockets.
10) Module refcount leak in qdisc_create_dflt(), from Eric Dumazet.
11) Fix deadlock in cp_rx_poll() of 8139cp driver, from Gao Feng.
12) Memory leak in rhashtable's alloc_bucket_locks(), from Eric Dumazet.
13) Add new device ID to alx driver, from Owen Lin.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
Add Killer E2500 device ID in alx driver.
net: smc91x: fix SMC accesses
Documentation: networking: dsa: Remove platform device TODO
net/mlx5: Increase number of ethtool steering priorities
net/mlx5: Add error prints when validate ETS failed
net/mlx5e: Fix memory leak if refreshing TIRs fails
net/mlx5e: Add ethtool counter for TX xmit_more
net/mlx5e: Fix ethtool -g/G rx ring parameter report with striding RQ
net/mlx5e: Don't wait for SQ completions on close
net/mlx5e: Don't post fragmented MPWQE when RQ is disabled
net/mlx5e: Don't wait for RQ completions on close
net/mlx5e: Limit UMR length to the device's limitation
rhashtable: fix a memory leak in alloc_bucket_locks()
sfc: fix potential stack corruption from running past stat bitmask
team: loadbalance: push lacpdus to exact delivery
net: hns: dereference ppe_cb->ppe_common_cb if it is non-null
8139cp: Fix one possible deadloop in cp_rx_poll
i40e: Change some init flow for the client
Revert "phy: IRQ cannot be shared"
net: dsa: bcm_sf2: Fix race condition while unmasking interrupts
...
|