summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* raid5: fix a small race conditionShaohua Li2016-09-091-2/+2
| | | | | | | | | | | commit 5f9d1fde7d54a5(raid5: fix memory leak of bio integrity data) moves bio_reset to bio_endio. But it introduces a small race condition. It does bio_reset after raid5_release_stripe, which could make the stripe reusable and hence reuse the bio just before bio_reset. Moving bio_reset before raid5_release_stripe is called should fix the race. Reported-and-tested-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: make md-cluster also can work when compiled into kernelGuoqing Jiang2016-09-081-8/+4
| | | | | | | | | | | | | | | | | The md-cluster is compiled as module by default, if it is compiled by built-in way, then we can't make md-cluster works. [64782.630008] md/raid1:md127: active with 2 out of 2 mirrors [64782.630528] md-cluster module not found. [64782.630530] md127: Could not setup cluster service (-2) Fixes: edb39c9 ("Introduce md_cluster_operations to handle cluster functions") Cc: stable@vger.kernel.org (v4.1+) Reported-by: Marc Smith <marc.smith@mcc.edu> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* raid5: guarantee enough stripes to avoid reshape hangShaohua Li2016-08-311-0/+10
| | | | | | | | | | If there aren't enough stripes, reshape will hang. We have a check for this in new reshape, but miss it for reshape resume, hence we could see hang in reshape resume. This patch forces enough stripes existed if reshape resumes. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* raid5-cache: fix a deadlock in superblock writeShaohua Li2016-08-311-31/+15
| | | | | | | | | | | | | | | | There is a potential deadlock in superblock write. Discard could zero data, so before discard we must make sure superblock is updated to new log tail. Updating superblock (either directly call md_update_sb() or depend on md thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all IO finish, hence waitting for reclaim thread, while reclaim thread is calling this function and waitting for reconfig mutex. So there is a deadlock. We workaround this issue with a trylock. The downside of the solution is we could miss discard if we can't take reconfig mutex. But this should happen rarely (mainly in raid array stop), so miss discard shouldn't be a big problem. Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* Merge tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds2016-08-305-57/+107
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull MD fixes from Shaohua Li: "This includes several bug fixes: - Alexey Obitotskiy fixed a hang for faulty raid5 array with external management - Song Liu fixed two raid5 journal related bugs - Tomasz Majchrzak fixed a bad block recording issue and an accounting issue for raid10 - ZhengYuan Liu fixed an accounting issue for raid5 - I fixed a potential race condition and memory leak with DIF/DIX enabled - other trival fixes" * tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: raid5: avoid unnecessary bio data set raid5: fix memory leak of bio integrity data raid10: record correct address of bad block md-cluster: fix error return code in join() r5cache: set MD_JOURNAL_CLEAN correctly md: don't print the same repeated messages about delayed sync operation md: remove obsolete ret in md_start_sync md: do not count journal as spare in GET_ARRAY_INFO md: Prevent IO hold during accessing to faulty raid5 array MD: hold mddev lock to change bitmap location raid5: fix incorrectly counter of conf->empty_inactive_list_nr raid10: increment write counter after bio is split
| * raid5: avoid unnecessary bio data setShaohua Li2016-08-241-8/+5
| | | | | | | | | | | | | | | | bio_reset doesn't change bi_io_vec and bi_max_vecs, so we don't need to set them every time. bi_private will be set before the bio is dispatched. Signed-off-by: Shaohua Li <shli@fb.com>
| * raid5: fix memory leak of bio integrity dataShaohua Li2016-08-241-7/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5 doesn't alloc/free bio, instead it reuses bios. There are two issues in current code: 1. the code calls bio_init (from init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io). The bio is reused, so likely there is integrity data attached. bio_init will clear a pointer to integrity data and makes bio_reset can't release the data 2. bio_reset is called before dispatching bio. After bio is finished, it's possible we don't free bio's integrity data (eg, we don't call bio_reset again) Both issues will cause memory leak. The patch moves bio_init to stripe creation and bio_reset to bio end io. This will fix the two issues. Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * raid10: record correct address of bad blockTomasz Majchrzak2016-08-241-4/+5
| | | | | | | | | | | | | | | | For failed write request record block address on a device, not block address in an array. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md-cluster: fix error return code in join()Wei Yongjun2016-08-241-3/+9
| | | | | | | | | | | | | | | | Fix to return error code -ENOMEM from the lockres_init() error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * r5cache: set MD_JOURNAL_CLEAN correctlySong Liu2016-08-242-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | Currently, the code sets MD_JOURNAL_CLEAN when the array has MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array will be MD_JOURNAL_CLEAN even if the journal device is missing. With this patch, the MD_JOURNAL_CLEAN is only set when the journal device presents. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: don't print the same repeated messages about delayed sync operationArtur Paszkiewicz2016-08-171-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a long-standing bug that caused a flood of messages like: "md: delaying data-check of md1 until md2 has finished (they share one or more physical units)" It can be reproduced like this: 1. Create at least 3 raid1 arrays on a pair of disks, each on different partitions. 2. Request a sync operation like 'check' or 'repair' on 2 arrays by writing to their md/sync_action attribute files. One operation should start and one should be delayed and a message like the above will be printed. 3. Issue a write to the third array. Each write will cause 2 copies of the message to be printed. This happens when wake_up(&resync_wait) is called, usually by md_check_recovery(). Then the delayed sync thread again prints the message and is put to sleep. This patch adds a check in md_do_sync() to prevent printing this message more than once for the same pair of devices. Reported-by: Sven Koehler <sven.koehler@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=151801 Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: remove obsolete ret in md_start_syncGuoqing Jiang2016-08-171-5/+2
| | | | | | | | | | | | | | | | | | The ret is not needed anymore since we have already move resync_start into md_do_sync in commit 41a9a0d. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: do not count journal as spare in GET_ARRAY_INFOSong Liu2016-08-171-0/+3
| | | | | | | | | | | | | | | | | | GET_ARRAY_INFO counts journal as spare (spare_disks), which is not accurate. This patch fixes this. Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: Prevent IO hold during accessing to faulty raid5 arrayAlexey Obitotskiy2016-08-061-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After array enters in faulty state (e.g. number of failed drives becomes more then accepted for raid5 level) it sets error flags (one of this flags is MD_CHANGE_PENDING). For internal metadata arrays MD_CHANGE_PENDING cleared into md_update_sb, but not for external metadata arrays. MD_CHANGE_PENDING flag set prevents to finish all new or non-finished IOs to array and hold them in pending state. In some cases this can leads to deadlock situation. For example, we have faulty array (2 of 4 drives failed) and udev handle array state changes and blkid started (or other userspace application that used array to read/write) but unable to finish reads due to IO hold. At the same time we unable to get exclusive access to array (to stop array in our case) because another external application still use this array. Fix makes possible to return IO with errors immediately. So external application can finish working with array and give exclusive access to other applications to perform required management actions with array. Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * MD: hold mddev lock to change bitmap locationShaohua Li2016-08-061-14/+33
| | | | | | | | | | | | | | | | Changing the location changes a lot of things. Holding the lock to avoid race. This makes the .quiesce called with mddev lock hold too. Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * raid5: fix incorrectly counter of conf->empty_inactive_list_nrZhengYuan Liu2016-08-021-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The counter conf->empty_inactive_list_nr is only used for determine if the raid5 is congested which is deal with in function raid5_congested(). It was increased in get_free_stripe() when conf->inactive_list got to be empty and decreased in release_inactive_stripe_list() when splice temp_inactive_list to conf->inactive_list. However, this may have a problem when raid5_get_active_stripe or stripe_add_to_batch_list was called, because these two functions may call list_del_init(&sh->lru) to delete sh from "conf->inactive_list + hash" which may cause "conf->inactive_list + hash" to be empty when atomic_inc_not_zero(&sh->count) got false. So a check should be done at these two point and increase empty_inactive_list_nr accordingly. Otherwise the counter may get to be negative number which would influence async readahead from VFS. Signed-off-by: ZhengYuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
| * raid10: increment write counter after bio is splitTomasz Majchrzak2016-07-301-2/+2
| | | | | | | | | | | | | | | | | | | | md pending write counter must be incremented after bio is split, otherwise it gets decremented too many times in end bio callback and becomes negative. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | Merge tag 'nfs-for-4.8-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2016-08-3018-78/+244
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable patches: - Fix a refcount leak in nfs_callback_up_net - Fix an Oopsable condition when the flexfile pNFS driver connection to the DS fails - Fix an Oopsable condition in NFSv4.1 server callback races - Ensure pNFS clients stop doing I/O to the DS if their lease has expired, as required by the NFSv4.1 protocol Bugfixes: - Fix potential looping in the NFSv4.x migration code - Patch series to close callback races for OPEN, LAYOUTGET and LAYOUTRETURN - Silence WARN_ON when NFSv4.1 over RDMA is in use - Fix a LAYOUTCOMMIT race in the pNFS/blocks client - Fix pNFS timeout issues when the DS fails" * tag 'nfs-for-4.8-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4.x: Fix a refcount leak in nfs_callback_up_net NFS4: Avoid migration loops pNFS/flexfiles: Fix an Oopsable condition when connection to the DS fails NFSv4.1: Remove obsolete and incorrrect assignment in nfs4_callback_sequence NFSv4.1: Close callback races for OPEN, LAYOUTGET and LAYOUTRETURN NFSv4.1: Defer bumping the slot sequence number until we free the slot NFSv4.1: Delay callback processing when there are referring triples NFSv4.1: Fix Oopsable condition in server callback races SUNRPC: Silence WARN_ON when NFSv4.1 over RDMA is in use pnfs/blocklayout: update last_write_offset atomically with extents pNFS: The client must not do I/O to the DS if it's lease has expired pNFS: Handle NFS4ERR_OLD_STATEID correctly in LAYOUTSTAT calls pNFS/flexfiles: Set reasonable default retrans values for the data channel NFS: Allow the mount option retrans=0 pNFS/flexfiles: Fix layoutstat periodic reporting
| * | NFSv4.x: Fix a refcount leak in nfs_callback_up_netTrond Myklebust2016-08-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | On error, the callers expect us to return without bumping nn->cb_users[]. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v3.7+
| * | NFS4: Avoid migration loopsBenjamin Coddington2016-08-301-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a server returns itself as a location while migrating, the client may end up getting stuck attempting to migrate twice to the same server. Catch this by checking if the nfs_client found is the same as the existing client. For the other two callers to nfs4_set_client, the nfs_client will always be ERR_PTR(-EINVAL). Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | pNFS/flexfiles: Fix an Oopsable condition when connection to the DS failsTrond Myklebust2016-08-292-28/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the attempt to connect to a DS fails inside ff_layout_pg_init_read or ff_layout_pg_init_write, then we currently end up clearing the layout segment carried by the struct nfs_pageio_descriptor, causing an Oops when we later call into ff_layout_read_pagelist/ff_layout_write_pagelist. The fix is to ensure we return the layout and then retry. Fixes: 446ca2195303 ("pNFS/flexfiles: When initing reads or writes, we...") Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | NFSv4.1: Remove obsolete and incorrrect assignment in nfs4_callback_sequenceTrond Myklebust2016-08-281-1/+0
| | | | | | | | | | | | Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | NFSv4.1: Close callback races for OPEN, LAYOUTGET and LAYOUTRETURNTrond Myklebust2016-08-281-13/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Defer freeing the slot until after we have processed the results from OPEN and LAYOUTGET. This means that the server can rely on the mechanism in RFC5661 Section 2.10.6.3 to ensure that replies to an OPEN or LAYOUTGET/RETURN RPC call don't race with the callbacks that apply to them. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | NFSv4.1: Defer bumping the slot sequence number until we free the slotTrond Myklebust2016-08-282-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For operations like OPEN or LAYOUTGET, which return recallable state (i.e. delegations and layouts) we want to enable the mechanism for resolving recall races in RFC5661 Section 2.10.6.3. To do so, we will want to defer bumping the slot's sequence number until we have finished processing the RPC results. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | NFSv4.1: Delay callback processing when there are referring triplesTrond Myklebust2016-08-284-4/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If CB_SEQUENCE tells us that the processing of this request depends on the completion of one or more referring triples (see RFC 5661 Section 2.10.6.3), delay the callback processing until after the RPC requests being referred to have completed. If we end up delaying for more than 1/2 second, then fall back to returning NFS4ERR_DELAY in reply to the callback. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | NFSv4.1: Fix Oopsable condition in server callback racesTrond Myklebust2016-08-283-4/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | The slot table hasn't been an array since v3.7. Ensure that we use nfs4_lookup_slot() to access the slot correctly. Fixes: 87dda67e7386 ("NFSv4.1: Allow SEQUENCE to resize the slot table...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v3.8+
| * | SUNRPC: Silence WARN_ON when NFSv4.1 over RDMA is in useChuck Lever2016-08-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using NFSv4.1 on RDMA should be safe, so broaden the new checks in rpc_create(). WARN_ON_ONCE is used, matching most other WARN call sites in clnt.c. Fixes: 39a9beab5acb ("rpc: share one xps between all backchannels") Fixes: d50039ea5ee6 ("nfsd4/rpc: move backchannel create logic...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | pnfs/blocklayout: update last_write_offset atomically with extentsBenjamin Coddington2016-08-233-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Block/SCSI layout write completion may add committable extents to the extent tree before updating the layout's last-written byte under the inode lock. If a sync happens before this value is updated, then prepare_layoutcommit may find and encode these extents which would produce a LAYOUTCOMMIT request whose encoded extents are larger than the request's loca_length. Fix this by using a last-written byte value that is updated atomically with the extent tree so that commitable extents always match. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | pNFS: The client must not do I/O to the DS if it's lease has expiredTrond Myklebust2016-08-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that the client conforms to the normative behaviour described in RFC5661 Section 12.7.2: "If a client believes its lease has expired, it MUST NOT send I/O to the storage device until it has validated its lease." So ensure that we wait for the lease to be validated before using the layout. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v3.20+
| * | pNFS: Handle NFS4ERR_OLD_STATEID correctly in LAYOUTSTAT callsTrond Myklebust2016-08-192-6/+29
| | | | | | | | | | | | | | | | | | We normally want to update the stateid and then retry, Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | pNFS/flexfiles: Set reasonable default retrans values for the data channelTrond Myklebust2016-08-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to this patch, the retrans value was set at 5, meaning that we could see a maximum retransmission timeout value of more than 6 minutes. That's a tad high for NFSv3 where the protocol does allow the server to drop requests at any time. Since this is a data channel, let's just set retrans to 0, and the default timeout to 60s. The user can continue to adjust these defaults using the dataserver_retrans and dataserver_timeo module parameters. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | NFS: Allow the mount option retrans=0Trond Myklebust2016-08-163-8/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should allow retrans=0 as just meaning that every timeout is a major timeout, and that there is no increment in the timeout value. For instance, this means that we would allow TCP users to specify a flat timeout value of 60s, by specifying "timeo=600,retrans=0" in their mount option string. Siged-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | pNFS/flexfiles: Fix layoutstat periodic reportingTrond Myklebust2016-08-152-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Putting the periodicity timer in the mirror instances is causing non-scalable reporting behaviour and missed reporting intervals. When you recall layouts and/or implement client side mirroring, it leads to consecutive reports with only a few ms between RPC calls. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Fixes: d0379a5d066a9 ("pNFS/flexfiles: Support server-supplied...")
* | | mm/usercopy: get rid of CONFIG_DEBUG_STRICT_USER_COPY_CHECKSJosh Poimboeuf2016-08-3019-128/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are three usercopy warnings which are currently being silenced for gcc 4.6 and newer: 1) "copy_from_user() buffer size is too small" compile warning/error This is a static warning which happens when object size and copy size are both const, and copy size > object size. I didn't see any false positives for this one. So the function warning attribute seems to be working fine here. Note this scenario is always a bug and so I think it should be changed to *always* be an error, regardless of CONFIG_DEBUG_STRICT_USER_COPY_CHECKS. 2) "copy_from_user() buffer size is not provably correct" compile warning This is another static warning which happens when I enable __compiletime_object_size() for new compilers (and CONFIG_DEBUG_STRICT_USER_COPY_CHECKS). It happens when object size is const, but copy size is *not*. In this case there's no way to compare the two at build time, so it gives the warning. (Note the warning is a byproduct of the fact that gcc has no way of knowing whether the overflow function will be called, so the call isn't dead code and the warning attribute is activated.) So this warning seems to only indicate "this is an unusual pattern, maybe you should check it out" rather than "this is a bug". I get 102(!) of these warnings with allyesconfig and the __compiletime_object_size() gcc check removed. I don't know if there are any real bugs hiding in there, but from looking at a small sample, I didn't see any. According to Kees, it does sometimes find real bugs. But the false positive rate seems high. 3) "Buffer overflow detected" runtime warning This is a runtime warning where object size is const, and copy size > object size. All three warnings (both static and runtime) were completely disabled for gcc 4.6 with the following commit: 2fb0815c9ee6 ("gcc4: disable __compiletime_object_size for GCC 4.6+") That commit mistakenly assumed that the false positives were caused by a gcc bug in __compiletime_object_size(). But in fact, __compiletime_object_size() seems to be working fine. The false positives were instead triggered by #2 above. (Though I don't have an explanation for why the warnings supposedly only started showing up in gcc 4.6.) So remove warning #2 to get rid of all the false positives, and re-enable warnings #1 and #3 by reverting the above commit. Furthermore, since #1 is a real bug which is detected at compile time, upgrade it to always be an error. Having done all that, CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is no longer needed. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Nilay Vaish <nilayvaish@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'for-4.8-fixes' of ↵Linus Torvalds2016-08-302-2/+2
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata fixes from Tejun Heo: "Two libata driver specific fixes for v4.8-rc4. Nothing too scary" * 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: pata_ninja32: Avoid corrupting status flags ahci: disable correct irq for dummy ports
| * | | pata_ninja32: Avoid corrupting status flagsAlan Cox2016-08-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ninja32 needs to set some flags to indicate it does 32bit IO. However it currently assigns this which loses the initializing flag and causes a warning spew. Fix it to use a logical or as is intended. Signed-off-by: Alan Cox <alan@linux.intel.com> Tested-by: Ellmar Stelnberger <estellnb@elstel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | ahci: disable correct irq for dummy portsChristoph Hellwig2016-08-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | irq already contains the interrupt number for the port, don't add the port index to it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: d684a90d38e2 ("ahci: per-port msix support") Cc: stable@vger.kernel.org v4.5+
* | | | Merge branch 'for-4.8-fixes' of ↵Linus Torvalds2016-08-302-2/+17
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Two fixes for cgroup. - There still was a hole in enforcing cpuset rules, fixed by Li. - The recent switch to global percpu_rwseom for threadgroup locking revealed a couple issues in how percpu_rwsem is implemented and used by cgroup. Balbir found that the read locking section was too wide unnecessarily including operations which can often depend on IOs. With percpu_rwsem updates (coming through a different tree) and reduction of read locking section, all the reported locking latency issues, including the android one, are resolved. It looks like we can keep global percpu_rwsem locking for now. If there actually are cases which can't be resolved, we can go back to more complex per-signal_struct locking" * 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork cpuset: make sure new tasks conform to the current config of the cpuset
| * | | | cgroup: reduce read locked section of cgroup_threadgroup_rwsem during forkBalbir Singh2016-08-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_threadgroup_rwsem is acquired in read mode during process exit and fork. It is also grabbed in write mode during __cgroups_proc_write(). I've recently run into a scenario with lots of memory pressure and OOM and I am beginning to see systemd __switch_to+0x1f8/0x350 __schedule+0x30c/0x990 schedule+0x48/0xc0 percpu_down_write+0x114/0x170 __cgroup_procs_write.isra.12+0xb8/0x3c0 cgroup_file_write+0x74/0x1a0 kernfs_fop_write+0x188/0x200 __vfs_write+0x6c/0xe0 vfs_write+0xc0/0x230 SyS_write+0x6c/0x110 system_call+0x38/0xb4 This thread is waiting on the reader of cgroup_threadgroup_rwsem to exit. The reader itself is under memory pressure and has gone into reclaim after fork. There are times the reader also ends up waiting on oom_lock as well. __switch_to+0x1f8/0x350 __schedule+0x30c/0x990 schedule+0x48/0xc0 jbd2_log_wait_commit+0xd4/0x180 ext4_evict_inode+0x88/0x5c0 evict+0xf8/0x2a0 dispose_list+0x50/0x80 prune_icache_sb+0x6c/0x90 super_cache_scan+0x190/0x210 shrink_slab.part.15+0x22c/0x4c0 shrink_zone+0x288/0x3c0 do_try_to_free_pages+0x1dc/0x590 try_to_free_pages+0xdc/0x260 __alloc_pages_nodemask+0x72c/0xc90 alloc_pages_current+0xb4/0x1a0 page_table_alloc+0xc0/0x170 __pte_alloc+0x58/0x1f0 copy_page_range+0x4ec/0x950 copy_process.isra.5+0x15a0/0x1870 _do_fork+0xa8/0x4b0 ppc_clone+0x8/0xc In the meanwhile, all processes exiting/forking are blocked almost stalling the system. This patch moves the threadgroup_change_begin from before cgroup_fork() to just before cgroup_canfork(). There is no nee to worry about threadgroup changes till the task is actually added to the threadgroup. This avoids having to call reclaim with cgroup_threadgroup_rwsem held. tj: Subject and description edits. Signed-off-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Zefan Li <lizefan@huawei.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | cpuset: make sure new tasks conform to the current config of the cpusetZefan Li2016-08-101-0/+15
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A new task inherits cpus_allowed and mems_allowed masks from its parent, but if someone changes cpuset's config by writing to cpuset.cpus/cpuset.mems before this new task is inserted into the cgroup's task list, the new task won't be updated accordingly. Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
* | | | Merge tag 'hwmon-for-linus-v4.8-rc5' of ↵Linus Torvalds2016-08-301-0/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging Pull hwmon fix from Guenter Roeck: "Add missing sysfs attribute group terminator to it87 driver" * tag 'hwmon-for-linus-v4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: hwmon: (it87) Add missing sysfs attribute group terminator
| * | | | hwmon: (it87) Add missing sysfs attribute group terminatorJean Delvare2016-08-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Attribute array it87_attributes_in lacks its NULL terminator, causing random behavior when operating on the attribute group. Fixes: 52929715634a ("hwmon: (it87) Use is_visible for voltage sensors") Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: stable@vger.kernel.org Signed-off-by: Guenter Roeck <linux@roeck-us.net>
* | | | | Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2016-08-294-25/+49
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix bugs that could cause kernel deadlocks or file system corruption while moving xattrs to expand the extended inode. Also add some sanity checks to the block group descriptors to make sure we don't end up overwriting the superblock" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: avoid deadlock when expanding inode size ext4: properly align shifted xattrs when expanding inodes ext4: fix xattr shifting when expanding inodes part 2 ext4: fix xattr shifting when expanding inodes ext4: validate that metadata blocks do not overlap superblock ext4: reserve xattr index for the Hurd
| * | | | | ext4: avoid deadlock when expanding inode sizeJan Kara2016-08-112-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we need to move xattrs into external xattr block, we call ext4_xattr_block_set() from ext4_expand_extra_isize_ea(). That may end up calling ext4_mark_inode_dirty() again which will recurse back into the inode expansion code leading to deadlocks. Protect from recursion using EXT4_STATE_NO_EXPAND inode flag and move its management into ext4_expand_extra_isize_ea() since its manipulation is safe there (due to xattr_sem) from possible races with ext4_xattr_set_handle() which plays with it as well. CC: stable@vger.kernel.org # 4.4.x Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | | | ext4: properly align shifted xattrs when expanding inodesJan Kara2016-08-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We did not count with the padding of xattr value when computing desired shift of xattrs in the inode when expanding i_extra_isize. As a result we could create unaligned start of inline xattrs. Account for alignment properly. CC: stable@vger.kernel.org # 4.4.x- Signed-off-by: Jan Kara <jack@suse.cz>
| * | | | | ext4: fix xattr shifting when expanding inodes part 2Jan Kara2016-08-111-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When multiple xattrs need to be moved out of inode, we did not properly recompute total size of xattr headers in the inode and the new header position. Thus when moving the second and further xattr we asked ext4_xattr_shift_entries() to move too much and from the wrong place, resulting in possible xattr value corruption or general memory corruption. CC: stable@vger.kernel.org # 4.4.x Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | | | ext4: fix xattr shifting when expanding inodesJan Kara2016-08-111-13/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code in ext4_expand_extra_isize_ea() treated new_extra_isize argument sometimes as the desired target i_extra_isize and sometimes as the amount by which we need to grow current i_extra_isize. These happen to coincide when i_extra_isize is 0 which used to be the common case and so nobody noticed this until recently when we added i_projid to the inode and so i_extra_isize now needs to grow from 28 to 32 bytes. The result of these bugs was that we sometimes unnecessarily decided to move xattrs out of inode even if there was enough space and we often ended up corrupting in-inode xattrs because arguments to ext4_xattr_shift_entries() were just wrong. This could demonstrate itself as BUG_ON in ext4_xattr_shift_entries() triggering. Fix the problem by introducing new isize_diff variable and use it where appropriate. CC: stable@vger.kernel.org # 4.4.x Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | | | ext4: validate that metadata blocks do not overlap superblockTheodore Ts'o2016-08-011-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A number of fuzzing failures seem to be caused by allocation bitmaps or other metadata blocks being pointed at the superblock. This can cause kernel BUG or WARNings once the superblock is overwritten, so validate the group descriptor blocks to make sure this doesn't happen. Cc: stable@vger.kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | | | ext4: reserve xattr index for the HurdTheodore Ts'o2016-08-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Hurd is using inode fields which restricts it from using more advanced ext4 file system features, due to design choices made over a decade ago. By giving the Hurd an extended attribute index field we allow it to move the translator and author fields out of the core inode fields, and hopefully we can get rid of ugly hacks such as EXT4_OS_HURD and EXT4_MOUNT2_HURD_COMPAT somday. For more information please see: https://summerofcode.withgoogle.com/projects/#5869799859027968 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2016-08-29100-453/+813
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Segregate namespaces properly in conntrack dumps, from Liping Zhang. 2) tcp listener refcount fix in netfilter tproxy, from Eric Dumazet. 3) Fix timeouts in qed driver due to xmit_more, from Yuval Mintz. 4) Fix use-after-free in tcp_xmit_retransmit_queue(). 5) Userspace header fixups (use of __u32, missing includes, etc.) from Mikko Rapeli. 6) Further refinements to fragmentation wrt gso and tunnels, from Shmulik Ladkani. 7) Trigger poll correctly for zero length UDP packets, from Eric Dumazet. 8) TCP window scaling fix, also from Eric Dumazet. 9) SLAB_DESTROY_BY_RCU is not relevant any more for UDP sockets. 10) Module refcount leak in qdisc_create_dflt(), from Eric Dumazet. 11) Fix deadlock in cp_rx_poll() of 8139cp driver, from Gao Feng. 12) Memory leak in rhashtable's alloc_bucket_locks(), from Eric Dumazet. 13) Add new device ID to alx driver, from Owen Lin. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits) Add Killer E2500 device ID in alx driver. net: smc91x: fix SMC accesses Documentation: networking: dsa: Remove platform device TODO net/mlx5: Increase number of ethtool steering priorities net/mlx5: Add error prints when validate ETS failed net/mlx5e: Fix memory leak if refreshing TIRs fails net/mlx5e: Add ethtool counter for TX xmit_more net/mlx5e: Fix ethtool -g/G rx ring parameter report with striding RQ net/mlx5e: Don't wait for SQ completions on close net/mlx5e: Don't post fragmented MPWQE when RQ is disabled net/mlx5e: Don't wait for RQ completions on close net/mlx5e: Limit UMR length to the device's limitation rhashtable: fix a memory leak in alloc_bucket_locks() sfc: fix potential stack corruption from running past stat bitmask team: loadbalance: push lacpdus to exact delivery net: hns: dereference ppe_cb->ppe_common_cb if it is non-null 8139cp: Fix one possible deadloop in cp_rx_poll i40e: Change some init flow for the client Revert "phy: IRQ cannot be shared" net: dsa: bcm_sf2: Fix race condition while unmasking interrupts ...