linux - linux

	Commit message (Collapse)	Author	Files	Lines
2012-12-05	cifs: don't override the uid/gid in getattr when cifsacl is enabled	Jeff Layton	1	-3/+4
	If we're using cifsacl, then we don't want to override the uid/gid with the current uid/gid, since that would prevent you from being able to upcall for this info. Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-05	cifs: remove uneeded __KERNEL__ block from cifsacl.h	Jeff Layton	2	-7/+2
	...and make those symbols static in cifsacl.c. Nothing outside of that file refers to them. Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-05	cifs: fix the format specifiers in sid_to_str	Jeff Layton	1	-7/+4
	The format specifiers are for signed values, but these are unsigned. Given that '-' is a delimiter between fields, I don't think you'd get what you'd expect if you got a value here that would overflow the sign bit. The version and authority fields are 8 bit values so use a "hh" length modifier there. The subauths are 32 bit values, so there's no need to use a "l" length modifier there. Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-05	cifs: redefine NUM_SUBAUTH constant from 5 to 15	Jeff Layton	2	-6/+19
	According to several places on the Internet and the samba winbind code, this is hard limited to 15 in windows, not 5. This does balloon out the allocation of each by 40 bytes, but I don't see any alternative. Also, rename it to SID_MAX_SUB_AUTHORITIES to match the alleged name of this constant in the windows header files Finally, rename SIDLEN to SID_STRING_MAX, fix the value to reflect the change to SID_MAX_SUB_AUTHORITIES and document how it was determined. Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-05	cifs: make cifs_copy_sid handle a source sid with variable size subauth arrays	Jeff Layton	2	-2/+11
	...and lift the restriction in id_to_sid upcall that the size must be at least as big as a full cifs_sid. Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-05	cifs: make compare_sids static	Jeff Layton	2	-50/+50
	..nothing outside of cifsacl.c calls it. Also fix the incorrect comment on the function. It returns 0 when they match. Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-05	cifs: use the NUM_AUTHS and NUM_SUBAUTHS constants in cifsacl code	Jeff Layton	2	-5/+5
	...instead of hardcoding in '5' and '6' all over the place. Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-05	cifs: move num_subauth check inside of CONFIG_CIFS_DEBUG2 check in parse_sid()	Jeff Layton	1	-2/+2
	Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-05	cifs: clean up id_mode_to_cifs_acl	Jeff Layton	1	-28/+25
	Add a label we can goto on error, and get rid of some excess indentation. Also move to kernel-style comments. Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-05	cifs: fix types on module parameters	Jeff Layton	1	-6/+5
	Most of these are unsigned ints, so we should be passing "uint" to module_param. Also, get rid of the extra "(bool)" in the description of enable_oplocks. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-05	default authentication needs to be at least ntlmv2 security for cifs mounts	Steve French	2	-11/+1
	We had planned to upgrade to ntlmv2 security a few releases ago, and have been warning users in dmesg on mount about the impending upgrade, but had to make a change (to use nltmssp with ntlmv2) due to testing issues with some non-Windows, non-Samba servers. The approach in this patch is simpler than earlier patches, and changes the default authentication mechanism to ntlmv2 password hashes (encapsulated in ntlmssp) from ntlm (ntlm is too weak for current use and ntlmv2 has been broadly supported for many, many years). Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Jeff Layton <jlayton@redhat.com>
2012-12-05	vfs: clear to the end of the buffer on partial buffer reads	Dan Carpenter	1	-1/+1
	READ is zero so the "rw & READ" test is always false. The intended test was "((rw & RW_MASK) == READ)". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-04	vfs: avoid "attempt to access beyond end of device" warnings	Linus Torvalds	1	-0/+52
	The block device access simplification that avoided accessing the (racy) block size information (commit bbec0270bdd8: "blkdev_max_block: make private to fs/buffer.c") no longer checks the maximum block size in the block mapping path. That was _almost_ as simple as just removing the code entirely, because the readers and writers all check the size of the device anyway, so under normal circumstances it "just worked". However, the block size may be such that the end of the device may straddle one single buffer_head. At which point we may still want to access the end of the device, but the buffer we use to access it partially extends past the end. The 'bd_set_size()' function intentionally sets the block size to avoid this, but mounting the device - or setting the block size by hand to some other value - can modify that block size. So instead, teach 'submit_bh()' about the special case of the buffer head straddling the end of the device, and turning such an access into a smaller IO access, avoiding the problem. This, btw, also means that unlike before, we can now access the whole device regardless of device block size setting. So now, even if the device size is only 512-byte aligned, we can read and write even the last sector even when having a much bigger block size for accessing the rest of the device. So with this, we could now get rid of the 'bd_set_size()' block size code entirely - resulting in faster IO for the common case - but that would be a separate patch. Reported-and-tested-by: Romain Francoise <romain@orebokech.com> Reporeted-and-tested-by: Meelis Roos <mroos@linux.ee> Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-04	workqueue: convert BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s	Tejun Heo	1	-2/+2
	8852aac25e ("workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay") unexpectedly uncovered a very nasty abuse of delayed_work in megaraid - it allocated work_struct, casted it to delayed_work and then pass that into queue_delayed_work(). Previously, this was okay because 0 @delay short-circuited to queue_work() before doing anything with delayed_work. 8852aac25e moved 0 @delay test into __queue_delayed_work() after sanity check on delayed_work making megaraid trigger BUG_ON(). Although megaraid is already fixed by c1d390d8e6 ("megaraid: fix BUG_ON() from incorrect use of delayed work"), this patch converts BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s so that such abusers, if there are more, trigger warning but don't crash the machine. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Xiaotian Feng <xtfeng@gmail.com>
2012-12-04	megaraid: fix BUG_ON() from incorrect use of delayed work	Xiaotian Feng	2	-9/+7
	megaraid use INIT_WORK to declare a hotplug_work, but cast the hotplug_work from work_struct to delayed_work and schedule_delayed_work on it. This is very dangerous, as other part of delayed_work might be kernel memories allocated by others. With commit 8852aac ("workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay"), schedule_delayed_work() will check dwork->timer before queue_work even when @delay is 0, this causes megaraid code to hit the BUG_ON() in workqueue code. Change megaraid code to use delayed work. Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Neela Syam Kolli <megaraidlinux@lsi.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: linux-scsi@vger.kernel.org
2012-12-04	UBI: dont call ubi_self_check_all_ff() in __wl_get_peb()	Richard Weinberger	1	-9/+9
	As ubi_self_check_all_ff() might sleep we are not allowed to call it from atomic context. For now we call it only from ubi_wl_get_peb(). There are some code paths where it would also make sense, but these paths are currently atomic and only enabled when fastmap is used. Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-12-04	UBI: remove PEB from free tree in get_peb_for_wl()	Richard Weinberger	1	-1/+7
	If UBI is built without fastmap, get_peb_for_wl() has to remove the PEB manially from the free tree. Otherwise the requested PEB lives in two trees. Reported-by: Zach Sadecki <zsadecki@itwatchdogs.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-12-03	sparc: Fix piggyback with newer binutils.	David S. Miller	1	-6/+6
	Newer versions of binutils mark '_end' as 'B' instead of 'A' for whatever reason. To be honest, the piggyback code doesn't actually care what kind of symbol _start and _end are, it just wants to find them and record the address. So remove the type from the match strings. Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-03	Linux 3.7-rc8v3.7-rc8	Linus Torvalds	1	-1/+1

2012-12-03	sparc64: exit_group should kill register windows just like plain exit.	David S. Miller	3	-4/+14
	Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-03	[parisc] open(2) compat bug	Al Viro	1	-1/+1
	In commit 9d73fc2d641f ("open(2) compat fixes (s390, arm64)") I said: > > The usual rules for open()/openat()/open_by_handle_at() are > 1) native 32bit - don't force O_LARGEFILE in flags > 2) native 64bit - force O_LARGEFILE in flags > 3) compat on 64bit host - as for native 32bit > 4) native 32bit ABI for 64bit system (mips/n32, x86/x32) - as for native 64bit > > There are only two exceptions - s390 compat has open() forcing O_LARGEFILE and > arm64 compat has open_by_handle_at() doing the same thing. The same binaries > on native host (s390/31 and arm resp.) will not* force O_LARGEFILE, so IMO > both are emulation bugs. Three exceptions, actually - parisc open() is another case like that. Native 32bit won't force O_LARGEFILE, the same binary on parisc64 will. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-03	Revert "sched, autogroup: Stop going ahead if autogroup is disabled"	Mike Galbraith	2	-9/+0
	This reverts commit 800d4d30c8f20bd728e5741a3b77c4859a613f7c. Between commits 8323f26ce342 ("sched: Fix race in task_group()") and 800d4d30c8f2 ("sched, autogroup: Stop going ahead if autogroup is disabled"), autogroup is a wreck. With both applied, all you have to do to crash a box is disable autogroup during boot up, then reboot.. boom, NULL pointer dereference due to commit 800d4d30c8f2 not allowing autogroup to move things, and commit 8323f26ce342 making that the only way to switch runqueues: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81063ac0>] effective_load.isra.43+0x50/0x90 Pid: 7047, comm: systemd-user-se Not tainted 3.6.8-smp #7 MEDIONPC MS-7502/MS-7502 RIP: effective_load.isra.43+0x50/0x90 Process systemd-user-se (pid: 7047, threadinfo ffff880221dde000, task ffff88022618b3a0) Call Trace: select_task_rq_fair+0x255/0x780 try_to_wake_up+0x156/0x2c0 wake_up_state+0xb/0x10 signal_wake_up+0x28/0x40 complete_signal+0x1d6/0x250 __send_signal+0x170/0x310 send_signal+0x40/0x80 do_send_sig_info+0x47/0x90 group_send_sig_info+0x4a/0x70 kill_pid_info+0x3a/0x60 sys_kill+0x97/0x1a0 ? vfs_read+0x120/0x160 ? sys_read+0x45/0x90 system_call_fastpath+0x16/0x1b Code: 49 0f af 41 50 31 d2 49 f7 f0 48 83 f8 01 48 0f 46 c6 48 2b 07 48 8b bf 40 01 00 00 48 85 ff 74 3a 45 31 c0 48 8b 8f 50 01 00 00 <48> 8b 11 4c 8b 89 80 00 00 00 49 89 d2 48 01 d0 45 8b 59 58 4c RIP [<ffffffff81063ac0>] effective_load.isra.43+0x50/0x90 RSP <ffff880221ddfbd8> CR2: 0000000000000000 Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Yong Zhang <yong.zhang0@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@vger.kernel.org # 2.6.39+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-03	modsign: add symbol prefix to certificate list	James Hogan	1	-2/+2
	Add the arch symbol prefix (if applicable) to the asm definition of modsign_certificate_list and modsign_certificate_list_end. This uses the recently defined SYMBOL_PREFIX which is derived from CONFIG_SYMBOL_PREFIX. This fixes the build of module signing on the blackfin and metag architectures. Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: David Howells <dhowells@redhat.com> Cc: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-12-03	linux/kernel.h: define SYMBOL_PREFIX	James Hogan	1	-0/+7
	Define SYMBOL_PREFIX to be the same as CONFIG_SYMBOL_PREFIX if set by the architecture, or "" otherwise. This avoids the need for ugly #ifdefs whenever symbols are referenced in asm blocks. Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Joe Perches <joe@perches.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Jean Delvare <khali@linux-fr.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-12-02	open*(2) compat fixes (s390, arm64)	Al Viro	2	-2/+2
	The usual rules for open()/openat()/open_by_handle_at() are 1) native 32bit - don't force O_LARGEFILE in flags 2) native 64bit - force O_LARGEFILE in flags 3) compat on 64bit host - as for native 32bit 4) native 32bit ABI for 64bit system (mips/n32, x86/x32) - as for native 64bit There are only two exceptions - s390 compat has open() forcing O_LARGEFILE and arm64 compat has open_by_handle_at() doing the same thing. The same binaries on native host (s390/31 and arm resp.) will not force O_LARGEFILE, so IMO both are emulation bugs. Objections? The fix is obvious... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-02	8139cp: fix coherent mapping leak in error path.	françois romieu	1	-3/+8
	cp_open [...] rc = cp_alloc_rings(cp); if (rc) return rc; cp_alloc_rings [...] mem = dma_alloc_coherent(&cp->pdev->dev, CP_RING_BYTES, &cp->ring_dma, GFP_KERNEL); - cp_alloc_rings never frees the coherent mapping it allocates - neither do cp_open when cp_alloc_rings fails Signed-off-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-02	tcp: fix crashes in do_tcp_sendpages()	Eric Dumazet	1	-9/+6
	Recent network changes allowed high order pages being used for skb fragments. This uncovered a bug in do_tcp_sendpages() which was assuming its caller provided an array of order-0 page pointers. We only have to deal with a single page in this function, and its order is irrelevant. Reported-by: Willy Tarreau <w@1wt.eu> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-02	workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay	Tejun Heo	1	-3/+11
	8376fe22c7 ("workqueue: implement mod_delayed_work[_on]()") implemented mod_delayed_work[_on]() using the improved try_to_grab_pending(). The function is later used, among others, to replace [__]candel_delayed_work() + queue_delayed_work() combinations. Unfortunately, a delayed_work item w/ zero @delay is handled slightly differently by mod_delayed_work_on() compared to queue_delayed_work_on(). The latter skips timer altogether and directly queues it using queue_work_on() while the former schedules timer which will expire on the closest tick. This means, when @delay is zero, that [__]cancel_delayed_work() + queue_delayed_work_on() makes the target item immediately executable while mod_delayed_work_on() may induce delay of upto a full tick. This somewhat subtle difference breaks some of the converted users. e.g. block queue plugging uses delayed_work for deferred processing and uses mod_delayed_work_on() when the queue needs to be immediately unplugged. The above problem manifested as noticeably higher number of context switches under certain circumstances. The difference in behavior was caused by missing special case handling for 0 delay in mod_delayed_work_on() compared to queue_delayed_work_on(). Joonsoo Kim posted a patch to add it - ("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1]. The patch was queued for 3.8 but it was described as optimization and I missed that it was a correctness issue. As both queue_delayed_work_on() and mod_delayed_work_on() use __queue_delayed_work() for queueing, it seems that the better approach is to move the 0 delay special handling to the function instead of duplicating it in mod_delayed_work_on(). Fix the problem by moving 0 delay special case handling from queue_delayed_work_on() to __queue_delayed_work(). This replaces Joonsoo's patch. [1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Anders Kaseorg <andersk@MIT.EDU> Reported-and-tested-by: Zlatko Calusic <zlatko.calusic@iskon.hr> LKML-Reference: <alpine.DEB.2.00.1211280953350.26602@dr-wily.mit.edu> LKML-Reference: <50A78AA9.5040904@iskon.hr> Cc: Joonsoo Kim <js1304@gmail.com>
2012-12-02	workqueue: exit rescuer_thread() as TASK_RUNNING	Mike Galbraith	1	-1/+3
	A rescue thread exiting TASK_INTERRUPTIBLE can lead to a task scheduling off, never to be seen again. In the case where this occurred, an exiting thread hit reiserfs homebrew conditional resched while holding a mutex, bringing the box to its knees. PID: 18105 TASK: ffff8807fd412180 CPU: 5 COMMAND: "kdmflush" #0 [ffff8808157e7670] schedule at ffffffff8143f489 #1 [ffff8808157e77b8] reiserfs_get_block at ffffffffa038ab2d [reiserfs] #2 [ffff8808157e79a8] __block_write_begin at ffffffff8117fb14 #3 [ffff8808157e7a98] reiserfs_write_begin at ffffffffa0388695 [reiserfs] #4 [ffff8808157e7ad8] generic_perform_write at ffffffff810ee9e2 #5 [ffff8808157e7b58] generic_file_buffered_write at ffffffff810eeb41 #6 [ffff8808157e7ba8] __generic_file_aio_write at ffffffff810f1a3a #7 [ffff8808157e7c58] generic_file_aio_write at ffffffff810f1c88 #8 [ffff8808157e7cc8] do_sync_write at ffffffff8114f850 #9 [ffff8808157e7dd8] do_acct_process at ffffffff810a268f [exception RIP: kernel_thread_helper] RIP: ffffffff8144a5c0 RSP: ffff8808157e7f58 RFLAGS: 00000202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff8107af60 RDI: ffff8803ee491d18 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2012-11-30	x86, fpu: Avoid FPU lazy restore after suspend	Vincent Palatin	2	-6/+14
	When a cpu enters S3 state, the FPU state is lost. After resuming for S3, if we try to lazy restore the FPU for a process running on the same CPU, this will result in a corrupted FPU context. Ensure that "fpu_owner_task" is properly invalided when (re-)initializing a CPU, so nobody will try to lazy restore a state which doesn't exist in the hardware. Tested with a 64-bit kernel on a 4-core Ivybridge CPU with eagerfpu=off, by doing thousands of suspend/resume cycles with 4 processes doing FPU operations running. Without the patch, a process is killed after a few hundreds cycles by a SIGFPE. Cc: Duncan Laurie <dlaurie@chromium.org> Cc: Olof Johansson <olofj@chromium.org> Cc: <stable@kernel.org> v3.4+ # for 3.4 need to replace this_cpu_write by percpu_write Signed-off-by: Vincent Palatin <vpalatin@chromium.org> Link: http://lkml.kernel.org/r/1354306532-1014-1-git-send-email-vpalatin@chromium.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-30	drivers/rtc/rtc-tps65910.c: fix invalid pointer access on _remove()	Kim, Milo	1	-3/+3
	The tps65910_rtc data is registered as the platform driver data in _probe(= ). Therefore the tps65910_rtc should be used on unregistering the rtc device. And device pointer should be retrieved from the platform_device structure. This patch fixes the below oops: Unable to handle kernel NULL pointer dereference at virtual address 00000008 Modules linked in: rtc_tps65910(-) CPU: 0 Not tainted (3.7.0-rc7-next-20121128-g6b1f974-dirty #7) PC is at tps65910_rtc_alarm_irq_enable+0x20/0x2c [rtc_tps65910] (tps65910_rtc_alarm_irq_enable+0x20/0x2c [rtc_tps65910]) (tps65910_rtc_remove+0x18/0x28 [rtc_tps65910]) (platform_drv_remove+0x18/0x1c) (__device_release_driver+0x70/0xcc) (driver_detach+0xb4/0xb8) (bus_remove_driver+0x7c/0xc0) (sys_delete_module+0x148/0x21c) Signed-off-by: Milo(Woogyom) Kim <milo.kim@ti.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30	mm: soft offline: split thp at the beginning of soft_offline_page()	Naoya Horiguchi	1	-0/+8
	When we try to soft-offline a thp tail page, put_page() is called on the tail page unthinkingly and VM_BUG_ON is triggered in put_compound_page(). This patch splits thp before going into the main body of soft-offlining. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30	mm: avoid waking kswapd for THP allocations when compaction is deferred or ↵	Mel Gorman	1	-10/+27
	contended With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" reverted, Zdenek Kabelac reported the following Hmm, so it's just took longer to hit the problem and observe kswapd0 spinning on my CPU again - it's not as endless like before - but still it easily eats minutes - it helps to turn off Firefox or TB (memory hungry apps) so kswapd0 stops soon - and restart those apps again. (And I still have like >1GB of cached memory) kswapd0 R running task 0 30 2 0x00000000 Call Trace: preempt_schedule+0x42/0x60 _raw_spin_unlock+0x55/0x60 put_super+0x31/0x40 drop_super+0x22/0x30 prune_super+0x149/0x1b0 shrink_slab+0xba/0x510 The sysrq+m indicates the system has no swap so it'll never reclaim anonymous pages as part of reclaim/compaction. That is one part of the problem but not the root cause as file-backed pages could also be reclaimed. The likely underlying problem is that kswapd is woken up or kept awake for each THP allocation request in the page allocator slow path. If compaction fails for the requesting process then compaction will be deferred for a time and direct reclaim is avoided. However, if there are a storm of THP requests that are simply rejected, it will still be the the case that kswapd is awake for a prolonged period of time as pgdat->kswapd_max_order is updated each time. This is noticed by the main kswapd() loop and it will not call kswapd_try_to_sleep(). Instead it will loopp, shrinking a small number of pages and calling shrink_slab() on each iteration. This patch defers when kswapd gets woken up for THP allocations. For !THP allocations, kswapd is always woken up. For THP allocations, kswapd is woken up iff the process is willing to enter into direct reclaim/compaction. [akpm@linux-foundation.org: fix typo in comment] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Zdenek Kabelac <zkabelac@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Glauber Costa <glommer@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30	revert "Revert "mm: remove __GFP_NO_KSWAPD""	Andrew Morton	4	-17/+10
	It apepars that this patch was innocent, and we hope that "mm: avoid waking kswapd for THP allocations when compaction is deferred or contended" will fix the final kswapd-spinning cause. Cc: Zdenek Kabelac <zkabelac@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30	mm: vmscan: fix endless loop in kswapd balancing	Johannes Weiner	1	-9/+18
	Kswapd does not in all places have the same criteria for a balanced zone. Zones are only being reclaimed when their high watermark is breached, but compaction checks loop over the zonelist again when the zone does not meet the low watermark plus two times the size of the allocation. This gets kswapd stuck in an endless loop over a small zone, like the DMA zone, where the high watermark is smaller than the compaction requirement. Add a function, zone_balanced(), that checks the watermark, and, for higher order allocations, if compaction has enough free memory. Then use it uniformly to check for balanced zones. This makes sure that when the compaction watermark is not met, at least reclaim happens and progress is made - or the zone is declared unreclaimable at some point and skipped entirely. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: George Spelvin <linux@horizon.com> Reported-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> Reported-by: Tomas Racek <tracek@redhat.com> Tested-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30	mm/vmemmap: fix wrong use of virt_to_page	Jianguo Wu	1	-6/+4
	I enable CONFIG_DEBUG_VIRTUAL and CONFIG_SPARSEMEM_VMEMMAP, when doing memory hotremove, there is a kernel BUG at arch/x86/mm/physaddr.c:20. It is caused by free_section_usemap()->virt_to_page(), virt_to_page() is only used for kernel direct mapping address, but sparse-vmemmap uses vmemmap address, so it is going wrong here. ------------[ cut here ]------------ kernel BUG at arch/x86/mm/physaddr.c:20! invalid opcode: 0000 [#1] SMP Modules linked in: acpihp_drv acpihp_slot edd cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_intel ipv6 ixgbe igb iTCO_wdt i7core_edac edac_core pcspkr iTCO_vendor_support ioatdma microcode joydev sr_mod i2c_i801 dca lpc_ich mfd_core mdio tpm_tis i2c_core hid_generic tpm cdrom sg tpm_bios rtc_cmos button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hwmon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_piix libata megaraid_sas scsi_mod CPU 39 Pid: 6454, comm: sh Not tainted 3.7.0-rc1-acpihp-final+ #45 QCI QSSC-S4R/QSSC-S4R RIP: 0010:[<ffffffff8103c908>] [<ffffffff8103c908>] __phys_addr+0x88/0x90 RSP: 0018:ffff8804440d7c08 EFLAGS: 00010006 RAX: 0000000000000006 RBX: ffffea0012000000 RCX: 000000000000002c ... Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Reviewd-by: Wen Congyang <wency@cn.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30	mm: compaction: fix return value of capture_free_page()	Mel Gorman	1	-1/+1
	Commit ef6c5be658f6 ("fix incorrect NR_FREE_PAGES accounting (appears like memory leak)") fixes a NR_FREE_PAGE accounting leak but missed the return value which was also missed by this reviewer until today. That return value is used by compaction when adding pages to a list of isolated free pages and without this follow-up fix, there is a risk of free list corruption. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30	fix off-by-one in argument passed by iterate_fd() to callbacks	Al Viro	1	-6/+8
	Noticed by Pavel Roskin; the thing in his patch I disagree with was compensating for that shite in callbacks instead of fixing it once in the iterator itself. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-30	lookup_one_len: don't accept . and ..	Al Viro	1	-0/+5
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-30	cifs: get rid of blind d_drop() in readdir	Al Viro	1	-1/+4
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-30	nfs_lookup_revalidate(): fix a leak	Al Viro	1	-2/+2
	We are leaking fattr and fhandle if we decide that dentry is not to be invalidated, after all (e.g. happens to be a mountpoint). Just free both before that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-30	don't do blind d_drop() in nfs_prime_dcache()	Al Viro	1	-1/+2
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-30	blkdev_max_block: make private to fs/buffer.c	Linus Torvalds	3	-56/+14
	We really don't want to look at the block size for the raw block device accesses in fs/block-dev.c, because it may be changing from under us. So get rid of the max_block logic entirely, since the caller should already have done it anyway. That leaves the only user of this function in fs/buffer.c, so move the whole function there and make it static. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-29	direct-io: don't read inode->i_blkbits multiple times	Linus Torvalds	1	-3/+5
	Since directio can work on a raw block device, and the block size of the device can change under it, we need to do the same thing that fs/buffer.c now does: read the block size a single time, using ACCESS_ONCE(). Reading it multiple times can get different results, which will then confuse the code because it actually encodes the i_blksize in relationship to the underlying logical blocksize. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-29	blockdev: remove bd_block_size_semaphore again	Linus Torvalds	3	-106/+5
	This reverts the block-device direct access code to the previous unlocked code, now that fs/buffer.c no longer needs external locking. With this, fs/block_dev.c is back to the original version, apart from a whitespace cleanup that I didn't want to revert. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-29	fs/buffer.c: make block-size be per-page and protected by the page lock	Linus Torvalds	1	-31/+48
	This makes the buffer size handling be a per-page thing, which allows us to not have to worry about locking too much when changing the buffer size. If a page doesn't have buffers, we still need to read the block size from the inode, but we can do that with ACCESS_ONCE(), so that even if the size is changing, we get a consistent value. This doesn't convert all functions - many of the buffer functions are used purely by filesystems, which in turn results in the buffer size being fixed at mount-time. So they don't have the same consistency issues that the raw device access can have. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-29	bonding: fix race condition in bonding_store_slaves_active	nikolay@redhat.com	1	-0/+2
	Race between bonding_store_slaves_active() and slave manipulation functions. The bond_for_each_slave use in bonding_store_slaves_active() is not protected by any synchronization mechanism. NULL pointer dereference is easy to reach. Fixed by acquiring the bond->lock for the slave walk. v2: Make description text < 75 columns Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-29	bonding: make arp_ip_target parameter checks consistent with sysfs	nikolay@redhat.com	1	-2/+3
	The module can be loaded with arp_ip_target="255.255.255.255" which makes it impossible to remove as the function in sysfs checks for that value, so we make the parameter checks consistent with sysfs. v2: Fix formatting v3: Make description text < 75 columns Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-29	bonding: fix miimon and arp_interval delayed work race conditions	nikolay@redhat.com	2	-86/+36
	First I would give three observations which will be used later. Observation 1: if (delayed_work_pending(wq)) cancel_delayed_work(wq) This usage is wrong because the pending bit is cleared just before the work's fn is executed and if the function re-arms itself we might end up with the work still running. It's safe to call cancel_delayed_work_sync() even if the work is not queued at all. Observation 2: Use of INIT_DELAYED_WORK() Work needs to be initialized only once prior to (de/en)queueing. Observation 3: IFF_UP is set only after ndo_open is called Related race conditions: 1. Race between bonding_store_miimon() and bonding_store_arp_interval() Because of Obs.1 we can end up having both works enqueued. 2. Multiple races with INIT_DELAYED_WORK() Since the works are not protected by anything between INIT_DELAYED_WORK() and calls to (en/de)queue it is possible for races between the following functions: (races are also possible between the calls to INIT_DELAYED_WORK() and workqueue code) bonding_store_miimon() - bonding_store_arp_interval(), bond_close(), bond_open(), enqueued functions bonding_store_arp_interval() - bonding_store_miimon(), bond_close(), bond_open(), enqueued functions 3. By Obs.1 we need to change bond_cancel_all() Bugs 1 and 2 are fixed by moving all work initializations in bond_open which by Obs. 2 and Obs. 3 and the fact that we make sure that all works are cancelled in bond_close(), is guaranteed not to have any work enqueued. Also RTNL lock is now acquired in bonding_store_miimon/arp_interval so they can't race with bond_close and bond_open. The opposing work is cancelled only if the IFF_UP flag is set and it is cancelled unconditionally. The opposing work is already cancelled if the interface is down so no need to cancel it again. This way we don't need new synchronizations for the bonding workqueue. These bugs (and fixes) are tied together and belong in the same patch. Note: I have left 1 line intentionally over 80 characters (84) because I didn't like how it looks broken down. If you'd prefer it otherwise, then simply break it. v2: Make description text < 75 columns Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-29	remoteproc: fix error path of ->find_vqs	Ohad Ben-Cohen	1	-6/+12
	Eliminate an erroneous invocation of rproc_shutdown inside the error path of rproc_virtio_find_vqs. Reported-by: Ido Yariv <ido@wizery.com> Signed-off-by: Ohad Ben-Cohen <ohad@wizery.com>