summaryrefslogtreecommitdiffstats
path: root/ipc (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of ↵Linus Torvalds2014-08-101-3/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "This is a bunch of small changes built against 3.16-rc6. The most significant change for users is the first patch which makes setns drmatically faster by removing unneded rcu handling. The next chunk of changes are so that "mount -o remount,.." will not allow the user namespace root to drop flags on a mount set by the system wide root. Aks this forces read-only mounts to stay read-only, no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec mounts to stay no exec and it prevents unprivileged users from messing with a mounts atime settings. I have included my test case as the last patch in this series so people performing backports can verify this change works correctly. The next change fixes a bug in NFS that was discovered while auditing nsproxy users for the first optimization. Today you can oops the kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever with pid namespaces. I rebased and fixed the build of the !CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given that no one to my knowledge bases anything on my tree fixing the typo in place seems more responsible that requiring a typo-fix to be backported as well. The last change is a small semantic cleanup introducing /proc/thread-self and pointing /proc/mounts and /proc/net at it. This prevents several kinds of problemantic corner cases. It is a user-visible change so it has a minute chance of causing regressions so the change to /proc/mounts and /proc/net are individual one line commits that can be trivially reverted. Unfortunately I lost and could not find the email of the original reporter so he is not credited. From at least one perspective this change to /proc/net is a refgression fix to allow pthread /proc/net uses that were broken by the introduction of the network namespace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net proc: Implement /proc/thread-self to point at the directory of the current thread proc: Have net show up under /proc/<tgid>/task/<tid> NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes mnt: Add tests for unprivileged remount cases that have found to be faulty mnt: Change the default remount atime from relatime to the existing value mnt: Correct permission checks in do_remount mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount mnt: Only change user settable mount flags in remount namespaces: Use task_lock and not rcu to protect nsproxy
| * namespaces: Use task_lock and not rcu to protect nsproxyEric W. Biederman2014-07-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The synchronous syncrhonize_rcu in switch_task_namespaces makes setns a sufficiently expensive system call that people have complained. Upon inspect nsproxy no longer needs rcu protection for remote reads. remote reads are rare. So optimize for same process reads and write by switching using rask_lock instead. This yields a simpler to understand lock, and a faster setns system call. In particular this fixes a performance regression observed by Rafael David Tinoco <rafael.tinoco@canonical.com>. This is effectively a revert of Pavel Emelyanov's commit cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter from 2007. The race this originialy fixed no longer exists as do_notify_parent uses task_active_pid_ns(parent) instead of parent->nsproxy. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | shm: allow exit_shm in parallel if only marking orphansJack Miller2014-08-091-32/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If shm_rmid_force (the default state) is not set then the shmids are only marked as orphaned and does not require any add, delete, or locking of the tree structure. Seperate the sysctl on and off case, and only obtain the read lock. The newly added list head can be deleted under the read lock because we are only called with current and will only change the semids allocated by this task and not manipulate the list. This commit assumes that up_read includes a sufficient memory barrier for the writes to be seen my others that later obtain a write lock. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Jack Miller <millerjo@us.ibm.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | shm: make exit_shm work proportional to task activityJack Miller2014-08-091-11/+11
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is small set of patches our team has had kicking around for a few versions internally that fixes tasks getting hung on shm_exit when there are many threads hammering it at once. Anton wrote a simple test to cause the issue: http://ozlabs.org/~anton/junkcode/bust_shm_exit.c Before applying this patchset, this test code will cause either hanging tracebacks or pthread out of memory errors. After this patchset, it will still produce output like: root@somehost:~# ./bust_shm_exit 1024 160 ... INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113) INFO: Stall ended before state dump start ... But the task will continue to run along happily, so we consider this an improvement over hanging, even if it's a bit noisy. This patch (of 3): exit_shm obtains the ipc_ns shm rwsem for write and holds it while it walks every shared memory segment in the namespace. Thus the amount of work is related to the number of shm segments in the namespace not the number of segments that might need to be cleaned. In addition, this occurs after the task has been notified the thread has exited, so the number of tasks waiting for the ns shm rwsem can grow without bound until memory is exausted. Add a list to the task struct of all shmids allocated by this task. Init the list head in copy_process. Use the ns->rwsem for locking. Add segments after id is added, remove before removing from id. On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar to handling of semaphore undo. I chose a define for the init sequence since its a simple list init, otherwise it would require a function call to avoid include loops between the semaphore code and the task struct. Converting the list_del to list_del_init for the unshare cases would remove the exit followed by init, but I left it blow up if not inited. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Jack Miller <millerjo@us.ibm.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc: convert use of typedef ctl_table to struct ctl_tableJoe Perches2014-06-072-13/+13
| | | | | | | | This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc/sem.c: add a printk_once for semctl(GETNCNT/GETZCNT)Manfred Spraul2014-06-071-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | The actual Linux implementation for semctl(GETNCNT) and semctl(GETZCNT) always (since 0.99.10) reported a thread as sleeping on all semaphores that are listed in the semop() call. The documented behavior (both in the Linux man page and in the Single Unix Specification) is that a task should be reported on exactly one semaphore: The semaphore that caused the thread to got to sleep. This patch adds a pr_info_once() that is triggered if a thread hits the relevant case. The code triggers slightly too often, otherwise it would be necessary to replicate the old code. As there are no known users of GETNCNT or GETZCNT, this is done to prevent unnecessary bloat. The task that triggered is reported with name (tsk->comm) and pid. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc/sem.c: make semctl(,,{GETNCNT,GETZCNT}) standard compliantManfred Spraul2014-06-071-21/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SUSv4 clearly defines how semncnt and semzcnt must be calculated: A task waits on exactly one semaphore: The semaphore from the first operation in the sop array that cannot proceed. The Linux implementation never followed the standard, it tried to count all semaphores that might be the reason why a task sleeps. This patch fixes that. Note: a) The implementation assumes that GETNCNT and GETZCNT are rare operations, therefore the code counts them only on demand. (If they wouldn't be rare, then the non-compliance would have been found earlier) b) compared to the initial version of the patch, the BUG_ONs were removed and it was clarified that the new behavior conforms to SUS. Back-compatibility concerns: Manfred: : - there is no application in Fedora that uses GETNCNT or GETZCNT. : : - application that use only single-sop semop() are also safe, the : difference only affects complex apps. : : - portable application are also safe, the new behavior is standard : compliant. : : But that's it. The old behavior existed in Linux from 0.99.something : until now. Michael: : * These operations seem to be very little used. Grepping the public : source that is contained Fedora 20 source DVD, there appear to be no : uses. Of course, this says nothing about uses in private / : non-mainstream FOSS code, but it seems likely that the same pattern : is followed there. : : * The existing behavior is hard enough to understand that I suspect : that no one understood it well enough to rely on it anyway : (especially as that behavior contradicted both man page and POSIX). : : So, there's a chance of breakage, but I estimate that it's minute. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc/sem.c: store which operation blocks in perform_atomic_semop()Manfred Spraul2014-06-071-0/+3
| | | | | | | | | | | | | Preparation for the next patch: In the slow-path of perform_atomic_semop(), store a pointer to the operation that caused the operation to block. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc/sem.c: change perform_atomic_semop parametersManfred Spraul2014-06-071-19/+19
| | | | | | | | | | | | | | Right now, perform_atomic_semop gets the content of sem_queue as individual fields. Changes that, instead pass a pointer to sem_queue. This is a preparation for the next patch: it uses sem_queue to store the reason why a task must sleep. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc/sem.c: remove code duplicationManfred Spraul2014-06-071-56/+51
| | | | | | | | | | | | | | count_semzcnt and count_semncnt are more of less identical. The patch creates a single function that either counts the number of tasks waiting for zero or waiting due to a decrease operation. Compared to the initial version, the BUG_ONs were removed. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc/sem.c: bugfix for semctl(,,GETZCNT)Manfred Spraul2014-06-071-0/+10
| | | | | | | | | | | | | | | | | | GETZCNT is supposed to return the number of threads that wait until a semaphore value becomes 0. The current implementation overlooks complex operations that contain both wait-for-zero operation and operations that alter at least one semaphore. The patch fixes that. It's intentionally copy&paste, this will be cleaned up in the next patch. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc,msg: document volatile r_msgDavidlohr Bueso2014-06-071-3/+7
| | | | | | | | | | The need for volatile is not obvious, document it. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc,msg: move some msgq ns code aroundDavidlohr Bueso2014-06-071-69/+63
| | | | | | | | | | | | Nothing big and no logical changes, just get rid of some redundant function declarations. Move msg_[init/exit]_ns down the end of the file. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc,msg: use current->state helpersDavidlohr Bueso2014-06-071-2/+2
| | | | | | | | | | Call __set_current_state() instead of assigning the new state directly. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Manfred Spraul <manfred@colorfullif.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc/shm.c: check for integer overflow during shmget.Manfred Spraul2014-06-071-0/+3
| | | | | | | | | | | | | | | | SHMMAX is the upper limit for the size of a shared memory segment, counted in bytes. The actual allocation is that size, rounded up to the next full page. Add a check that prevents the creation of segments where the rounded up size causes an integer overflow. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc/shm.c: check for overflows of shm_totManfred Spraul2014-06-071-1/+2
| | | | | | | | | | | | | | | | | shm_tot counts the total number of pages used by shm segments. If SHMALL is ULONG_MAX (or nearly ULONG_MAX), then the number can overflow. Subsequent calls to shmctl(,SHM_INFO,) would return wrong values for shm_tot. The patch adds a detection for overflows. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc/shm.c: check for ulong overflows in shmatManfred Spraul2014-06-071-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The increase of SHMMAX/SHMALL is a 4 patch series. The change itself is trivial, the only problem are interger overflows. The overflows are not new, but if we make huge values the default, then the code should be free from overflows. SHMMAX: - shmmem_file_setup places a hard limit on the segment size: MAX_LFS_FILESIZE. On 32-bit, the limit is > 1 TB, i.e. 4 GB-1 byte segments are possible. Rounded up to full pages the actual allocated size is 0. --> must be fixed, patch 3 - shmat: - find_vma_intersection does not handle overflows properly. --> must be fixed, patch 1 - the rest is fine, do_mmap_pgoff limits mappings to TASK_SIZE and checks for overflows (i.e.: map 2 GB, starting from addr=2.5GB fails). SHMALL: - after creating 8192 segments size (1L<<63)-1, shm_tot overflows and returns 0. --> must be fixed, patch 2. Userspace: - Obviously, there could be overflows in userspace. There is nothing we can do, only use values smaller than ULONG_MAX. I ended with "ULONG_MAX - 1L<<24": - TASK_SIZE cannot be used because it is the size of the current task. Could be 4G if it's a 32-bit task on a 64-bit kernel. - The maximum size is not standardized across archs: I found TASK_MAX_SIZE, TASK_SIZE_MAX and TASK_SIZE_64. - Just in case some arch revives a 4G/4G split, nearly ULONG_MAX is a valid segment size. - Using "0" as a magic value for infinity is even worse, because right now 0 means 0, i.e. fail all allocations. This patch (of 4): find_vma_intersection() does not work as intended if addr+size overflows. The patch adds a manual check before the call to find_vma_intersection. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc, kernel: clear whitespacePaul McQuade2014-06-075-28/+27
| | | | | | | | trailing whitespace Signed-off-by: Paul McQuade <paulmcquad@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc, kernel: use Linux headersPaul McQuade2014-06-075-5/+5
| | | | | | | | | Use #include <linux/uaccess.h> instead of <asm/uaccess.h> Use #include <linux/types.h> instead of <asm/types.h> Signed-off-by: Paul McQuade <paulmcquad@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc: constify ipc_opsMathias Krause2014-06-075-20/+19
| | | | | | | | | | | | | | | | There is no need to recreate the very same ipc_ops structure on every kernel entry for msgget/semget/shmget. Just declare it static and be done with it. While at it, constify it as we don't modify the structure at runtime. Found in the PaX patch, written by the PaX Team. Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: PaX Team <pageexec@freemail.hu> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc: use device_initcallDavidlohr Bueso2014-04-083-3/+3
| | | | | | | | | ... since __initcall is now deprecated. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc/compat.c: remove sc_semopm macroDavidlohr Bueso2014-04-081-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | This macro appears to have been introduced back in the 2.5 era for semtimedop32 backward compatibility on ia32: https://lkml.org/lkml/2003/4/28/78 Nowadays, this syscall in compat just defaults back to the code found in sem.c, so it is no longer used and can thus be removed: long compat_sys_semtimedop(int semid, struct sembuf __user *tsems, unsigned nsops, const struct compat_timespec __user *timeout) { struct timespec __user *ts64; if (compat_convert_timespec(&ts64, timeout)) return -EFAULT; return sys_semtimedop(semid, tsems, nsops, ts64); } Furthermore, there are no users in compat.c. After this change, kernel builds just fine with both CONFIG_SYSVIPC_COMPAT and CONFIG_SYSVIPC. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'x86-x32-for-linus' of ↵Linus Torvalds2014-04-022-25/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull compat time conversion changes from Peter Anvin: "Despite the branch name this is really neither an x86 nor an x32-specific patchset, although it the implementation of the discussions that followed the x32 security hole a few months ago. This removes get/put_compat_timespec/val() and replaces them with compat_get/put_timespec/val() which are savvy as to the current status of COMPAT_USE_64BIT_TIME. It removes several unused and/or incorrect/misleading functions (like compat_put_timeval_convert which doesn't in fact do any conversion) and also replaces several open-coded implementations what is now called compat_convert_timespec() with that function" * 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: compat: Fix sparse address space warnings compat: Get rid of (get|put)_compat_time(val|spec)
| * compat: Get rid of (get|put)_compat_time(val|spec)H. Peter Anvin2014-02-022-25/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have two APIs for compatiblity timespec/val, with confusingly similar names. compat_(get|put)_time(val|spec) *do* handle the case where COMPAT_USE_64BIT_TIME is set, whereas (get|put)_compat_time(val|spec) do not. This is an accident waiting to happen. Clean it up by favoring the full-service version; the limited version is replaced with double-underscore versions static to kernel/compat.c. A common pattern is to convert a struct timespec to kernel format in an allocation on the user stack. Unfortunately it is open-coded in several places. Since this allocation isn't actually needed if COMPAT_USE_64BIT_TIME is true (since user format == kernel format) encapsulate that whole pattern into the function compat_convert_timespec(). An equivalent function should be written for struct timeval if it is needed in the future. Finally, get rid of compat_(get|put)_timeval_convert(): each was only used once, and the latter was not even doing what the function said (no conversion actually was being done.) Moving the conversion into compat_sys_settimeofday() itself makes the code much more similar to sys_settimeofday() itself. v3: Remove unused compat_convert_timeval(). v2: Drop bogus "const" in the destination argument for compat_convert_time*(). Cc: Mauro Carvalho Chehab <m.chehab@samsung.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Hans Verkuil <hans.verkuil@cisco.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Tested-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* | Merge branch 'compat' of ↵Linus Torvalds2014-03-312-22/+23
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 compat wrapper rework from Heiko Carstens: "S390 compat system call wrapper simplification work. The intention of this work is to get rid of all hand written assembly compat system call wrappers on s390, which perform proper sign or zero extension, or pointer conversion of compat system call parameters. Instead all of this should be done with C code eg by using Al's COMPAT_SYSCALL_DEFINEx() macro. Therefore all common code and s390 specific compat system calls have been converted to the COMPAT_SYSCALL_DEFINEx() macro. In order to generate correct code all compat system calls may only have eg compat_ulong_t parameters, but no unsigned long parameters. Those patches which change parameter types from unsigned long to compat_ulong_t parameters are separate in this series, but shouldn't cause any harm. The only compat system calls which intentionally have 64 bit parameters (preadv64 and pwritev64) in support of the x86/32 ABI haven't been changed, but are now only available if an architecture defines __ARCH_WANT_COMPAT_SYS_PREADV64/PWRITEV64. System calls which do not have a compat variant but still need proper zero extension on s390, like eg "long sys_brk(unsigned long brk)" will get a proper wrapper function with the new s390 specific COMPAT_SYSCALL_WRAPx() macro: COMPAT_SYSCALL_WRAP1(brk, unsigned long, brk); which generates the following code (simplified): asmlinkage long sys_brk(unsigned long brk); asmlinkage long compat_sys_brk(long brk) { return sys_brk((u32)brk); } Given that the C file which contains all the COMPAT_SYSCALL_WRAP lines includes both linux/syscall.h and linux/compat.h, it will generate build errors, if the declaration of sys_brk() doesn't match, or if there exists a non-matching compat_sys_brk() declaration. In addition this will intentionally result in a link error if somewhere else a compat_sys_brk() function exists, which probably should have been used instead. Two more BUILD_BUG_ONs make sure the size and type of each compat syscall parameter can be handled correctly with the s390 specific macros. I converted the compat system calls step by step to verify the generated code is correct and matches the previous code. In fact it did not always match, however that was always a bug in the hand written asm code. In result we get less code, less bugs, and much more sanity checking" * 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (44 commits) s390/compat: add copyright statement compat: include linux/unistd.h within linux/compat.h s390/compat: get rid of compat wrapper assembly code s390/compat: build error for large compat syscall args mm/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types kexec/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types net/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types ipc/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types fs/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types ipc/compat: convert to COMPAT_SYSCALL_DEFINE fs/compat: convert to COMPAT_SYSCALL_DEFINE security/compat: convert to COMPAT_SYSCALL_DEFINE mm/compat: convert to COMPAT_SYSCALL_DEFINE net/compat: convert to COMPAT_SYSCALL_DEFINE kernel/compat: convert to COMPAT_SYSCALL_DEFINE fs/compat: optional preadv64/pwrite64 compat system calls ipc/compat_sys_msgrcv: change msgtyp type from long to compat_long_t s390/compat: partial parameter conversion within syscall wrappers s390/compat: automatic zero, sign and pointer conversion of syscalls s390/compat: add sync_file_range and fallocate compat syscalls ...
| * | ipc/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter typesHeiko Carstens2014-03-061-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | In order to allow the COMPAT_SYSCALL_DEFINE macro generate code that performs proper zero and sign extension convert all 64 bit parameters to their corresponding 32 bit compat counterparts. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
| * | ipc/compat: convert to COMPAT_SYSCALL_DEFINEHeiko Carstens2014-03-062-12/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert all compat system call functions where all parameter types have a size of four or less than four bytes, or are pointer types to COMPAT_SYSCALL_DEFINE. The implicit casts within COMPAT_SYSCALL_DEFINE will perform proper zero and sign extension to 64 bit of all parameters if needed. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
| * | ipc/compat_sys_msgrcv: change msgtyp type from long to compat_long_tHeiko Carstens2014-03-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the type of compat_sys_msgrcv's msgtyp parameter from long to compat_long_t, since compat user space passes only a 32 bit signed value. Let the compat wrapper do proper sign extension to 64 bit of this parameter. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
* | | ipc: Fix 2 bugs in msgrcv() MSG_COPY implementationMichael Kerrisk2014-03-161-0/+2
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While testing and documenting the msgrcv() MSG_COPY flag that Stanislav Kinsbursky added in commit 4a674f34ba04 ("ipc: introduce message queue copy feature" => kernel 3.8), I discovered a couple of bugs in the implementation. The two bugs concern MSG_COPY interactions with other msgrcv() flags, namely: (A) MSG_COPY + MSG_EXCEPT (B) MSG_COPY + !IPC_NOWAIT The bugs are distinct (and the fix for the first one is obvious), however my fix for both is a single-line patch, which is why I'm combining them in a single mail, rather than writing two mails+patches. ===== (A) MSG_COPY + MSG_EXCEPT ===== With the addition of the MSG_COPY flag, there are now two msgrcv() flags--MSG_COPY and MSG_EXCEPT--that modify the meaning of the 'msgtyp' argument in unrelated ways. Specifying both in the same call is a logical error that is currently permitted, with the effect that MSG_COPY has priority and MSG_EXCEPT is ignored. The call should give an error if both flags are specified. The patch below implements that behavior. ===== (B) (B) MSG_COPY + !IPC_NOWAIT ===== The test code that was submitted in commit 3a665531a3b7 ("selftests: IPC message queue copy feature test") shows MSG_COPY being used in conjunction with IPC_NOWAIT. In other words, if there is no message at the position 'msgtyp'. return immediately with the error in ENOMSG. What was not (fully) tested is the behavior if MSG_COPY is specified *without* IPC_NOWAIT, and there is an odd behavior. If the queue contains less than 'msgtyp' messages, then the call blocks until the next message is written to the queue. At that point, the msgrcv() call returns a copy of the newly added message, regardless of whether that message is at the ordinal position 'msgtyp'. This is clearly bogus, and problematic for applications that might want to make use of the MSG_COPY flag. I considered the following possible solutions to this problem: (1) Force the call to block until a message *does* appear at the position 'msgtyp'. (2) If the MSG_COPY flag is specified, the kernel should implicitly add IPC_NOWAIT, so that the call fails with ENOMSG for this case. (3) If the MSG_COPY flag is specified, but IPC_NOWAIT is not, generate an error (probably, EINVAL is the right one). I do not know if any application would really want to have the functionality of solution (1), especially since an application can determine in advance the number of messages in the queue using msgctl() IPC_STAT. Obviously, this solution would be the most work to implement. Solution (2) would have the effect of silently fixing any applications that tried to employ broken behavior. However, it would mean that if we later decided to implement solution (1), then user-space could not easily detect what the kernel supports (but, since I'm somewhat doubtful that solution (1) is needed, I'm not sure that this is much of a problem). Solution (3) would have the effect of informing broken applications that they are doing something broken. The downside is that this would cause a ABI breakage for any applications that are currently employing the broken behavior. However: a) Those applications are almost certainly not getting the results they expect. b) Possibly, those applications don't even exist, because MSG_COPY is currently hidden behind CONFIG_CHECKPOINT_RESTORE. The upside of solution (3) is that if we later decided to implement solution (1), user-space could determine what the kernel supports, via the error return. In my view, solution (3) is mildly preferable to solution (2), and solution (1) could still be done later if anyone really cares. The patch below implements solution (3). PS. For anyone out there still listening, it's the usual story: documenting an API (and the thinking about, and the testing of the API, that documentation entails) is the one of the single best ways of finding bugs in the API, as I've learned from a lot of experience. Best to do that documentation before releasing the API. Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: stable@vger.kernel.org Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* / ipc,mqueue: remove limits for the amount of system-wide queuesDavidlohr Bueso2014-02-262-9/+15
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 93e6f119c0ce ("ipc/mqueue: cleanup definition names and locations") added global hardcoded limits to the amount of message queues that can be created. While these limits are per-namespace, reality is that it ends up breaking userspace applications. Historically users have, at least in theory, been able to create up to INT_MAX queues, and limiting it to just 1024 is way too low and dramatic for some workloads and use cases. For instance, Madars reports: "This update imposes bad limits on our multi-process application. As our app uses approaches that each process opens its own set of queues (usually something about 3-5 queues per process). In some scenarios we might run up to 3000 processes or more (which of-course for linux is not a problem). Thus we might need up to 9000 queues or more. All processes run under one user." Other affected users can be found in launchpad bug #1155695: https://bugs.launchpad.net/ubuntu/+source/manpages/+bug/1155695 Instead of increasing this limit, revert it entirely and fallback to the original way of dealing queue limits -- where once a user's resource limit is reached, and all memory is used, new queues cannot be created. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reported-by: Madars Vitolins <m@silodev.com> Acked-by: Doug Ledford <dledford@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> [3.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc: fix compat msgrcv with negative msgtypMateusz Guzik2014-01-281-1/+1
| | | | | | | | | | | | | | | | Compat function takes msgtyp argument as u32 and passes it down to do_msgrcv which results in casting to long, thus the sign is lost and we get a big positive number instead. Cast the argument to signed type before passing it down. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Reported-by: Gabriellla Schmidt <gsc@bruker.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc,msg: document barriersDavidlohr Bueso2014-01-281-2/+17
| | | | | | | | | | | | | | | | | | Both expunge_all() and pipeline_send() rely on both a nil msg value and a full barrier to guarantee the correct ordering when waking up a task. While its counterpart at the receiving end is well documented for the lockless recv algorithm, we still need to document these specific smp_mb() calls. [akpm@linux-foundation.org: fix typo, per Mike] [akpm@linux-foundation.org: mroe tpyos] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc: delete seq_max field in struct ipc_idsDavidlohr Bueso2014-01-282-11/+3
| | | | | | | | | | | | | | | | | | | | This field is only used to reset the ids seq number if it exceeds the smaller of INT_MAX/SEQ_MULTIPLIER and USHRT_MAX, and can therefore be moved out of the structure and into its own macro. Since each ipc_namespace contains a table of 3 pointers to struct ipc_ids we can save space in instruction text: text data bss dec hex filename 56232 2348 24 58604 e4ec ipc/built-in.o 56216 2348 24 58588 e4dc ipc/built-in.o-after Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Jonathan Gonzalez <jgonzalez@linets.cl> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc: simplify sysvipc_proc_open() returnDavidlohr Bueso2014-01-281-5/+4
| | | | | | | | | | | Get rid of silly/useless label jumping. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc: remove useless return statementDavidlohr Bueso2014-01-281-4/+0
| | | | | | | | | | | Only found in ipc_rmid(). Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc: remove braces for single statementsDavidlohr Bueso2014-01-284-19/+16
| | | | | | | | | | | | Deal with checkpatch messages: WARNING: braces {} are not necessary for single statement blocks Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc: standardize code commentsDavidlohr Bueso2014-01-283-139/+125
| | | | | | | | | | | | IPC commenting style is all over the place, *specially* in util.c. This patch orders things a bit. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc: whitespace cleanupManfred Spraul2014-01-289-113/+113
| | | | | | | | | | | | | | | | | | | | | | | The ipc code does not adhere the typical linux coding style. This patch fixes lots of simple whitespace errors. - mostly autogenerated by scripts/checkpatch.pl -f --fix \ --types=pointer_location,spacing,space_before_tab - one manual fixup (keep structure members tab-aligned) - removal of additional space_before_tab that were not found by --fix Tested with some of my msg and sem test apps. Andrew: Could you include it in -mm and move it towards Linus' tree? Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Suggested-by: Li Bin <huawei.libin@huawei.com> Cc: Joe Perches <joe@perches.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc: change kern_ipc_perm.deleted type to boolRafael Aquini2014-01-283-5/+5
| | | | | | | | | | | | | struct kern_ipc_perm.deleted is meant to be used as a boolean toggle, and the changes introduced by this patch are just to make the case explicit. Signed-off-by: Rafael Aquini <aquini@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Greg Thelen <gthelen@google.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc: introduce ipc_valid_object() helper to sort out IPC_RMID racesRafael Aquini2014-01-284-19/+41
| | | | | | | | | | | | | | | | | | | | After the locking semantics for the SysV IPC API got improved, a couple of IPC_RMID race windows were opened because we ended up dropping the 'kern_ipc_perm.deleted' check performed way down in ipc_lock(). The spotted races got sorted out by re-introducing the old test within the racy critical sections. This patch introduces ipc_valid_object() to consolidate the way we cope with IPC_RMID races by using the same abstraction across the API implementation. Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Greg Thelen <gthelen@google.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc/sem.c: avoid overflow of semop undo (semadj) valuePetr Mladek2014-01-281-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When trying to understand semop code, I found a small mistake in the check for semadj (undo) value overflow. The new undo value is not stored immediately and next potential checks are done against the old value. The failing scenario is not much practical. One semop call has to do more operations on the same semaphore. Also semval and semadj must have different values, so there has to be some operations without SEM_UNDO flag. For example: struct sembuf depositor_op[1]; struct sembuf collector_op[2]; depositor_op[0].sem_num = 0; depositor_op[0].sem_op = 20000; depositor_op[0].sem_flg = 0; collector_op[0].sem_num = 0; collector_op[0].sem_op = -10000; collector_op[0].sem_flg = SEM_UNDO; collector_op[1].sem_num = 0; collector_op[1].sem_op = -10000; collector_op[1].sem_flg = SEM_UNDO; if (semop(semid, depositor_op, 1) == -1) { perror("Failed to do 1st deposit"); return 1; } if (semop(semid, collector_op, 2) == -1) { perror("Failed to do 1st collect"); return 1; } if (semop(semid, depositor_op, 1) == -1) { perror("Failed to do 2nd deposit"); return 1; } if (semop(semid, collector_op, 2) == -1) { perror("Failed to do 2nd collect"); return 1; } return 0; It passes without error now but the semadj value has overflown in the 2nd collector operation. [akpm@linux-foundation.org: restore lessened scope of local `undo'] [davidlohr@hp.com: correct header comment for perform_atomic_semop] Signed-off-by: Petr Mladek <pmladek@suse.cz> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc,shm: correct error return value in shmctl (SHM_UNLOCK)Jesper Nilsson2013-11-221-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | Commit 2caacaa82a51 ("ipc,shm: shorten critical region for shmctl") restructured the ipc shm to shorten critical region, but introduced a path where the return value could be -EPERM, even if the operation actually was performed. Before the commit, the err return value was reset by the return value from security_shm_shmctl() after the if (!ns_capable(...)) statement. Now, we still exit the if statement with err set to -EPERM, and in the case of SHM_UNLOCK, it is not reset at all, and used as the return value from shmctl. To fix this, we only set err when errors occur, leaving the fallthrough case alone. Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> [3.12.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipc,shm: fix shm_file deletion racesGreg Thelen2013-11-221-5/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When IPC_RMID races with other shm operations there's potential for use-after-free of the shm object's associated file (shm_file). Here's the race before this patch: TASK 1 TASK 2 ------ ------ shm_rmid() ipc_lock_object() shmctl() shp = shm_obtain_object_check() shm_destroy() shum_unlock() fput(shp->shm_file) ipc_lock_object() shmem_lock(shp->shm_file) <OOPS> The oops is caused because shm_destroy() calls fput() after dropping the ipc_lock. fput() clears the file's f_inode, f_path.dentry, and f_path.mnt, which causes various NULL pointer references in task 2. I reliably see the oops in task 2 if with shmlock, shmu This patch fixes the races by: 1) set shm_file=NULL in shm_destroy() while holding ipc_object_lock(). 2) modify at risk operations to check shm_file while holding ipc_object_lock(). Example workloads, which each trigger oops... Workload 1: while true; do id=$(shmget 1 4096) shm_rmid $id & shmlock $id & wait done The oops stack shows accessing NULL f_inode due to racing fput: _raw_spin_lock shmem_lock SyS_shmctl Workload 2: while true; do id=$(shmget 1 4096) shmat $id 4096 & shm_rmid $id & wait done The oops stack is similar to workload 1 due to NULL f_inode: touch_atime shmem_mmap shm_mmap mmap_region do_mmap_pgoff do_shmat SyS_shmat Workload 3: while true; do id=$(shmget 1 4096) shmlock $id shm_rmid $id & shmunlock $id & wait done The oops stack shows second fput tripping on an NULL f_inode. The first fput() completed via from shm_destroy(), but a racing thread did a get_file() and queued this fput(): locks_remove_flock __fput ____fput task_work_run do_notify_resume int_signal Fixes: c2c737a0461e ("ipc,shm: shorten critical region for shmat") Fixes: 2caacaa82a51 ("ipc,shm: shorten critical region for shmctl") Signed-off-by: Greg Thelen <gthelen@google.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> # 3.10.17+ 3.11.6+ Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'akpm' (patches from Andrew Morton)Linus Torvalds2013-11-133-15/+13
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge first patch-bomb from Andrew Morton: "Quite a lot of other stuff is banked up awaiting further next->mainline merging, but this batch contains: - Lots of random misc patches - OCFS2 - Most of MM - backlight updates - lib/ updates - printk updates - checkpatch updates - epoll tweaking - rtc updates - hfs - hfsplus - documentation - procfs - update gcov to gcc-4.7 format - IPC" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (269 commits) ipc, msg: fix message length check for negative values ipc/util.c: remove unnecessary work pending test devpts: plug the memory leak in kill_sb ./Makefile: export initial ramdisk compression config option init/Kconfig: add option to disable kernel compression drivers: w1: make w1_slave::flags long to avoid memory corruption drivers/w1/masters/ds1wm.cuse dev_get_platdata() drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page() drivers/memstick/core/mspro_block.c: fix attributes array allocation drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr kernel/panic.c: reduce 1 byte usage for print tainted buffer gcov: reuse kbasename helper kernel/gcov/fs.c: use pr_warn() kernel/module.c: use pr_foo() gcov: compile specific gcov implementation based on gcc version gcov: add support for gcc 4.7 gcov format gcov: move gcov structs definitions to a gcc version specific file kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener() kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end() kernel/sysctl_binary.c: use scnprintf() instead of snprintf() ...
| * ipc, msg: fix message length check for negative valuesMathias Krause2013-11-132-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On 64 bit systems the test for negative message sizes is bogus as the size, which may be positive when evaluated as a long, will get truncated to an int when passed to load_msg(). So a long might very well contain a positive value but when truncated to an int it would become negative. That in combination with a small negative value of msg_ctlmax (which will be promoted to an unsigned type for the comparison against msgsz, making it a big positive value and therefore make it pass the check) will lead to two problems: 1/ The kmalloc() call in alloc_msg() will allocate a too small buffer as the addition of alen is effectively a subtraction. 2/ The copy_from_user() call in load_msg() will first overflow the buffer with userland data and then, when the userland access generates an access violation, the fixup handler copy_user_handle_tail() will try to fill the remainder with zeros -- roughly 4GB. That almost instantly results in a system crash or reset. ,-[ Reproducer (needs to be run as root) ]-- | #include <sys/stat.h> | #include <sys/msg.h> | #include <unistd.h> | #include <fcntl.h> | | int main(void) { | long msg = 1; | int fd; | | fd = open("/proc/sys/kernel/msgmax", O_WRONLY); | write(fd, "-1", 2); | close(fd); | | msgsnd(0, &msg, 0xfffffff0, IPC_NOWAIT); | | return 0; | } '--- Fix the issue by preventing msgsz from getting truncated by consistently using size_t for the message length. This way the size checks in do_msgsnd() could still be passed with a negative value for msg_ctlmax but we would fail on the buffer allocation in that case and error out. Also change the type of m_ts from int to size_t to avoid similar nastiness in other code paths -- it is used in similar constructs, i.e. signed vs. unsigned checks. It should never become negative under normal circumstances, though. Setting msg_ctlmax to a negative value is an odd configuration and should be prevented. As that might break existing userland, it will be handled in a separate commit so it could easily be reverted and reworked without reintroducing the above described bug. Hardening mechanisms for user copy operations would have catched that bug early -- e.g. checking slab object sizes on user copy operations as the usercopy feature of the PaX patch does. Or, for that matter, detect the long vs. int sign change due to truncation, as the size overflow plugin of the very same patch does. [akpm@linux-foundation.org: fix i386 min() warnings] Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Pax Team <pageexec@freemail.hu> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Brad Spengler <spender@grsecurity.net> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> [ v2.3.27+ -- yes, that old ;) ] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * ipc/util.c: remove unnecessary work pending testXie XiuQi2013-11-131-3/+1
| | | | | | | | | | | | | | | | | | | | | | Remove unnecessary work pending test before calling schedule_work(). It has been tested in queue_work_on() already. No functional changed. Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Cc: Tejun Heo <tj@kernel.org> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-linus' of ↵Linus Torvalds2013-11-131-1/+1
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "All kinds of stuff this time around; some more notable parts: - RCU'd vfsmounts handling - new primitives for coredump handling - files_lock is gone - Bruce's delegations handling series - exportfs fixes plus misc stuff all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits) ecryptfs: ->f_op is never NULL locks: break delegations on any attribute modification locks: break delegations on link locks: break delegations on rename locks: helper functions for delegation breaking locks: break delegations on unlink namei: minor vfs_unlink cleanup locks: implement delegations locks: introduce new FL_DELEG lock flag vfs: take i_mutex on renamed file vfs: rename I_MUTEX_QUOTA now that it's not used for quotas vfs: don't use PARENT/CHILD lock classes for non-directories vfs: pull ext4's double-i_mutex-locking into common code exportfs: fix quadratic behavior in filehandle lookup exportfs: better variable name exportfs: move most of reconnect_path to helper function exportfs: eliminate unused "noprogress" counter exportfs: stop retrying once we race with rename/remove exportfs: clear DISCONNECTED on all parents sooner exportfs: more detailed comment for path_reconnect ...
| * locks: break delegations on unlinkJ. Bruce Fields2013-11-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to break delegations on any operation that changes the set of links pointing to an inode. Start with unlink. Such operations also hold the i_mutex on a parent directory. Breaking a delegation may require waiting for a timeout (by default 90 seconds) in the case of a unresponsive NFS client. To avoid blocking all directory operations, we therefore drop locks before waiting for the delegation. The logic then looks like: acquire locks ... test for delegation; if found: take reference on inode release locks wait for delegation break drop reference on inode retry It is possible this could never terminate. (Even if we take precautions to prevent another delegation being acquired on the same inode, we could get a different inode on each retry.) But this seems very unlikely. The initial test for a delegation happens after the lock on the target inode is acquired, but the directory inode may have been acquired further up the call stack. We therefore add a "struct inode **" argument to any intervening functions, which we use to pass the inode back up to the caller in the case it needs a delegation synchronously broken. Cc: David Howells <dhowells@redhat.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Dustin Kirkland <dustin.kirkland@gazzang.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | ipc, msg: forbid negative values for "msg{max,mnb,mni}"Mathias Krause2013-11-031-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Negative message lengths make no sense -- so don't do negative queue lenghts or identifier counts. Prevent them from getting negative. Also change the underlying data types to be unsigned to avoid hairy surprises with sign extensions in cases where those variables get evaluated in unsigned expressions with bigger data types, e.g size_t. In case a user still wants to have "unlimited" sizes she could just use INT_MAX instead. Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ipc/sem.c: synchronize semop and semctl with IPC_RMIDManfred Spraul2013-10-171-13/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After acquiring the semlock spinlock, operations must test that the array is still valid. - semctl() and exit_sem() would walk stale linked lists (ugly, but should be ok: all lists are empty) - semtimedop() would sleep forever - and if woken up due to a signal - access memory after free. The patch also: - standardizes the tests for .deleted, so that all tests in one function leave the function with the same approach. - unconditionally tests for .deleted immediately after every call to sem_lock - even it it means that for semctl(GETALL), .deleted will be tested twice. Both changes make the review simpler: After every sem_lock, there must be a test of .deleted, followed by a goto to the cleanup code (if the function uses "goto cleanup"). The only exception is semctl_down(): If sem_ids().rwsem is locked, then the presence in ids->ipcs_idr is equivalent to !.deleted, thus no additional test is required. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Mike Galbraith <efault@gmx.de> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>