summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'y2038-for-4.21' of ↵Linus Torvalds2018-12-2835-755/+847
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground Pull y2038 updates from Arnd Bergmann: "More syscalls and cleanups This concludes the main part of the system call rework for 64-bit time_t, which has spread over most of year 2018, the last six system calls being - ppoll - pselect6 - io_pgetevents - recvmmsg - futex - rt_sigtimedwait As before, nothing changes for 64-bit architectures, while 32-bit architectures gain another entry point that differs only in the layout of the timespec structure. Hopefully in the next release we can wire up all 22 of those system calls on all 32-bit architectures, which gives us a baseline version for glibc to start using them. This does not include the clock_adjtime, getrusage/waitid, and getitimer/setitimer system calls. I still plan to have new versions of those as well, but they are not required for correct operation of the C library since they can be emulated using the old 32-bit time_t based system calls. Aside from the system calls, there are also a few cleanups here, removing old kernel internal interfaces that have become unused after all references got removed. The arch/sh cleanups are part of this, there were posted several times over the past year without a reaction from the maintainers, while the corresponding changes made it into all other architectures" * tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: timekeeping: remove obsolete time accessors vfs: replace current_kernel_time64 with ktime equivalent timekeeping: remove timespec_add/timespec_del timekeeping: remove unused {read,update}_persistent_clock sh: remove board_time_init() callback sh: remove unused rtc_sh_get/set_time infrastructure sh: sh03: rtc: push down rtc class ops into driver sh: dreamcast: rtc: push down rtc class ops into driver y2038: signal: Add compat_sys_rt_sigtimedwait_time64 y2038: signal: Add sys_rt_sigtimedwait_time32 y2038: socket: Add compat_sys_recvmmsg_time64 y2038: futex: Add support for __kernel_timespec y2038: futex: Move compat implementation into futex.c io_pgetevents: use __kernel_timespec pselect6: use __kernel_timespec ppoll: use __kernel_timespec signal: Add restore_user_sigmask() signal: Add set_user_sigmask()
| * timekeeping: remove obsolete time accessorsArnd Bergmann2018-12-182-23/+0
| | | | | | | | | | | | | | | | | | | | | | There are no more remaining users of these deprecated wrappers, so let's remove them before new users have a chance to make it in. See Documentation/core-api/timekeeping.rst for replacements when porting old drivers that contain calls to this function. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: John Stultz <john.stultz@linaro.org>
| * vfs: replace current_kernel_time64 with ktime equivalentArnd Bergmann2018-12-181-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | current_time is the last remaining caller of current_kernel_time64(), which is a wrapper around ktime_get_coarse_real_ts64(). This calls the latter directly for consistency with the rest of the kernel that is moving to the ktime_get_ family of time accessors, as now documented in Documentation/core-api/timekeeping.rst. An open questions is whether we may want to actually call the more accurate ktime_get_real_ts64() for file systems that save high-resolution timestamps in their on-disk format. This would add a small overhead to each update of the inode stamps but lead to inode timestamps to actually have a usable resolution better than one jiffy (1 to 10 milliseconds normally). Experiments on a variety of hardware platforms show a typical time of around 100 CPU cycles to read the cycle counter and calculate the accurate time from that. On old platforms without a cycle counter, this can be signiciantly higher, up to several microseconds to access a hardware clock, but those have become very rare by now. I traced the original addition of the current_kernel_time() call to set the nanosecond fields back to linux-2.5.48, where Andi Kleen added a patch with subject "nanosecond stat timefields". Andi explains that the motivation was to introduce as little overhead as possible back then. At this time, reading the clock hardware was also more expensive when most architectures did not have a cycle counter. One side effect of having more accurate inode timestamp would be having to write out the inode every time that mtime/ctime/atime get touched on most systems, whereas many file systems today only write it when the timestamps have changed, i.e. at most once per jiffy unless something else changes as well. That change would certainly be noticed in some workloads, which is enough reason to not do it without a good reason, regardless of the cost of reading the time. One thing we could still consider however would be to round the timestamps from current_time() to multiples of NSEC_PER_JIFFY, e.g. full milliseconds rather than having six or seven meaningless but confusing digits at the end of the timestamp. Link: http://lkml.kernel.org/r/20180726130820.4174359-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * timekeeping: remove timespec_add/timespec_delArnd Bergmann2018-12-182-61/+0
| | | | | | | | | | | | | | | | | | | | | | The last users were removed a while ago since everyone moved to ktime_t, so we can remove the two unused interfaces for old timespec structures. With those two gone, set_normalized_timespec() is also unused, so remove that as well. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: John Stultz <john.stultz@linaro.org>
| * timekeeping: remove unused {read,update}_persistent_clockArnd Bergmann2018-12-183-25/+3
| | | | | | | | | | | | | | | | | | | | After arch/sh has removed the last reference to these functions, we can remove them completely and just rely on the 64-bit time_t based versions. This cleans up a rather ugly use of __weak functions. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: John Stultz <john.stultz@linaro.org>
| * sh: remove board_time_init() callbackArnd Bergmann2018-12-184-21/+1
| | | | | | | | | | | | | | | | | | The only remaining user of board_time_init() is the of-generic machine, and that just calls the global timer_init() function. Calling that one has no effect on non-DT platforms, so we can simply call it unconditionally in place of board_time_init(). Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * sh: remove unused rtc_sh_get/set_time infrastructureArnd Bergmann2018-12-182-71/+0
| | | | | | | | | | | | | | | | | | | | | | All platforms are now converted to RTC drivers, so this has become obsolete. The board_time_init() callback still has one caller, but could otherwise also get killed. This removes one more usage of the deprecated timespec structure, which overflows in y2038. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * sh: sh03: rtc: push down rtc class ops into driverArnd Bergmann2018-12-184-30/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SH RTC support has an extra level of indirection to provide either the old read_persistent_clock/update_persistent_clock interface or the rtc-generic device for hctosys/systohc. By removing the indirection and always using the RTC_CLASS interface, we can avoid the lossy double conversion between rtc_time and timespec, so we end up supporting the entire range of 'year' values, and clarifying the rtc_set_time callback. I did not change the behavior of sh03_rtc_settimeofday(), which keeps just updating the seconds/minutes by calling set_rtc_mmss(), this could be improved if anyone cares. Also, the file should ideally be moved into drivers/rtc and not use rtc-generic. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * sh: dreamcast: rtc: push down rtc class ops into driverArnd Bergmann2018-12-185-18/+35
| | | | | | | | | | | | | | | | | | | | | | The SH RTC support has an extra level of indirection to provide either the old read_persistent_clock/update_persistent_clock interface or the rtc-generic device for hctosys/systohc. Both do the same thing for dreamcast, so we can do away with the abstraction and simply let the RTC core code to take care of it. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * y2038: signal: Add compat_sys_rt_sigtimedwait_time64Arnd Bergmann2018-12-182-0/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that 32-bit architectures have two variants of sys_rt_sigtimedwaid() for 32-bit and 64-bit time_t, we also need to have a second compat system call entry point on the corresponding 64-bit architectures. The traditional system call keeps getting handled by compat_sys_rt_sigtimedwait(), and this adds a new compat_sys_rt_sigtimedwait_time64() that differs only in the timeout argument type. The naming remains a bit asymmetric for the moment. Ideally we would want to have compat_sys_rt_sigtimedwait_time32() for the old version and compat_sys_rt_sigtimedwait() for the new one to mirror the names of the native entry points, but renaming the existing system call tables causes unnecessary churn. I would suggest renaming all such system calls together at a later point. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * y2038: signal: Add sys_rt_sigtimedwait_time32Arnd Bergmann2018-12-182-0/+37
| | | | | | | | | | | | | | | | | | | | | | | | Once sys_rt_sigtimedwait() gets changed to a 64-bit time_t, we have to provide compatibility support for existing binaries. An earlier version of this patch reused the compat_sys_rt_sigtimedwait entry point to avoid code duplication, but this newer approach duplicates the existing native entry point instead, which seems a bit cleaner. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * y2038: socket: Add compat_sys_recvmmsg_time64Arnd Bergmann2018-12-186-41/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | recvmmsg() takes two arguments to pointers of structures that differ between 32-bit and 64-bit architectures: mmsghdr and timespec. For y2038 compatbility, we are changing the native system call from timespec to __kernel_timespec with a 64-bit time_t (in another patch), and use the existing compat system call on both 32-bit and 64-bit architectures for compatibility with traditional 32-bit user space. As we now have two variants of recvmmsg() for 32-bit tasks that are both different from the variant that we use on 64-bit tasks, this means we also require two compat system calls! The solution I picked is to flip things around: The existing compat_sys_recvmmsg() call gets moved from net/compat.c into net/socket.c and now handles the case for old user space on all architectures that have set CONFIG_COMPAT_32BIT_TIME. A new compat_sys_recvmmsg_time64() call gets added in the old place for 64-bit architectures only, this one handles the case of a compat mmsghdr structure combined with __kernel_timespec. In the indirect sys_socketcall(), we now need to call either do_sys_recvmmsg() or __compat_sys_recvmmsg(), depending on what kind of architecture we are on. For compat_sys_socketcall(), no such change is needed, we always call __compat_sys_recvmmsg(). I decided to not add a new SYS_RECVMMSG_TIME64 socketcall: Any libc implementation for 64-bit time_t will need significant changes including an updated asm/unistd.h, and it seems better to consistently use the separate syscalls that configuration, leaving the socketcall only for backward compatibility with 32-bit time_t based libc. The naming is asymmetric for the moment, so both existing syscalls entry points keep their names, while the new ones are recvmmsg_time32 and compat_recvmmsg_time64 respectively. I expect that we will rename the compat syscalls later as we start using generated syscall tables everywhere and add these entry points. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * y2038: futex: Add support for __kernel_timespecArnd Bergmann2018-12-072-11/+13
| | | | | | | | | | | | | | | | | | | | | | This prepares sys_futex for y2038 safe calling: the native syscall is changed to receive a __kernel_timespec argument, which will be switched to 64-bit time_t in the future. All the internal time handling gets changed to timespec64, and the compat_sys_futex entry point is moved under the CONFIG_COMPAT_32BIT_TIME check to provide compatibility for existing 32-bit architectures. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * y2038: futex: Move compat implementation into futex.cArnd Bergmann2018-12-074-216/+192
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are going to share the compat_sys_futex() handler between 64-bit architectures and 32-bit architectures that need to deal with both 32-bit and 64-bit time_t, and this is easier if both entry points are in the same file. In fact, most other system call handlers do the same thing these days, so let's follow the trend here and merge all of futex_compat.c into futex.c. In the process, a few minor changes have to be done to make sure everything still makes sense: handle_futex_death() and futex_cmpxchg_enabled() become local symbol, and the compat version of the fetch_robust_entry() function gets renamed to compat_fetch_robust_entry() to avoid a symbol clash. This is intended as a purely cosmetic patch, no behavior should change. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * io_pgetevents: use __kernel_timespecDeepa Dinamani2018-12-063-5/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct timespec is not y2038 safe. struct __kernel_timespec is the new y2038 safe structure for all syscalls that are using struct timespec. Update io_pgetevents interfaces to use struct __kernel_timespec. sigset_t also has different representations on 32 bit and 64 bit architectures. Hence, we need to support the following different syscalls: New y2038 safe syscalls: (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs) Native 64 bit(unchanged) and native 32 bit : sys_io_pgetevents Compat : compat_sys_io_pgetevents_time64 Older y2038 unsafe syscalls: (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs) Native 32 bit : sys_io_pgetevents_time32 Compat : compat_sys_io_pgetevents Note that io_getevents syscalls do not have a y2038 safe solution. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * pselect6: use __kernel_timespecDeepa Dinamani2018-12-063-14/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct timespec is not y2038 safe. struct __kernel_timespec is the new y2038 safe structure for all syscalls that are using struct timespec. Update pselect interfaces to use struct __kernel_timespec. sigset_t also has different representations on 32 bit and 64 bit architectures. Hence, we need to support the following different syscalls: New y2038 safe syscalls: (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs) Native 64 bit(unchanged) and native 32 bit : sys_pselect6 Compat : compat_sys_pselect6_time64 Older y2038 unsafe syscalls: (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs) Native 32 bit : pselect6_time32 Compat : compat_sys_pselect6 Note that all other versions of select syscalls will not have y2038 safe versions. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * ppoll: use __kernel_timespecDeepa Dinamani2018-12-063-56/+120
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct timespec is not y2038 safe. struct __kernel_timespec is the new y2038 safe structure for all syscalls that are using struct timespec. Update ppoll interfaces to use struct __kernel_timespec. sigset_t also has different representations on 32 bit and 64 bit architectures. Hence, we need to support the following different syscalls: New y2038 safe syscalls: (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs) Native 64 bit(unchanged) and native 32 bit : sys_ppoll Compat : compat_sys_ppoll_time64 Older y2038 unsafe syscalls: (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs) Native 32 bit : ppoll_time32 Compat : compat_sys_ppoll Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * signal: Add restore_user_sigmask()Deepa Dinamani2018-12-065-103/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor the logic to restore the sigmask before the syscall returns into an api. This is useful for versions of syscalls that pass in the sigmask and expect the current->sigmask to be changed during the execution and restored after the execution of the syscall. With the advent of new y2038 syscalls in the subsequent patches, we add two more new versions of the syscalls (for pselect, ppoll and io_pgetevents) in addition to the existing native and compat versions. Adding such an api reduces the logic that would need to be replicated otherwise. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * signal: Add set_user_sigmask()Deepa Dinamani2018-12-066-70/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor reading sigset from userspace and updating sigmask into an api. This is useful for versions of syscalls that pass in the sigmask and expect the current->sigmask to be changed during, and restored after, the execution of the syscall. With the advent of new y2038 syscalls in the subsequent patches, we add two more new versions of the syscalls (for pselect, ppoll, and io_pgetevents) in addition to the existing native and compat versions. Adding such an api reduces the logic that would need to be replicated otherwise. Note that the calls to sigprocmask() ignored the return value from the api as the function only returns an error on an invalid first argument that is hardcoded at these call sites. The updated logic uses set_current_blocked() instead. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
* | Fix failure path in alloc_pid()Matthew Wilcox2018-12-281-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | The failure path removes the allocated PIDs from the wrong namespace. This could lead to us inadvertently reusing PIDs in the leaf namespace and leaking PIDs in parent namespaces. Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API") Cc: <stable@vger.kernel.org> Signed-off-by: Matthew Wilcox <willy@infradead.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'locks-v4.21-1' of ↵Linus Torvalds2018-12-289-158/+253
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "The main change in this set is Neil Brown's work to reduce the thundering herd problem when a heavily-contended file lock is released. Previously we'd always wake up all waiters when this occurred. With this set, we'll now we only wake up waiters that were blocked on the range being released" * tag 'locks-v4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: locks: Use inode_is_open_for_write fs/locks: remove unnecessary white space. fs/locks: merge posix_unblock_lock() and locks_delete_block() fs/locks: create a tree of dependent requests. fs/locks: change all *_conflict() functions to return bool. fs/locks: always delete_block after waiting. fs/locks: allow a lock request to block other requests. fs/locks: use properly initialized file_lock when unlocking. ocfs2: properly initial file_lock used for unlock. gfs2: properly initial file_lock used for unlock. NFS: use locks_copy_lock() to copy locks. fs/locks: split out __locks_wake_up_blocks(). fs/locks: rename some lists and pointers.
| * | locks: Use inode_is_open_for_writeNikolay Borisov2018-12-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Use the aptly named function rather than open coding it. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
| * | fs/locks: remove unnecessary white space.NeilBrown2018-12-071-21/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - spaces before tabs, - spaces at the end of lines, - multiple blank lines, - blank lines before EXPORT_SYMBOL, can all go. Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
| * | fs/locks: merge posix_unblock_lock() and locks_delete_block()NeilBrown2018-12-075-31/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | posix_unblock_lock() is not specific to posix locks, and behaves nearly identically to locks_delete_block() - the former returning a status while the later doesn't. So discard posix_unblock_lock() and use locks_delete_block() instead, after giving that function an appropriate return value. Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
| * | fs/locks: create a tree of dependent requests.NeilBrown2018-12-071-6/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we find an existing lock which conflicts with a request, and the request wants to wait, we currently add the request to a list. When the lock is removed, the whole list is woken. This can cause the thundering-herd problem. To reduce the problem, we make use of the (new) fact that a pending request can itself have a list of blocked requests. When we find a conflict, we look through the existing blocked requests. If any one of them blocks the new request, the new request is attached below that request, otherwise it is added to the list of blocked requests, which are now known to be mutually non-conflicting. This way, when the lock is released, only a set of non-conflicting locks will be woken, the rest can stay asleep. If the lock request cannot be granted and the request needs to be requeued, all the other requests it blocks will then be woken To make this more concrete: If you have a many-core machine, and have many threads all wanting to briefly lock a give file (udev is known to do this), you can get quite poor performance. When one thread releases a lock, it wakes up all other threads that are waiting (classic thundering-herd) - one will get the lock and the others go to sleep. When you have few cores, this is not very noticeable: by the time the 4th or 5th thread gets enough CPU time to try to claim the lock, the earlier threads have claimed it, done what was needed, and released. So with few cores, many of the threads don't end up contending. With 50+ cores, lost of threads can get the CPU at the same time, and the contention can easily be measured. This patchset creates a tree of pending lock requests in which siblings don't conflict and each lock request does conflict with its parent. When a lock is released, only requests which don't conflict with each other a woken. Testing shows that lock-acquisitions-per-second is now fairly stable even as the number of contending process goes to 1000. Without this patch, locks-per-second drops off steeply after a few 10s of processes. There is a small cost to this extra complexity. At 20 processes running a particular test on 72 cores, the lock acquisitions per second drops from 1.8 million to 1.4 million with this patch. For 100 processes, this patch still provides 1.4 million while without this patch there are about 700,000. Reported-and-tested-by: Martin Wilck <mwilck@suse.de> Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
| * | fs/locks: change all *_conflict() functions to return bool.NeilBrown2018-12-071-12/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | posix_locks_conflict() and flock_locks_conflict() both return int. leases_conflict() returns bool. This inconsistency will cause problems for the next patch if not fixed. So change posix_locks_conflict() and flock_locks_conflict() to return bool. Also change the locks_conflict() helper. And convert some return (foo); to return foo; Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
| * | fs/locks: always delete_block after waiting.NeilBrown2018-12-071-16/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that requests can block other requests, we need to be careful to always clean up those blocked requests. Any time that we wait for a request, we might have other requests attached, and when we stop waiting, we must clean them up. If the lock was granted, the requests might have been moved to the new lock, though when merged with a pre-exiting lock, this might not happen. In all cases we don't want blocked locks to remain attached, so we remove them to be safe. Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: J. Bruce Fields <bfields@redhat.com> Tested-by: syzbot+a4a3d526b4157113ec6a@syzkaller.appspotmail.com Tested-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
| * | fs/locks: allow a lock request to block other requests.NeilBrown2018-11-301-6/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, a lock can block pending requests, but all pending requests are equal. If lots of pending requests are mutually exclusive, this means they will all be woken up and all but one will fail. This can hurt performance. So we will allow pending requests to block other requests. Only the first request will be woken, and it will wake the others. This patch doesn't implement this fully, but prepares the way. - It acknowledges that a request might be blocking other requests, and when the request is converted to a lock, those blocked requests are moved across. - When a request is requeued or discarded, all blocked requests are woken. - When deadlock-detection looks for the lock which blocks a given request, we follow the chain of ->fl_blocker all the way to the top. Tested-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
| * | fs/locks: use properly initialized file_lock when unlocking.NeilBrown2018-11-301-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both locks_remove_posix() and locks_remove_flock() use a struct file_lock without calling locks_init_lock() on it. This means the various list_heads are not initialized, which will become a problem with a later patch. So change them both to initialize properly. For flock locks, this involves using flock_make_lock(), and changing it to allow a file_lock to be passed in, so memory allocation isn't always needed. Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
| * | ocfs2: properly initial file_lock used for unlock.NeilBrown2018-11-301-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than assuming all-zeros is sufficient, use the available API to initialize the file_lock structure use for unlock. VFS-level changes will soon make it important that the list_heads in file_lock are always properly initialized. Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
| * | gfs2: properly initial file_lock used for unlock.NeilBrown2018-11-301-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than assuming all-zeros is sufficient, use the available API to initialize the file_lock structure use for unlock. VFS-level changes will soon make it important that the list_heads in file_lock are always properly initialized. Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
| * | NFS: use locks_copy_lock() to copy locks.NeilBrown2018-11-301-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using memcpy() to copy lock requests leaves the various list_head in an inconsistent state. As we will soon attach a list of conflicting request to another pending request, we need these lists to be consistent. So change NFS to use locks_init_lock/locks_copy_lock instead of memcpy. Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
| * | fs/locks: split out __locks_wake_up_blocks().NeilBrown2018-11-301-11/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | This functionality will be useful in future patches, so split it out from locks_wake_up_blocks(). Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
| * | fs/locks: rename some lists and pointers.NeilBrown2018-11-304-39/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct file lock contains an 'fl_next' pointer which is used to point to the lock that this request is blocked waiting for. So rename it to fl_blocker. The fl_blocked list_head in an active lock is the head of a list of blocked requests. In a request it is a node in that list. These are two distinct uses, so replace with two list_heads with different names. fl_blocked_requests is the head of a list of blocked requests fl_blocked_member is a node in a member of that list. The two different list_heads are never used at the same time, but that will change in a future patch. Note that a tracepoint is changed to report fl_blocker instead of fl_next. Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
* | | Merge tag 'ext4_for_linus' of ↵Linus Torvalds2018-12-2815-177/+296
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "All cleanups and bug fixes; most notably, fix some problems discovered in ext4's NFS support, and fix an ioctl (EXT4_IOC_GROUP_ADD) used by old versions of e2fsprogs which we accidentally broke a while back. Also fixed some error paths in ext4's quota and inline data support. Finally, improve tail latency in jbd2's commit code" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: check for shutdown and r/o file system in ext4_write_inode() ext4: force inode writes when nfsd calls commit_metadata() ext4: avoid declaring fs inconsistent due to invalid file handles ext4: include terminating u32 in size of xattr entries when expanding inodes ext4: compare old and new mode before setting update_mode flag ext4: fix EXT4_IOC_GROUP_ADD ioctl ext4: hard fail dax mount on unsupported devices jbd2: update locking documentation for transaction_t ext4: remove redundant condition check jbd2: clean up indentation issue, replace spaces with tab ext4: clean up indentation issues, remove extraneous tabs ext4: missing unlock/put_page() in ext4_try_to_write_inline_data() ext4: fix possible use after free in ext4_quota_enable jbd2: avoid long hold times of j_state_lock while committing a transaction ext4: add ext4_sb_bread() to disambiguate ENOMEM cases
| * | | ext4: check for shutdown and r/o file system in ext4_write_inode()Theodore Ts'o2018-12-191-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the file system has been shut down or is read-only, then ext4_write_inode() needs to bail out early. Also use jbd2_complete_transaction() instead of ext4_force_commit() so we only force a commit if it is needed. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * | | ext4: force inode writes when nfsd calls commit_metadata()Theodore Ts'o2018-12-192-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some time back, nfsd switched from calling vfs_fsync() to using a new commit_metadata() hook in export_operations(). If the file system did not provide a commit_metadata() hook, it fell back to using sync_inode_metadata(). Unfortunately doesn't work on all file systems. In particular, it doesn't work on ext4 due to how the inode gets journalled --- the VFS writeback code will not always call ext4_write_inode(). So we need to provide our own ext4_nfs_commit_metdata() method which calls ext4_write_inode() directly. Google-Bug-Id: 121195940 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * | | ext4: avoid declaring fs inconsistent due to invalid file handlesTheodore Ts'o2018-12-198-41/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we receive a file handle, either from NFS or open_by_handle_at(2), and it points at an inode which has not been initialized, and the file system has metadata checksums enabled, we shouldn't try to get the inode, discover the checksum is invalid, and then declare the file system as being inconsistent. This can be reproduced by creating a test file system via "mke2fs -t ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that directory, and then running the following program. #define _GNU_SOURCE #include <fcntl.h> struct handle { struct file_handle fh; unsigned char fid[MAX_HANDLE_SZ]; }; int main(int argc, char **argv) { struct handle h = {{8, 1 }, { 12, }}; open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY); return 0; } Google-Bug-Id: 120690101 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * | | ext4: include terminating u32 in size of xattr entries when expanding inodesTheodore Ts'o2018-12-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In ext4_expand_extra_isize_ea(), we calculate the total size of the xattr header, plus the xattr entries so we know how much of the beginning part of the xattrs to move when expanding the inode extra size. We need to include the terminating u32 at the end of the xattr entries, or else if there is uninitialized, non-zero bytes after the xattr entries and before the xattr values, the list of xattr entries won't be properly terminated. Reported-by: Steve Graham <stgraham2000@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * | | ext4: compare old and new mode before setting update_mode flagChengguang Xu2018-12-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If new mode is the same as old mode we don't have to reset inode mode in the rest of the code, so compare old and new mode before setting update_mode flag. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: fix EXT4_IOC_GROUP_ADD ioctlruippan (潘睿)2018-12-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e2b911c53584 ("ext4: clean up feature test macros with predicate functions") broke the EXT4_IOC_GROUP_ADD ioctl. This was not noticed since only very old versions of resize2fs (before e2fsprogs 1.42) use this ioctl. However, using a new kernel with an enterprise Linux userspace will cause attempts to use online resize to fail with "No reserved GDT blocks". Fixes: e2b911c53584 ("ext4: clean up feature test macros with predicate...") Cc: stable@kernel.org # v4.4 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: ruippan (潘睿) <ruippan@tencent.com>
| * | | ext4: hard fail dax mount on unsupported devicesEric Sandeen2018-12-041-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As dax inches closer to production use, an administrator should not be surprised by silently disabling the feature they asked for. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | jbd2: update locking documentation for transaction_tAlexander Lochmann2018-12-041-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following members of struct transaction_s aka transaction_t were turned into lock-free variables in the past: - t_updates - t_outstanding_credits - t_handle_count However, the documentation has not been updated yet. This commit replaced the annotated lock by [none]. Found by LockDoc (Alexander Lochmann, Horst Schirmeier and Olaf Spinczyk) Signed-off-by: Alexander Lochmann <alexander.lochmann@tu-dortmund.de> Signed-off-by: Horst Schirmeier <horst.schirmeier@tu-dortmund.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: remove redundant condition checkChengguang Xu2018-12-041-16/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4_xattr_destroy_cache() can handle NULL pointer correctly, so there is no need to check NULL pointer before calling ext4_xattr_destroy_cache(). Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | jbd2: clean up indentation issue, replace spaces with tabColin Ian King2018-12-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a statement that is indented with spaces, replace it with a tab. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: clean up indentation issues, remove extraneous tabsColin Ian King2018-12-042-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several lines that are indented too far, clean these up by removing the tabs. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: missing unlock/put_page() in ext4_try_to_write_inline_data()Maurizio Lombardi2018-12-041-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of error, ext4_try_to_write_inline_data() should unlock and release the page it holds. Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Cc: stable@kernel.org # 3.8 Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: fix possible use after free in ext4_quota_enablePan Bian2018-12-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function frees qf_inode via iput but then pass qf_inode to lockdep_set_quota_inode on the failure path. This may result in a use-after-free bug. The patch frees df_inode only when it is never used. Fixes: daf647d2dd5 ("ext4: add lockdep annotations for i_data_sem") Cc: stable@kernel.org # 4.6 Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | jbd2: avoid long hold times of j_state_lock while committing a transactionJan Kara2018-12-043-5/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can hold j_state_lock for writing at the beginning of jbd2_journal_commit_transaction() for a rather long time (reportedly for 30 ms) due cleaning revoke bits of all revoked buffers under it. The handling of revoke tables as well as cleaning of t_reserved_list, and checkpoint lists does not need j_state_lock for anything. It is only needed to prevent new handles from joining the transaction. Generally T_LOCKED transaction state prevents new handles from joining the transaction - except for reserved handles which have to allowed to join while we wait for other handles to complete. To prevent reserved handles from joining the transaction while cleaning up lists, add new transaction state T_SWITCH and watch for it when starting reserved handles. With this we can just drop the lock for operations that don't need it. Reported-and-tested-by: Adrian Hunter <adrian.hunter@intel.com> Suggested-by: "Theodore Y. Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: add ext4_sb_bread() to disambiguate ENOMEM casesTheodore Ts'o2018-11-255-94/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Today, when sb_bread() returns NULL, this can either be because of an I/O error or because the system failed to allocate the buffer. Since it's an old interface, changing would require changing many call sites. So instead we create our own ext4_sb_bread(), which also allows us to set the REQ_META flag. Also fixed a problem in the xattr code where a NULL return in a function could also mean that the xattr was not found, which could lead to the wrong error getting returned to userspace. Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3") Cc: stable@kernel.org # 2.6.19 Signed-off-by: Theodore Ts'o <tytso@mit.edu>