summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'exec-for-v5.11' of ↵Linus Torvalds2020-12-1619-244/+158
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull execve updates from Eric Biederman: "This set of changes ultimately fixes the interaction of posix file lock and exec. Fundamentally most of the change is just moving where unshare_files is called during exec, and tweaking the users of files_struct so that the count of files_struct is not unnecessarily played with. Along the way fcheck and related helpers were renamed to more accurately reflect what they do. There were also many other small changes that fell out, as this is the first time in a long time much of this code has been touched. Benchmarks haven't turned up any practical issues but Al Viro has observed a possibility for a lot of pounding on task_lock. So I have some changes in progress to convert put_files_struct to always rcu free files_struct. That wasn't ready for the merge window so that will have to wait until next time" * 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits) exec: Move io_uring_task_cancel after the point of no return coredump: Document coredump code exclusively used by cell spufs file: Remove get_files_struct file: Rename __close_fd_get_file close_fd_get_file file: Replace ksys_close with close_fd file: Rename __close_fd to close_fd and remove the files parameter file: Merge __alloc_fd into alloc_fd file: In f_dupfd read RLIMIT_NOFILE once. file: Merge __fd_install into fd_install proc/fd: In fdinfo seq_show don't use get_files_struct bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu file: Implement task_lookup_next_fd_rcu kcmp: In get_file_raw_ptr use task_lookup_fd_rcu proc/fd: In tid_fd_mode use task_lookup_fd_rcu file: Implement task_lookup_fd_rcu file: Rename fcheck lookup_fd_rcu file: Replace fcheck_files with files_lookup_fd_rcu file: Factor files_lookup_fd_locked out of fcheck_files file: Rename __fcheck_files to files_lookup_fd_raw ...
| * exec: Move io_uring_task_cancel after the point of no returnEric W. Biederman2020-12-101-5/+5
| | | | | | | | | | | | | | | | | | | | | | Now that unshare_files happens in begin_new_exec after the point of no return, io_uring_task_cancel can also happen later. Effectively this means io_uring activities for a task are only canceled when exec succeeds. Link: https://lkml.kernel.org/r/878saih2op.fsf@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * coredump: Document coredump code exclusively used by cell spufsEric W. Biederman2020-12-102-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Oleg Nesterov recently asked[1] why is there an unshare_files in do_coredump. After digging through all of the callers of lookup_fd it turns out that it is arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context that needs the unshare_files in do_coredump. Looking at the history[2] this code was also the only piece of coredump code that required the unshare_files when the unshare_files was added. Looking at that code it turns out that cell is also the only architecture that implements elf_coredump_extra_notes_size and elf_coredump_extra_notes_write. I looked at the gdb repo[3] support for cell has been removed[4] in binutils 2.34. Geoff Levand reports he is still getting questions on how to run modern kernels on the PS3, from people using 3rd party firmware so this code is not dead. According to Wikipedia the last PS3 shipped in Japan sometime in 2017. So it will probably be a little while before everyone's hardware dies. Add some comments briefly documenting the coredump code that exists only to support cell spufs to make it easier to understand the coredump code. Eventually the hardware will be dead, or their won't be userspace tools, or the coredump code will be refactored and it will be too difficult to update a dead architecture and these comments make it easy to tell where to pull to remove cell spufs support. [1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com [2] 179e037fc137 ("do_coredump(): make sure that descriptor table isn't shared") [3] git://sourceware.org/git/binutils-gdb.git [4] abf516c6931a ("Remove Cell Broadband Engine debugging support"). Link: https://lkml.kernel.org/r/87h7pdnlzv.fsf_-_@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * exec: Move unshare_files and guarantee files_struct.count is correctEric W. Biederman2020-12-1018-247/+153
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A while ago it was reported that posix file locking goes wrong when a multi-threaded process calls exec. I looked into the history and this is definitely a regression, that should be fixed if we can. This set of changes cleanups of the code in exec so hopefully this code will not regress again. Then it adds helpers and fixes the users of files_struct so the reference count is only incremented if COPY_FILES is passed to clone (or if io_uring takes a reference). Then it removes helpers (get_files_struct, __install_fd, __alloc_fd, __close_fd) that are no longer needed and if used would encourage code that increments the count of files_struct somewhere besides in clone when COPY_FILES is passed. In addition to fixing the bug in exec and simplifing the code this set of changes by virtue of getting files_struct.count correct it optimizes fdget. With proc and other places not temporarily increasing the count on files_struct __fget_light should succeed more often in being able to return a struct file without touching it's reference count. Fixing the count in files_struct was suggested by Oleg[1]. For those that are interested in the history of this issue I have included as much of it as I could find in the first change. Since v1: - Renamed the functions __fcheck_files -> files_lookup_fd_raw fcheck_files -> files_lookup_fd_locked fcheck_files -> files_lookup_fd_rcu fcheck_files -> lookup_fd_rcu fcheck_task -> task_lookup_fd_rcu fnext_task -> task_lookup_next_fd_rcu __close_fd_get_file -> close_fd_get_file - Simplified get_file_raw_ptr - Removed ksys_close - Examined the penalty for taking task_lock. The helper task_lookup_next_fd_rcu takes task_lock each iteration. Concern was expressed that this might be a problem. The function tid_fd_mode isn called from tid_fd_revalidate which is called when ever a file descriptor file is stat'ed, opened, or otherwise accessed. The function tid_fd_mode histrocally called get_files_struct which took and dropped task_lock. So the volume of task_lock calls is already proportional to the number of file descriptors. A micro benchmark did not see the move to task_lookup_next_fd_rcu making a difference in performance. Which suggests that the change to taking the task lock for every file descriptor found in task_lookup_next_fd will not be a problem. - Reviewed the code for conflicts with io_uring (especially the removal of get_files_struct). To my surprise no conflicts were found as io_uring does not use standard helpers but instead rolls it's own version of get_files_struct by hand. Documentation/filesystems/files.rst | 8 +- arch/powerpc/platforms/cell/spufs/coredump.c | 2 +- drivers/android/binder.c | 2 +- fs/autofs/dev-ioctl.c | 5 +- fs/coredump.c | 5 +- fs/exec.c | 29 +++---- fs/file.c | 124 +++++++++++++-------------- fs/io_uring.c | 2 +- fs/locks.c | 14 +-- fs/notify/dnotify/dnotify.c | 2 +- fs/open.c | 2 +- fs/proc/fd.c | 48 ++++------- include/linux/fdtable.h | 40 +++++---- include/linux/syscalls.h | 12 --- kernel/bpf/syscall.c | 20 +---- kernel/bpf/task_iter.c | 44 +++------- kernel/fork.c | 12 +-- kernel/kcmp.c | 29 ++----- 18 files changed, 153 insertions(+), 247 deletions(-) Eric W. Biederman (25): exec: Don't open code get_close_on_exec exec: Move unshare_files to fix posix file locking during exec exec: Simplify unshare_files exec: Remove reset_files_struct kcmp: In kcmp_epoll_target use fget_task bpf: In bpf_task_fd_query use fget_task proc/fd: In proc_fd_link use fget_task file: Rename __fcheck_files to files_lookup_fd_raw file: Factor files_lookup_fd_locked out of fcheck_files file: Replace fcheck_files with files_lookup_fd_rcu file: Rename fcheck lookup_fd_rcu file: Implement task_lookup_fd_rcu proc/fd: In tid_fd_mode use task_lookup_fd_rcu kcmp: In get_file_raw_ptr use task_lookup_fd_rcu file: Implement task_lookup_next_fd_rcu proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu proc/fd: In fdinfo seq_show don't use get_files_struct file: Merge __fd_install into fd_install file: In f_dupfd read RLIMIT_NOFILE once. file: Merge __alloc_fd into alloc_fd file: Rename __close_fd to close_fd and remove the files parameter file: Replace ksys_close with close_fd file: Rename __close_fd_get_file close_fd_get_file file: Remove get_files_struct [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com v1: https://lkml.kernel.org/r/87ft8l6ic3.fsf@x220.int.ebiederm.org Reported-by: Jeff Layton <jlayton@redhat.com> Reported-by: Daniel P. Berrangé <berrange@redhat.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/87r1on1v62.fsf@x220.int.ebiederm.org Link: https://lists.openvz.org/pipermail/criu/2020-November/045123.html Link: https://marc.info/?l=openvz-criu&m=160591423214257 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| | * file: Remove get_files_structEric W. Biederman2020-12-102-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Now that get_files_struct has no more users and can not cause the problems for posix file locking and fget_light remove get_files_struct so that it does not gain any new users. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-13-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-24-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * file: Rename __close_fd_get_file close_fd_get_fileEric W. Biederman2020-12-104-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function close_fd_get_file is explicitly a variant of __close_fd[1]. Now that __close_fd has been renamed close_fd, rename close_fd_get_file to be consistent with close_fd. When __alloc_fd, __close_fd and __fd_install were introduced the double underscore indicated that the function took a struct files_struct parameter. The function __close_fd_get_file never has so the naming has always been inconsistent. This just cleans things up so there are not any lingering mentions or references __close_fd left in the code. [1] 80cd795630d6 ("binder: fix use-after-free due to ksys_close() during fdget()") Link: https://lkml.kernel.org/r/20201120231441.29911-23-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * file: Replace ksys_close with close_fdEric W. Biederman2020-12-102-14/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that ksys_close is exactly identical to close_fd replace the one caller of ksys_close with close_fd. [1] https://lkml.kernel.org/r/20200818112020.GA17080@infradead.org Suggested-by: Christoph Hellwig <hch@infradead.org> Link: https://lkml.kernel.org/r/20201120231441.29911-22-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * file: Rename __close_fd to close_fd and remove the files parameterEric W. Biederman2020-12-104-12/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function __close_fd was added to support binder[1]. Now that binder has been fixed to no longer need __close_fd[2] all calls to __close_fd pass current->files. Therefore transform the files parameter into a local variable initialized to current->files, and rename __close_fd to close_fd to reflect this change, and keep it in sync with the similar changes to __alloc_fd, and __fd_install. This removes the need for callers to care about the extra care that needs to be take if anything except current->files is passed, by limiting the callers to only operation on current->files. [1] 483ce1d4b8c3 ("take descriptor-related part of close() to file.c") [2] 44d8047f1d87 ("binder: use standard functions to allocate fds") Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-17-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-21-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * file: Merge __alloc_fd into alloc_fdEric W. Biederman2020-12-102-10/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function __alloc_fd was added to support binder[1]. With binder fixed[2] there are no more users. As alloc_fd just calls __alloc_fd with "files=current->files", merge them together by transforming the files parameter into a local variable initialized to current->files. [1] dcfadfa4ec5a ("new helper: __alloc_fd()") [2] 44d8047f1d87 ("binder: use standard functions to allocate fds") Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-16-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-20-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * file: In f_dupfd read RLIMIT_NOFILE once.Eric W. Biederman2020-12-101-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify the code, and remove the chance of races by reading RLIMIT_NOFILE only once in f_dupfd. Pass the read value of RLIMIT_NOFILE into alloc_fd which is the other location the rlimit was read in f_dupfd. As f_dupfd is the only caller of alloc_fd this changing alloc_fd is trivially safe. Further this causes alloc_fd to take all of the same arguments as __alloc_fd except for the files_struct argument. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-15-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-19-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * file: Merge __fd_install into fd_installEric W. Biederman2020-12-102-21/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function __fd_install was added to support binder[1]. With binder fixed[2] there are no more users. As fd_install just calls __fd_install with "files=current->files", merge them together by transforming the files parameter into a local variable initialized to current->files. [1] f869e8a7f753 ("expose a low-level variant of fd_install() for binder") [2] 44d8047f1d87 ("binder: use standard functions to allocate fds") Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1:https://lkml.kernel.org/r/20200817220425.9389-14-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-18-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * proc/fd: In fdinfo seq_show don't use get_files_structEric W. Biederman2020-12-101-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Instead hold task_lock for the duration that task->files needs to be stable in seq_show. The task_lock was already taken in get_files_struct, and so skipping get_files_struct performs less work overall, and avoids the problems with the files_struct reference count. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-12-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-17-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcuEric W. Biederman2020-12-101-34/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Using task_lookup_next_fd_rcu simplifies task_file_seq_get_next, by moving the checking for the maximum file descritor into the generic code, and by remvoing the need for capturing and releasing a reference on files_struct. As the reference count of files_struct no longer needs to be maintained bpf_iter_seq_task_file_info can have it's files member removed and task_file_seq_get_next no longer needs it's fstruct argument. The curr_fd local variable does need to become unsigned to be used with fnext_task. As curr_fd is assigned from and assigned a u32 making curr_fd an unsigned int won't cause problems and might prevent them. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> v1: https://lkml.kernel.org/r/20200817220425.9389-11-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-16-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * proc/fd: In proc_readfd_common use task_lookup_next_fd_rcuEric W. Biederman2020-12-101-12/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Using task_lookup_next_fd_rcu simplifies proc_readfd_common, by moving the checking for the maximum file descritor into the generic code, and by remvoing the need for capturing and releasing a reference on files_struct. As task_lookup_fd_rcu may update the fd ctx->pos has been changed to be the fd +2 after task_lookup_fd_rcu returns. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Andy Lavr <andy.lavr@gmail.com> v1: https://lkml.kernel.org/r/20200817220425.9389-10-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-15-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * file: Implement task_lookup_next_fd_rcuEric W. Biederman2020-12-102-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As a companion to fget_task and task_lookup_fd_rcu implement task_lookup_next_fd_rcu that will return the struct file for the first file descriptor number that is equal or greater than the fd argument value, or NULL if there is no such struct file. This allows file descriptors of foreign processes to be iterated through safely, without needed to increment the count on files_struct. Some concern[1] has been expressed that this function takes the task_lock for each iteration and thus for each file descriptor. This place where this function will be called in a commonly used code path is for listing /proc/<pid>/fd. I did some small benchmarks and did not see any measurable performance differences. For ordinary users ls is likely to stat each of the directory entries and tid_fd_mode called from tid_fd_revalidae has always taken the task lock for each file descriptor. So this does not look like it will be a big change in practice. At some point is will probably be worth changing put_files_struct to free files_struct after an rcu grace period so that task_lock won't be needed at all. [1] https://lkml.kernel.org/r/20200817220425.9389-10-ebiederm@xmission.com v1: https://lkml.kernel.org/r/20200817220425.9389-9-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-14-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * kcmp: In get_file_raw_ptr use task_lookup_fd_rcuEric W. Biederman2020-12-101-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Modify get_file_raw_ptr to use task_lookup_fd_rcu. The helper task_lookup_fd_rcu does the work of taking the task lock and verifying that task->files != NULL and then calls files_lookup_fd_rcu. So let use the helper to make a simpler implementation of get_file_raw_ptr. Acked-by: Cyrill Gorcunov <gorcunov@gmail.com> Link: https://lkml.kernel.org/r/20201120231441.29911-13-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * proc/fd: In tid_fd_mode use task_lookup_fd_rcuEric W. Biederman2020-12-101-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Instead of manually coding finding the files struct for a task and then calling files_lookup_fd_rcu, use the helper task_lookup_fd_rcu that combines those to steps. Making the code simpler and removing the need to get a reference on a files_struct. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-7-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-12-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * file: Implement task_lookup_fd_rcuEric W. Biederman2020-12-102-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As a companion to lookup_fd_rcu implement task_lookup_fd_rcu for querying an arbitrary process about a specific file. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200818103713.aw46m7vprsy4vlve@wittgenstein Link: https://lkml.kernel.org/r/20201120231441.29911-11-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * file: Rename fcheck lookup_fd_rcuEric W. Biederman2020-12-104-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Also remove the confusing comment about checking if a fd exists. I could not find one instance in the entire kernel that still matches the description or the reason for the name fcheck. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-10-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * file: Replace fcheck_files with files_lookup_fd_rcuEric W. Biederman2020-12-106-13/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change renames fcheck_files to files_lookup_fd_rcu. All of the remaining callers take the rcu_read_lock before calling this function so the _rcu suffix is appropriate. This change also tightens up the debug check to verify that all callers hold the rcu_read_lock. All callers that used to call files_check with the files->file_lock held have now been changed to call files_lookup_fd_locked. This change of name has helped remind me of which locks and which guarantees are in place helping me to catch bugs later in the patchset. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-9-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * file: Factor files_lookup_fd_locked out of fcheck_filesEric W. Biederman2020-12-104-8/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To make it easy to tell where files->file_lock protection is being used when looking up a file create files_lookup_fd_locked. Only allow this function to be called with the file_lock held. Update the callers of fcheck and fcheck_files that are called with the files->file_lock held to call files_lookup_fd_locked instead. Hopefully this makes it easier to quickly understand what is going on. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-8-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * file: Rename __fcheck_files to files_lookup_fd_rawEric W. Biederman2020-12-102-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function fcheck despite it's comment is poorly named as it has no callers that only check it's return value. All of fcheck's callers use the returned file descriptor. The same is true for fcheck_files and __fcheck_files. A new less confusing name is needed. In addition the names of these functions are confusing as they do not report the kind of locks that are needed to be held when these functions are called making error prone to use them. To remedy this I am making the base functio name lookup_fd and will and prefixes and sufficies to indicate the rest of the context. Name the function (previously called __fcheck_files) that proceeds from a struct files_struct, looks up the struct file of a file descriptor, and requires it's callers to verify all of the appropriate locks are held files_lookup_fd_raw. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-7-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * proc/fd: In proc_fd_link use fget_taskEric W. Biederman2020-12-101-10/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Simplifying proc_fd_link is a little bit tricky. It is necessary to know that there is a reference to fd_f ile while path_get is running. This reference can either be guaranteed to exist either by locking the fdtable as the code currently does or by taking a reference on the file in question. Use fget_task to remove the need for get_files_struct and to take a reference to file in question. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> v1: https://lkml.kernel.org/r/20200817220425.9389-8-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-6-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * bpf: In bpf_task_fd_query use fget_taskEric W. Biederman2020-12-101-17/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the helper fget_task to simplify bpf_task_fd_query. As well as simplifying the code this removes one unnecessary increment of struct files_struct. This unnecessary increment of files_struct.count can result in exec unnecessarily unsharing files_struct and breaking posix locks, and it can result in fget_light having to fallback to fget reducing performance. This simplification comes from the observation that none of the callers of get_files_struct actually need to call get_files_struct that was made when discussing[1] exec and posix file locks. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> v1: https://lkml.kernel.org/r/20200817220425.9389-5-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-5-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * kcmp: In kcmp_epoll_target use fget_taskEric W. Biederman2020-12-101-16/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the helper fget_task and simplify the code. As well as simplifying the code this removes one unnecessary increment of struct files_struct. This unnecessary increment of files_struct.count can result in exec unnecessarily unsharing files_struct and breaking posix locks, and it can result in fget_light having to fallback to fget reducing performance. Suggested-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> v1: https://lkml.kernel.org/r/20200817220425.9389-4-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-4-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * exec: Remove reset_files_structEric W. Biederman2020-12-102-13/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that exec no longer needs to restore the previous value of current->files on error there are no more callers of reset_files_struct so remove it. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-3-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-3-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * exec: Simplify unshare_filesEric W. Biederman2020-12-104-15/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that exec no longer needs to return the unshared files to their previous value there is no reason to return displaced. Instead when unshare_fd creates a copy of the file table, call put_files_struct before returning from unshare_files. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-2-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-2-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * exec: Move unshare_files to fix posix file locking during execEric W. Biederman2020-12-101-14/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many moons ago the binfmts were doing some very questionable things with file descriptors and an unsharing of the file descriptor table was added to make things better[1][2]. The helper steal_lockss was added to avoid breaking the userspace programs[3][4][6]. Unfortunately it turned out that steal_locks did not work for network file systems[5], so it was removed to see if anyone would complain[7][8]. It was thought at the time that NPTL would not be affected as the unshare_files happened after the other threads were killed[8]. Unfortunately because there was an unshare_files in binfmt_elf.c before the threads were killed this analysis was incorrect. This unshare_files in binfmt_elf.c resulted in the unshares_files happening whenever threads were present. Which led to unshare_files being moved to the start of do_execve[9]. Later the problems were rediscovered and the suggested approach was to readd steal_locks under a different name[10]. I happened to be reviewing patches and I noticed that this approach was a step backwards[11]. I proposed simply moving unshare_files[12] and it was pointed out that moving unshare_files without auditing the code was also unsafe[13]. There were then several attempts to solve this[14][15][16] and I even posted this set of changes[17]. Unfortunately because auditing all of execve is time consuming this change did not make it in at the time. Well now that I am cleaning up exec I have made the time to read through all of the binfmts and the only playing with file descriptors is either the security modules closing them in security_bprm_committing_creds or is in the generic code in fs/exec.c. None of it happens before begin_new_exec is called. So move unshare_files into begin_new_exec, after the point of no return. If memory is very very very low and the application calling exec is sharing file descriptor tables between processes we might fail past the point of no return. Which is unfortunate but no different than any of the other places where we allocate memory after the point of no return. This movement allows another process that shares the file table, or another thread of the same process and that closes files or changes their close on exec behavior and races with execve to cause some unexpected things to happen. There is only one time of check to time of use race and it is just there so that execve fails instead of an interpreter failing when it tries to open the file it is supposed to be interpreting. Failing later if userspace is being silly is not a problem. With this change it the following discription from the removal of steal_locks[8] finally becomes true. Apps using NPTL are not affected, since all other threads are killed before execve. Apps using LinuxThreads are only affected if they - have multiple threads during exec (LinuxThreads doesn't kill other threads, the app may do it with pthread_kill_other_threads_np()) - rely on POSIX locks being inherited across exec Both conditions are documented, but not their interaction. Apps using clone() natively are affected if they - use clone(CLONE_FILES) - rely on POSIX locks being inherited across exec I have investigated some paths to make it possible to solve this without moving unshare_files but they all look more complicated[18]. Reported-by: Daniel P. Berrangé <berrange@redhat.com> Reported-by: Jeff Layton <jlayton@redhat.com> History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git [1] 02cda956de0b ("[PATCH] unshare_files" [2] 04e9bcb4d106 ("[PATCH] use new unshare_files helper") [3] 088f5d7244de ("[PATCH] add steal_locks helper") [4] 02c541ec8ffa ("[PATCH] use new steal_locks helper") [5] https://lkml.kernel.org/r/E1FLIlF-0007zR-00@dorka.pomaz.szeredi.hu [6] https://lkml.kernel.org/r/0060321191605.GB15997@sorel.sous-sol.org [7] https://lkml.kernel.org/r/E1FLwjC-0000kJ-00@dorka.pomaz.szeredi.hu [8] c89681ed7d0e ("[PATCH] remove steal_locks()") [9] fd8328be874f ("[PATCH] sanitize handling of shared descriptor tables in failing execve()") [10] https://lkml.kernel.org/r/20180317142520.30520-1-jlayton@kernel.org [11] https://lkml.kernel.org/r/87r2nwqk73.fsf@xmission.com [12] https://lkml.kernel.org/r/87bmfgvg8w.fsf@xmission.com [13] https://lkml.kernel.org/r/20180322111424.GE30522@ZenIV.linux.org.uk [14] https://lkml.kernel.org/r/20180827174722.3723-1-jlayton@kernel.org [15] https://lkml.kernel.org/r/20180830172423.21964-1-jlayton@kernel.org [16] https://lkml.kernel.org/r/20180914105310.6454-1-jlayton@kernel.org [17] https://lkml.kernel.org/r/87a7ohs5ow.fsf@xmission.com [18] https://lkml.kernel.org/r/87pn8c1uj6.fsf_-_@x220.int.ebiederm.org Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-1-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-1-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| | * exec: Don't open code get_close_on_execEric W. Biederman2020-12-101-2/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Al Viro pointed out that using the phrase "close_on_exec(fd, rcu_dereference_raw(current->files->fdt))" instead of wrapping it in rcu_read_lock(), rcu_read_unlock() is a very questionable optimization[1]. Once wrapped with rcu_read_lock()/rcu_read_unlock() that phrase becomes equivalent the helper function get_close_on_exec so simplify the code and make it more robust by simply using get_close_on_exec. [1] https://lkml.kernel.org/r/20201207222214.GA4115853@ZenIV.linux.org.uk Suggested-by: Al Viro <viro@ftp.linux.org.uk> Link: https://lkml.kernel.org/r/87k0tqr6zi.fsf_-_@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* | Merge branch 'signal-for-v5.11' of ↵Linus Torvalds2020-12-161-2/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull signal cleanup from Eric Biederman: "Remove a never used HP-UX compatibility from parisc headers and consolidating the SA_* flags definitions into a generic header as much as possible. We only have 32 SA_* flag bits total, so we need to be careful. But as this is the first addition in a decade or so I think we are fine for the forseeable future" * 'signal-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signal/parisc: Remove parisc specific definition of __ARCH_UAPI_SA_FLAGS
| * | signal/parisc: Remove parisc specific definition of __ARCH_UAPI_SA_FLAGSEric W. Biederman2020-11-301-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Randy Dunlap wrote: > On 11/27/20 10:43 AM, Randy Dunlap wrote: > > > on parisc, _SA_SIGGFAULT is undefined and causing build errors. > > > > commit 23acdc76f1798b090bb9dcc90671cd29d929834e > > Author: Peter Collingbourne <pcc@google.com> > > Date: Thu Nov 12 18:53:34 2020 -0800 > > > > signal: clear non-uapi flag bits when passing/returning sa_flags > > > > > > > > _SA_SIGGFAULT is not used or defined anywhere else in the > > kernel source tree. > > > Here is the build error (although it should be obvious): > > ../kernel/signal.c: In function 'do_sigaction': > ../arch/parisc/include/asm/signal.h:24:30: error: '_SA_SIGGFAULT' undeclared (first use in this function) > 24 | #define __ARCH_UAPI_SA_FLAGS _SA_SIGGFAULT > | ^~~~~~~~~~~~~ Stephen Rothwell pointed out: > _SA_SIGGFAULT was removed by commit > > 41f5a81c07cd ("parisc: Drop HP-UX specific fcntl and signal flags") > > which was added to Linus' tree in v5.10-rc1. Solve this by removing the the parisc specific definition of __ARCH_UAPI_SA_FLAGS that was just added. Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Fixes: 23acdc76f179 ("signal: clear non-uapi flag bits when passing/returning sa_flags") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | | Merge tag 'close-range-openat2-v5.11' of ↵Linus Torvalds2020-12-165-11/+122
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull close_range/openat2 updates from Christian Brauner: "This contains a fix for openat2() to make RESOLVE_BENEATH and RESOLVE_IN_ROOT mutually exclusive. It doesn't make sense to specify both at the same time. The openat2() selftests have been extended to verify that these two flags can't be specified together. This also adds the CLOSE_RANGE_CLOEXEC flag to close_range() which allows to mark a range of file descriptors as close-on-exec without actually closing them. This is useful in general but the use-case that triggered the patch is installing a seccomp profile in the calling task before exec. If the seccomp profile wants to block the close_range() syscall it obviously can't use it to close all fds before exec. If it calls close_range() before installing the seccomp profile it needs to take care not to close fds that it will still need before the exec meaning it would have to call close_range() multiple times on different ranges and then still fall back to closing fds one by one right before the exec. CLOSE_RANGE_CLOEXEC allows to solve this problem relying on the exec codepath to get rid of the unwanted fds. The close_range() tests have been expanded to verify that CLOSE_RANGE_CLOEXEC works" * tag 'close-range-openat2-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: selftests: core: add tests for CLOSE_RANGE_CLOEXEC fs, close_range: add flag CLOSE_RANGE_CLOEXEC selftests: openat2: add RESOLVE_ conflict test openat2: reject RESOLVE_BENEATH|RESOLVE_IN_ROOT
| * | | selftests: core: add tests for CLOSE_RANGE_CLOEXECGiuseppe Scrivano2020-12-041-0/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | check that close_range(initial_fd, last_fd, CLOSE_RANGE_CLOEXEC) correctly sets the close-on-exec bit for the specified file descriptors. Open 100 file descriptors and set the close-on-exec flag for a subset of them first, then set it for every file descriptor above 2. Make sure RLIMIT_NOFILE doesn't affect the result. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Link: https://lore.kernel.org/r/20201118104746.873084-3-gscrivan@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
| * | | fs, close_range: add flag CLOSE_RANGE_CLOEXECGiuseppe Scrivano2020-12-042-10/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't immediately close the files but it sets the close-on-exec bit. It is useful for e.g. container runtimes that usually install a seccomp profile "as late as possible" before execv'ing the container process itself. The container runtime could either do: 1 2 - install_seccomp_profile(); - close_range(MIN_FD, MAX_INT, 0); - close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile(); - execve(...); - execve(...); Both alternative have some disadvantages. In the first variant the seccomp_profile cannot block the close_range syscall, as well as opendir/read/close/... for the fallback on older kernels. In the second variant, close_range() can be used only on the fds that are not going to be needed by the runtime anymore, and it must be potentially called multiple times to account for the different ranges that must be closed. Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues. The runtime is able to use the existing open fds, the seccomp profile can block close_range() and the syscalls used for its fallback. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Link: https://lore.kernel.org/r/20201118104746.873084-2-gscrivan@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
| * | | selftests: openat2: add RESOLVE_ conflict testAleksa Sarai2020-12-031-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we reject conflicting RESOLVE_ flags, add a selftest to avoid regressions. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20201027235044.5240-3-cyphar@cyphar.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
| * | | openat2: reject RESOLVE_BENEATH|RESOLVE_IN_ROOTAleksa Sarai2020-12-031-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was an oversight in the original implementation, as it makes no sense to specify both scoping flags to the same openat2(2) invocation (before this patch, the result of such an invocation was equivalent to RESOLVE_IN_ROOT being ignored). This is a userspace-visible ABI change, but the only user of openat2(2) at the moment is LXC which doesn't specify both flags and so no userspace programs will break as a result. Fixes: fddb5d430ad9 ("open: introduce openat2(2) syscall") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: <stable@vger.kernel.org> # v5.6+ Link: https://lore.kernel.org/r/20201027235044.5240-2-cyphar@cyphar.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* | | | Merge branch 'regset.followup' of ↵Linus Torvalds2020-12-168-44/+28
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull regset updates from Al Viro: "Dead code removal, mostly. The only exception is a bit of cleanups on itanic (getting rid of redundant stack unwinds - each access_uarea() call does it and we call that 7 times in a row in ptrace_[sg]etregs(), *after* having done it ourselves in the caller; location where the user registers have been spilled won't change under us, and we can bloody well just call access_elf_reg() directly, giving it the unw_frame_info we'd calculated for our own purposes)" * 'regset.followup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: c6x: kill ELF_CORE_COPY_FPREGS whack-a-mole: USE_ELF_CORE_DUMP [ia64] ptrace_[sg]etregs(): use access_elf_reg() instead of access_uarea() [ia64] missed cleanups from switch to regset coredumps arm: kill dump_task_regs()
| * | | | c6x: kill ELF_CORE_COPY_FPREGSAl Viro2020-10-261-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | not used anymore, now that fdpic coredump got switched to regset Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | whack-a-mole: USE_ELF_CORE_DUMPAl Viro2020-10-264-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's been killed off back in 2009. Not a damn thing checks it. Just die, already... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | [ia64] ptrace_[sg]etregs(): use access_elf_reg() instead of access_uarea()Al Viro2020-10-261-20/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of positions passed by ptrace_[sg]etregs() to access_uarea() the latter sets the stack unwind up, walks all the way up the stack and proceeds to pass the resulting info to access_elf_reg(). The thing is, we'd *already* obtained that info, so we can bloody well call access_elf_reg() directly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | [ia64] missed cleanups from switch to regset coredumpsAl Viro2020-10-262-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | a bunch of function could've been made static back in 2008 when ia64 switched to regset-based coredumps Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | arm: kill dump_task_regs()Al Viro2020-10-262-13/+0
| | |_|/ | |/| | | | | | | | | | | | | | | | | | the last user had been fdpic Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | Merge branch 'work.epoll' of ↵Linus Torvalds2020-12-164-429/+305
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull epoll updates from Al Viro: "Deal with epoll loop check/removal races sanely (among other things). The solution merged last cycle (pinning a bunch of struct file instances) had been forced by the wrong data structures; untangling that takes a bunch of preparations, but it's worth doing - control flow in there is ridiculously overcomplicated. Memory footprint has also gone down, while we are at it. This is not all I want to do in the area, but since I didn't get around to posting the followups they'll have to wait for the next cycle" * 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (27 commits) epoll: take epitem list out of struct file epoll: massage the check list insertion lift rcu_read_lock() into reverse_path_check() convert ->f_ep_links/->fllink to hlist ep_insert(): move creation of wakeup source past the fl_ep_links insertion fold ep_read_events_proc() into the only caller take the common part of ep_eventpoll_poll() and ep_item_poll() into helper ep_insert(): we only need tep->mtx around the insertion itself ep_insert(): don't open-code ep_remove() on failure exits lift locking/unlocking ep->mtx out of ep_{start,done}_scan() ep_send_events_proc(): fold into the caller lift the calls of ep_send_events_proc() into the callers lift the calls of ep_read_events_proc() into the callers ep_scan_ready_list(): prepare to splitup ep_loop_check_proc(): saner calling conventions get rid of ep_push_nested() ep_loop_check_proc(): lift pushing the cookie into callers clean reverse_path_check_proc() a bit reverse_path_check_proc(): don't bother with cookies reverse_path_check_proc(): sane arguments ...
| * | | | epoll: take epitem list out of struct fileAl Viro2020-10-264-56/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the head of epitem list out of struct file; for epoll ones it's moved into struct eventpoll (->refs there), for non-epoll - into the new object (struct epitem_head). In place of ->f_ep_links we leave a pointer to the list head (->f_ep). ->f_ep is protected by ->f_lock and it's zeroed as soon as the list of epitems becomes empty (that can happen only in ep_remove() by now). The list of files for reverse path check is *not* going through struct file now - it's a single-linked list going through epitem_head instances. It's terminated by ERR_PTR(-1) (== EP_UNACTIVE_POINTER), so the elements of list can be distinguished by head->next != NULL. epitem_head instances are allocated at ep_insert() time (by attach_epitem()) and freed either by ep_remove() (if it empties the set of epitems *and* epitem_head does not belong to the reverse path check list) or by clear_tfile_check_list() when the list is emptied (if the set of epitems is empty by that point). Allocations are done from a separate slab - minimal kmalloc() size is too large on some architectures. As the result, we trim struct file _and_ get rid of the games with temporary file references. Locking and barriers are interesting (aren't they always); see unlist_file() and ep_remove() for details. The non-obvious part is that ep_remove() needs to decide if it will be the one to free the damn thing *before* actually storing NULL to head->epitems.first - that's what smp_load_acquire is for in there. unlist_file() lockless path is safe, since we hit it only if we observe NULL in head->epitems.first and whoever had done that store is guaranteed to have observed non-NULL in head->next. IOW, their last access had been the store of NULL into ->epitems.first and we can safely free the sucker. OTOH, we are under rcu_read_lock() and both epitem and epitem->file have their freeing RCU-delayed. So if we see non-NULL ->epitems.first, we can grab ->f_lock (all epitems in there share the same struct file) and safely recheck the emptiness of ->epitems; again, ->next is still non-NULL, so ep_remove() couldn't have freed head yet. ->f_lock serializes us wrt ep_remove(); the rest is trivial. Note that once head->epitems becomes NULL, nothing can get inserted into it - the only remaining reference to head after that point is from the reverse path check list. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | epoll: massage the check list insertionAl Viro2020-10-261-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | in the "non-epoll target" cases do it in ep_insert() rather than in do_epoll_ctl(), so that we do it only with some epitem is already guaranteed to exist. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | lift rcu_read_lock() into reverse_path_check()Al Viro2020-10-261-2/+2
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | convert ->f_ep_links/->fllink to hlistAl Viro2020-10-263-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | we don't care about the order of elements there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | ep_insert(): move creation of wakeup source past the fl_ep_links insertionAl Viro2020-10-261-11/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | That's the beginning of preparations for taking f_ep_links out of struct file. If insertion might fail, we will need a new failure exit. Having wakeup source creation done after that point will simplify life there; ep_remove() can (and commonly does) live with NULL epi->ws, so it can be used for cleanup after ep_create_wakeup_source() failure. It can't be used before the rbtree insertion, though, so if we are to unify all old failure exits, we need to move that thing down. Then we would be free to do simple kmem_cache_free() on the failure to insert into f_ep_links - no wakeup source to leak on that failure exit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | fold ep_read_events_proc() into the only callerAl Viro2020-10-261-29/+20
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | take the common part of ep_eventpoll_poll() and ep_item_poll() into helperAl Viro2020-10-261-30/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only reason why ep_item_poll() can't simply call ep_eventpoll_poll() (or, better yet, call vfs_poll() in all cases) is that we need to tell lockdep how deep into the hierarchy of ->mtx we are. So let's add a variant of ep_eventpoll_poll() that would take depth explicitly and turn ep_eventpoll_poll() into wrapper for that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>