summaryrefslogtreecommitdiffstats
path: root/fs
diff options
context:
space:
mode:
authorKirill Tkhai <ktkhai@virtuozzo.com>2018-02-09 16:04:54 +0100
committerJan Kara <jack@suse.cz>2018-02-14 11:16:28 +0100
commite1603b6effe177210701d3d7132d1b68e7bd2c93 (patch)
treef2c768d3bc82d0fff7a233cb0242901cc8427654 /fs
parentMerge tag 'mips_4.16_2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhog... (diff)
downloadlinux-e1603b6effe177210701d3d7132d1b68e7bd2c93.tar.xz
linux-e1603b6effe177210701d3d7132d1b68e7bd2c93.zip
inotify: Extend ioctl to allow to request id of new watch descriptor
Watch descriptor is id of the watch created by inotify_add_watch(). It is allocated in inotify_add_to_idr(), and takes the numbers starting from 1. Every new inotify watch obtains next available number (usually, old + 1), as served by idr_alloc_cyclic(). CRIU (Checkpoint/Restore In Userspace) project supports inotify files, and restores watched descriptors with the same numbers, they had before dump. Since there was no kernel support, we had to use cycle to add a watch with specific descriptor id: while (1) { int wd; wd = inotify_add_watch(inotify_fd, path, mask); if (wd < 0) { break; } else if (wd == desired_wd_id) { ret = 0; break; } inotify_rm_watch(inotify_fd, wd); } (You may find the actual code at the below link: https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577) The cycle is suboptiomal and very expensive, but since there is no better kernel support, it was the only way to restore that. Happily, we had met mostly descriptors with small id, and this approach had worked somehow. But recent time containers with inotify with big watch descriptors begun to come, and this way stopped to work at all. When descriptor id is something about 0x34d71d6, the restoring process spins in busy loop for a long time, and the restore hungs and delay of migration from node to node could easily be watched. This patch aims to solve this problem. It introduces new ioctl INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created watch descriptor from userspace. It simply calls idr_set_cursor() primitive to populate idr::idr_next, so that next idr_alloc_cyclic() allocation will return this id, if it is not occupied. This is the way which is used to restore some other resources from userspace. For example, /proc/sys/kernel/ns_last_pid works the same for task pids. The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system may exclude it. v2: Use INT_MAX instead of custom definition of max id, as IDR subsystem guarantees id is between 0 and INT_MAX. CC: Jan Kara <jack@suse.cz> CC: Matthew Wilcox <willy@infradead.org> CC: Andrew Morton <akpm@linux-foundation.org> CC: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz>
Diffstat (limited to 'fs')
-rw-r--r--fs/notify/inotify/inotify_user.c14
1 files changed, 14 insertions, 0 deletions
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 2c908b31d6c9..8f17719842ec 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -307,6 +307,20 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
spin_unlock(&group->notification_lock);
ret = put_user(send_len, (int __user *) p);
break;
+#ifdef CONFIG_CHECKPOINT_RESTORE
+ case INOTIFY_IOC_SETNEXTWD:
+ ret = -EINVAL;
+ if (arg >= 1 && arg <= INT_MAX) {
+ struct inotify_group_private_data *data;
+
+ data = &group->inotify_data;
+ spin_lock(&data->idr_lock);
+ idr_set_cursor(&data->idr, (unsigned int)arg);
+ spin_unlock(&data->idr_lock);
+ ret = 0;
+ }
+ break;
+#endif /* CONFIG_CHECKPOINT_RESTORE */
}
return ret;