summaryrefslogtreecommitdiffstats
path: root/fs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* ext4: fix kernel BUG on large-scale rm -rf commandsTheodore Ts'o2012-08-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 968dee7722: "ext4: fix hole punch failure when depth is greater than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel crashes when users ran run "rm -rf" on large directory hierarchy on ext4 filesystems on RAID devices: BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 Process rm (pid: 18229, threadinfo ffff8801276bc000, task ffff880123631710) Call Trace: [<ffffffff81236483>] ? __ext4_handle_dirty_metadata+0x83/0x110 [<ffffffff812353d3>] ext4_ext_truncate+0x193/0x1d0 [<ffffffff8120a8cf>] ? ext4_mark_inode_dirty+0x7f/0x1f0 [<ffffffff81207e05>] ext4_truncate+0xf5/0x100 [<ffffffff8120cd51>] ext4_evict_inode+0x461/0x490 [<ffffffff811a1312>] evict+0xa2/0x1a0 [<ffffffff811a1513>] iput+0x103/0x1f0 [<ffffffff81196d84>] do_unlinkat+0x154/0x1c0 [<ffffffff8118cc3a>] ? sys_newfstatat+0x2a/0x40 [<ffffffff81197b0b>] sys_unlinkat+0x1b/0x50 [<ffffffff816135e9>] system_call_fastpath+0x16/0x1b Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49 89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 ae f8 ff ff 0f 1f 00 49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 85 f8 ff ff 0f 1f 80 00 00 00 RIP [<ffffffff81233164>] ext4_ext_remove_space+0xa34/0xdf0 This could be reproduced as follows: The problem in commit 968dee7722 was that caused the variable 'i' to be left uninitialized if the truncate required more space than was available in the journal. This resulted in the function ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused ext4_ext_remove_space() to restart the truncate operation after starting a new jbd2 handle. Reported-by: Maciej Żenczykowski <maze@google.com> Reported-by: Marti Raudsepp <marti@juffo.org> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: fix long mount times on very big file systemsTheodore Ts'o2012-08-171-0/+4
| | | | | | | | | | | Commit 8aeb00ff85a: "ext4: fix overhead calculation used by ext4_statfs()" introduced a O(n**2) calculation which makes very large file systems take forever to mount. Fix this with an optimization for non-bigalloc file systems. (For bigalloc file systems the overhead needs to be set in the the superblock.) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: don't call ext4_error while block group is lockedTheodore Ts'o2012-08-172-26/+37
| | | | | | | | | | | | | | | | | | | While in ext4_validate_block_bitmap(), if an block allocation bitmap is found to be invalid, we call ext4_error() while the block group is still locked. This causes ext4_commit_super() to call a function which might sleep while in an atomic context. There's no need to keep the block group locked at this point, so hoist the ext4_error() call up to ext4_validate_block_bitmap() and release the block group spinlock before calling ext4_error(). The reported stack trace can be found at: http://article.gmane.org/gmane.comp.file-systems.ext4/33731 Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: avoid kmemcheck complaint from reading uninitialized memoryTheodore Ts'o2012-08-061-0/+1
| | | | | | | | | | | | | | | | Commit 03179fe923 introduced a kmemcheck complaint in ext4_da_get_block_prep() because we save and restore ei->i_da_metadata_calc_last_lblock even though it is left uninitialized in the case where i_da_metadata_calc_len is zero. This doesn't hurt anything, but silencing the kmemcheck complaint makes it easier for people to find real bugs. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=45631 (which is marked as a regression). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: make sure the journal sb is written in ext4_clear_journal_err()Theodore Ts'o2012-08-062-1/+3
| | | | | | | | | | | | | | | | | | | | After we transfer set the EXT4_ERROR_FS bit in the file system superblock, it's not enough to call jbd2_journal_clear_err() to clear the error indication from journal superblock --- we need to call jbd2_journal_update_sb_errno() as well. Otherwise, when the root file system is mounted read-only, the journal is replayed, and the error indicator is transferred to the superblock --- but the s_errno field in the jbd2 superblock is left set (since although we cleared it in memory, we never flushed it out to disk). This can end up confusing e2fsck. We should make e2fsck more robust in this case, but the kernel shouldn't be leaving things in this confused state, either. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
* Merge branch 'for-linus' of ↵Linus Torvalds2012-08-023-66/+40
|\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull two ceph fixes from Sage Weil: "The first patch fixes up the old crufty open intent code to use the atomic_open stuff properly, and the second fixes a possible null deref and memory leak with the crypto keys." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: libceph: fix crypto key null deref, memory leak ceph: simplify+fix atomic_open
| * ceph: simplify+fix atomic_openSage Weil2012-08-023-66/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | The initial ->atomic_open op was carried over from the old intent code, which was incomplete and didn't really work. Replace it with a fresh method. In particular: * always attempt to do an atomic open+lookup, both for the create case and for lookups of existing files. * fix symlink handling by returning 1 to the VFS so that we can follow the link to its destination. This fixes a longstanding ceph bug (#2392). Signed-off-by: Sage Weil <sage@inktank.com>
* | Merge tag 'ecryptfs-3.6-rc1-fixes' of ↵Linus Torvalds2012-08-027-316/+158
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs Pull ecryptfs fixes from Tyler Hicks: - Fixes a bug when the lower filesystem mount options include 'acl', but the eCryptfs mount options do not - Cleanups in the messaging code - Better handling of empty files in the lower filesystem to improve usability. Failed file creations are now cleaned up and empty lower files are converted into eCryptfs during open(). - The write-through cache changes are being reverted due to bugs that are not easy to fix. Stability outweighs the performance enhancements here. - Improvement to the mount code to catch unsupported ciphers specified in the mount options * tag 'ecryptfs-3.6-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs: eCryptfs: check for eCryptfs cipher support at mount eCryptfs: Revert to a writethrough cache model eCryptfs: Initialize empty lower files when opening them eCryptfs: Unlink lower inode when ecryptfs_create() fails eCryptfs: Make all miscdev functions use daemon ptr in file private_data eCryptfs: Remove unused messaging declarations and function eCryptfs: Copy up POSIX ACL and read-only flags from lower mount
| * | eCryptfs: check for eCryptfs cipher support at mountTim Sally2012-07-141-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The issue occurs when eCryptfs is mounted with a cipher supported by the crypto subsystem but not by eCryptfs. The mount succeeds and an error does not occur until a write. This change checks for eCryptfs cipher support at mount time. Resolves Launchpad issue #338914, reported by Tyler Hicks in 03/2009. https://bugs.launchpad.net/ecryptfs/+bug/338914 Signed-off-by: Tim Sally <tsally@atomicpeace.com> Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
| * | eCryptfs: Revert to a writethrough cache modelTyler Hicks2012-07-143-63/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A change was made about a year ago to get eCryptfs to better utilize its page cache during writes. The idea was to do the page encryption operations during page writeback, rather than doing them when initially writing into the page cache, to reduce the number of page encryption operations during sequential writes. This meant that the encrypted page would only be written to the lower filesystem during page writeback, which was a change from how eCryptfs had previously wrote to the lower filesystem in ecryptfs_write_end(). The change caused a few eCryptfs-internal bugs that were shook out. Unfortunately, more grave side effects have been identified that will force changes outside of eCryptfs. Because the lower filesystem isn't consulted until page writeback, eCryptfs has no way to pass lower write errors (ENOSPC, mainly) back to userspace. Additionaly, it was reported that quotas could be bypassed because of the way eCryptfs may sometimes open the lower filesystem using a privileged kthread. It would be nice to resolve the latest issues, but it is best if the eCryptfs commits be reverted to the old behavior in the meantime. This reverts: 32001d6f "eCryptfs: Flush file in vma close" 5be79de2 "eCryptfs: Flush dirty pages in setattr" 57db4e8d "ecryptfs: modify write path to encrypt page in writepage" Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Tested-by: Colin King <colin.king@canonical.com> Cc: Colin King <colin.king@canonical.com> Cc: Thieu Le <thieule@google.com>
| * | eCryptfs: Initialize empty lower files when opening themTyler Hicks2012-07-083-28/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically, eCryptfs has only initialized lower files in the ecryptfs_create() path. Lower file initialization is the act of writing the cryptographic metadata from the inode's crypt_stat to the header of the file. The ecryptfs_open() path already expects that metadata to be in the header of the file. A number of users have reported empty lower files in beneath their eCryptfs mounts. Most of the causes for those empty files being left around have been addressed, but the presence of empty files causes problems due to the lack of proper cryptographic metadata. To transparently solve this problem, this patch initializes empty lower files in the ecryptfs_open() error path. If the metadata is unreadable due to the lower inode size being 0, plaintext passthrough support is not in use, and the metadata is stored in the header of the file (as opposed to the user.ecryptfs extended attribute), the lower file will be initialized. The number of nested conditionals in ecryptfs_open() was getting out of hand, so a helper function was created. To avoid the same nested conditional problem, the conditional logic was reversed inside of the helper function. https://launchpad.net/bugs/911507 Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Cc: John Johansen <john.johansen@canonical.com> Cc: Colin Ian King <colin.king@canonical.com>
| * | eCryptfs: Unlink lower inode when ecryptfs_create() failsTyler Hicks2012-07-081-23/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ecryptfs_create() creates a lower inode, allocates an eCryptfs inode, initializes the eCryptfs inode and cryptographic metadata attached to the inode, and then writes the metadata to the header of the file. If an error was to occur after the lower inode was created, an empty lower file would be left in the lower filesystem. This is a problem because ecryptfs_open() refuses to open any lower files which do not have the appropriate metadata in the file header. This patch properly unlinks the lower inode when an error occurs in the later stages of ecryptfs_create(), reducing the chance that an empty lower file will be left in the lower filesystem. https://launchpad.net/bugs/872905 Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Cc: John Johansen <john.johansen@canonical.com> Cc: Colin Ian King <colin.king@canonical.com>
| * | eCryptfs: Make all miscdev functions use daemon ptr in file private_dataTyler Hicks2012-07-083-172/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that a pointer to a valid struct ecryptfs_daemon is stored in the private_data of an opened /dev/ecryptfs file, the remaining miscdev functions can utilize the pointer rather than looking up the ecryptfs_daemon at the beginning of each operation. The security model of /dev/ecryptfs is simplified a little bit with this patch. Upon opening /dev/ecryptfs, a per-user ecryptfs_daemon is registered. Another daemon cannot be registered for that user until the last file reference is released. During the lifetime of the ecryptfs_daemon, access checks are not performed on the /dev/ecryptfs operations because it is assumed that the application securely handles the opened file descriptor and does not unintentionally leak it to processes that are not trusted. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Cc: Sasha Levin <levinsasha928@gmail.com>
| * | eCryptfs: Remove unused messaging declarations and functionTyler Hicks2012-07-082-37/+0
| | | | | | | | | | | | | | | | | | | | | These are no longer needed. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Cc: Sasha Levin <levinsasha928@gmail.com>
| * | eCryptfs: Copy up POSIX ACL and read-only flags from lower mountTyler Hicks2012-07-081-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the eCryptfs mount options do not include '-o acl', but the lower filesystem's mount options do include 'acl', the MS_POSIXACL flag is not flipped on in the eCryptfs super block flags. This flag is what the VFS checks in do_last() when deciding if the current umask should be applied to a newly created inode's mode or not. When a default POSIX ACL mask is set on a directory, the current umask is incorrectly applied to new inodes created in the directory. This patch ignores the MS_POSIXACL flag passed into ecryptfs_mount() and sets the flag on the eCryptfs super block depending on the flag's presence on the lower super block. Additionally, it is incorrect to allow a writeable eCryptfs mount on top of a read-only lower mount. This missing check did not allow writes to the read-only lower mount because permissions checks are still performed on the lower filesystem's objects but it is best to simply not allow a rw mount on top of ro mount. However, a ro eCryptfs mount on top of a rw mount is valid and still allowed. https://launchpad.net/bugs/1009207 Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Reported-by: Stefan Beller <stefanbeller@googlemail.com> Cc: John Johansen <john.johansen@canonical.com>
* | | Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2012-08-028-158/+270
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull CIFS update from Steve French: "Adds SMB2 rmdir/mkdir capability to the SMB2/SMB2.1 support in cifs. I am holding up a few more days on merging the remainder of the SMB2/SMB2.1 enablement although it is nearing review completion, in order to address some review comments from Jeff Layton on a few of the subsequent SMB2 patches, and also to debug an unrelated cifs problem that Pavel discovered." * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: CIFS: Add SMB2 support for rmdir CIFS: Move rmdir code to ops struct CIFS: Add SMB2 support for mkdir operation CIFS: Separate protocol specific part from mkdir CIFS: Simplify cifs_mkdir call
| * | CIFS: Add SMB2 support for rmdirPavel Shilovsky2012-07-273-0/+12
| | | | | | | | | | | | | | | Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>
| * | CIFS: Move rmdir code to ops structPavel Shilovsky2012-07-275-13/+24
| | | | | | | | | | | | | | | Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>
| * | CIFS: Add SMB2 support for mkdir operationPavel Shilovsky2012-07-273-0/+38
| | | | | | | | | | | | | | | Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <smfrench@gmail.com>
| * | CIFS: Separate protocol specific part from mkdirPavel Shilovsky2012-07-275-25/+49
| | | | | | | | | | | | | | | Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>
| * | CIFS: Simplify cifs_mkdir callPavel Shilovsky2012-07-273-140/+167
| | | | | | | | | | | | | | | Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2012-08-0157-412/+937
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second vfs pile from Al Viro: "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the deadlock reproduced by xfstests 068), symlink and hardlink restriction patches, plus assorted cleanups and fixes. Note that another fsfreeze deadlock (emergency thaw one) is *not* dealt with - the series by Fernando conflicts a lot with Jan's, breaks userland ABI (FIFREEZE semantics gets changed) and trades the deadlock for massive vfsmount leak; this is going to be handled next cycle. There probably will be another pull request, but that stuff won't be in it." Fix up trivial conflicts due to unrelated changes next to each other in drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) delousing target_core_file a bit Documentation: Correct s_umount state for freeze_fs/unfreeze_fs fs: Remove old freezing mechanism ext2: Implement freezing btrfs: Convert to new freezing mechanism nilfs2: Convert to new freezing mechanism ntfs: Convert to new freezing mechanism fuse: Convert to new freezing mechanism gfs2: Convert to new freezing mechanism ocfs2: Convert to new freezing mechanism xfs: Convert to new freezing code ext4: Convert to new freezing mechanism fs: Protect write paths by sb_start_write - sb_end_write fs: Skip atime update on frozen filesystem fs: Add freezing handling to mnt_want_write() / mnt_drop_write() fs: Improve filesystem freezing handling switch the protection of percpu_counter list to spinlock nfsd: Push mnt_want_write() outside of i_mutex btrfs: Push mnt_want_write() outside of i_mutex fat: Push mnt_want_write() outside of i_mutex ...
| * | | fs: Remove old freezing mechanismJan Kara2012-07-311-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that all users are converted, we can remove functions, variables, and constants defined by the old freezing mechanism. BugLink: https://bugs.launchpad.net/bugs/897421 Tested-by: Kamal Mostafa <kamal@canonical.com> Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Massimo Morana <massimo.morana@canonical.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | ext2: Implement freezingJan Kara2012-07-312-1/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only missing piece to make freezing work reliably with ext2 is to stop iput() of unlinked inode from deleting the inode on frozen filesystem. So add a necessary protection to ext2_evict_inode(). We also provide appropriate ->freeze_fs and ->unfreeze_fs functions. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | btrfs: Convert to new freezing mechanismJan Kara2012-07-314-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We convert btrfs_file_aio_write() to use new freeze check. We also add proper freeze protection to btrfs_page_mkwrite(). We also add freeze protection to the transaction mechanism to avoid starting transactions on frozen filesystem. At minimum this is necessary to stop iput() of unlinked file to change frozen filesystem during truncation. Checks in cleaner_kthread() and transaction_kthread() can be safely removed since btrfs_freeze() will lock the mutexes and thus block the threads (and they shouldn't have anything to do anyway). CC: linux-btrfs@vger.kernel.org CC: Chris Mason <chris.mason@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | nilfs2: Convert to new freezing mechanismJan Kara2012-07-313-10/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We change nilfs_page_mkwrite() to provide proper freeze protection for writeable page faults (we must wait for frozen filesystem even if the page is fully mapped). We remove all vfs_check_frozen() checks since they are now handled by the generic code. CC: linux-nilfs@vger.kernel.org CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | ntfs: Convert to new freezing mechanismJan Kara2012-07-311-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move check in ntfs_file_aio_write_nolock() to ntfs_file_aio_write() and use new freeze protection. CC: linux-ntfs-dev@lists.sourceforge.net CC: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fuse: Convert to new freezing mechanismJan Kara2012-07-311-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert check in fuse_file_aio_write() to using new freeze protection. CC: fuse-devel@lists.sourceforge.net CC: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | gfs2: Convert to new freezing mechanismJan Kara2012-07-312-12/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We update gfs2_page_mkwrite() to use new freeze protection and the transaction code to use freeze protection while the transaction is running. That is needed to stop iput() of unlinked file from modifying the filesystem. The rest is handled by the generic code. CC: cluster-devel@redhat.com CC: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | ocfs2: Convert to new freezing mechanismJan Kara2012-07-314-5/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Protect ocfs2_page_mkwrite() and ocfs2_file_aio_write() using the new freeze protection. We also protect several ioctl entry points which were missing the protection. Finally, we add freeze protection to the journaling mechanism so that iput() of unlinked inode cannot modify a frozen filesystem. CC: Mark Fasheh <mfasheh@suse.com> CC: Joel Becker <jlbec@evilplan.org> CC: ocfs2-devel@oss.oracle.com Acked-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | xfs: Convert to new freezing codeJan Kara2012-07-3110-16/+109
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Generic code now blocks all writers from standard write paths. So we add blocking of all writers coming from ioctl (we get a protection of ioctl against racing remount read-only as a bonus) and convert xfs_file_aio_write() to a non-racy freeze protection. We also keep freeze protection on transaction start to block internal filesystem writes such as removal of preallocated blocks. CC: Ben Myers <bpm@sgi.com> CC: Alex Elder <elder@kernel.org> CC: xfs@oss.sgi.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | ext4: Convert to new freezing mechanismJan Kara2012-07-313-29/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We remove most of frozen checks since upper layer takes care of blocking all writes. We have to handle protection in ext4_page_mkwrite() in a special way because we cannot use generic block_page_mkwrite(). Also we add a freeze protection to ext4_evict_inode() so that iput() of unlinked inode cannot modify a frozen filesystem (we cannot easily instrument ext4_journal_start() / ext4_journal_stop() with freeze protection because we are missing the superblock pointer in ext4_journal_stop() in nojournal mode). CC: linux-ext4@vger.kernel.org CC: "Theodore Ts'o" <tytso@mit.edu> BugLink: https://bugs.launchpad.net/bugs/897421 Tested-by: Kamal Mostafa <kamal@canonical.com> Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Massimo Morana <massimo.morana@canonical.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fs: Protect write paths by sb_start_write - sb_end_writeJan Kara2012-07-313-19/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several entry points which dirty pages in a filesystem. mmap (handled by block_page_mkwrite()), buffered write (handled by __generic_file_aio_write()), splice write (generic_file_splice_write), truncate, and fallocate (these can dirty last partial page - handled inside each filesystem separately). Protect these places with sb_start_write() and sb_end_write(). ->page_mkwrite() calls are particularly complex since they are called with mmap_sem held and thus we cannot use standard sb_start_write() due to lock ordering constraints. We solve the problem by using a special freeze protection sb_start_pagefault() which ranks below mmap_sem. BugLink: https://bugs.launchpad.net/bugs/897421 Tested-by: Kamal Mostafa <kamal@canonical.com> Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Massimo Morana <massimo.morana@canonical.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fs: Skip atime update on frozen filesystemJan Kara2012-07-311-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is unexpected to block reading of frozen filesystem because of atime update. Also handling blocking on frozen filesystem because of atime update would make locking more complex than it already is. So just skip atime update when filesystem is frozen like we skip it when filesystem is remounted read-only. BugLink: https://bugs.launchpad.net/bugs/897421 Tested-by: Kamal Mostafa <kamal@canonical.com> Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Massimo Morana <massimo.morana@canonical.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fs: Add freezing handling to mnt_want_write() / mnt_drop_write()Jan Kara2012-07-315-24/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of places where we want freeze protection coincides with the places where we also have remount-ro protection. So make mnt_want_write() and mnt_drop_write() (and their _file alternative) prevent freezing as well. For the few cases that are really interested only in remount-ro protection provide new function variants. BugLink: https://bugs.launchpad.net/bugs/897421 Tested-by: Kamal Mostafa <kamal@canonical.com> Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Massimo Morana <massimo.morana@canonical.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fs: Improve filesystem freezing handlingJan Kara2012-07-311-21/+230
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vfs_check_frozen() tests are racy since the filesystem can be frozen just after the test is performed. Thus in write paths we can end up marking some pages or inodes dirty even though the file system is already frozen. This creates problems with flusher thread hanging on frozen filesystem. Another problem is that exclusion between ->page_mkwrite() and filesystem freezing has been handled by setting page dirty and then verifying s_frozen. This guaranteed that either the freezing code sees the faulted page, writes it, and writeprotects it again or we see s_frozen set and bail out of page fault. This works to protect from page being marked writeable while filesystem freezing is running but has an unpleasant artefact of leaving dirty (although unmodified and writeprotected) pages on frozen filesystem resulting in similar problems with flusher thread as the first problem. This patch aims at providing exclusion between write paths and filesystem freezing. We implement a writer-freeze read-write semaphore in the superblock. Actually, there are three such semaphores because of lock ranking reasons - one for page fault handlers (->page_mkwrite), one for all other writers, and one of internal filesystem purposes (used e.g. to track running transactions). Write paths which should block freezing (e.g. directory operations, ->aio_write(), ->page_mkwrite) hold reader side of the semaphore. Code freezing the filesystem takes the writer side. Only that we don't really want to bounce cachelines of the semaphores between CPUs for each write happening. So we implement the reader side of the semaphore as a per-cpu counter and the writer side is implemented using s_writers.frozen superblock field. [AV: microoptimize sb_start_write(); we want it fast in normal case] BugLink: https://bugs.launchpad.net/bugs/897421 Tested-by: Kamal Mostafa <kamal@canonical.com> Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Massimo Morana <massimo.morana@canonical.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | nfsd: Push mnt_want_write() outside of i_mutexJan Kara2012-07-305-46/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When mnt_want_write() starts to handle freezing it will get a full lock semantics requiring proper lock ordering. So push mnt_want_write() call consistently outside of i_mutex. CC: linux-nfs@vger.kernel.org CC: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | btrfs: Push mnt_want_write() outside of i_mutexJan Kara2012-07-301-12/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When mnt_want_write() starts to handle freezing it will get a full lock semantics requiring proper lock ordering. So push mnt_want_write() call consistently outside of i_mutex. CC: Chris Mason <chris.mason@oracle.com> CC: linux-btrfs@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fat: Push mnt_want_write() outside of i_mutexJan Kara2012-07-301-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When mnt_want_write() starts to handle freezing it will get a full lock semantics requiring proper lock ordering. So push mnt_want_write() call outside of i_mutex as in other places. CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fs: Push mnt_want_write() outside of i_mutexJan Kara2012-07-301-21/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, mnt_want_write() is sometimes called with i_mutex held and sometimes without it. This isn't really a problem because mnt_want_write() is a non-blocking operation (essentially has a trylock semantics) but when the function starts to handle also frozen filesystems, it will get a full lock semantics and thus proper lock ordering has to be established. So move all mnt_want_write() calls outside of i_mutex. One non-trivial case needing conversion is kern_path_create() / user_path_create() which didn't include mnt_want_write() but now needs to because it acquires i_mutex. Because there are virtual file systems which don't bother with freeze / remount-ro protection we actually provide both versions of the function - one which calls mnt_want_write() and one which does not. [AV: scratch the previous, mnt_want_write() has been moved to kern_path_create() by now] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | sysfs: Push file_update_time() into bin_page_mkwrite()Jan Kara2012-07-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | gfs2: Push file_update_time() into gfs2_page_mkwrite()Jan Kara2012-07-301-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CC: Steven Whitehouse <swhiteho@redhat.com> CC: cluster-devel@redhat.com Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | 9p: Push file_update_time() into v9fs_vm_page_mkwrite()Jan Kara2012-07-301-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CC: Eric Van Hensbergen <ericvh@gmail.com> CC: Ron Minnich <rminnich@sandia.gov> CC: Latchesar Ionkov <lucho@ionkov.net> CC: v9fs-developer@lists.sourceforge.net Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | ceph: Push file_update_time() into ceph_page_mkwrite()Jan Kara2012-07-301-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CC: Sage Weil <sage@newdream.net> CC: ceph-devel@vger.kernel.org Acked-by: Sage Weil <sage@newdream.net> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fs: Push file_update_time() into __block_page_mkwrite()Jan Kara2012-07-301-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tested-by: Kamal Mostafa <kamal@canonical.com> Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Massimo Morana <massimo.morana@canonical.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | simplify lookup_open()/atomic_open() - do the temporary mnt_want_write() earlyAl Viro2012-07-301-22/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The write ref to vfsmount taken in lookup_open()/atomic_open() is going to be dropped; we take the one to stay in dentry_open(). Just grab the temporary in caller if it looks like we are going to need it (create/truncate/writable open) and pass (by value) "has it succeeded" flag. Instead of doing mnt_want_write() inside, check that flag and treat "false" as "mnt_want_write() has just failed". mnt_want_write() is cheap and the things get considerably simpler and more robust that way - we get it and drop it in the same function, to start with, rather than passing a "has something in the guts of really scary functions taken it" back to caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fix O_EXCL handling for devicesAl Viro2012-07-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | O_EXCL without O_CREAT has different semantics; it's "fail if already opened", not "fail if already exists". commit 71574865 broke that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | lockd: handle lockowner allocation failure in nlmclnt_proc()Al Viro2012-07-291-0/+5
| | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | lockd: shift grabbing a reference to nlm_host into nlm_alloc_call()Al Viro2012-07-294-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's used both for client and server hosts; we can't do nlmclnt_release_host() on failure exits, since the host might need nlmsvc_release_host(), with BUG_ON() for calling the wrong one. Makes life simpler for callers, actually... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fs: add link restriction audit reportingKees Cook2012-07-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds audit messages for unexpected link restriction violations so that system owners will have some sort of potentially actionable information about misbehaving processes. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>