summaryrefslogtreecommitdiffstats
path: root/fs/jbd2 (follow)
Commit message (Collapse)AuthorAgeFilesLines
* mm, page_alloc: distinguish between being unable to sleep, unwilling to ↵Mel Gorman2015-11-071-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* jbd2: limit number of reserved creditsLukas Czerner2015-08-041-9/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently there is no limitation on number of reserved credits we can ask for. If we ask for more reserved credits than 1/2 of maximum transaction size, or if total number of credits exceeds the maximum transaction size per operation (which is currently only possible with the former) we will spin forever in start_this_handle(). Fix this by adding this limitation at the start of start_this_handle(). This patch also removes the credit limitation 1/2 of maximum transaction size, since we really only want to limit the number of reserved credits. There is not much point to limit the credits if there is still space in the journal. This accidentally also fixes the online resize, where due to the limitation of the journal credits we're unable to grow file systems with 1k block size and size between 16M and 32M. It has been partially fixed by 2c869b262a10ca99cb866d04087d75311587a30c, but not entirely. Thanks Jan Kara for helping me getting the correct fix. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: avoid infinite loop when destroying aborted journalJan Kara2015-07-283-8/+44
| | | | | | | | | | | | | | | | | | | Commit 6f6a6fda2945 "jbd2: fix ocfs2 corrupt when updating journal superblock fails" changed jbd2_cleanup_journal_tail() to return EIO when the journal is aborted. That makes logic in jbd2_log_do_checkpoint() bail out which is fine, except that jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make a progress in cleaning the journal. Without it jbd2_journal_destroy() just loops in an infinite loop. Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of jbd2_log_do_checkpoint() fails with error. Reported-by: Eryu Guan <guaneryu@gmail.com> Tested-by: Eryu Guan <guaneryu@gmail.com> Fixes: 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4, jbd2: add REQ_FUA flag when recording an error in the superblockDaeho Jeong2015-07-231-1/+1
| | | | | | | | | | | | | | | | | | When an error condition is detected, an error status should be recorded into superblocks of EXT4 or JBD2. However, the write request is submitted now without REQ_FUA flag, even in "barrier=1" mode, which is followed by panic() function in "errors=panic" mode. On mobile devices which make whole system reset as soon as kernel panic occurs, this write request containing an error flag will disappear just from storage cache without written to the physical cells. Therefore, when next start, even forever, the error flag cannot be shown in both superblocks, and e2fsck cannot fix the filesystem problems automatically, unless e2fsck is executed in force checking mode. [ Changed use test_opt(sb, BARRIER) of checking the journal flags -- TYT ] Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: speedup jbd2_journal_dirty_metadata()Jan Kara2015-07-131-6/+32
| | | | | | | | | | | | | | | | It is often the case that we mark buffer as having dirty metadata when the buffer is already in that state (frequent for bitmaps, inode table blocks, superblock). Thus it is unnecessary to contend on grabbing journal head reference and bh_state lock. Avoid that by checking whether any modification to the buffer is needed before grabbing any locks or references. [ Note: this is a fixed version of commit 2143c1965a761, which was reverted in ebeaa8ddb3663b5 due to a false positive triggering of an assertion check. -- Ted ] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* Revert "jbd2: speedup jbd2_journal_dirty_metadata()"Linus Torvalds2015-06-271-27/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 2143c1965a761332ae417b22fd477b636e4f54ec. This commit seems to be the cause of the following jbd2 assertion failure: ------------[ cut here ]------------ kernel BUG at fs/jbd2/transaction.c:1325! invalid opcode: 0000 [#1] SMP Modules linked in: bnep bluetooth fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 ... CPU: 7 PID: 5509 Comm: gcc Not tainted 4.1.0-10944-g2a298679b411 #1 Hardware name: /DH87RL, BIOS RLH8710H.86A.0327.2014.0924.1645 09/24/2014 task: ffff8803bf866040 ti: ffff880308528000 task.ti: ffff880308528000 RIP: jbd2_journal_dirty_metadata+0x237/0x290 Call Trace: __ext4_handle_dirty_metadata+0x43/0x1f0 ext4_handle_dirty_dirent_node+0xde/0x160 ? jbd2_journal_get_write_access+0x36/0x50 ext4_delete_entry+0x112/0x160 ? __ext4_journal_start_sb+0x52/0xb0 ext4_unlink+0xfa/0x260 vfs_unlink+0xec/0x190 do_unlinkat+0x24a/0x270 SyS_unlink+0x11/0x20 entry_SYSCALL_64_fastpath+0x12/0x6a ---[ end trace ae033ebde8d080b4 ]--- which is not easily reproducible (I've seen it just once, and then Ted was able to reproduce it once). Revert it while Ted and Jan try to figure out what is wrong. Cc: Jan Kara <jack@suse.cz> Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2015-06-261-8/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge second patchbomb from Andrew Morton: - most of the rest of MM - lots of misc things - procfs updates - printk feature work - updates to get_maintainer, MAINTAINERS, checkpatch - lib/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits) exit,stats: /* obey this comment */ coredump: add __printf attribute to cn_*printf functions coredump: use from_kuid/kgid when formatting corename fs/reiserfs: remove unneeded cast NILFS2: support NFSv2 export fs/befs/btree.c: remove unneeded initializations fs/minix: remove unneeded cast init/do_mounts.c: add create_dev() failure log kasan: remove duplicate definition of the macro KASAN_FREE_PAGE fs/efs: femove unneeded cast checkpatch: emit "NOTE: <types>" message only once after multiple files checkpatch: emit an error when there's a diff in a changelog checkpatch: validate MODULE_LICENSE content checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr() checkpatch: fix processing of MEMSET issues checkpatch: suggest using ether_addr_equal*() checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files checkpatch: remove local from codespell path checkpatch: add --showfile to allow input via pipe to show filenames ...
| * fs/jbd2/journal.c: use strreplace()Rasmus Villemoes2015-06-261-8/+2
| | | | | | | | | | | | | | | | | | | | In one case, we eliminate a local variable; in the other a strlen() call and some .text. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | jbd2: speedup jbd2_journal_dirty_metadata()Jan Kara2015-06-211-6/+27
| | | | | | | | | | | | | | | | | | | | | | | | It is often the case that we mark buffer as having dirty metadata when the buffer is already in that state (frequent for bitmaps, inode table blocks, superblock). Thus it is unnecessary to contend on grabbing journal head reference and bh_state lock. Avoid that by checking whether any modification to the buffer is needed before grabbing any locks or references. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | jbd2: get rid of open coded allocation retry loopMichal Hocko2015-06-152-14/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | insert_revoke_hash does an open coded endless allocation loop if journal_oom_retry is true. It doesn't implement any allocation fallback strategy between the retries, though. The memory allocator doesn't know about the never fail requirement so it cannot potentially help to move on with the allocation (e.g. use memory reserves). Get rid of the retry loop and use __GFP_NOFAIL instead. We will lose the debugging message but I am not sure it is anyhow helpful. Do the same for journal_alloc_journal_head which is doing a similar thing. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | jbd2: fix ocfs2 corrupt when updating journal superblock failsJoseph Qi2015-06-152-10/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If updating journal superblock fails after journal data has been flushed, the error is omitted and this will mislead the caller as a normal case. In ocfs2, the checkpoint will be treated successfully and the other node can get the lock to update. Since the sb_start is still pointing to the old log block, it will rewrite the journal data during journal recovery by the other node. Thus the new updates will be overwritten and ocfs2 corrupts. So in above case we have to return the error, and ocfs2_commit_cache will take care of the error and prevent the other node to do update first. And only after recovering journal it can do the new updates. The issue discussion mail can be found at: https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html http://comments.gmane.org/gmane.comp.file-systems.ext4/48841 [ Fixed bug in patch which allowed a non-negative error return from jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this was causing xfstests ext4/306 to fail. -- Ted ] Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Tested-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: stable@vger.kernel.org
* | jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()Dmitry Monakhov2015-06-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | jbd2_cleanup_journal_tail() can be invoked by jbd2__journal_start() So allocations should be done with GFP_NOFS [Full stack trace snipped from 3.10-rh7] [<ffffffff815c4bd4>] dump_stack+0x19/0x1b [<ffffffff8105dba1>] warn_slowpath_common+0x61/0x80 [<ffffffff8105dcca>] warn_slowpath_null+0x1a/0x20 [<ffffffff815c2142>] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17 [<ffffffff8119c045>] kmem_cache_alloc+0x55/0x210 [<ffffffff811477f5>] ? mempool_alloc_slab+0x15/0x20 [<ffffffff811477f5>] mempool_alloc_slab+0x15/0x20 [<ffffffff81147939>] mempool_alloc+0x69/0x170 [<ffffffff815cb69e>] ? _raw_spin_unlock_irq+0xe/0x20 [<ffffffff8109160d>] ? finish_task_switch+0x5d/0x150 [<ffffffff811f1a8e>] bio_alloc_bioset+0x1be/0x2e0 [<ffffffff8127ee49>] blkdev_issue_flush+0x99/0x120 [<ffffffffa019a733>] jbd2_cleanup_journal_tail+0x93/0xa0 [jbd2] -->GFP_KERNEL [<ffffffffa019aca1>] jbd2_log_do_checkpoint+0x221/0x4a0 [jbd2] [<ffffffffa019afc7>] __jbd2_log_wait_for_space+0xa7/0x1e0 [jbd2] [<ffffffffa01952d8>] start_this_handle+0x2d8/0x550 [jbd2] [<ffffffff811b02a9>] ? __memcg_kmem_put_cache+0x29/0x30 [<ffffffff8119c120>] ? kmem_cache_alloc+0x130/0x210 [<ffffffffa019573a>] jbd2__journal_start+0xba/0x190 [jbd2] [<ffffffff811532ce>] ? lru_cache_add+0xe/0x10 [<ffffffffa01c9549>] ? ext4_da_write_begin+0xf9/0x330 [ext4] [<ffffffffa01f2c77>] __ext4_journal_start_sb+0x77/0x160 [ext4] [<ffffffffa01c9549>] ext4_da_write_begin+0xf9/0x330 [ext4] [<ffffffff811446ec>] generic_file_buffered_write_iter+0x10c/0x270 [<ffffffff81146918>] __generic_file_write_iter+0x178/0x390 [<ffffffff81146c6b>] __generic_file_aio_write+0x8b/0xb0 [<ffffffff81146ced>] generic_file_aio_write+0x5d/0xc0 [<ffffffffa01bf289>] ext4_file_write+0xa9/0x450 [ext4] [<ffffffff811c31d9>] ? pipe_read+0x379/0x4f0 [<ffffffff811b93f0>] do_sync_write+0x90/0xe0 [<ffffffff811b9b6d>] vfs_write+0xbd/0x1e0 [<ffffffff811ba5b8>] SyS_write+0x58/0xb0 [<ffffffff815d4799>] system_call_fastpath+0x16/0x1b Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* | jbd2: speedup jbd2_journal_get_[write|undo]_access()Jan Kara2015-06-082-5/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are frequently called for buffers that are already part of the running transaction - most frequently it is the case for bitmaps, inode table blocks, and superblock. Since in such cases we have nothing to do, it is unfortunate we still grab reference to journal head, lock the bh, lock bh_state only to find out there's nothing to do. Improving this is a bit subtle though since until we find out journal head is attached to the running transaction, it can disappear from under us because checkpointing / commit decided it's no longer needed. We deal with this by protecting journal_head slab with RCU. We still have to be careful about journal head being freed & reallocated within slab and about exposing journal head in consistent state (in particular b_modified and b_frozen_data must be in correct state before we allow user to touch the buffer). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | jbd2: more simplifications in do_get_write_access()Jan Kara2015-06-081-71/+59
| | | | | | | | | | | | | | | | | | Check for the simple case of unjournaled buffer first, handle it and bail out. This allows us to remove one if and unindent the difficult case by one tab. The result is easier to read. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | jbd2: simplify error path on allocation failure in do_get_write_access()Jan Kara2015-06-081-2/+1
| | | | | | | | | | | | | | | | | | | | We were acquiring bh_state_lock when allocation of buffer failed in do_get_write_access() only to be able to jump to a label that releases the lock and does all other checks that don't make sense for this error path. Just jump into the right label instead. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | jbd2: simplify code flow in do_get_write_access()Jan Kara2015-06-081-24/+25
| | | | | | | | | | | | | | | | | | needs_copy is set only in one place in do_get_write_access(), just move the frozen buffer copying into that place and factor it out to a separate function to make do_get_write_access() slightly more readable. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | jbd2: revert must-not-fail allocation loops back to GFP_NOFAILMichal Hocko2015-06-082-23/+8
|/ | | | | | | | | | | | | | | | | | | This basically reverts 47def82672b3 (jbd2: Remove __GFP_NOFAIL from jbd2 layer). The deprecation of __GFP_NOFAIL was a bad choice because it led to open coding the endless loop around the allocator rather than removing the dependency on the non failing allocation. So the deprecation was a clear failure and the reality tells us that __GFP_NOFAIL is not even close to go away. It is still true that __GFP_NOFAIL allocations are generally discouraged and new uses should be evaluated and an alternative (pre-allocations or reservations) should be considered but it doesn't make any sense to lie the allocator about the requirements. Allocator can take steps to help making a progress if it knows the requirements. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Acked-by: David Rientjes <rientjes@google.com>
* jbd2: fix r_count overflows leading to buffer overflow in journal recoveryDarrick J. Wong2015-05-152-9/+19
| | | | | | | | | | | | | | | | | | The journal revoke block recovery code does not check r_count for sanity, which means that an evil value of r_count could result in the kernel reading off the end of the revoke table and into whatever garbage lies beyond. This could crash the kernel, so fix that. However, in testing this fix, I discovered that the code to write out the revoke tables also was not correctly checking to see if the block was full -- the current offset check is fine so long as the revoke table space size is a multiple of the record size, but this is not true when either journal_csum_v[23] are set. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
* ext4: fix NULL pointer dereference when journal restart failsLukas Czerner2015-05-151-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when journal restart fails, we'll have the h_transaction of the handle set to NULL to indicate that the handle has been effectively aborted. We handle this situation quietly in the jbd2_journal_stop() and just free the handle and exit because everything else has been done before we attempted (and failed) to restart the journal. Unfortunately there are a number of problems with that approach introduced with commit 41a5b913197c "jbd2: invalidate handle if jbd2_journal_restart() fails" First of all in ext4 jbd2_journal_stop() will be called through __ext4_journal_stop() where we would try to get a hold of the superblock by dereferencing h_transaction which in this case would lead to NULL pointer dereference and crash. In addition we're going to free the handle regardless of the refcount which is bad as well, because others up the call chain will still reference the handle so we might potentially reference already freed memory. Moreover it's expected that we'll get aborted handle as well as detached handle in some of the journalling function as the error propagates up the stack, so it's unnecessary to call WARN_ON every time we get detached handle. And finally we might leak some memory by forgetting to free reserved handle in jbd2_journal_stop() in the case where handle was detached from the transaction (h_transaction is NULL). Fix the NULL pointer dereference in __ext4_journal_stop() by just calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix the potential memory leak in jbd2_journal_stop() and use proper handle refcounting before we attempt to free it to avoid use-after-free issues. And finally remove all WARN_ON(!transaction) from the code so that we do not get random traces when something goes wrong because when journal restart fails we will get to some of those functions. Cc: stable@vger.kernel.org Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* jbd2: complain about descriptor block checksum errorsDarrick J. Wong2015-01-191-0/+3
| | | | | | | | | We should complain in dmesg when journal recovery fails on account of the descriptor block being corrupt, so that the diagnostic data can be recovered. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* Merge tag 'ext4_for_linus' of ↵Linus Torvalds2014-12-121-2/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Lots of bugs fixes, including Zheng and Jan's extent status shrinker fixes, which should improve CPU utilization and potential soft lockups under heavy memory pressure, and Eric Whitney's bigalloc fixes" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits) ext4: ext4_da_convert_inline_data_to_extent drop locked page after error ext4: fix suboptimal seek_{data,hole} extents traversial ext4: ext4_inline_data_fiemap should respect callers argument ext4: prevent fsreentrance deadlock for inline_data ext4: forbid journal_async_commit in data=ordered mode jbd2: remove unnecessary NULL check before iput() ext4: Remove an unnecessary check for NULL before iput() ext4: remove unneeded code in ext4_unlink ext4: don't count external journal blocks as overhead ext4: remove never taken branch from ext4_ext_shift_path_extents() ext4: create nojournal_checksum mount option ext4: update comments regarding ext4_delete_inode() ext4: cleanup GFP flags inside resize path ext4: introduce aging to extent status tree ext4: cleanup flag definitions for extent status tree ext4: limit number of scanned extents in status tree shrinker ext4: move handling of list of shrinkable inodes into extent status code ext4: change LRU to round-robin in extent status tree shrinker ext4: cache extent hole in extent status tree for ext4_da_map_blocks() ext4: fix block reservation for bigalloc filesystems ...
| * jbd2: remove unnecessary NULL check before iput()Theodore Ts'o2014-11-261-2/+1
| | | | | | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | jbd2: fix regression where we fail to initialize checksum seed when loadingDarrick J. Wong2014-12-021-3/+2
|/ | | | | | | | | | | | | | | | | | | | When we're enabling journal features, we cannot use the predicate jbd2_journal_has_csum_v2or3() because we haven't yet set the sb feature flag fields! Moreover, we just finished loading the shash driver, so the test is unnecessary; calculate the seed always. Without this patch, we fail to initialize the checksum seed the first time we turn on journal_checksum, which means that all journal blocks written during that first mount are corrupt. Transactions written after the second mount will be fine, since the feature flag will be set in the journal superblock. xfstests generic/{034,321,322} are the regression tests. (This is important for 3.18.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM> Reported-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: use a better hash function for the revoke tableTheodore Ts'o2014-10-301-8/+2
| | | | | | | | | The old hash function didn't work well for 64-bit block numbers, and used undefined (negative) shift right behavior. Use the generic 64-bit hash function instead. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
* jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_listJan Kara2014-09-181-32/+24
| | | | | | | | | | __jbd2_journal_clean_checkpoint_list() returns number of buffers it freed but noone was using the value so just stop doing that. This also allows for simplifying the calling convention for journal_clean_once_cp_list(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: avoid pointless scanning of checkpoint listsJan Kara2014-09-181-14/+18
| | | | | | | | | | | | | | | | | | Yuanhan has reported that when he is running fsync(2) heavy workload creating new files over ramdisk, significant amount of time is spent in __jbd2_journal_clean_checkpoint_list() trying to clean old transactions (but they cannot be cleaned up because flusher hasn't yet checkpointed those buffers). The workload can be generated by: fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096 Reduce the amount of scanning by stopping to scan the transaction list once we find a transaction that cannot be checkpointed. Note that this way of cleaning is still enough to keep freeing space in the journal after fully checkpointed transactions. Reported-and-tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: jbd2_log_wait_for_space improve error detetcionDmitry Monakhov2014-09-161-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If EIO happens after we have dropped j_state_lock, we won't notice that the journal has been aborted. So it is reasonable to move this check after we have grabbed the j_checkpoint_mutex and re-grabbed the j_state_lock. This patch helps to prevent false positive complain after EIO. #DMESG: __jbd2_log_wait_for_space: needed 8448 blocks and only had 8386 space available __jbd2_log_wait_for_space: no way to get more journal space in ram1-8 ------------[ cut here ]------------ WARNING: CPU: 15 PID: 6739 at fs/jbd2/checkpoint.c:168 __jbd2_log_wait_for_space+0x188/0x200() Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod CPU: 15 PID: 6739 Comm: fsstress Tainted: G W 3.17.0-rc2-00429-g684de57 #139 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 00000000000000a8 ffff88077aaab878 ffffffff815c1a8c 00000000000000a8 0000000000000000 ffff88077aaab8b8 ffffffff8106ce8c ffff88077aaab898 ffff8807c57e6000 ffff8807c57e6028 0000000000002100 ffff8807c57e62f0 Call Trace: [<ffffffff815c1a8c>] dump_stack+0x51/0x6d [<ffffffff8106ce8c>] warn_slowpath_common+0x8c/0xc0 [<ffffffff8106ceda>] warn_slowpath_null+0x1a/0x20 [<ffffffff812419f8>] __jbd2_log_wait_for_space+0x188/0x200 [<ffffffff8123be9a>] start_this_handle+0x4da/0x7b0 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810aba87>] ? lockdep_init_map+0xe7/0x180 [<ffffffff8123c5bc>] jbd2__journal_start+0xdc/0x1d0 [<ffffffff811f2414>] ? __ext4_new_inode+0x7f4/0x1330 [<ffffffff81222a38>] __ext4_journal_start_sb+0xf8/0x110 [<ffffffff811f2414>] __ext4_new_inode+0x7f4/0x1330 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190 [<ffffffff812025bb>] ext4_create+0x8b/0x150 [<ffffffff8117fe3b>] vfs_create+0x7b/0xb0 [<ffffffff8118097b>] do_last+0x7db/0xcf0 [<ffffffff8117e31d>] ? inode_permission+0x4d/0x50 [<ffffffff811845d2>] path_openat+0x242/0x590 [<ffffffff81191a76>] ? __alloc_fd+0x36/0x140 [<ffffffff81184a6a>] do_filp_open+0x4a/0xb0 [<ffffffff81191b61>] ? __alloc_fd+0x121/0x140 [<ffffffff81172f20>] do_sys_open+0x170/0x220 [<ffffffff8117300e>] SyS_open+0x1e/0x20 [<ffffffff811715d6>] SyS_creat+0x16/0x20 [<ffffffff815c7e12>] system_call_fastpath+0x16/0x1b ---[ end trace cd71c831f82059db ]--- Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: free bh when descriptor block checksum failsDarrick J. Wong2014-09-161-0/+1
| | | | | | | | | | | | | Free the buffer head if the journal descriptor block fails checksum verification. This is the jbd2 port of the e2fsprogs patch "e2fsck: free bh on csum verify error in do_one_pass". Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Cc: stable@vger.kernel.org
* jbd2: fix journal checksum feature flag handlingDarrick J. Wong2014-09-111-8/+8
| | | | | | | | | | Clear all three journal checksum feature flags before turning on whichever journal checksum options we want. Rearrange the error checking so that newer flags get complained about first. Reported-by: TR Reardon <thomas_reardon@hotmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd/jbd2: use non-movable memory for the jbd superblockGioh Kim2014-09-051-1/+1
| | | | | | | | | | Sicne the jbd/jbd2 superblock is not released until the file system is unmounted, allocate the buffer cache from the non-moveable area to allow page migration and CMA allocations to more easily succeed. Signed-off-by: Gioh Kim <gioh.kim@lge.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* jbd2: optimize jbd2_log_do_checkpoint() a bitJan Kara2014-09-051-3/+4
| | | | | | | | | | | When we discover written out buffer in transaction checkpoint list we don't have to recheck validity of a transaction. Either this is the last buffer in a transaction - and then we are done - or this isn't and then we can just take another buffer from the checkpoint list without dropping j_list_lock. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: don't call get_bh() before calling __jbd2_journal_remove_checkpoint()Theodore Ts'o2014-09-051-14/+5
| | | | | | | | | | | The __jbd2_journal_remove_checkpoint() doesn't require an elevated b_count; indeed, until the jh structure gets released by the call to jbd2_journal_put_journal_head(), the bh's b_count is elevated by virtue of the existence of the jh structure. Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: fold __wait_cp_io into jbd2_log_do_checkpoint()Theodore Ts'o2014-09-021-56/+31
| | | | | | | __wait_cp_io() is only called by jbd2_log_do_checkpoint(). Fold it in to make it a bit easier to understand. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: fold __process_buffer() into jbd2_log_do_checkpoint()Theodore Ts'o2014-09-021-111/+84
| | | | | | | | | | | | | | | | | | __process_buffer() is only called by jbd2_log_do_checkpoint(), and it had a very complex locking protocol where it would be called with the j_list_lock, and sometimes exit with the lock held (if the return code was 0), or release the lock. This was confusing both to humans and to smatch (which erronously complained that the lock was taken twice). Folding __process_buffer() to the caller allows us to simplify the control flow, making the resulting function easier to read and reason about, and dropping the compiled size of fs/jbd2/checkpoint.c by 150 bytes (over 4% of the text size). Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* jbd2: fix descriptor block size handling errors with journal_csumDarrick J. Wong2014-08-294-42/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It turns out that there are some serious problems with the on-disk format of journal checksum v2. The foremost is that the function to calculate descriptor tag size returns sizes that are too big. This causes alignment issues on some architectures and is compounded by the fact that some parts of jbd2 use the structure size (incorrectly) to determine the presence of a 64bit journal instead of checking the feature flags. Therefore, introduce journal checksum v3, which enlarges the descriptor block tag format to allow for full 32-bit checksums of journal blocks, fix the journal tag function to return the correct sizes, and fix the jbd2 recovery code to use feature flags to determine 64bitness. Add a few function helpers so we don't have to open-code quite so many pieces. Switching to a 16-byte block size was found to increase journal size overhead by a maximum of 0.1%, to convert a 32-bit journal with no checksumming to a 32-bit journal with checksum v3 enabled. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: TR Reardon <thomas_reardon@hotmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* jbd2: fix infinite loop when recovering corrupt journal blocksDarrick J. Wong2014-08-291-2/+5
| | | | | | | | | | | When recovering the journal, don't fall into an infinite loop if we encounter a corrupt journal block. Instead, just skip the block and return an error, which fails the mount and thus forces the user to run a full filesystem fsck. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* sched: Remove proliferation of wait_on_bit() action functionsNeilBrown2014-07-161-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current "wait_on_bit" interface requires an 'action' function to be provided which does the actual waiting. There are over 20 such functions, many of them identical. Most cases can be satisfied by one of just two functions, one which uses io_schedule() and one which just uses schedule(). So: Rename wait_on_bit and wait_on_bit_lock to wait_on_bit_action and wait_on_bit_lock_action to make it explicit that they need an action function. Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io which are *not* given an action function but implicitly use a standard one. The decision to error-out if a signal is pending is now made based on the 'mode' argument rather than being encoded in the action function. All instances of the old wait_on_bit and wait_on_bit_lock which can use the new version have been changed accordingly and their action functions have been discarded. wait_on_bit{_lock} does not return any specific error code in the event of a signal so the caller must check for non-zero and interpolate their own error code as appropriate. The wait_on_bit() call in __fscache_wait_on_invalidate() was ambiguous as it specified TASK_UNINTERRUPTIBLE but used fscache_wait_bit_interruptible as an action function. David Howells confirms this should be uniformly "uninterruptible" The main remaining user of wait_on_bit{,_lock}_action is NFS which needs to use a freezer-aware schedule() call. A comment in fs/gfs2/glock.c notes that having multiple 'action' functions is useful as they display differently in the 'wchan' field of 'ps'. (and /proc/$PID/wchan). As the new bit_wait{,_io} functions are tagged "__sched", they will not show up at all, but something higher in the stack. So the distinction will still be visible, only with different function names (gds2_glock_wait versus gfs2_glock_dq_wait in the gfs2/glock.c case). Since first version of this patch (against 3.15) two new action functions appeared, on in NFS and one in CIFS. CIFS also now uses an action function that makes the same freezer aware schedule call as NFS. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: David Howells <dhowells@redhat.com> (fscache, keys) Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2) Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steve French <sfrench@samba.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown Signed-off-by: Ingo Molnar <mingo@kernel.org>
* ext4: disable synchronous transaction batching if max_batch_time==0Eric Sandeen2014-07-061-1/+4
| | | | | | | | | | | | | | The mount manpage says of the max_batch_time option, This optimization can be turned off entirely by setting max_batch_time to 0. But the code doesn't do that. So fix the code to do that. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* arch: Mass conversion of smp_mb__*()Peter Zijlstra2014-04-181-3/+3
| | | | | | | | | | | Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* jbd2: improve error messages for inconsistent journal headsTheodore Ts'o2014-03-121-19/+14
| | | | | | | | | | Fix up error messages printed when the transaction pointers in a journal head are inconsistent. This improves the error messages which are printed when running xfstests generic/068 in data=journal mode. See the bug report at: https://bugzilla.kernel.org/show_bug.cgi?id=60786 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()Theodore Ts'o2014-03-091-2/+4
| | | | | | | | It's not needed until we start trying to modifying fields in the journal_head which are protected by j_list_lock. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* jbd2: minimize region locked by j_list_lock in journal_get_create_access()Theodore Ts'o2014-03-091-1/+2
| | | | | | | It's not needed until we start trying to modifying fields in the journal_head which are protected by j_list_lock. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* jbd2: check jh->b_transaction without taking j_list_lockTheodore Ts'o2014-03-091-2/+2
| | | | | | | | jh->b_transaction is adequately protected for reading by the jbd_lock_bh_state(bh), so we don't need to take j_list_lock in __journal_try_to_free_buffer(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* jbd2: add transaction to checkpoint list earlierTheodore Ts'o2014-03-091-19/+20
| | | | | | | | | We don't otherwise need j_list_lock during the rest of commit phase #7, so add the transaction to the checkpoint list at the very end of commit phase #6. This allows us to drop j_list_lock earlier, which is a good thing since it is a super hot lock. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* jbd2: calculate statistics without holding j_state_lock and j_list_lockTheodore Ts'o2014-03-091-18/+18
| | | | | | | | | | | | | | | The two hottest locks, and thus the biggest scalability bottlenecks, in the jbd2 layer, are the j_list_lock and j_state_lock. This has inspired some people to do some truly unnatural things[1]. [1] https://www.usenix.org/system/files/conference/fast14/fast14-paper_kang.pdf We don't need to be holding both j_state_lock and j_list_lock while calculating the journal statistics, so move those calculations to the very end of jbd2_journal_commit_transaction. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* jbd2: don't hold j_state_lock while calling wake_up()Theodore Ts'o2014-03-091-2/+2
| | | | | | | | | | | The j_state_lock is one of the hottest locks in the jbd2 layer and thus one of its scalability bottlenecks. We don't need to be holding the j_state_lock while we are calling wake_up(&journal->j_wait_commit), so release the lock a little bit earlier. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* jbd2: don't unplug after writing revoke recordsTheodore Ts'o2014-03-091-2/+0
| | | | | | | | | | During commit process, keep the block device plugged after we are done writing the revoke records, until we are finished writing the rest of the commit records in the journal. This will allow most of the journal blocks to be written in a single I/O operation, instead of separating the the revoke blocks from the rest of the journal blocks. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* jbd2: mark file-local functions as staticRashika Kheria2014-02-181-3/+3
| | | | | | | | | | | | | | | | Mark functions as static in jbd2/journal.c because they are not used outside this file. This eliminates the following warning in jbd2/journal.c: fs/jbd2/journal.c:125:5: warning: no previous prototype for ‘jbd2_verify_csum_type’ [-Wmissing-prototypes] fs/jbd2/journal.c:146:5: warning: no previous prototype for ‘jbd2_superblock_csum_verify’ [-Wmissing-prototypes] fs/jbd2/journal.c:154:6: warning: no previous prototype for ‘jbd2_superblock_csum_set’ [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
* jbd2: fix use after free in jbd2_journal_start_reserved()Dan Carpenter2014-02-181-2/+4
| | | | | | | | | | If start_this_handle() fails then it leads to a use after free of "handle". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* jbd2: rename obsoleted msg JBD->JBD2Dmitry Monakhov2013-12-093-9/+9
| | | | | | | | Rename performed via: perl -pi -e 's/JBD:/JBD2:/g' fs/jbd2/*.c Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>