summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* lib/int_sqrt: optimize initial value computePeter Zijlstra2017-11-181-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The initial value (@m) compute is: m = 1UL << (BITS_PER_LONG - 2); while (m > x) m >>= 2; Which is a linear search for the highest even bit smaller or equal to @x We can implement this using a binary search using __fls() (or better when its hardware implemented). m = 1UL << (__fls(x) & ~1UL); Especially for small values of @x; which are the more common arguments when doing a CDF on idle times; the linear search is near to worst case, while the binary search of __fls() is a constant 6 (or 5 on 32bit) branches. cycles: branches: branch-misses: PRE: hot: 43.633557 +- 0.034373 45.333132 +- 0.002277 0.023529 +- 0.000681 cold: 207.438411 +- 0.125840 45.333132 +- 0.002277 6.976486 +- 0.004219 SOFTWARE FLS: hot: 29.576176 +- 0.028850 26.666730 +- 0.004511 0.019463 +- 0.000663 cold: 165.947136 +- 0.188406 26.666746 +- 0.004511 6.133897 +- 0.004386 HARDWARE FLS: hot: 24.720922 +- 0.025161 20.666784 +- 0.004509 0.020836 +- 0.000677 cold: 132.777197 +- 0.127471 20.666776 +- 0.004509 5.080285 +- 0.003874 Averages computed over all values <128k using a LFSR to generate order. Cold numbers have a LFSR based branch trace buffer 'confuser' ran between each int_sqrt() invocation. Link: http://lkml.kernel.org/r/20171020164644.936577234@infradead.org Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: Joe Perches <joe@perches.com> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Anshul Garg <aksgarg1989@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Michael Davidson <md@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib/int_sqrt: optimize small argumentPeter Zijlstra2017-11-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current int_sqrt() computation is sub-optimal for the case of small @x. Which is the interesting case when we're going to do cumulative distribution functions on idle times, which we assume to be a random variable, where the target residency of the deepest idle state gives an upper bound on the variable (5e6ns on recent Intel chips). In the case of small @x, the compute loop: while (m != 0) { b = y + m; y >>= 1; if (x >= b) { x -= b; y += m; } m >>= 2; } can be reduced to: while (m > x) m >>= 2; Because y==0, b==m and until x>=m y will remain 0. And while this is computationally equivalent, it runs much faster because there's less code, in particular less branches. cycles: branches: branch-misses: OLD: hot: 45.109444 +- 0.044117 44.333392 +- 0.002254 0.018723 +- 0.000593 cold: 187.737379 +- 0.156678 44.333407 +- 0.002254 6.272844 +- 0.004305 PRE: hot: 67.937492 +- 0.064124 66.999535 +- 0.000488 0.066720 +- 0.001113 cold: 232.004379 +- 0.332811 66.999527 +- 0.000488 6.914634 +- 0.006568 POST: hot: 43.633557 +- 0.034373 45.333132 +- 0.002277 0.023529 +- 0.000681 cold: 207.438411 +- 0.125840 45.333132 +- 0.002277 6.976486 +- 0.004219 Averages computed over all values <128k using a LFSR to generate order. Cold numbers have a LFSR based branch trace buffer 'confuser' ran between each int_sqrt() invocation. Link: http://lkml.kernel.org/r/20171020164644.876503355@infradead.org Fixes: 30493cc9dddb ("lib/int_sqrt.c: optimize square root algorithm") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: Anshul Garg <aksgarg1989@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Joe Perches <joe@perches.com> Cc: David Miller <davem@davemloft.net> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Davidson <md@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib/test: delete five error messages for failed memory allocationsMarkus Elfring2017-11-183-15/+7
| | | | | | | | | | | | Omit extra messages for a memory allocation failure in these functions. This issue was detected by using the Coccinelle software. Link: http://lkml.kernel.org/r/410a4c5a-4ee0-6fcc-969c-103d8e496b78@users.sourceforge.net Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib: add module support to string testsGeert Uytterhoeven2017-11-184-142/+143
| | | | | | | | | | | | | Extract the string test code into its own source file, to allow compiling it either to a loadable module, or built into the kernel. Fixes: 03270c13c5ffaa6a ("lib/string.c: add testcases for memset16/32/64") Link: http://lkml.kernel.org/r/1505397744-3387-1-git-send-email-geert@linux-m68k.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* include/linux/radix-tree.h: remove unneeded #include <linux/bug.h>Masahiro Yamada2017-11-181-1/+0
| | | | | | | | | | | | | | | | | | | | | | This include was added by commit 187f1882b5b0 ("BUG: headers with BUG/BUG_ON etc. need linux/bug.h") because BUG_ON() was used in this header at that time. Some time later, commit 6d75f366b924 ("lib: radix-tree: check accounting of existing slot replacement users") removed the use of BUG_ON() from this header. Since then, there is no reason to include <linux/bug.h>. Link: http://lkml.kernel.org/r/1505660151-4383-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Chris Mi <chrism@mellanox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* include/linux/bitfield.h: include <linux/build_bug.h> instead of <linux/bug.h>Masahiro Yamada2017-11-181-1/+1
| | | | | | | | | | | | | | | Since commit bc6245e5efd7 ("bug: split BUILD_BUG stuff out into <linux/build_bug.h>"), #include <linux/build_bug.h> is better to pull minimal headers needed for BUILG_BUG() family. Link: http://lkml.kernel.org/r/1505700775-19826-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Dinan Gunawardena <dinan.gunawardena@netronome.com> Cc: Kalle Valo <kvalo@codeaurora.org> Cc: Ian Abbott <abbotti@mev.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* get_maintainer: add more --self-test optionsJoe Perches2017-11-181-17/+132
| | | | | | | | | | | | | | | | | | | | Add tests for duplicate section headers, missing section content, link and scm reachability. Miscellanea: o Add --self-test=<foo> options (a comma separated list of any of sections, patterns, links or scm) where the default without options is all tests o Rename check_maintainers_patterns to self_test o Rename self_test_pattern_info to self_test_info [tom.saeger@oracle.com: improvements] Link: http://lkml.kernel.org/r/13e3986c374902fcf08ae947e36c5c608bbe3b79.1510075301.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Tom Saeger <tom.saeger@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* get_maintainer: add --self-test for internal consistency testsTom Saeger2017-11-181-17/+77
| | | | | | | | | | | | | | Add "--self-test" option to get_maintainer.pl to show potential issues in MAINTAINERS file(s) content. Pattern check warnings are shown for "F" and "X" patterns found in MAINTAINERS file(s) which do not match any files known by git. Link: http://lkml.kernel.org/r/64994f911b3510d0f4c8ac2e113501dfcec1f3c9.1509559540.git.tom.saeger@oracle.com Signed-off-by: Tom Saeger <tom.saeger@oracle.com> Acked-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* dynamic_debug documentation: minor fixesRandy Dunlap2017-11-181-3/+3
| | | | | | | | | | | | Fix minor typo. Fix missing words in explaining parsing of last line number. Link: http://lkml.kernel.org/r/ebb7ff42-4945-103f-d5b4-f07a6f3343a7@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* dynamic-debug-howto: fix optional/omitted ending line number to be LARGE ↵Randy Dunlap2017-11-181-0/+4
| | | | | | | | | | | | | | | | | | instead of 0 line-range is supposed to treat "1-" as "1-endoffile", so handle the special case by setting last_lineno to UINT_MAX. Fixes this error: dynamic_debug:ddebug_parse_query: last-line:0 < 1st-line:1 dynamic_debug:ddebug_exec_query: query parse failed Link: http://lkml.kernel.org/r/10a6a101-e2be-209f-1f41-54637824788e@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel/umh.c: optimize 'proc_cap_handler()'Christophe JAILLET2017-11-181-2/+2
| | | | | | | | | | If 'write' is 0, we can avoid a call to spin_lock/spin_unlock. Link: http://lkml.kernel.org/r/20171020193331.7233-1-christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: Luis R. Rodriguez <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* include/linux/compiler-clang.h: handle randomizable anonymous structsSandipan Das2017-11-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The GCC randomize layout plugin can randomize the member offsets of sensitive kernel data structures. To use this feature, certain annotations and members are added to the structures which affect the member offsets even if this plugin is not used. All of these structures are completely randomized, except for task_struct which leaves out some of its members. All the other members are wrapped within an anonymous struct with the __randomize_layout attribute. This is done using the randomized_struct_fields_start and randomized_struct_fields_end defines. When the plugin is disabled, the behaviour of this attribute can vary based on the GCC version. For GCC 5.1+, this attribute maps to __designated_init otherwise it is just an empty define but the anonymous structure is still present. For other compilers, both randomized_struct_fields_start and randomized_struct_fields_end default to empty defines meaning the anonymous structure is not introduced at all. So, if a module compiled with Clang, such as a BPF program, needs to access task_struct fields such as pid and comm, the offsets of these members as recognized by Clang are different from those recognized by modules compiled with GCC. If GCC 4.6+ is used to build the kernel, this can be solved by introducing appropriate defines for Clang so that the anonymous structure is seen when determining the offsets for the members. Link: http://lkml.kernel.org/r/20171109064645.25581-1-sandipan@linux.vnet.ibm.com Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Alexei Starovoitov <ast@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bug: fix "cut here" location for __WARN_TAINT architecturesKees Cook2017-11-182-3/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to v4.11, x86 used warn_slowpath_fmt() for handling WARN()s. After WARN() was moved to using UD0 on x86, the warning text started appearing _before_ the "cut here" line. This appears to have been a long-standing bug on architectures that used __WARN_TAINT, but it didn't get fixed. v4.11 and earlier on x86: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 2956 at drivers/misc/lkdtm_bugs.c:65 lkdtm_WARNING+0x21/0x30 This is a warning message Modules linked in: v4.12 and later on x86: This is a warning message ------------[ cut here ]------------ WARNING: CPU: 1 PID: 2982 at drivers/misc/lkdtm_bugs.c:68 lkdtm_WARNING+0x15/0x20 Modules linked in: With this fix: ------------[ cut here ]------------ This is a warning message WARNING: CPU: 3 PID: 3009 at drivers/misc/lkdtm_bugs.c:67 lkdtm_WARNING+0x15/0x20 Since the __FILE__ reporting happens as part of the UD0 handler, it isn't trivial to move the message to after the WARNING line, but at least we can fix the position of the "cut here" line so all the various logging tools will start including the actual runtime warning message again, when they follow the instruction and "cut here". Link: http://lkml.kernel.org/r/1510100869-73751-4-git-send-email-keescook@chromium.org Fixes: 9a93848fe787 ("x86/debug: Implement __WARN() using UD0") Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bug: define the "cut here" string in a single placeKees Cook2017-11-184-3/+5
| | | | | | | | | | | | | | | The "cut here" string is used in a few paths. Define it in a single place. Link: http://lkml.kernel.org/r/1510100869-73751-3-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lkdtm: include WARN format stringKees Cook2017-11-181-1/+3
| | | | | | | | | | | | | | | In order to test the ordering of WARN format strings, actually include one in LKDTM. Link: http://lkml.kernel.org/r/1510100869-73751-2-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* iopoll: avoid -Wint-in-bool-context warningArnd Bergmann2017-11-181-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | When we pass the result of a multiplication as the timeout or the delay, we can get a warning from gcc-7: drivers/mmc/host/bcm2835.c:596:149: error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context] drivers/mfd/arizona-core.c:247:195: error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context] drivers/gpu/drm/sun4i/sun4i_hdmi_i2c.c:49:27: error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context] The warning is a bit questionable inside of a macro, but this is intentional on the side of the gcc developers. It is also an indication of another problem: we evaluate the timeout and sleep arguments multiple times, which can have undesired side-effects when those are complex expressions. This changes the two iopoll variants to use local variables for storing copies of the timeouts. This adds some more type safety, and avoids both the double-evaluation and the gcc warning. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81484 Link: http://lkml.kernel.org/r/20170726133756.2161367-1-arnd@arndb.de Link: http://lkml.kernel.org/r/20171102114048.1526955-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* parse-maintainers: add ability to specify filenamesJoe Perches2017-11-181-5/+47
| | | | | | | | | | | | parse-maintainers.pl is convenient, but currently hard-codes the filenames that are used. Allow user-specified filenames to simplify the use of the script. Link: http://lkml.kernel.org/r/48703c068b3235223ffa3b2eb268fa0a125b25e0.1502251549.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel debug: support resetting WARN_ONCE for all architecturesAndi Kleen2017-11-183-1/+30
| | | | | | | | | | | | | | | | | | | | | Some architectures store the WARN_ONCE state in the flags field of the bug_entry. Clear that one too when resetting once state through /sys/kernel/debug/clear_warn_once Pointed out by Michael Ellerman Improves the earlier patch that add clear_warn_once. [ak@linux.intel.com: add a missing ifdef CONFIG_MODULES] Link: http://lkml.kernel.org/r/20171020170633.9593-1-andi@firstfloor.org [akpm@linux-foundation.org: fix unused var warning] [akpm@linux-foundation.org: Use 0200 for clear_warn_once file, per mpe] [akpm@linux-foundation.org: clear BUGFLAG_DONE in clear_once_table(), per mpe] Link: http://lkml.kernel.org/r/20171019204642.7404-1-andi@firstfloor.org Signed-off-by: Andi Kleen <ak@linux.intel.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel debug: support resetting WARN*_ONCEAndi Kleen2017-11-185-3/+42
| | | | | | | | | | | | | | | | | | | | | | | | I like _ONCE warnings because it's guaranteed that they don't flood the log. During testing I find it useful to reset the state of the once warnings, so that I can rerun tests and see if they trigger again, or can guarantee that a test run always hits the same warnings. This patch adds a debugfs interface to reset all the _ONCE warnings so that they appear again: echo 1 > /sys/kernel/debug/clear_warn_once This is implemented by putting all the warning booleans into a special section, and clearing it. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20171017221455.6740-1-andi@firstfloor.org Signed-off-by: Andi Kleen <ak@linux.intel.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sh/boot: add static stack-protector to pre-kernelKees Cook2017-11-181-0/+14
| | | | | | | | | | | | | | | | | | | | | The sh decompressor code triggers stack-protector code generation when using CONFIG_CC_STACKPROTECTOR_STRONG. As done for arm and mips, add a simple static stack-protector canary. As this wasn't protected before, the risk of using a weak canary is minimized. Once the kernel is actually up, a better canary is chosen. Link: http://lkml.kernel.org/r/1506972007-80614-2-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ingo Molnar <mingo@kernel.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Michal Marek <mmarek@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* spelling.txt: add "unnecessary" typo variantsJoe Perches2017-11-181-0/+4
| | | | | | | | | Add unnecessary typos by copying the necessary typos. Link: http://lkml.kernel.org/r/1505074722.22023.6.camel@perches.com Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: use do-while in name_to_int()Alexey Dobriyan2017-11-181-2/+2
| | | | | | | | | | | | | | Gcc doesn't know that "len" is guaranteed to be >=1 by dcache and generates standard while-loop prologue duplicating loop condition. add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27) function old new delta name_to_int 104 77 -27 Link: http://lkml.kernel.org/r/20170912195213.GB17730@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: : uninline name_to_int()Alexey Dobriyan2017-11-183-22/+25
| | | | | | | | | | | | | | | | | Save ~360 bytes. add/remove: 1/0 grow/shrink: 0/4 up/down: 104/-463 (-359) function old new delta name_to_int - 104 +104 proc_pid_lookup 217 126 -91 proc_lookupfd_common 212 121 -91 proc_task_lookup 289 194 -95 __proc_create 588 402 -186 Link: http://lkml.kernel.org/r/20170912194850.GA17730@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc, coredump: add CoreDumping flag to /proc/pid/statusRoman Gushchin2017-11-182-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now there is no convenient way to check if a process is being coredumped at the moment. It might be necessary to recognize such state to prevent killing the process and getting a broken coredump. Writing a large core might take significant time, and the process is unresponsive during it, so it might be killed by timeout, if another process is monitoring and killing/restarting hanging tasks. We're getting a significant number of corrupted coredump files on machines in our fleet, just because processes are being killed by timeout in the middle of the core writing process. We do have a process health check, and some agent is responsible for restarting processes which are not responding for health check requests. Writing a large coredump to the disk can easily exceed the reasonable timeout (especially on an overloaded machine). This flag will allow the agent to distinguish processes which are being coredumped, extend the timeout for them, and let them produce a full coredump file. To provide an ability to detect if a process is in the state of being coredumped, we can expose a boolean CoreDumping flag in /proc/pid/status. Example: $ cat core.sh #!/bin/sh echo "|/usr/bin/sleep 10" > /proc/sys/kernel/core_pattern sleep 1000 & PID=$! cat /proc/$PID/status | grep CoreDumping kill -ABRT $PID sleep 1 cat /proc/$PID/status | grep CoreDumping $ ./core.sh CoreDumping: 0 CoreDumping: 1 [guro@fb.com: document CoreDumping flag in /proc/<pid>/status] Link: http://lkml.kernel.org/r/20170928135357.GA8470@castle.DHCP.thefacebook.com Link: http://lkml.kernel.org/r/20170920230634.31572-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ingo Molnar <mingo@kernel.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: remove unneeded pageblock_skip_persistent() checksVlastimil Babka2017-11-181-15/+3
| | | | | | | | | | | | | | | | | | | | | | Commit f3c931633a59 ("mm, compaction: persistently skip hugetlbfs pageblocks") has introduced pageblock_skip_persistent() checks into migration and free scanners, to make sure pageblocks that should be persistently skipped are marked as such, regardless of the ignore_skip_hint flag. Since the previous patch introduced a new no_set_skip_hint flag, the ignore flag no longer prevents marking pageblocks as skipped. Therefore we can remove the special cases. The relevant pageblocks will be marked as skipped by the common logic which marks each pageblock where no page could be isolated. This makes the code simpler. Link: http://lkml.kernel.org/r/20171102121706.21504-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: split off flag for not updating skip hintsVlastimil Babka2017-11-183-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | Pageblock skip hints were added as a heuristic for compaction, which shares core code with CMA. Since CMA reliability would suffer from the heuristics, compact_control flag ignore_skip_hint was added for the CMA use case. Since 6815bf3f233e ("mm/compaction: respect ignore_skip_hint in update_pageblock_skip") the flag also means that CMA won't *update* the skip hints in addition to ignoring them. Today, direct compaction can also ignore the skip hints in the last resort attempt, but there's no reason not to set them when isolation fails in such case. Thus, this patch splits off a new no_set_skip_hint flag to avoid the updating, which only CMA sets. This should improve the heuristics a bit, and allow us to simplify the persistent skip bit handling as the next step. Link: http://lkml.kernel.org/r/20171102121706.21504-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: extend pageblock_skip_persistent() to all compound pagesVlastimil Babka2017-11-181-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | pageblock_skip_persistent() checks for HugeTLB pages of pageblock order. When clearing pageblock skip bits for compaction, the bits are not cleared for such pageblocks, because they cannot contain base pages suitable for migration, nor free pages to use as migration targets. This optimization can be simply extended to all compound pages of order equal or larger than pageblock order, because migrating such pages (if they support it) cannot help sub-pageblock fragmentation. This includes THP's and also gigantic HugeTLB pages, which the current implementation doesn't persistently skip due to a strict pageblock_order equality check and not recognizing tail pages. While THP pages are generally less "persistent" than HugeTLB, we can still expect that if a THP exists at the point of __reset_isolation_suitable(), it will exist also during the subsequent compaction run. The time difference here could be actually smaller than between a compaction run that sets a (non-persistent) skip bit on a THP, and the next compaction run that observes it. Link: http://lkml.kernel.org/r/20171102121706.21504-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: persistently skip hugetlbfs pageblocksDavid Rientjes2017-11-182-12/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | It is pointless to migrate hugetlb memory as part of memory compaction if the hugetlb size is equal to the pageblock order. No defragmentation is occurring in this condition. It is also pointless to for the freeing scanner to scan a pageblock where a hugetlb page is pinned. Unconditionally skip these pageblocks, and do so peristently so that they are not rescanned until it is observed that these hugepages are no longer pinned. It would also be possible to do this by involving the hugetlb subsystem in marking pageblocks to no longer be skipped when they hugetlb pages are freed. This is a simple solution that doesn't involve any additional subsystems in pageblock skip manipulation. [rientjes@google.com: fix build] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708201734390.117182@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708151639130.106658@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Tested-by: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: kcompactd should not ignore pageblock skipDavid Rientjes2017-11-181-2/+1
| | | | | | | | | | | | | | | | | | | | Kcompactd is needlessly ignoring pageblock skip information. It is doing MIGRATE_SYNC_LIGHT compaction, which is no more powerful than MIGRATE_SYNC compaction. If compaction recently failed to isolate memory from a set of pageblocks, there is nothing to indicate that kcompactd will be able to do so, or that it is beneficial from attempting to isolate memory. Use the pageblock skip hint to avoid rescanning pageblocks needlessly until that information is reset. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708151638550.106658@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: shmem: remove unused info variableCorentin Labbe2017-11-181-2/+0
| | | | | | | | | | | | | Fix the following warning by removing the unused variable: mm/shmem.c:3205:27: warning: variable 'info' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/1510774029-30652-1-git-send-email-clabbe@baylibre.com Signed-off-by: Corentin Labbe <clabbe@baylibre.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib/dma-debug.c: fix incorrect pfn calculationMiles Chen2017-11-181-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dma-debug reports the following warning: WARNING: CPU: 3 PID: 298 at kernel-4.4/lib/dma-debug.c:604 debug _dma_assert_idle+0x1a8/0x230() DMA-API: cpu touching an active dma mapped cacheline [cln=0x00000882300] CPU: 3 PID: 298 Comm: vold Tainted: G W O 4.4.22+ #1 Hardware name: MT6739 (DT) Call trace: debug_dma_assert_idle+0x1a8/0x230 wp_page_copy.isra.96+0x118/0x520 do_wp_page+0x4fc/0x534 handle_mm_fault+0xd4c/0x1310 do_page_fault+0x1c8/0x394 do_mem_abort+0x50/0xec I found that debug_dma_alloc_coherent() and debug_dma_free_coherent() assume that dma_alloc_coherent() always returns a linear address. However it's possible that dma_alloc_coherent() returns a non-linear address. In this case, page_to_pfn(virt_to_page(virt)) will return an incorrect pfn. If the pfn is valid and mapped as a COW page, we will hit the warning when doing wp_page_copy(). Fix this by calculating pfn for linear and non-linear addresses. [miles.chen@mediatek.com: v4] Link: http://lkml.kernel.org/r/1510872972-23919-1-git-send-email-miles.chen@mediatek.com Link: http://lkml.kernel.org/r/1506484087-1177-1-git-send-email-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/z3fold.c: use kref to prevent page free/compact raceVitaly Wool2017-11-181-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | There is a race in the current z3fold implementation between do_compact() called in a work queue context and the page release procedure when page's kref goes to 0. do_compact() may be waiting for page lock, which is released by release_z3fold_page_locked right before putting the page onto the "stale" list, and then the page may be freed as do_compact() modifies its contents. The mechanism currently implemented to handle that (checking the PAGE_STALE flag) is not reliable enough. Instead, we'll use page's kref counter to guarantee that the page is not released if its compaction is scheduled. It then becomes compaction function's responsibility to decrease the counter and quit immediately if the page was actually freed. Link: http://lkml.kernel.org/r/20171117092032.00ea56f42affbed19f4fcc6c@gmail.com Signed-off-by: Vitaly Wool <vitaly.wool@sonymobile.com> Cc: <Oleksiy.Avramchenko@sony.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: fix nodemask printingArnd Bergmann2017-11-181-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | The cleanup caused build warnings for constant mask pointers: mm/mempolicy.c: In function `mpol_to_str': ./include/linux/nodemask.h:108:11: warning: the comparison will always evaluate as `true' for the address of `nodes' will never be NULL [-Waddress] An earlier workaround I suggested was incorporated in the version that got merged, but that only solved the problem for gcc-7 and higher, while gcc-4.6 through gcc-6.x still warn. This changes the printing again to use inline functions that make it clear to the compiler that the line that does the NULL check has no idea whether the argument is a constant NULL. Link: http://lkml.kernel.org/r/20171117101545.119689-1-arnd@arndb.de Fixes: 0205f75571e3 ("mm: simplify nodemask printing") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Zhangshaokun <zhangshaokun@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'libnvdimm-for-4.15' of ↵Linus Torvalds2017-11-1748-560/+1405
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm and dax updates from Dan Williams: "Save for a few late fixes, all of these commits have shipped in -next releases since before the merge window opened, and 0day has given a build success notification. The ext4 touches came from Jan, and the xfs touches have Darrick's reviewed-by. An xfstest for the MAP_SYNC feature has been through a few round of reviews and is on track to be merged. - Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable 'userspace flush' of persistent memory updates via filesystem-dax mappings. It arranges for any filesystem metadata updates that may be required to satisfy a write fault to also be flushed ("on disk") before the kernel returns to userspace from the fault handler. Effectively every write-fault that dirties metadata completes an fsync() before returning from the fault handler. The new MAP_SHARED_VALIDATE mapping type guarantees that the MAP_SYNC flag is validated as supported by the filesystem's ->mmap() file operation. - Add support for the standard ACPI 6.2 label access methods that replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods. This enables interoperability with environments that only implement the standardized methods. - Add support for the ACPI 6.2 NVDIMM media error injection methods. - Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for latch last shutdown status, firmware update, SMART error injection, and SMART alarm threshold control. - Cleanup physical address information disclosures to be root-only. - Fix revalidation of the DIMM "locked label area" status to support dynamic unlock of the label area. - Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA (system-physical-address) command and error injection commands. Acknowledgements that came after the commits were pushed to -next: - 957ac8c421ad ("dax: fix PMD faults on zero-length files"): Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> - a39e596baa07 ("xfs: support for synchronous DAX faults") and 7b565c9f965b ("xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()") Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>" * tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (49 commits) acpi, nfit: add 'Enable Latch System Shutdown Status' command support dax: fix general protection fault in dax_alloc_inode dax: fix PMD faults on zero-length files dax: stop requiring a live device for dax_flush() brd: remove dax support dax: quiet bdev_dax_supported() fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax core tools/testing/nvdimm: unit test clear-error commands acpi, nfit: validate commands against the device type tools/testing/nvdimm: stricter bounds checking for error injection commands xfs: support for synchronous DAX faults xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault() ext4: Support for synchronous DAX faults ext4: Simplify error handling in ext4_dax_huge_fault() dax: Implement dax_finish_sync_fault() dax, iomap: Add support for synchronous faults mm: Define MAP_SYNC and VM_SYNC flags dax: Allow tuning whether dax_insert_mapping_entry() dirties entry dax: Allow dax_iomap_fault() to return pfn dax: Fix comment describing dax_iomap_fault() ...
| * Merge branch 'for-4.15/dax' into libnvdimm-for-nextDan Williams2017-11-16658-4541/+7338
| |\
| | * dax: fix general protection fault in dax_alloc_inodeMikulas Patocka2017-11-151-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't crash in case of allocation failure in dax_alloc_inode. syzkaller hit the following crash on e4880bc5dfb1 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access [..] RIP: 0010:dax_alloc_inode+0x3b/0x70 drivers/dax/super.c:348 Call Trace: alloc_inode+0x65/0x180 fs/inode.c:208 new_inode_pseudo+0x69/0x190 fs/inode.c:890 new_inode+0x1c/0x40 fs/inode.c:919 mount_pseudo_xattr+0x288/0x560 fs/libfs.c:261 mount_pseudo include/linux/fs.h:2137 [inline] dax_mount+0x2e/0x40 drivers/dax/super.c:388 mount_fs+0x66/0x2d0 fs/super.c:1223 Cc: <stable@vger.kernel.org> Fixes: 7b6be8444e0f ("dax: refactor dax-fs into a generic provider...") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * dax: fix PMD faults on zero-length filesJeff Moyer2017-11-151-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PMD faults on a zero length file on a file system mounted with -o dax will not generate SIGBUS as expected. fd = open(...O_TRUNC); addr = mmap(NULL, 2*1024*1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); *addr = 'a'; <expect SIGBUS> The problem is this code in dax_iomap_pmd_fault: max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT; If the inode size is zero, we end up with a max_pgoff that is way larger than 0. :) Fix it by using DIV_ROUND_UP, as is done elsewhere in the kernel. I tested this with some simple test code that ensured that SIGBUS was received where expected. Cc: <stable@vger.kernel.org> Fixes: 642261ac995e ("dax: add struct iomap based DAX PMD support") Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * dax: stop requiring a live device for dax_flush()Dan Williams2017-11-151-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that dax_flush() is no longer a driver callback (commit c3ca015fab6d "dax: remove the pmem_dax_ops->flush abstraction"), stop requiring the dax_read_lock() to be held and the device to be alive. This is in preparation for switching filesystem-dax to store pfns instead of sectors in the radix. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * brd: remove dax supportDan Williams2017-11-152-77/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DAX support in brd is awkward because its backing page frames are distinct from the ones provided by pmem, dcssblk, or axonram. We need pfn_t_devmap() entries to fully support DAX, and the limited DAX support for pfn_t_special() page frames is not interesting for brd when pmem is already a superset of brd. Lastly, brd is the only dax capable driver that may sleep in its ->direct_access() implementation. So it causes a global burden with no net gain of kernel functionality. For all these reasons, remove DAX support. Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * dax: quiet bdev_dax_supported()Dan Williams2017-11-151-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before we add another failure reason, quiet the existing log messages. Leave it to the caller to decide if bdev_dax_supported() failures are errors worth emitting to the log. Reported-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax coreDan Williams2017-11-143-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While reviewing whether MAP_SYNC should strengthen its current guarantee of syncing writes from the initiating process to also include third-party readers observing dirty metadata, Dave pointed out that the check of IOMAP_WRITE is misplaced. The policy of what to with IOMAP_F_DIRTY should be separated from the generic filesystem mechanism of reporting dirty metadata. Move this policy to the fs-dax core to simplify the per-filesystem iomap handlers, and further centralize code that implements the MAP_SYNC policy. This otherwise should not change behavior, it just makes it easier to change behavior in the future. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * xfs: support for synchronous DAX faultsChristoph Hellwig2017-11-032-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare blocks for writing and the inode is pinned, and has dirty fields other than the timestamps. In __xfs_filemap_fault() we then detect this case and call dax_finish_sync_fault() to make sure all metadata is committed, and to insert the page table entry. Note that this will also dirty corresponding radix tree entry which is what we want - fsync(2) will still provide data integrity guarantees for applications not using userspace flushing. And applications using userspace flushing can avoid calling fsync(2) and thus avoid the performance overhead. [JK: Added VM_SYNC flag handling] Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()Jan Kara2017-11-032-27/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_filemap_pfn_mkwrite() duplicates a lot of __xfs_filemap_fault(). It will also need to handle flushing for synchronous page faults. So just make that function use __xfs_filemap_fault(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * ext4: Support for synchronous DAX faultsJan Kara2017-11-034-1/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We return IOMAP_F_DIRTY flag from ext4_iomap_begin() when asked to prepare blocks for writing and the inode has some uncommitted metadata changes. In the fault handler ext4_dax_fault() we then detect this case (through VM_FAULT_NEEDDSYNC return value) and call helper dax_finish_sync_fault() to flush metadata changes and insert page table entry. Note that this will also dirty corresponding radix tree entry which is what we want - fsync(2) will still provide data integrity guarantees for applications not using userspace flushing. And applications using userspace flushing can avoid calling fsync(2) and thus avoid the performance overhead. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * ext4: Simplify error handling in ext4_dax_huge_fault()Jan Kara2017-11-031-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | If transaction starting fails, just bail out of the function immediately instead of checking for that condition throughout the function. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * dax: Implement dax_finish_sync_fault()Jan Kara2017-11-033-0/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement a function that filesystems can call to finish handling of synchronous page faults. It takes care of syncing appropriare file range and insertion of page table entry. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * dax, iomap: Add support for synchronous faultsJan Kara2017-11-033-3/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a flag to iomap interface informing the caller that inode needs fdstasync(2) for returned extent to become persistent and use it in DAX fault code so that we don't map such extents into page tables immediately. Instead we propagate the information that fdatasync(2) is necessary from dax_iomap_fault() with a new VM_FAULT_NEEDDSYNC flag. Filesystem fault handler is then responsible for calling fdatasync(2) and inserting pfn into page tables. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * mm: Define MAP_SYNC and VM_SYNC flagsJan Kara2017-11-034-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Define new MAP_SYNC flag and corresponding VMA VM_SYNC flag. As the MAP_SYNC flag is not part of LEGACY_MAP_MASK, currently it will be refused by all MAP_SHARED_VALIDATE map attempts and silently ignored for everything else. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * dax: Allow tuning whether dax_insert_mapping_entry() dirties entryJan Kara2017-11-031-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we dirty radix tree entry whenever dax_insert_mapping_entry() gets called for a write fault. With synchronous page faults we would like to insert clean radix tree entry and dirty it only once we call fdatasync() and update page tables to save some unnecessary cache flushing. Add 'dirty' argument to dax_insert_mapping_entry() for that. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| | * dax: Allow dax_iomap_fault() to return pfnJan Kara2017-11-035-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For synchronous page fault dax_iomap_fault() will need to return PFN which will then need to be inserted into page tables after fsync() completes. Add necessary parameter to dax_iomap_fault(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>