summaryrefslogtreecommitdiffstats
path: root/kernel (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Make wait_for_device_probe() also do scsi_complete_async_scans()Linus Torvalds2012-07-192-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a7a20d103994 ("sd: limit the scope of the async probe domain") make the SCSI device probing run device discovery in it's own async domain. However, as a result, the partition detection was no longer synchronized by async_synchronize_full() (which, despite the name, only synchronizes the global async space, not all of them). Which in turn meant that "wait_for_device_probe()" would not wait for the SCSI partitions to be parsed. And "wait_for_device_probe()" was what the boot time init code relied on for mounting the root filesystem. Now, most people never noticed this, because not only is it timing-dependent, but modern distributions all use initrd. So the root filesystem isn't actually on a disk at all. And then before they actually mount the final disk filesystem, they will have loaded the scsi-wait-scan module, which not only does the expected wait_for_device_probe(), but also does scsi_complete_async_scans(). [ Side note: scsi_complete_async_scans() had also been partially broken, but that was fixed in commit 43a8d39d0137 ("fix async probe regression"), so that same commit a7a20d103994 had actually broken setups even if you used scsi-wait-scan explicitly ] Solve this problem by just moving the scsi_complete_async_scans() call into wait_for_device_probe(). Everybody who wants to wait for device probing to finish really wants the SCSI probing to complete, so there's no reason not to do this. So now "wait_for_device_probe()" really does what the name implies, and properly waits for device probing to finish. This also removes the now unnecessary extra calls to scsi_complete_async_scans(). Reported-and-tested-by: Artem S. Tashkinov <t.artem@mailcity.com> Cc: Dan Williams <dan.j.williams@gmail.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: James Bottomley <jbottomley@parallels.com> Cc: Borislav Petkov <bp@amd64.org> Cc: linux-scsi <linux-scsi@vger.kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds2012-07-181-2/+6
|\ | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip One more time/ntp fix pulled from Ingo Molnar. * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: ntp: Fix STA_INS/DEL clearing bug
| * ntp: Fix STA_INS/DEL clearing bugJohn Stultz2012-07-151-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 6b43ae8a619d17c4935c3320d2ef9e92bdeed05d, I introduced a bug that kept the STA_INS or STA_DEL bit from being cleared from time_status via adjtimex() without forcing STA_PLL first. Usually once the STA_INS is set, it isn't cleared until the leap second is applied, so its unlikely this affected anyone. However during testing I noticed it took some effort to cancel a leap second once STA_INS was set. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> CC: stable@vger.kernel.org # 3.4 Link: http://lkml.kernel.org/r/1342156917-25092-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | timekeeping: Add missing update call in timekeeping_resume()Thomas Gleixner2012-07-161-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | The leap second rework unearthed another issue of inconsistent data. On timekeeping_resume() the timekeeper data is updated, but nothing calls timekeeping_update(), so now the update code in the timer interrupt sees stale values. This has been the case before those changes, but then the timer interrupt was using stale data as well so this went unnoticed for quite some time. Add the missing update call, so all the data is consistent everywhere. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Reported-and-tested-by: "Rafael J. Wysocki" <rjw@sisk.pl> Reported-and-tested-by: Martin Steigerwald <Martin@lichtvoll.de> Cc: LKML <linux-kernel@vger.kernel.org> Cc: Linux PM list <linux-pm@vger.kernel.org> Cc: John Stultz <johnstul@us.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>, Cc: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*---. Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus' and ↵Linus Torvalds2012-07-149-85/+229
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU, perf, and scheduler fixes from Ingo Molnar. The RCU fix is a revert for an optimization that could cause deadlocks. One of the scheduler commits (164c33c6adee "sched: Fix fork() error path to not crash") is correct but not complete (some architectures like Tile are not covered yet) - the resulting additional fixes are still WIP and Ingo did not want to delay these pending fixes. See this thread on lkml: [PATCH] fork: fix error handling in dup_task() The perf fixes are just trivial oneliners. * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf kvm: Fix segfault with report and mixed guestmount use perf kvm: Fix regression with guest machine creation perf script: Fix format regression due to libtraceevent merge ring-buffer: Fix accounting of entries when removing pages ring-buffer: Fix crash due to uninitialized new_pages list head * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS/sched: Update scheduler file pattern sched/nohz: Rewrite and fix load-avg computation -- again sched: Fix fork() error path to not crash
| | | * sched/nohz: Rewrite and fix load-avg computation -- againPeter Zijlstra2012-07-054-75/+205
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <dsmythies@telus.net> Reported-and-tested-by: Charles Wang <muming.wq@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@kernel.org Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | | * sched: Fix fork() error path to not crashSalman Qazi2012-07-051-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In dup_task_struct(), if arch_dup_task_struct() fails, the clean up code fails to clean up correctly. That's because the clean up code depends on unininitalized ti->task pointer. We fix this by making sure that the task and thread_info know about each other before we attempt to take the error path. Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120626011815.11323.5533.stgit@dungbeetle.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | ring-buffer: Fix accounting of entries when removing pagesVaibhav Nagarnaik2012-06-291-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When removing pages from the ring buffer, its state is not reset. This means that the counters need to be correctly updated to account for the pages removed. Update the overrun counter to reflect the removed events from the pages. Link: http://lkml.kernel.org/r/1340998301-1715-1-git-send-email-vnagarnaik@google.com Cc: Justin Teravest <teravest@google.com> Cc: David Sharp <dhsharp@google.com> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | ring-buffer: Fix crash due to uninitialized new_pages list headVaibhav Nagarnaik2012-06-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new_pages list head in the cpu_buffer is not initialized. When adding pages to the ring buffer, if the memory allocation fails in ring_buffer_resize, the clean up handler tries to free up the allocated pages from all the cpu buffers. The panic is caused by referencing the uninitialized new_pages list head. Initializing the new_pages list head in rb_allocate_cpu_buffer fixes this. Link: http://lkml.kernel.org/r/1340391005-10880-1-git-send-email-vnagarnaik@google.com Cc: Justin Teravest <teravest@google.com> Cc: David Sharp <dhsharp@google.com> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | Merge branch 'rcu/urgent' of ↵Ingo Molnar2012-07-064-4/+13
| |\ \ \ | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent Pull low probability CONFIG_RCU_BOOST=y deadlock fix from Paul E. McKenney. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"Paul E. McKenney2012-07-024-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 616c310e83b872024271c915c1b9ab505b9efad9. (Move PREEMPT_RCU preemption to switch_to() invocation). Testing by Sasha Levin <levinsasha928@gmail.com> showed that this can result in deadlock due to invoking the scheduler when one of the runqueue locks is held. Because this commit was simply a performance optimization, revert it. Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Sasha Levin <levinsasha928@gmail.com>
* | | | Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds2012-07-142-18/+98
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull the leap second fixes from Thomas Gleixner: "It's a rather large series, but well discussed, refined and reviewed. It got a massive testing by John, Prarit and tip. In theory we could split it into two parts. The first two patches f55a6faa3843: hrtimer: Provide clock_was_set_delayed() 4873fa070ae8: timekeeping: Fix leapsecond triggered load spike issue are merely preventing the stuff loops forever issues, which people have observed. But there is no point in delaying the other 4 commits which achieve full correctness into 3.6 as they are tagged for stable anyway. And I rather prefer to have the full fixes merged in bulk than a "prevent the observable wreckage and deal with the hidden fallout later" approach." * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Update hrtimer base offsets each hrtimer_interrupt timekeeping: Provide hrtimer update function hrtimers: Move lock held region in hrtimer_interrupt() timekeeping: Maintain ktime_t based offsets for hrtimers timekeeping: Fix leapsecond triggered load spike issue hrtimer: Provide clock_was_set_delayed()
| * | | | hrtimer: Update hrtimer base offsets each hrtimer_interruptJohn Stultz2012-07-111-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The update of the hrtimer base offsets on all cpus cannot be made atomically from the timekeeper.lock held and interrupt disabled region as smp function calls are not allowed there. clock_was_set(), which enforces the update on all cpus, is called either from preemptible process context in case of do_settimeofday() or from the softirq context when the offset modification happened in the timer interrupt itself due to a leap second. In both cases there is a race window for an hrtimer interrupt between dropping timekeeper lock, enabling interrupts and clock_was_set() issuing the updates. Any interrupt which arrives in that window will see the new time but operate on stale offsets. So we need to make sure that an hrtimer interrupt always sees a consistent state of time and offsets. ktime_get_update_offsets() allows us to get the current monotonic time and update the per cpu hrtimer base offsets from hrtimer_interrupt() to capture a consistent state of monotonic time and the offsets. The function replaces the existing ktime_get() calls in hrtimer_interrupt(). The overhead of the new function vs. ktime_get() is minimal as it just adds two store operations. This ensures that any changes to realtime or boottime offsets are noticed and stored into the per-cpu hrtimer base structures, prior to any hrtimer expiration and guarantees that timers are not expired early. Signed-off-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-8-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | timekeeping: Provide hrtimer update functionThomas Gleixner2012-07-111-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To finally fix the infamous leap second issue and other race windows caused by functions which change the offsets between the various time bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a function which atomically gets the current monotonic time and updates the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic overhead. The previous patch which provides ktime_t offsets allows us to make this function almost as cheap as ktime_get() which is going to be replaced in hrtimer_interrupt(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: John Stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | hrtimers: Move lock held region in hrtimer_interrupt()Thomas Gleixner2012-07-111-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to update the base offsets from this code and we need to do that under base->lock. Move the lock held region around the ktime_get() calls. The ktime_get() calls are going to be replaced with a function which gets the time and the offsets atomically. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: John Stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/1341960205-56738-6-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | timekeeping: Maintain ktime_t based offsets for hrtimersThomas Gleixner2012-07-111-2/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to update the hrtimer clock offsets from the hrtimer interrupt context. To avoid conversions from timespec to ktime_t maintain a ktime_t based representation of those offsets in the timekeeper. This puts the conversion overhead into the code which updates the underlying offsets and provides fast accessible values in the hrtimer interrupt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-4-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | timekeeping: Fix leapsecond triggered load spike issueJohn Stultz2012-07-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The timekeeping code misses an update of the hrtimer subsystem after a leap second happened. Due to that timers based on CLOCK_REALTIME are either expiring a second early or late depending on whether a leap second has been inserted or deleted until an operation is initiated which causes that update. Unless the update happens by some other means this discrepancy between the timekeeping and the hrtimer data stays forever and timers are expired either early or late. The reported immediate workaround - $ data -s "`date`" - is causing a call to clock_was_set() which updates the hrtimer data structures. See: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix Add the missing clock_was_set() call to update_wall_time() in case of a leap second event. The actual update is deferred to softirq context as the necessary smp function call cannot be invoked from hard interrupt context. Signed-off-by: John Stultz <johnstul@us.ibm.com> Reported-by: Jan Engelhardt <jengelh@inai.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-3-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | hrtimer: Provide clock_was_set_delayed()John Stultz2012-07-111-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | clock_was_set() cannot be called from hard interrupt context because it calls on_each_cpu(). For fixing the widely reported leap seconds issue it is necessary to call it from hard interrupt context, i.e. the timer tick code, which does the timekeeping updates. Provide a new function which denotes it in the hrtimer cpu base structure of the cpu on which it is called and raise the hrtimer softirq. We then execute the clock_was_set() notificiation from softirq context in run_hrtimer_softirq(). The hrtimer softirq is rarely used, so polling the flag there is not a performance issue. [ tglx: Made it depend on CONFIG_HIGH_RES_TIMERS. We really should get rid of all this ifdeffery ASAP ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Reported-by: Jan Engelhardt <jengelh@inai.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-2-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | | | | Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds2012-07-121-6/+10
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge random patches from Andrew Morton. * Merge emailed patches from Andrew Morton <akpm@linux-foundation.org>: (32 commits) memblock: free allocated memblock_reserved_regions later mm: sparse: fix usemap allocation above node descriptor section mm: sparse: fix section usemap placement calculation xtensa: fix incorrect memset shmem: cleanup shmem_add_to_page_cache shmem: fix negative rss in memcg memory.stat tmpfs: revert SEEK_DATA and SEEK_HOLE drivers/rtc/rtc-twl.c: fix threaded IRQ to use IRQF_ONESHOT fat: fix non-atomic NFS i_pos read MAINTAINERS: add OMAP CPUfreq driver to OMAP Power Management section sgi-xp: nested calls to spin_lock_irqsave() fs: ramfs: file-nommu: add SetPageUptodate() drivers/rtc/rtc-mxc.c: fix irq enabled interrupts warning mm/memory_hotplug.c: release memory resources if hotadd_new_pgdat() fails h8300/uaccess: add mising __clear_user() h8300/uaccess: remove assignment to __gu_val in unhandled case of get_user() h8300/time: add missing #include <asm/irq_regs.h> h8300/signal: fix typo "statis" h8300/pgtable: add missing #include <asm-generic/pgtable.h> drivers/rtc/rtc-ab8500.c: ensure correct probing of the AB8500 RTC when Device Tree is enabled ...
| * | | | | c/r: prctl: less paranoid prctl_set_mm_exe_file()Konstantin Khlebnikov2012-07-121-6/+10
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "no other files mapped" requirement from my previous patch (c/r: prctl: update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal) is too paranoid, it forbids operation even if there mapped one shared-anon vma. Let's check that current mm->exe_file already unmapped, in this case exe_file symlink already outdated and its changing is reasonable. Plus, this patch fixes exit code in case operation success. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Reported-by: Cyrill Gorcunov <gorcunov@openvz.org> Tested-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Kees Cook <keescook@chromium.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | Merge tag 'driver-core-3.5-rc6' of ↵Linus Torvalds2012-07-111-76/+126
|\ \ \ \ \ | |_|/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull printk fixes from Greg Kroah-Hartman: "Here are some more printk fixes for 3.5-rc6. They resolve all known outstanding issues with the printk changes that have been happening. They have been tested by the people reporting the problems. This hopefully should be it for the printk stuff for 3.5-final. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" * tag 'driver-core-3.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: kmsg: merge continuation records while printing kmsg: /proc/kmsg - support reading of partial log records kmsg: make sure all messages reach a newly registered boot console kmsg: properly handle concurrent non-blocking read() from /proc/kmsg kmsg: add the facility number to the syslog prefix kmsg: escape the backslash character while exporting data printk: replacing the raw_spin_lock/unlock with raw_spin_lock/unlock_irq
| * | | | kmsg: merge continuation records while printingKay Sievers2012-07-091-42/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In (the unlikely) case our continuation merge buffer is busy, we unfortunately can not merge further continuation printk()s into a single record and have to store them separately, which leads to split-up output of these lines when they are printed. Add some flags about newlines and prefix existence to these records and try to reconstruct the full line again, when the separated records are printed. Reported-By: Michael Neuling <mikey@neuling.org> Cc: Dave Jones <davej@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Tested-By: Michael Neuling <mikey@neuling.org> Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | kmsg: /proc/kmsg - support reading of partial log recordsKay Sievers2012-07-091-8/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Restore support for partial reads of any size on /proc/kmsg, in case the supplied read buffer is smaller than the record size. Some people seem to think is is ia good idea to run: $ dd if=/proc/kmsg bs=1 of=... as a klog bridge. Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=44211 Reported-by: Jukka Ollila <jiiksteri@gmail.com> Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | kmsg: make sure all messages reach a newly registered boot consoleKay Sievers2012-07-061-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We suppress printing kmsg records to the console, which are already printed immediately while we have received their fragments. Newly registered boot consoles print the entire kmsg buffer during registration. Clear the console-suppress flag after we skipped the record during its first storage, so any later print will see these records as usual. Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | kmsg: properly handle concurrent non-blocking read() from /proc/kmsgKay Sievers2012-07-061-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The /proc/kmsg read() interface is internally simply wired up to a sequence of syslog() syscalls, which might are racy between their checks and actions, regarding concurrency. In the (very uncommon) case of concurrent readers of /dev/kmsg, relying on usual O_NONBLOCK behavior, the recently introduced mutex might block an O_NONBLOCK reader in read(), when poll() returns for it, but another process has already read the data in the meantime. We've seen that while running artificial test setups and tools that "fight" about /proc/kmsg data. This restores the original /proc/kmsg behavior, where in case of concurrent read()s, poll() might wake up but the read() syscall will just return 0 to the caller, while another process has "stolen" the data. This is in the general case not the expected behavior, but it is the exact same one, that can easily be triggered with a 3.4 kernel, and some tools might just rely on it. The mutex is not needed, the original integrity issue which introduced it, is in the meantime covered by: "fill buffer with more than a single message for SYSLOG_ACTION_READ" 116e90b23f74d303e8d607c7a7d54f60f14ab9f2 Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com> Acked-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | kmsg: add the facility number to the syslog prefixKay Sievers2012-07-061-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the recent split of facility and level into separate variables, we miss the facility value (always 0 for kernel-originated messages) in the syslog prefix. On Tue, Jul 3, 2012 at 12:45 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote: > Static checkers complain about the impossible condition here. > > In 084681d14e ('printk: flush continuation lines immediately to > console'), we changed msg->level from being a u16 to being an unsigned > 3 bit bitfield. Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | kmsg: escape the backslash character while exporting dataKay Sievers2012-07-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Non-printable characters in the log data are hex-escaped to ensure safe post processing. We need to escape a backslash we find in the data, to be able to distinguish it from a backslash we add for the escaping. Also escape the non-printable character 127. Thanks to Miloslav Trmac for the heads up. Reported-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | printk: replacing the raw_spin_lock/unlock with raw_spin_lock/unlock_irqliu chuansheng2012-07-061-12/+12
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In function devkmsg_read/writev/llseek/poll/open()..., the function raw_spin_lock/unlock is used, there is potential deadlock case happening. CPU1: thread1 doing the cat /dev/kmsg: raw_spin_lock(&logbuf_lock); while (user->seq == log_next_seq) { when thread1 run here, at this time one interrupt is coming on CPU1 and running based on this thread,if the interrupt handle called the printk which need the logbuf_lock spin also, it will cause deadlock. So we should use raw_spin_lock/unlock_irq here. Acked-by: Kay Sievers <kay@vrfy.org> Signed-off-by: liu chuansheng <chuansheng.liu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | | cgroup: fix cgroup hierarchy umount raceTejun Heo2012-07-081-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal optional" allowed a css to linger after the associated cgroup is removed. As a css holds a reference on the cgroup's dentry, it means that cgroup dentries may linger for a while. Destroying a superblock which has dentries with positive refcnts is a critical bug and triggers BUG() in vfs code. As each cgroup dentry holds an s_active reference, any lingering cgroup has both its dentry and the superblock pinned and thus preventing premature release of superblock. Unfortunately, after 48ddbe1946, there's a small window while releasing a cgroup which is directly under the root of the hierarchy. When a cgroup directory is released, vfs layer first deletes the corresponding dentry and then invokes dput() on the parent, which may recurse further, so when a cgroup directly below root cgroup is released, the cgroup is first destroyed - which releases the s_active it was holding - and then the dentry for the root cgroup is dput(). This creates a window where the root dentry's refcnt isn't zero but superblock's s_active is. If umount happens before or during this window, vfs will see the root dentry with non-zero refcnt and trigger BUG(). Before 48ddbe1946, this problem didn't exist because the last dentry reference was guaranteed to be put synchronously from rmdir(2) invocation which holds s_active around the whole process. Fix it by holding an extra superblock->s_active reference across dput() from css release, which is the dput() path added by 48ddbe1946 and the only one which doesn't hold an extra s_active ref across the final cgroup dput(). Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <4FEEA5CB.8070809@huawei.com> Reported-by: shyju pv <shyju.pv@huawei.com> Tested-by: shyju pv <shyju.pv@huawei.com> Cc: Sasha Levin <levinsasha928@gmail.com> Acked-by: Li Zefan <lizefan@huawei.com>
* | | | Revert "cgroup: superblock can't be released with active dentries"Tejun Heo2012-07-081-14/+3
| |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit fa980ca87d15bb8a1317853f257a505990f3ffde. The commit was an attempt to fix a race condition where a cgroup hierarchy may be unmounted with positive dentry reference on root cgroup. While the commit made the race condition slightly more difficult to trigger, the race was still there and could be reliably triggered using a different test case. Revert the incorrect fix. The next commit will describe the race and fix it correctly. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <4FEEA5CB.8070809@huawei.com> Reported-by: shyju pv <shyju.pv@huawei.com> Cc: Sasha Levin <levinsasha928@gmail.com> Acked-by: Li Zefan <lizefan@huawei.com>
* | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2012-07-042-4/+7
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block bits from Jens Axboe: "As vacation is coming up, thought I'd better get rid of my pending changes in my for-linus branch for this iteration. It contains: - Two patches for mtip32xx. Killing a non-compliant sysfs interface and moving it to debugfs, where it belongs. - A few patches from Asias. Two legit bug fixes, and one killing an interface that is no longer in use. - A patch from Jan, making the annoying partition ioctl warning a bit less annoying, by restricting it to !CAP_SYS_RAWIO only. - Three bug fixes for drbd from Lars Ellenberg. - A fix for an old regression for umem, it hasn't really worked since the plugging scheme was changed in 3.0. - A few fixes from Tejun. - A splice fix from Eric Dumazet, fixing an issue with pipe resizing." * 'for-linus' of git://git.kernel.dk/linux-block: scsi: Silence unnecessary warnings about ioctl to partition block: Drop dead function blk_abort_queue() block: Mitigate lock unbalance caused by lock switching block: Avoid missed wakeup in request waitqueue umem: fix up unplugging splice: fix racy pipe->buffers uses drbd: fix null pointer dereference with on-congestion policy when diskless drbd: fix list corruption by failing but already aborted reads drbd: fix access of unallocated pages and kernel panic xen/blkfront: Add WARN to deal with misbehaving backends. blkcg: drop local variable @q from blkg_destroy() mtip32xx: Create debugfs entries for troubleshooting mtip32xx: Remove 'registers' and 'flags' from sysfs blkcg: fix blkg_alloc() failure path block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED block: fix return value on cfq_init() failure mtip32xx: Remove version.h header file inclusion xen/blkback: Copy id field when doing BLKIF_DISCARD.
| * | splice: fix racy pipe->buffers usesEric Dumazet2012-06-132-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered by splice_shrink_spd() called from vmsplice_to_pipe() commit 35f3d14dbbc5 (pipe: add support for shrinking and growing pipes) added capability to adjust pipe->buffers. Problem is some paths don't hold pipe mutex and assume pipe->buffers doesn't change for their duration. Fix this by adding nr_pages_max field in struct splice_pipe_desc, and use it in place of pipe->buffers where appropriate. splice_shrink_spd() loses its struct pipe_inode_info argument. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Tom Herbert <therbert@google.com> Cc: stable <stable@vger.kernel.org> # 2.6.35 Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | printk.c: fix kernel-doc warningsRandy Dunlap2012-07-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix kernel-doc warnings in printk.c: use correct parameter name. Warning(kernel/printk.c:2429): No description found for parameter 'buf' Warning(kernel/printk.c:2429): Excess function parameter 'line' description in 'kmsg_dump_get_buffer' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'driver-core-3.5-rc5' of ↵Linus Torvalds2012-06-301-84/+213
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver Core fixes from Greg Kroah-Hartman: "Here is a number of printk() fixes, specifically a few reported by the crazy blog program that ships in SUSE releases (that's "boot log" and not "web log", it predates the general "blog" terminology by many years), and the restoration of the continuation line functionality reported by Stephen and others. Yes, the changes seem a bit big this late in the cycle, but I've been beating on them for a while now, and Stephen has even optimized it a bit, so all looks good to me. The other change in here is a Documentation update for the stable kernel rules describing how some distro patches should be backported, to hopefully drive a bit more response from the distros to the stable kernel releases. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" * tag 'driver-core-3.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: printk: Optimize if statement logic where newline exists printk: flush continuation lines immediately to console syslog: fill buffer with more than a single message for SYSLOG_ACTION_READ Revert "printk: return -EINVAL if the message len is bigger than the buf size" printk: fix regression in SYSLOG_ACTION_CLEAR stable: Allow merging of backports for serious user-visible performance issues
| * | | printk: Optimize if statement logic where newline existsSteven Rostedt2012-06-291-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In reviewing Kay's fix up patch: "printk: Have printk() never buffer its data", I found two if statements that could be combined and optimized. Put together the two 'cont.len && cont.owner == current' if statements into a single one, and check if we need to call cont_add(). This also removes the unneeded double cont_flush() calls. Link: http://lkml.kernel.org/r/1340869133.876.10.camel@mop Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | printk: flush continuation lines immediately to consoleKay Sievers2012-06-291-68/+176
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Continuation lines are buffered internally, intended to merge the chunked printk()s into a single record, and to isolate potentially racy continuation users from usual terminated line users. This though, has the effect that partial lines are not printed to the console in the moment they are emitted. In case the kernel crashes in the meantime, the potentially interesting printed information would never reach the consoles. Here we share the continuation buffer with the console copy logic, and partial lines are always immediately flushed to the available consoles. They are still buffered internally to improve the readability and integrity of the messages and minimize the amount of needed record headers to store. Signed-off-by: Kay Sievers <kay@vrfy.org> Tested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | syslog: fill buffer with more than a single message for SYSLOG_ACTION_READJan Beulich2012-06-261-14/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent changes to the printk buffer management resulted in SYSLOG_ACTION_READ to only return a single message, whereas previously the buffer would get filled as much as possible. As, when too small to fit everything, filling it to the last byte would be pretty ugly with the new code, the patch arranges for as many messages as possible to get returned in a single invocation. User space tools in at least all SLES versions depend on the old behavior. This at once addresses the issue attempted to get fixed with commit b56a39ac263e5b8cafedd551a49c2105e68b98c2 ("printk: return -EINVAL if the message len is bigger than the buf size"), and since that commit widened the possibility for losing a message altogether, the patch here assumes that this other commit would get reverted first (otherwise the patch here won't apply). Furthermore, this patch also addresses the problem dealt with in commit 4a77a5a06ec66ed05199b301e7c25f42f979afdc ("printk: use mutex lock to stop syslog_seq from going wild"), so I'd recommend reverting that one too (albeit there's no direct collision between the two). Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kay Sievers <kay@vrfy.org> Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | Revert "printk: return -EINVAL if the message len is bigger than the buf size"Greg Kroah-Hartman2012-06-261-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit b56a39ac263e5b8cafedd551a49c2105e68b98c2. A better patch from Jan will follow this to resolve the issue. Acked-by: Kay Sievers <kay@vrfy.org> Cc: Fengguang Wu <wfg@linux.intel.com> Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com> Cc: Jan Beulich <JBeulich@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | printk: fix regression in SYSLOG_ACTION_CLEARAlan Stern2012-06-251-0/+1
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 7ff9554bb578ba02166071d2d487b7fc7d860d62 (printk: convert byte-buffer to variable-length record buffer) introduced a regression by accidentally removing a "break" statement from inside the big switch in printk's do_syslog(). The symptom of this bug is that the "dmesg -C" command doesn't only clear the kernel's log buffer; it also disables console logging. This patch (as1561) fixes the regression by adding the missing "break". Signed-off-by: Alan Stern <stern@rowland.harvard.edu> CC: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* / | rcu: Stop rcu_do_batch() from multiplexing the "count" variablePaul E. McKenney2012-06-251-7/+7
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | Commit b1420f1c (Make rcu_barrier() less disruptive) rearranged the code in rcu_do_batch(), moving the ->qlen manipulation to follow the requeueing of the callbacks. Unfortunately, this rearrangement clobbered the value of the "count" local variable before the value of rdp->qlen was adjusted, resulting in the value of rdp->qlen being inaccurate. This commit therefore introduces an index variable "i", avoiding the inadvertent multiplexing. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* | Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2012-06-221-3/+7
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar. * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: ftrace: Make all inline tags also include notrace perf: Use css_tryget() to avoid propping up css refcount perf tools: Fix synthesizing tracepoint names from the perf.data headers perf stat: Fix default output file perf tools: Fix endianity swapping for adds_features bitmask
| * | perf: Use css_tryget() to avoid propping up css refcountSalman Qazi2012-06-181-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An rmdir pushes css's ref count to zero. However, if the associated directory is open at the time, the dentry ref count is non-zero. If the fd for this directory is then passed into perf_event_open, it does a css_get(). This bounces the ref count back up from zero. This is a problem by itself. But what makes it turn into a crash is the fact that we end up doing an extra dput, since we perform a dput when css_put sees the ref count go down to zero. css_tryget() does not fall into that trap. So, we use that instead. Reproduction test-case for the bug: #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <linux/unistd.h> #include <linux/perf_event.h> #include <string.h> #include <errno.h> #include <stdio.h> #define PERF_FLAG_PID_CGROUP (1U << 2) int perf_event_open(struct perf_event_attr *hw_event_uptr, pid_t pid, int cpu, int group_fd, unsigned long flags) { return syscall(__NR_perf_event_open,hw_event_uptr, pid, cpu, group_fd, flags); } /* * Directly poke at the perf_event bug, since it's proving hard to repro * depending on where in the kernel tree. what moved? */ int main(int argc, char **argv) { int fd; struct perf_event_attr attr; memset(&attr, 0, sizeof(attr)); attr.exclude_kernel = 1; attr.size = sizeof(attr); mkdir("/dev/cgroup/perf_event/blah", 0777); fd = open("/dev/cgroup/perf_event/blah", O_RDONLY); perror("open"); rmdir("/dev/cgroup/perf_event/blah"); sleep(2); perf_event_open(&attr, fd, 0, -1, PERF_FLAG_PID_CGROUP); perror("perf_event_open"); close(fd); return 0; } Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20120614223108.1025.2503.stgit@dungbeetle.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | Merge branch 'for-3.5-fixes' of ↵Linus Torvalds2012-06-211-3/+10
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull two cgroup fixes from Tejun Heo: "This containes two patches fixing a refcnt race bug during css_put(). Decrementing and checking the value weren't atomic and two tasks could think that they both pushed the counter to zero." * 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroups: Account for CSS_DEACT_BIAS in __css_put cgroup: make sure that decisions in __css_put are atomic
| * | | cgroups: Account for CSS_DEACT_BIAS in __css_putSalman Qazi2012-06-191-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we fixed the race between atomic_dec and css_refcnt, we missed the fact that css_refcnt internally subtracts CSS_DEACT_BIAS to get the actual reference count. This can potentially cause a refcount leak if __css_put races with cgroup_clear_css_refs. Signed-off-by: Salman Qazi <sqazi@google.com> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | cgroup: make sure that decisions in __css_put are atomicSalman Qazi2012-06-071-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __css_put is using atomic_dec on the ref count, and then looking at the ref count to make decisions. This is prone to races, as someone else may decrement ref count between our decrement and our decision. Instead, we should base our decisions on the value that we decremented the ref count to. (This results in an actual race on Google's kernel which I haven't been able to reproduce on the upstream kernel. Having said that, it's still incorrect by inspection). Signed-off-by: Salman Qazi <sqazi@google.com> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
* | | | Merge tag 'driver-core-3.5-rc4' of ↵Linus Torvalds2012-06-211-33/+208
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core and printk fixes from Greg Kroah-Hartman: "Here are some fixes for 3.5-rc4 that resolve the kmsg problems that people have reported showing up after the printk and kmsg changes went into 3.5-rc1. There are also a smattering of other tiny fixes for the extcon and hyper-v drivers that people have reported. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" * tag 'driver-core-3.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: extcon: max8997: Add missing kfree for info->edev in max8997_muic_remove() extcon: Set platform drvdata in gpio_extcon_probe() and fix irq leak extcon: Fix wrong index in max8997_extcon_cable[] kmsg - kmsg_dump() fix CONFIG_PRINTK=n compilation printk: return -EINVAL if the message len is bigger than the buf size printk: use mutex lock to stop syslog_seq from going wild kmsg - kmsg_dump() use iterator to receive log buffer content vme: change maintainer e-mail address Extcon: Don't try to create duplicate link names driver core: fixup reversed deferred probe order printk: Fix alignment of buf causing crash on ARM EABI Tools: hv: verify origin of netlink connector message
| * | | | printk: return -EINVAL if the message len is bigger than the buf sizeYuanhan Liu2012-06-161-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just like what devkmsg_read() does, return -EINVAL if the message len is bigger than the buf size, or it will trigger a segfault error. Acked-by: Kay Sievers <kay@vrfy.org> Acked-by: Fengguang Wu <wfg@linux.intel.com> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | printk: use mutex lock to stop syslog_seq from going wildYuanhan Liu2012-06-161-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although syslog_seq and log_next_seq stuff are protected by logbuf_lock spin log, it's not enough. Say we have two processes A and B, and let syslog_seq = N, while log_next_seq = N + 1, and the two processes both come to syslog_print at almost the same time. And No matter which process get the spin lock first, it will increase syslog_seq by one, then release spin lock; thus later, another process increase syslog_seq by one again. In this case, syslog_seq is bigger than syslog_next_seq. And latter, it would make: wait_event_interruptiable(log_wait, syslog != log_next_seq) don't wait any more even there is no new write comes. Thus it introduce a infinite loop reading. I can easily see this kind of issue by the following steps: # cat /proc/kmsg # at meantime, I don't kill rsyslog # So they are the two processes. # xinit # I added drm.debug=6 in the kernel parameter line, # so that it will produce lots of message and let that # issue happen It's 100% reproducable on my side. And my disk will be filled up by /var/log/messages in a quite short time. So, introduce a mutex_lock to stop syslog_seq from going wild just like what devkmsg_read() does. It does fix this issue as expected. v2: use mutex_lock_interruptiable() instead (comments from Kay) Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Acked-By: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | kmsg - kmsg_dump() use iterator to receive log buffer contentKay Sievers2012-06-151-28/+192
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide an iterator to receive the log buffer content, and convert all kmsg_dump() users to it. The structured data in the kmsg buffer now contains binary data, which should no longer be copied verbatim to the kmsg_dump() users. The iterator should provide reliable access to the buffer data, and also supports proper log line-aware chunking of data while iterating. Signed-off-by: Kay Sievers <kay@vrfy.org> Tested-by: Tony Luck <tony.luck@intel.com> Reported-by: Anton Vorontsov <anton.vorontsov@linaro.org> Tested-by: Anton Vorontsov <anton.vorontsov@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | printk: Fix alignment of buf causing crash on ARM EABIAndrew Lunn2012-06-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 7ff9554bb578ba02166071d2d487b7fc7d860d62, printk: convert byte-buffer to variable-length record buffer, causes systems using EABI to crash very early in the boot cycle. The first entry in struct log is a u64, which for EABI must be 8 byte aligned. Make use of __alignof__() so the compiler to decide the alignment, but allow it to be overridden using CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS, for systems which can perform unaligned access and want to save a few bytes of space. Tested on Orion5x and Kirkwood. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Stephen Warren <swarren@wwwdotorg.org> Acked-by: Stephen Warren <swarren@wwwdotorg.org> Acked-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>