summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* MD: Move macros from raid1*.h to raid1*.cJonathan Brassow2012-07-314-29/+29
| | | | | | | | | | | | | | | | | MD RAID1/RAID10: Move some macros from .h file to .c file There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined in both raid1.h and raid10.h. They are only used in there respective .c files. However, if we wish to make RAID10 accessible to the device-mapper RAID target (dm-raid.c), then we need to move these macros into the .c files where they are used so that they do not conflict with each other. The macros from the two files are identical and could be moved into md.h, but I chose to leave the duplication and have them remain in the personality files. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD RAID1: rename mirror_info structureJonathan Brassow2012-07-312-9/+9
| | | | | | | | | | | | | | MD RAID1: Rename the structure 'mirror_info' to 'raid1_info' The same structure name ('mirror_info') is used by raid10. Each of these structures are defined in there respective header files. If dm-raid is to support both RAID1 and RAID10, the header files will be included and the structure names must not collide. While only one of these structure names needs to change, this patch adds consistency to the naming of the structure. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD RAID10: rename mirror_info structureJonathan Brassow2012-07-312-12/+12
| | | | | | | | | | | | MD RAID10: Rename the structure 'mirror_info' to 'raid10_info' The same structure name ('mirror_info') is used by raid1. Each of these structures are defined in there respective header files. If dm-raid is to support both RAID1 and RAID10, the header files will be included and the structure names must not collide. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD RAID10: Fix compiler warning.Jonathan Brassow2012-07-311-1/+1
| | | | | | | | | MD RAID10: Fix compiler warning. Initialize variable to prevent compiler warning. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* raid5: add a per-stripe lockShaohua Li2012-07-192-16/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a per-stripe lock to protect stripe specific data. The purpose is to reduce lock contention of conf->device_lock. stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio list of the stripe is always serialized by this lock, so adding bio to the lists (add_stripe_bio()) and removing bio from the lists (like ops_run_biofill()) not race. If bio in ->read, ->written ... list are not shared by multiple stripes, we don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will protect them. If the bio are shared, there are two protections: 1. bi_phys_segments acts as a reference count 2. traverse the list uses r5_next_bio, which makes traverse never access bio not belonging to the stripe Let's have an example: | stripe1 | stripe2 | stripe3 | ...bio1......|bio2|bio3|....bio4..... stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2 because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and stripe3 will end_bio bio4. before add_stripe_bio() addes a bio to a stripe, we already increament the bio bi_phys_segments, so don't worry other stripes release the bio. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
* raid5: remove unnecessary bitmap write optimizationShaohua Li2012-07-191-12/+9
| | | | | | | | | | | Neil pointed out the bitmap write optimization in handle_stripe_clean_event() is unnecessary, because the chance one stripe gets written twice in the mean time is rare. We can always do a bitmap_startwrite when a write request is added to a stripe and bitmap_endwrite after write request is done. Delete the optimization. With it, we can delete some cases of device_lock. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
* raid5: lockless access raid5 overrided bi_phys_segmentsShaohua Li2012-07-191-30/+32
| | | | | | | | Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold, which is unnecessary, We can make it lockless actually. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
* raid5: reduce chance release_stripe() taking device_lockShaohua Li2012-07-191-34/+41
| | | | | | | | release_stripe() is a place conf->device_lock is heavily contended. We take the lock even stripe count isn't 1, which isn't required. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md/raid1: close some possible races on write errors during resyncNeilBrown2012-07-191-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | commit 4367af556133723d0f443e14ca8170d9447317cb md/raid1: clear bad-block record when write succeeds. Added a 'reschedule_retry' call possibility at the end of end_sync_write, but didn't add matching code at the end of sync_request_write. So if the writes complete very quickly, or scheduling makes it seem that way, then we can miss rescheduling the request and the resync could hang. Also commit 73d5c38a9536142e062c35997b044e89166e063b md: avoid races when stopping resync. Fix a race condition in this same code in end_sync_write but didn't make the change in sync_request_write. This patch updates sync_request_write to fix both of those. Patch is suitable for 3.1 and later kernels. Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Original-version-by: Alexander Lyakas <alex.bolshoy@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
* md: avoid crash when stopping md array races with closing other open fds.NeilBrown2012-07-191-13/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md will refuse to stop an array if any other fd (or mounted fs) is using it. When any fs is unmounted of when the last open fd is closed all pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put) so there will be no pending IO to worry about when the array is stopped. However in order to send the STOP_ARRAY ioctl to stop the array one must first get and open fd on the block device. If some fd is being used to write to the block device and it is closed after mdadm open the block device, but before mdadm issues the STOP_ARRAY ioctl, then there will be no last-close on the md device so __blkdev_put will not call sync_blockdev. If this happens, then IO can still be in-flight while md tears down the array and bad things can happen (use-after-free and subsequent havoc). So in the case where do_md_stop is being called from an open file descriptor, call sync_block after taking the mutex to ensure there will be no new openers. This is needed when setting a read-write device to read-only too. Cc: stable@vger.kernel.org Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: fix bug in handling of new_data_offsetNeilBrown2012-07-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit c6563a8c38fde3c1c7fc925a10bde3ca20799301 md: add possibility to change data-offset for devices. introduced a 'new_data_offset' attribute which should normally be the same as 'data_offset', but can be explicitly set to a different value to allow a reshape operation to move the data. Unfortunately when the 'data_offset' is explicitly set through sysfs, the new_data_offset is not also set, so the two would become out-of-sync incorrectly. One result of this is that trying to set the 'size' after the 'data_offset' would fail because it is not permitted to set the size when the 'data_offset' and 'new_data_offset' are different - as that can be confusing. Consequently when mdadm tried to do this while assembling an IMSM array it would fail. This bug was introduced in 3.5-rc1. Reported-by: Brian Downing <bdowning@lavos.net> Bisected-by: Brian Downing <bdowning@lavos.net> Tested-by: Brian Downing <bdowning@lavos.net> Signed-off-by: NeilBrown <neilb@suse.de>
* Linux 3.5-rc7v3.5-rc7Linus Torvalds2012-07-151-1/+1
|
* blk: fix wrong idr_pre_get() error check in loop.cSilva Paulo2012-07-151-5/+3
| | | | | | | | | The idr_pre_get() function never returns a value < 0. It returns 0 (no memory) or 1 (OK). Reported-by: Silva Paulo <psdasilva@yahoo.com> [ Rewrote Silva's patch, but attributing it to Silva anyway - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'sound-3.5' of ↵Linus Torvalds2012-07-142-91/+43
|\ | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "Containing the regression fixes for USB-audio due to the transition to the new streaming logic, mostly found on Logitech webcams." * tag 'sound-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: snd-usb: move calls to usb_set_interface ALSA: usb-audio: Fix the first PCM interface assignment
| * ALSA: snd-usb: move calls to usb_set_interfaceDaniel Mack2012-07-132-89/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rework of the snd-usb endpoint logic moved the calls to snd_usb_set_interface() into the snd_usb_endpoint implemenation. This changed the order in which these calls are issued to the device, and thereby caused regressions for some webcams. Fix this by moving the calls back to pcm.c for now to make it work again and use snd_usb_endpoint_activate() to really tear down all remaining URBs in the flight, consequently fixing another regression caused by USB packets on the wire after altsetting 0 has been selected. Signed-off-by: Daniel Mack <zonque@gmail.com> Reported-and-tested-by: Philipp Dreimann <philipp@dreimann.net> Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * ALSA: usb-audio: Fix the first PCM interface assignmentTakashi Iwai2012-07-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | In the new PCM streaming logic, the interface number is assigned to usb stream instance (subs->interface) after the format and rate setups are succeeded, but some codes are still passing subs->interface as the reference to helper functions. This leads to initializing with an invalid iface number (-1). This patch replaces the wrong references with the ones from the target fmt correctly. Signed-off-by: Takashi Iwai <tiwai@suse.de>
* | Merge branch 'release' of ↵Linus Torvalds2012-07-141-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux Pull ACPI patch from Len Brown. * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: ACPICA: Fix possible fault in return package object repair code
| * | ACPICA: Fix possible fault in return package object repair codeBob Moore2012-07-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes a problem that can occur when a lone package object is wrapped with an outer package object in order to conform to the ACPI specification. Can affect these predefined names: _ALR,_MLS,_PSS,_TRT,_TSS,_PRT,_HPX,_DLM,_CSD,_PSD,_TSD https://bugzilla.kernel.org/show_bug.cgi?id=44171 This problem was introduced in 3.4-rc1 by commit 6a99b1c94d053b3420eaa4a4bc8b2883dd90a2f9 (ACPICA: Object repair code: Support to add Package wrappers) Reported-by: Vlastimil Babka <caster@gentoo.org> Signed-off-by: Bob Moore <robert.moore@intel.com> Signed-off-by: Lin Ming <ming.m.lin@intel.com> Cc: <stable@vger.kernel.org> # 3.4 Signed-off-by: Len Brown <len.brown@intel.com>
* | | vsyscall_64: add missing ifdef CONFIG_SECCOMPWill Drewry2012-07-141-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | vsyscall_seccomp introduced a dependency on __secure_computing. On configurations with CONFIG_SECCOMP disabled, compilation will fail. Reported-by: feng xiangjun <fengxj325@gmail.com> Signed-off-by: Will Drewry <wad@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'cpufreq-for-3.5-rc7' of ↵Linus Torvalds2012-07-141-2/+4
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull cpufreq fix from Rafael Wysocki: "This fixes a regression preventing the ACPI cpufreq driver from loading on some systems where it worked previously without any problems." * tag 'cpufreq-for-3.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq / ACPI: Fix not loading acpi-cpufreq driver regression
| * | | cpufreq / ACPI: Fix not loading acpi-cpufreq driver regressionThomas Renninger2012-07-121-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit d640113fe80e45ebd4a5b420b introduced a regression on SMP systems where the processor core with ACPI id zero is disabled (typically should be the case because of hyperthreading). The regression got spread through stable kernels. On 3.0.X it got introduced via 3.0.18. Such platforms may be rare, but do exist. Look out for a disabled processor with acpi_id 0 in dmesg: ACPI: LAPIC (acpi_id[0x00] lapic_id[0x10] disabled) This problem has been observed on a: HP Proliant BL280c G6 blade This patch restricts the introduced workaround to platforms with nr_cpu_ids <= 1. Signed-off-by: Thomas Renninger <trenn@suse.de> CC: stable@vger.kernel.org Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
* | | | Merge tag 'fixes-for-linus' of ↵Linus Torvalds2012-07-145-9/+18
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM Samsung SoC fixes from Arnd Bergmann. * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: ARM: S3C24XX: Correct CAMIF interrupt definitions ARM: S3C24XX: Correct AC97 clock control bit for S3C2440 ARM: SAMSUNG: fix race in s3c_adc_start for ADC ARM: SAMSUNG: Update default rate for xusbxti clock ARM: EXYNOS: register devices in 'need_restore' state for pm_domains ARM: EXYNOS: read initial state of power domain from hw registers
| * \ \ \ Merge branch 'v3.5-samsung-fixes-2' of ↵Arnd Bergmann2012-07-145-9/+18
| |\ \ \ \ | | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung into fixes From Kukjin Kim <kgene.kim@samsung.com>: * 'v3.5-samsung-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung: ARM: S3C24XX: Correct CAMIF interrupt definitions ARM: S3C24XX: Correct AC97 clock control bit for S3C2440 ARM: SAMSUNG: fix race in s3c_adc_start for ADC ARM: SAMSUNG: Update default rate for xusbxti clock ARM: EXYNOS: register devices in 'need_restore' state for pm_domains ARM: EXYNOS: read initial state of power domain from hw registers Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| | * | | ARM: S3C24XX: Correct CAMIF interrupt definitionsSylwester Nawrocki2012-07-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Properly define the CAMIF interrupt resources. This device have two interrupts - corresponding to the "codec" and "preview" data paths. IRQ_CAM is handled internally by the architecture and demultiplexed to IRQ_S3C2440_CAM_C and IRQ_S3C2440_CAM_P - these interrupts only should be handled in the driver. Signed-off-by: Sylwester Nawrocki <sylvester.nawrocki@gmail.com> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
| | * | | ARM: S3C24XX: Correct AC97 clock control bit for S3C2440Sylwester Nawrocki2012-07-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use correct gate control bit for AC97 clock which is S3C2440_CLKCON_AC97, not S3C2440_CLKCON_CAMERA. Signed-off-by: Sylwester Nawrocki <sylvester.nawrocki@gmail.com> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
| | * | | ARM: SAMSUNG: fix race in s3c_adc_start for ADCTodd Poynor2012-07-131-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Checking for adc->ts_pend already claimed should be done with the lock held. Signed-off-by: Todd Poynor <toddpoynor@google.com> Acked-by: Ben Dooks <ben-linux@fluff.org> Cc: Stable <stable@vger.kernel.org> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
| | * | | ARM: SAMSUNG: Update default rate for xusbxti clockTushar Behera2012-07-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rate of xusbxti clock is set in individual machine files. The default value should be defined at the clock definition and individual machine files should modify it if required. Division by zero in kernel. [<c0011849>] (unwind_backtrace+0x1/0x9c) from [<c022c663>] (Ldiv0+0x9/0x12) [<c022c663>] (Ldiv0+0x9/0x12) from [<c001a3c3>] (s3c_setrate_clksrc+0x33/0x78) [<c001a3c3>] (s3c_setrate_clksrc+0x33/0x78) from [<c0019e67>] (clk_set_rate+0x2f/0x78) Signed-off-by: Tushar Behera <tushar.behera@linaro.org> Cc: Stable <stable@vger.kernel.org> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
| | * | | ARM: EXYNOS: register devices in 'need_restore' state for pm_domainsMarek Szyprowski2012-07-121-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit ca1d72f033 ('PM / Domains: Make it possible to add devices to inactive domains') introduced possibility to add devices to inactive power domains and added pm_genpd_dev_need_restore() function which lets platform core to notify power domain core that the specified device must be restored (with its runtime_resume() callback) before first use. This patch adds the pm_genpd_dev_need_restore() call what brings back the suspend/resume behaviour for the client devices known from the previous power domain driver (removed by commit 91cfbd4ee0 - 'ARM: EXYNOS: Hook up power domains to generic power domain infrastructure'). Client device drivers relay on that suspend/resume behaviour, thus this patch fixes runtime pm operation for client devices. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
| | * | | ARM: EXYNOS: read initial state of power domain from hw registersMarek Szyprowski2012-07-121-3/+6
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some bootloaders disable unused power domains to reduce power consuption. Power domain driver can easily read the actual state from the hardware registers instead of assuming that their initial state is always 'on'. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
| | | |
| \ \ \
| \ \ \
| \ \ \
*---. \ \ \ Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus' and ↵Linus Torvalds2012-07-1417-115/+261
|\ \ \ \ \ \ | |_|_|_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU, perf, and scheduler fixes from Ingo Molnar. The RCU fix is a revert for an optimization that could cause deadlocks. One of the scheduler commits (164c33c6adee "sched: Fix fork() error path to not crash") is correct but not complete (some architectures like Tile are not covered yet) - the resulting additional fixes are still WIP and Ingo did not want to delay these pending fixes. See this thread on lkml: [PATCH] fork: fix error handling in dup_task() The perf fixes are just trivial oneliners. * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf kvm: Fix segfault with report and mixed guestmount use perf kvm: Fix regression with guest machine creation perf script: Fix format regression due to libtraceevent merge ring-buffer: Fix accounting of entries when removing pages ring-buffer: Fix crash due to uninitialized new_pages list head * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS/sched: Update scheduler file pattern sched/nohz: Rewrite and fix load-avg computation -- again sched: Fix fork() error path to not crash
| | | * | | MAINTAINERS/sched: Update scheduler file patternNamhyung Kim2012-07-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit 391e43da797a ("sched: Move all scheduler bits into kernel/sched/") moved all scheduler codes to the kernel/sched/ directory, but missed the MAINTAINERS. Since it still expects files from kernel/ directory, get_maintainer script has to rely on the git (log) fallback mechanism. $ scripts/get_maintainer.pl -f kernel/sched/core.c --nogit-fallback linux-kernel@vger.kernel.org (open list) With this patch: $ scripts/get_maintainer.pl -f kernel/sched/core.c --nogit-fallback Ingo Molnar <mingo@redhat.com> (maintainer:SCHEDULER) Peter Zijlstra <peterz@infradead.org> (maintainer:SCHEDULER) linux-kernel@vger.kernel.org (open list) Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1341326251-4140-1-git-send-email-namhyung@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | | * | | sched/nohz: Rewrite and fix load-avg computation -- againPeter Zijlstra2012-07-055-75/+213
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <dsmythies@telus.net> Reported-and-tested-by: Charles Wang <muming.wq@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@kernel.org Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | | * | | sched: Fix fork() error path to not crashSalman Qazi2012-07-051-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In dup_task_struct(), if arch_dup_task_struct() fails, the clean up code fails to clean up correctly. That's because the clean up code depends on unininitalized ti->task pointer. We fix this by making sure that the task and thread_info know about each other before we attempt to take the error path. Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120626011815.11323.5533.stgit@dungbeetle.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | | | Merge branch 'tip/perf/urgent' of ↵Ingo Molnar2012-07-101-3/+3
| | |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent Pull ftrace ring-buffer fixes from Steve Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | | * | | | ring-buffer: Fix accounting of entries when removing pagesVaibhav Nagarnaik2012-06-291-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When removing pages from the ring buffer, its state is not reset. This means that the counters need to be correctly updated to account for the pages removed. Update the overrun counter to reflect the removed events from the pages. Link: http://lkml.kernel.org/r/1340998301-1715-1-git-send-email-vnagarnaik@google.com Cc: Justin Teravest <teravest@google.com> Cc: David Sharp <dhsharp@google.com> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | | * | | | ring-buffer: Fix crash due to uninitialized new_pages list headVaibhav Nagarnaik2012-06-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new_pages list head in the cpu_buffer is not initialized. When adding pages to the ring buffer, if the memory allocation fails in ring_buffer_resize, the clean up handler tries to free up the allocated pages from all the cpu buffers. The panic is caused by referencing the uninitialized new_pages list head. Initializing the new_pages list head in rb_allocate_cpu_buffer fixes this. Link: http://lkml.kernel.org/r/1340391005-10880-1-git-send-email-vnagarnaik@google.com Cc: Justin Teravest <teravest@google.com> Cc: David Sharp <dhsharp@google.com> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | | perf kvm: Fix segfault with report and mixed guestmount useDavid Ahern2012-07-021-14/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using the guestmount option on record: $ perf kvm --guest --host --guestmount=/tmp/guest-mount record -ag But not the subsequent report: $ perf kvm report causes a SEGFAULT in the usual place: (gdb) bt 0 0x0000000000470356 in machine__mmap_name (self=0x0, bf=0x7fffffffbdb0 " z\370\367\377\177", size= 4096) at util/map.c:712 1 0x00000000004453e8 in perf_event__process_kernel_mmap (tool=0x7fffffffde10, event=0x7ffff7f87e38, machine=0x0) at util/event.c:550 2 0x00000000004458c9 in perf_event__process_mmap (tool=0x7fffffffde10, event=0x7ffff7f87e38, sample= 0x7fffffffd2a0, machine=0x0) at util/event.c:656 3 0x00000000004733e0 in perf_session_deliver_event (session=0x91aca0, event=0x7ffff7f87e38, sample= 0x7fffffffd2a0, tool=0x7fffffffde10, file_offset=7736) at util/session.c:979 ... The MMAP events in this case already contain the full path to the module. No need to require it for the report path to. Signed-off-by: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1341241977-71535-1-git-send-email-dsahern@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | perf kvm: Fix regression with guest machine creationDavid Ahern2012-07-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 743eb868657bdb1b26c7b24077ca21c67c82c777 reworked when the machines were created. Prior to this commit guest machines could be created in perf_event__process_kernel_mmap() while processing kernel MMAP events. This commit assumes that the machines exist by the time perf_session_deliver_event is called (e.g., during processing of build id events) - which is not always correct. One example is the use of default guest args (--guestkallsyms and --guestmodules) for short times where no samples hit within a guest module. For this case no build id is added to the file header. No build id == no machine created. That leads to the next example -- the use of no-buildid (-B) on the record for all perf-kvm invocations. In both cases perf report dies with a SEGFAULT of the form: (gdb) bt 0 0x000000000046dd7b in machine__mmap_name (self=0x0, bf=0x7fffffffbd20 "q\021", size=4096) at util/map.c:715 1 0x0000000000444161 in perf_event__process_kernel_mmap (tool=0x7fffffffdd80, event=0x7ffff7fb4120, machine=0x0) at util/event.c:562 2 0x0000000000444642 in perf_event__process_mmap (tool=0x7fffffffdd80, event=0x7ffff7fb4120, sample=0x7fffffffd210, machine=0x0) at util/event.c:668 3 0x0000000000470e0b in perf_session_deliver_event (session=0x915ca0, event=0x7ffff7fb4120, sample=0x7fffffffd210, tool=0x7fffffffdd80, file_offset=8480) at util/session.c:979 4 0x000000000047032e in flush_sample_queue (s=0x915ca0, tool=0x7fffffffdd80) at util/session.c:679 5 0x0000000000471c8d in __perf_session__process_events (session=0x915ca0, data_offset=400, data_size=150448, file_size=150848, tool= 0x7fffffffdd80) at util/session.c:1363 6 0x0000000000471d42 in perf_session__process_events (self=0x915ca0, tool=0x7fffffffdd80) at util/session.c:1379 7 0x000000000042484a in __cmd_report (rep=0x7fffffffdd80) at builtin-report.c:368 8 0x0000000000425bf1 in cmd_report (argc=0, argv=0x915b00, prefix=0x0) at builtin-report.c:756 9 0x0000000000438505 in __cmd_report (argc=4, argv=0x7fffffffe260) at builtin-kvm.c:84 10 0x000000000043882a in cmd_kvm (argc=4, argv=0x7fffffffe260, prefix=0x0) at builtin-kvm.c:131 11 0x00000000004152cd in run_builtin (p=0x7a54e8, argc=9, argv=0x7fffffffe260) at perf.c:273 12 0x00000000004154c7 in handle_internal_command (argc=9, argv=0x7fffffffe260) at perf.c:345 13 0x0000000000415613 in run_argv (argcp=0x7fffffffe14c, argv=0x7fffffffe140) at perf.c:389 14 0x0000000000415899 in main (argc=9, argv=0x7fffffffe260) at perf.c:487 Fix by allowing the machine to be created in perf_session_deliver_event. Tested with --guestmount option and default guest args, with and without -B arg on record for both and for short (10 seconds) and long (10 minutes) windows. Reported-by: Pradeep Kumar Surisetty <psuriset@linux.vnet.ibm.com> Signed-off-by: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pradeep Kumar Surisetty <psuriset@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1341180697-64515-1-git-send-email-dsahern@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | perf script: Fix format regression due to libtraceevent mergeDavid Ahern2012-07-021-2/+1
| | |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Consider the commands: perf record -e sched:sched_switch -fo /tmp/perf.data -a -- sleep 1 perf script -i /tmp/perf.data In v3.4 the output has the form (lines wrapped here) perf 29214 [005] 821043.582596: sched_switch: prev_comm=perf prev_pid=29214 prev_prio=120 prev_state=S ==> next_comm=swapper/5 next_pid=0 next_prio=120 In 3.5 that same line has become: perf 29214 [005] 821043.582596: sched_switch: <...>-29214 [005] 0.000000000: sched_switch: prev_comm=perf prev_pid=29214 prev_prio=120 prev_state=S ==> next_comm=swapper/5 next_pid=0 next_prio=120 Note the duplicates in the output -- pid, cpu, event name. With this patch the v3.4 output is restored: perf 29214 [005] 821043.582596: sched_switch: prev_comm=perf prev_pid=29214 prev_prio=120 prev_state=S ==> next_comm=swapper/5 next_pid=0 next_prio=120 v3: Remove that pesky newline too. Output now matches v3.4 (pre-libtracevent). v2: Change print_trace_event function local to perf per Steve's comments. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1339698977-68962-1-git-send-email-dsahern@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| * | | | | Merge branch 'rcu/urgent' of ↵Ingo Molnar2012-07-068-16/+19
| |\ \ \ \ \ | | |_|/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent Pull low probability CONFIG_RCU_BOOST=y deadlock fix from Paul E. McKenney. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | | | Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"Paul E. McKenney2012-07-028-16/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 616c310e83b872024271c915c1b9ab505b9efad9. (Move PREEMPT_RCU preemption to switch_to() invocation). Testing by Sasha Levin <levinsasha928@gmail.com> showed that this can result in deadlock due to invoking the scheduler when one of the runqueue locks is held. Because this commit was simply a performance optimization, revert it. Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Sasha Levin <levinsasha928@gmail.com>
* | | | | | Merge tag 'md-3.5-fixes' of git://neil.brown.name/mdLinus Torvalds2012-07-141-1/+2
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull use-after-free RAID1 bugfix from NeilBrown. * tag 'md-3.5-fixes' of git://neil.brown.name/md: md/raid1: fix use-after-free bug in RAID1 data-check code.
| * | | | | | md/raid1: fix use-after-free bug in RAID1 data-check code.NeilBrown2012-07-091-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This bug has been present ever since data-check was introduce in 2.6.16. However it would only fire if a data-check were done on a degraded array, which was only possible if the array has 3 or more devices. This is certainly possible, but is quite uncommon. Since hot-replace was added in 3.3 it can happen more often as the same condition can arise if not all possible replacements are present. The problem is that as soon as we submit the last read request, the 'r1_bio' structure could be freed at any time, so we really should stop looking at it. If the last device is being read from we will stop looking at it. However if the last device is not due to be read from, we will still check the bio pointer in the r1_bio, but the r1_bio might already be free. So use the read_targets counter to make sure we stop looking for bios to submit as soon as we have submitted them all. This fix is suitable for any -stable kernel since 2.6.16. Cc: stable@vger.kernel.org Reported-by: Arnold Schulz <arnysch@gmx.net> Signed-off-by: NeilBrown <neilb@suse.de>
* | | | | | | Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds2012-07-143-19/+107
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull the leap second fixes from Thomas Gleixner: "It's a rather large series, but well discussed, refined and reviewed. It got a massive testing by John, Prarit and tip. In theory we could split it into two parts. The first two patches f55a6faa3843: hrtimer: Provide clock_was_set_delayed() 4873fa070ae8: timekeeping: Fix leapsecond triggered load spike issue are merely preventing the stuff loops forever issues, which people have observed. But there is no point in delaying the other 4 commits which achieve full correctness into 3.6 as they are tagged for stable anyway. And I rather prefer to have the full fixes merged in bulk than a "prevent the observable wreckage and deal with the hidden fallout later" approach." * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Update hrtimer base offsets each hrtimer_interrupt timekeeping: Provide hrtimer update function hrtimers: Move lock held region in hrtimer_interrupt() timekeeping: Maintain ktime_t based offsets for hrtimers timekeeping: Fix leapsecond triggered load spike issue hrtimer: Provide clock_was_set_delayed()
| * | | | | | | hrtimer: Update hrtimer base offsets each hrtimer_interruptJohn Stultz2012-07-111-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The update of the hrtimer base offsets on all cpus cannot be made atomically from the timekeeper.lock held and interrupt disabled region as smp function calls are not allowed there. clock_was_set(), which enforces the update on all cpus, is called either from preemptible process context in case of do_settimeofday() or from the softirq context when the offset modification happened in the timer interrupt itself due to a leap second. In both cases there is a race window for an hrtimer interrupt between dropping timekeeper lock, enabling interrupts and clock_was_set() issuing the updates. Any interrupt which arrives in that window will see the new time but operate on stale offsets. So we need to make sure that an hrtimer interrupt always sees a consistent state of time and offsets. ktime_get_update_offsets() allows us to get the current monotonic time and update the per cpu hrtimer base offsets from hrtimer_interrupt() to capture a consistent state of monotonic time and the offsets. The function replaces the existing ktime_get() calls in hrtimer_interrupt(). The overhead of the new function vs. ktime_get() is minimal as it just adds two store operations. This ensures that any changes to realtime or boottime offsets are noticed and stored into the per-cpu hrtimer base structures, prior to any hrtimer expiration and guarantees that timers are not expired early. Signed-off-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-8-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | timekeeping: Provide hrtimer update functionThomas Gleixner2012-07-112-0/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To finally fix the infamous leap second issue and other race windows caused by functions which change the offsets between the various time bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a function which atomically gets the current monotonic time and updates the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic overhead. The previous patch which provides ktime_t offsets allows us to make this function almost as cheap as ktime_get() which is going to be replaced in hrtimer_interrupt(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: John Stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | hrtimers: Move lock held region in hrtimer_interrupt()Thomas Gleixner2012-07-111-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to update the base offsets from this code and we need to do that under base->lock. Move the lock held region around the ktime_get() calls. The ktime_get() calls are going to be replaced with a function which gets the time and the offsets atomically. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: John Stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/1341960205-56738-6-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | timekeeping: Maintain ktime_t based offsets for hrtimersThomas Gleixner2012-07-111-2/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to update the hrtimer clock offsets from the hrtimer interrupt context. To avoid conversions from timespec to ktime_t maintain a ktime_t based representation of those offsets in the timekeeper. This puts the conversion overhead into the code which updates the underlying offsets and provides fast accessible values in the hrtimer interrupt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-4-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | timekeeping: Fix leapsecond triggered load spike issueJohn Stultz2012-07-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The timekeeping code misses an update of the hrtimer subsystem after a leap second happened. Due to that timers based on CLOCK_REALTIME are either expiring a second early or late depending on whether a leap second has been inserted or deleted until an operation is initiated which causes that update. Unless the update happens by some other means this discrepancy between the timekeeping and the hrtimer data stays forever and timers are expired either early or late. The reported immediate workaround - $ data -s "`date`" - is causing a call to clock_was_set() which updates the hrtimer data structures. See: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix Add the missing clock_was_set() call to update_wall_time() in case of a leap second event. The actual update is deferred to softirq context as the necessary smp function call cannot be invoked from hard interrupt context. Signed-off-by: John Stultz <johnstul@us.ibm.com> Reported-by: Jan Engelhardt <jengelh@inai.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-3-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | hrtimer: Provide clock_was_set_delayed()John Stultz2012-07-112-1/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | clock_was_set() cannot be called from hard interrupt context because it calls on_each_cpu(). For fixing the widely reported leap seconds issue it is necessary to call it from hard interrupt context, i.e. the timer tick code, which does the timekeeping updates. Provide a new function which denotes it in the hrtimer cpu base structure of the cpu on which it is called and raise the hrtimer softirq. We then execute the clock_was_set() notificiation from softirq context in run_hrtimer_softirq(). The hrtimer softirq is rarely used, so polling the flag there is not a performance issue. [ tglx: Made it depend on CONFIG_HIGH_RES_TIMERS. We really should get rid of all this ifdeffery ASAP ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Reported-by: Jan Engelhardt <jengelh@inai.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-2-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>