summaryrefslogtreecommitdiffstats
path: root/init (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'tracing-core-for-linus' of ↵Linus Torvalds2008-12-282-15/+21
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits) sched, trace: update trace_sched_wakeup() tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3 Revert "x86: disable X86_PTRACE_BTS" ring-buffer: prevent false positive warning ring-buffer: fix dangling commit race ftrace: enable format arguments checking x86, bts: memory accounting x86, bts: add fork and exit handling ftrace: introduce tracing_reset_online_cpus() helper tracing: fix warnings in kernel/trace/trace_sched_switch.c tracing: fix warning in kernel/trace/trace.c tracing/ring-buffer: remove unused ring_buffer size trace: fix task state printout ftrace: add not to regex on filtering functions trace: better use of stack_trace_enabled for boot up code trace: add a way to enable or disable the stack tracer x86: entry_64 - introduce FTRACE_ frame macro v2 tracing/ftrace: add the printk-msg-only option tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp() x86, bts: correctly report invalid bts records ... Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits being already partly merged by the SH merge.
| * init/main.c: use ktime accessor function in initcall_debug codeWill Newton2008-11-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix initcall debug output on non-scalar ktime platforms (32-bit embedded) The initcall_debug code access the tv64 member of ktime. This won't work correctly for large deltas on platforms that don't use the scalar ktime implementation. Signed-off-by: Will Newton <will.newton@gmail.com> Acked-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| *-. Merge branches 'tracing/ftrace' and 'tracing/urgent' into tracing/coreIngo Molnar2008-11-191-4/+4
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: kernel/trace/ftrace.c [ We conflicted here because we backported a few fixes to tracing/urgent - which has different internal APIs. ]
| * | | markers: auto enable tracepoints (new API : trace_mark_tp())Mathieu Desnoyers2008-11-161-0/+1
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: new API Add a new API trace_mark_tp(), which declares a marker within a tracepoint probe. When the marker is activated, the tracepoint is automatically enabled. No branch test is used at the marker site, because it would be a duplicate of the branch already present in the tracepoint. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | tracing/fastboot: Use the ring-buffer timestamp for initcall entriesFrederic Weisbecker2008-11-121-15/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: Split the boot tracer entries in two parts: call and return Now that we are using the sched tracer from the boot tracer, we want to use the same timestamp than the ring-buffer to have consistent time captures between sched events and initcall events. So we get rid of the old time capture by the boot tracer and split the initcall events in two parts: call and return. This way we have the ring buffer timestamp of both. An example trace: [ 27.904149584] calling net_ns_init+0x0/0x1c0 @ 1 [ 27.904429624] initcall net_ns_init+0x0/0x1c0 returned 0 after 0 msecs [ 27.904575926] calling reboot_init+0x0/0x20 @ 1 [ 27.904655399] initcall reboot_init+0x0/0x20 returned 0 after 0 msecs [ 27.904800228] calling sysctl_init+0x0/0x30 @ 1 [ 27.905142914] initcall sysctl_init+0x0/0x30 returned 0 after 0 msecs [ 27.905287211] calling ksysfs_init+0x0/0xb0 @ 1 ##### CPU 0 buffer started #### init-1 [000] 27.905395: 1:120:R + [001] 11:115:S ##### CPU 1 buffer started #### <idle>-0 [001] 27.905425: 0:140:R ==> [001] 11:115:R init-1 [000] 27.905426: 1:120:D ==> [000] 0:140:R <idle>-0 [000] 27.905431: 0:140:R + [000] 4:115:S <idle>-0 [000] 27.905451: 0:140:R ==> [000] 4:115:R ksoftirqd/0-4 [000] 27.905456: 4:115:S ==> [000] 0:140:R udevd-11 [001] 27.905458: 11:115:R + [001] 14:115:R <idle>-0 [000] 27.905459: 0:140:R + [000] 4:115:S <idle>-0 [000] 27.905462: 0:140:R ==> [000] 4:115:R udevd-11 [001] 27.905462: 11:115:R ==> [001] 14:115:R ksoftirqd/0-4 [000] 27.905467: 4:115:S ==> [000] 0:140:R <idle>-0 [000] 27.905470: 0:140:R + [000] 4:115:S <idle>-0 [000] 27.905473: 0:140:R ==> [000] 4:115:R ksoftirqd/0-4 [000] 27.905476: 4:115:S ==> [000] 0:140:R <idle>-0 [000] 27.905479: 0:140:R + [000] 4:115:S <idle>-0 [000] 27.905482: 0:140:R ==> [000] 4:115:R ksoftirqd/0-4 [000] 27.905486: 4:115:S ==> [000] 0:140:R udevd-14 [001] 27.905499: 14:120:X ==> [001] 11:115:R udevd-11 [001] 27.905506: 11:115:R + [000] 1:120:D <idle>-0 [000] 27.905515: 0:140:R ==> [000] 1:120:R udevd-11 [001] 27.905517: 11:115:S ==> [001] 0:140:R [ 27.905557107] initcall ksysfs_init+0x0/0xb0 returned 0 after 3906 msecs [ 27.905705736] calling init_jiffies_clocksource+0x0/0x10 @ 1 [ 27.905779239] initcall init_jiffies_clocksource+0x0/0x10 returned 0 after 0 msecs [ 27.906769814] calling pm_init+0x0/0x30 @ 1 [ 27.906853627] initcall pm_init+0x0/0x30 returned 0 after 0 msecs [ 27.906997803] calling pm_disk_init+0x0/0x20 @ 1 [ 27.907076946] initcall pm_disk_init+0x0/0x20 returned 0 after 0 msecs [ 27.907222556] calling swsusp_header_init+0x0/0x30 @ 1 [ 27.907294325] initcall swsusp_header_init+0x0/0x30 returned 0 after 0 msecs [ 27.907439620] calling stop_machine_init+0x0/0x50 @ 1 init-1 [000] 27.907485: 1:120:R + [000] 2:115:S init-1 [000] 27.907490: 1:120:D ==> [000] 2:115:R kthreadd-2 [000] 27.907507: 2:115:R + [001] 15:115:R <idle>-0 [001] 27.907517: 0:140:R ==> [001] 15:115:R kthreadd-2 [000] 27.907517: 2:115:D ==> [000] 0:140:R <idle>-0 [000] 27.907521: 0:140:R + [000] 4:115:S <idle>-0 [000] 27.907524: 0:140:R ==> [000] 4:115:R udevd-15 [001] 27.907527: 15:115:D + [000] 2:115:D ksoftirqd/0-4 [000] 27.907537: 4:115:S ==> [000] 2:115:R udevd-15 [001] 27.907537: 15:115:D ==> [001] 0:140:R kthreadd-2 [000] 27.907546: 2:115:R + [000] 1:120:D kthreadd-2 [000] 27.907550: 2:115:S ==> [000] 1:120:R init-1 [000] 27.907584: 1:120:R + [000] 15: 0:D init-1 [000] 27.907589: 1:120:R + [000] 2:115:S init-1 [000] 27.907593: 1:120:D ==> [000] 15: 0:R udevd-15 [000] 27.907601: 15: 0:S ==> [000] 2:115:R ##### CPU 0 buffer started #### kthreadd-2 [000] 27.907616: 2:115:R + [001] 16:115:R ##### CPU 1 buffer started #### <idle>-0 [001] 27.907620: 0:140:R ==> [001] 16:115:R kthreadd-2 [000] 27.907621: 2:115:D ==> [000] 0:140:R udevd-16 [001] 27.907625: 16:115:D + [000] 2:115:D <idle>-0 [000] 27.907628: 0:140:R + [000] 4:115:S udevd-16 [001] 27.907629: 16:115:D ==> [001] 0:140:R <idle>-0 [000] 27.907631: 0:140:R ==> [000] 4:115:R ksoftirqd/0-4 [000] 27.907636: 4:115:S ==> [000] 2:115:R kthreadd-2 [000] 27.907644: 2:115:R + [000] 1:120:D kthreadd-2 [000] 27.907647: 2:115:S ==> [000] 1:120:R init-1 [000] 27.907657: 1:120:R + [001] 16: 0:D <idle>-0 [001] 27.907666: 0:140:R ==> [001] 16: 0:R [ 27.907703862] initcall stop_machine_init+0x0/0x50 returned 0 after 0 msecs [ 27.907850704] calling filelock_init+0x0/0x30 @ 1 [ 27.907926573] initcall filelock_init+0x0/0x30 returned 0 after 0 msecs [ 27.908071327] calling init_script_binfmt+0x0/0x10 @ 1 [ 27.908165195] initcall init_script_binfmt+0x0/0x10 returned 0 after 0 msecs [ 27.908309461] calling init_elf_binfmt+0x0/0x10 @ 1 Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | tracing/fastboot: move boot tracer structs and funcs into their own header.Frederic Weisbecker2008-11-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: Cleanups on the boot tracer and ftrace This patch bring some cleanups about the boot tracer headers. The functions and structures of this tracer have nothing related to ftrace and should have so their own header file. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | tracing/fastboot: Enable boot tracing only during initcallsFrederic Weisbecker2008-11-041-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: modify boot tracer We used to disable the initcall tracing at a specified time (IE: end of builtin initcalls). But we don't need it anymore. It will be stopped when initcalls are finished. However we want two things: _Start this tracing only after pre-smp initcalls are finished. _Since we are planning to trace sched_switches at the same time, we want to enable them only during the initcall execution. For this purpose, this patch introduce two functions to enable/disable the sched_switch tracing during boot. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | Merge branch 'master' into nextJames Morris2008-12-041-1/+1
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | Conflicts: fs/nfsd/nfs4recover.c Manually fixed above to use new creds API functions, e.g. nfs4_save_creds(). Signed-off-by: James Morris <jmorris@namei.org>
| * | Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds2008-11-181-1/+1
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: cpuset: fix regression when failed to generate sched domains sched, signals: fix the racy usage of ->signal in account_group_xxx/run_posix_cpu_timers sched: fix kernel warning on /proc/sched_debug access sched: correct sched-rt-group.txt pathname in init/Kconfig
| | * | sched: correct sched-rt-group.txt pathname in init/KconfigAdrian Knoth2008-11-131-1/+1
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | init/Kconfig directs the user to Documentation/sched-rt-group.txt, but the file is actually in Documentation/scheduler/sched-rt-group.txt. This patch corrects the pathname mentioned in init/Kconfig. Signed-off-by: Adrian Knoth <adi@drcomp.erfurt.thur.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | Merge branch 'master' into nextJames Morris2008-11-141-3/+3
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: security/keys/internal.h security/keys/process_keys.c security/keys/request_key.c Fixed conflicts above by using the non 'tsk' versions. Signed-off-by: James Morris <jmorris@namei.org>
| * | Kconfig: SLUB is the default slab allocatorSimon Arlott2008-11-131-3/+3
| |/ | | | | | | | | | | | | | | | | In 2007, a0acd820807680d2ccc4ef3448387fcdbf152c73 changed the default slab allocator to SLUB, but the SLAB help text still says SLAB is the default. This change fixes that. Signed-off-by: Simon Arlott <simon@fire.lp0.eu> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
* / CRED: Inaugurate COW credentialsDavid Howells2008-11-141-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inaugurate copy-on-write credentials management. This uses RCU to manage the credentials pointer in the task_struct with respect to accesses by other tasks. A process may only modify its own credentials, and so does not need locking to access or modify its own credentials. A mutex (cred_replace_mutex) is added to the task_struct to control the effect of PTRACE_ATTACHED on credential calculations, particularly with respect to execve(). With this patch, the contents of an active credentials struct may not be changed directly; rather a new set of credentials must be prepared, modified and committed using something like the following sequence of events: struct cred *new = prepare_creds(); int ret = blah(new); if (ret < 0) { abort_creds(new); return ret; } return commit_creds(new); There are some exceptions to this rule: the keyrings pointed to by the active credentials may be instantiated - keyrings violate the COW rule as managing COW keyrings is tricky, given that it is possible for a task to directly alter the keys in a keyring in use by another task. To help enforce this, various pointers to sets of credentials, such as those in the task_struct, are declared const. The purpose of this is compile-time discouragement of altering credentials through those pointers. Once a set of credentials has been made public through one of these pointers, it may not be modified, except under special circumstances: (1) Its reference count may incremented and decremented. (2) The keyrings to which it points may be modified, but not replaced. The only safe way to modify anything else is to create a replacement and commit using the functions described in Documentation/credentials.txt (which will be added by a later patch). This patch and the preceding patches have been tested with the LTP SELinux testsuite. This patch makes several logical sets of alteration: (1) execve(). This now prepares and commits credentials in various places in the security code rather than altering the current creds directly. (2) Temporary credential overrides. do_coredump() and sys_faccessat() now prepare their own credentials and temporarily override the ones currently on the acting thread, whilst preventing interference from other threads by holding cred_replace_mutex on the thread being dumped. This will be replaced in a future patch by something that hands down the credentials directly to the functions being called, rather than altering the task's objective credentials. (3) LSM interface. A number of functions have been changed, added or removed: (*) security_capset_check(), ->capset_check() (*) security_capset_set(), ->capset_set() Removed in favour of security_capset(). (*) security_capset(), ->capset() New. This is passed a pointer to the new creds, a pointer to the old creds and the proposed capability sets. It should fill in the new creds or return an error. All pointers, barring the pointer to the new creds, are now const. (*) security_bprm_apply_creds(), ->bprm_apply_creds() Changed; now returns a value, which will cause the process to be killed if it's an error. (*) security_task_alloc(), ->task_alloc_security() Removed in favour of security_prepare_creds(). (*) security_cred_free(), ->cred_free() New. Free security data attached to cred->security. (*) security_prepare_creds(), ->cred_prepare() New. Duplicate any security data attached to cred->security. (*) security_commit_creds(), ->cred_commit() New. Apply any security effects for the upcoming installation of new security by commit_creds(). (*) security_task_post_setuid(), ->task_post_setuid() Removed in favour of security_task_fix_setuid(). (*) security_task_fix_setuid(), ->task_fix_setuid() Fix up the proposed new credentials for setuid(). This is used by cap_set_fix_setuid() to implicitly adjust capabilities in line with setuid() changes. Changes are made to the new credentials, rather than the task itself as in security_task_post_setuid(). (*) security_task_reparent_to_init(), ->task_reparent_to_init() Removed. Instead the task being reparented to init is referred directly to init's credentials. NOTE! This results in the loss of some state: SELinux's osid no longer records the sid of the thread that forked it. (*) security_key_alloc(), ->key_alloc() (*) security_key_permission(), ->key_permission() Changed. These now take cred pointers rather than task pointers to refer to the security context. (4) sys_capset(). This has been simplified and uses less locking. The LSM functions it calls have been merged. (5) reparent_to_kthreadd(). This gives the current thread the same credentials as init by simply using commit_thread() to point that way. (6) __sigqueue_alloc() and switch_uid() __sigqueue_alloc() can't stop the target task from changing its creds beneath it, so this function gets a reference to the currently applicable user_struct which it then passes into the sigqueue struct it returns if successful. switch_uid() is now called from commit_creds(), and possibly should be folded into that. commit_creds() should take care of protecting __sigqueue_alloc(). (7) [sg]et[ug]id() and co and [sg]et_current_groups. The set functions now all use prepare_creds(), commit_creds() and abort_creds() to build and check a new set of credentials before applying it. security_task_set[ug]id() is called inside the prepared section. This guarantees that nothing else will affect the creds until we've finished. The calling of set_dumpable() has been moved into commit_creds(). Much of the functionality of set_user() has been moved into commit_creds(). The get functions all simply access the data directly. (8) security_task_prctl() and cap_task_prctl(). security_task_prctl() has been modified to return -ENOSYS if it doesn't want to handle a function, or otherwise return the return value directly rather than through an argument. Additionally, cap_task_prctl() now prepares a new set of credentials, even if it doesn't end up using it. (9) Keyrings. A number of changes have been made to the keyrings code: (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have all been dropped and built in to the credentials functions directly. They may want separating out again later. (b) key_alloc() and search_process_keyrings() now take a cred pointer rather than a task pointer to specify the security context. (c) copy_creds() gives a new thread within the same thread group a new thread keyring if its parent had one, otherwise it discards the thread keyring. (d) The authorisation key now points directly to the credentials to extend the search into rather pointing to the task that carries them. (e) Installing thread, process or session keyrings causes a new set of credentials to be created, even though it's not strictly necessary for process or session keyrings (they're shared). (10) Usermode helper. The usermode helper code now carries a cred struct pointer in its subprocess_info struct instead of a new session keyring pointer. This set of credentials is derived from init_cred and installed on the new process after it has been cloned. call_usermodehelper_setup() allocates the new credentials and call_usermodehelper_freeinfo() discards them if they haven't been used. A special cred function (prepare_usermodeinfo_creds()) is provided specifically for call_usermodehelper_setup() to call. call_usermodehelper_setkeys() adjusts the credentials to sport the supplied keyring as the new session keyring. (11) SELinux. SELinux has a number of changes, in addition to those to support the LSM interface changes mentioned above: (a) selinux_setprocattr() no longer does its check for whether the current ptracer can access processes with the new SID inside the lock that covers getting the ptracer's SID. Whilst this lock ensures that the check is done with the ptracer pinned, the result is only valid until the lock is released, so there's no point doing it inside the lock. (12) is_single_threaded(). This function has been extracted from selinux_setprocattr() and put into a file of its own in the lib/ directory as join_session_keyring() now wants to use it too. The code in SELinux just checked to see whether a task shared mm_structs with other tasks (CLONE_VM), but that isn't good enough. We really want to know if they're part of the same thread group (CLONE_THREAD). (13) nfsd. The NFS server daemon now has to use the COW credentials to set the credentials it is going to use. It really needs to pass the credentials down to the functions it calls, but it can't do that until other patches in this series have been applied. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: James Morris <jmorris@namei.org>
* init/do_mounts_md.c: remove duplicated #includeHuang Weiyi2008-11-011-1/+0
| | | | | | | | | | | | | | | | | | | | Removed duplicated #include <linux/delay.h> in init/do_mounts_md.c. The same compile error ("error: implicit declaration of function 'msleep'") got fixed twice: - f8b77d39397e1510b1a3bcfd385ebd1a45aae77f ("init/do_mounts_md.c: msleep compile fix") - 73b4a24f5ff09389ba6277c53a266b142f655ed2 ("init/do_mounts_md.c must #include <linux/delay.h>") by people adding the <linux/delay.h> include in two slightly different places. Andrew's quilt scripts happily ignore the fuzz, and will re-apply the patch even though they had conflicts. Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: update menuconfig help textKAMEZAWA Hiroyuki2008-10-301-6/+10
| | | | | | | | | | page_cgroup is now allocated at boot and memmap doesn't includes pointer for page_cgroup. Fix the menu help text. Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* init/do_mounts_md.c: msleep compile fixAlexey Dobriyan2008-10-301-1/+1
| | | | | | | | init/do_mounts_md.c:285: error: implicit declaration of function 'msleep' Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Revert "Call init_workqueues before pre smp initcalls."Linus Torvalds2008-10-261-2/+1
| | | | | | | | | | | | | | | This reverts commit a802dd0eb5fc97a50cf1abb1f788a8f6cc5db635 by moving the call to init_workqueues() back where it belongs - after SMP has been initialized. It also moves stop_machine_init() - which needs workqueues - to a later phase using a core_initcall() instead of early_initcall(). That should satisfy all ordering requirements, and was apparently the reason why init_workqueues() was moved to be too early. Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2008-10-231-1/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (46 commits) [PATCH] fs: add a sanity check in d_free [PATCH] i_version: remount support [patch] vfs: make security_inode_setattr() calling consistent [patch 1/3] FS_MBCACHE: don't needlessly make it built-in [PATCH] move executable checking into ->permission() [PATCH] fs/dcache.c: update comment of d_validate() [RFC PATCH] touch_mnt_namespace when the mount flags change [PATCH] reiserfs: add missing llseek method [PATCH] fix ->llseek for more directories [PATCH vfs-2.6 6/6] vfs: add LOOKUP_RENAME_TARGET intent [PATCH vfs-2.6 5/6] vfs: remove LOOKUP_PARENT from non LOOKUP_PARENT lookup [PATCH vfs-2.6 4/6] vfs: remove unnecessary fsnotify_d_instantiate() [PATCH vfs-2.6 3/6] vfs: add __d_instantiate() helper [PATCH vfs-2.6 2/6] vfs: add d_ancestor() [PATCH vfs-2.6 1/6] vfs: replace parent == dentry->d_parent by IS_ROOT() [PATCH] get rid of on-stack dentry in udf [PATCH 2/2] anondev: switch to IDA [PATCH 1/2] anondev: init IDR statically [JFFS2] Use d_splice_alias() not d_add() in jffs2_lookup() [PATCH] Optimise NFS readdir hack slightly. ...
| * [PATCH 1/2] anondev: init IDR staticallyAlexey Dobriyan2008-10-231-1/+0
| | | | | | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
* | Merge branch 'linux-next' of ↵Linus Torvalds2008-10-231-1/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (32 commits) PCI hotplug: fix logic in Compaq hotplug controller bus speed setup PCI: don't export linux/io.h from pci.h PCI: PCI_QUIRKS depends on PCI PCI hotplug: pciehp: poll data link layer link active PCI hotplug: pciehp: fix possible memory leak in pcie_init PCI: Workaround invalid P2P bridge bus numbers PCI Hotplug: fakephp: add duplicate slot name debugging PCI: Hotplug core: remove 'name' PCI: shcphp: remove 'name' parameter PCI: SGI Hotplug: stop managing bss_hotplug_slot->name PCI: rpaphp: kmalloc/kfree slot->name directly PCI: pciehp: remove 'name' parameter PCI: ibmphp: stop managing hotplug_slot->name PCI: fakephp: remove 'name' parameter PCI, PCI Hotplug: introduce slot_name helpers PCI: cpqphp: stop managing hotplug_slot->name PCI: cpci_hotplug: stop managing hotplug_slot->name PCI: acpiphp: remove 'name' parameter PCI: prevent duplicate slot names PCI Hotplug: serialize pci_hp_register and pci_hp_deregister ...
| * | PCI: PCI_QUIRKS depends on PCIGeert Uytterhoeven2008-10-231-1/+2
| |/ | | | | | | | | | | | | | | | | commit 3d137310245e4cdc3e8c8ba1bea2e145a87ae8e3 ("PCI: allow quirks to be compiled out") introduced CONFIG_PCI_QUIRKS, which now shows up in each and every .config. Fix this by making it depend on PCI. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linusLinus Torvalds2008-10-231-9/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: stop_machine: fix error code handling on multiple cpus stop_machine: use workqueues instead of kernel threads workqueue: introduce create_rt_workqueue Call init_workqueues before pre smp initcalls. Make panic= and panic_on_oops into core_params Make initcall_debug a core_param core_param() for genuinely core kernel parameters param: Fix duplicate module prefixes module: check kernel param length at compile time, not runtime Remove stop_machine during module load v2 module: simplify load_module.
| * | Call init_workqueues before pre smp initcalls.Heiko Carstens2008-10-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | This allows to create workqueues from within the context of a pre smp initcall (aka early_initcall). Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | Make initcall_debug a core_paramRusty Russell2008-10-221-7/+1
| |/ | | | | | | | | | | | | | | This is the one I really wanted: now it effects module loading, it makes sense to be able to flip it after boot. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Arjan van de Ven <arjan@linux.intel.com>
* / memcg: fix page_cgroup allocationKAMEZAWA Hiroyuki2008-10-231-0/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | page_cgroup_init() is called from mem_cgroup_init(). But at this point, we cannot call alloc_bootmem(). (and this caused panic at boot.) This patch moves page_cgroup_init() to init/main.c. Time table is following: == parse_args(). # we can trust mem_cgroup_subsys.disabled bit after this. .... cgroup_init_early() # "early" init of cgroup. .... setup_arch() # memmap is allocated. ... page_cgroup_init(); mem_init(); # we cannot call alloc_bootmem after this. .... cgroup_init() # mem_cgroup is initialized. == Before page_cgroup_init(), mem_map must be initialized. So, I added page_cgroup_init() to init/main.c directly. (*) maybe this is not very clean but - cgroup_init_early() is too early - in cgroup_init(), we have to use vmalloc instead of alloc_bootmem(). use of vmalloc area in x86-32 is important and we should avoid very large vmalloc() in x86-32. So, we want to use alloc_bootmem() and added page_cgroup_init() directly to init/main.c [akpm@linux-foundation.org: remove unneeded/bad mem_cgroup_subsys declaration] [akpm@linux-foundation.org: fix build] Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'linux-next' of ↵Linus Torvalds2008-10-201-0/+8
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (41 commits) PCI: fix pci_ioremap_bar() on s390 PCI: fix AER capability check PCI: use pci_find_ext_capability everywhere PCI: remove #ifdef DEBUG around dev_dbg call PCI hotplug: fix get_##name return value problem PCI: document the pcie_aspm kernel parameter PCI: introduce an pci_ioremap(pdev, barnr) function powerpc/PCI: Add legacy PCI access via sysfs PCI: Add ability to mmap legacy_io on some platforms PCI: probing debug message uniformization PCI: support PCIe ARI capability PCI: centralize the capabilities code in probe.c PCI: centralize the capabilities code in pci-sysfs.c PCI: fix 64-vbit prefetchable memory resource BARs PCI: replace cfg space size (256/4096) by macros. PCI: use resource_size() everywhere. PCI: use same arg names in PCI_VDEVICE comment PCI hotplug: rpaphp: make debug var unique PCI: use %pF instead of print_fn_descriptor_symbol() in quirks.c PCI: fix hotplug get_##name return value problem ...
| * PCI: allow quirks to be compiled outThomas Petazzoni2008-10-201-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the CONFIG_PCI_QUIRKS option which allows to remove all the PCI quirks, which are not necessarily used on embedded systems when PCI is working properly. As this is a size-reduction option, it depends on CONFIG_EMBEDDED. It allows to save almost 12 kilobytes of kernel code: text data bss dec hex filename 1287806 123596 212992 1624394 18c94a vmlinux.old 1275854 123596 212992 1612442 189a9a vmlinux -11952 0 0 -11952 -2EB0 +/- This patch has originally been written by Zwane Mwaikambo <zwane@arm.linux.org.uk> and is part of the Linux Tiny project. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
* | Merge branch 'tracing-v28-for-linus' of ↵Linus Torvalds2008-10-202-14/+27
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (131 commits) tracing/fastboot: improve help text tracing/stacktrace: improve help text tracing/fastboot: fix initcalls disposition in bootgraph.pl tracing/fastboot: fix bootgraph.pl initcall name regexp tracing/fastboot: fix issues and improve output of bootgraph.pl tracepoints: synchronize unregister static inline tracepoints: tracepoint_synchronize_unregister() ftrace: make ftrace_test_p6nop disassembler-friendly markers: fix synchronize marker unregister static inline tracing/fastboot: add better resolution to initcall debug/tracing trace: add build-time check to avoid overrunning hex buffer ftrace: fix hex output mode of ftrace tracing/fastboot: fix initcalls disposition in bootgraph.pl tracing/fastboot: fix printk format typo in boot tracer ftrace: return an error when setting a nonexistent tracer ftrace: make some tracers reentrant ring-buffer: make reentrant ring-buffer: move page indexes into page headers tracing/fastboot: only trace non-module initcalls ftrace: move pc counter in irqtrace ... Manually fix conflicts: - init/main.c: initcall tracing - kernel/module.c: verbose level vs tracepoints - scripts/bootgraph.pl: fallout from cherry-picking commits.
| * tracing/fastboot: add better resolution to initcall debug/tracingTim Bird2008-10-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the time resolution for initcall_debug to microseconds, from milliseconds. This is handy to determine which initcalls you want to work on for faster booting. One one of my test machines, over 90% of the initcalls are less than a millisecond and (without this patch) these are all reported as 0 msecs. Working on the 900 us ones is more important than the 4 us ones. With 'quiet' on the kernel command line, this adds no significant overhead to kernel boot time. Signed-off-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * tracing/fastboot: only trace non-module initcallsFrederic Weisbecker2008-10-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | At this time, only built-in initcalls interest us. We can't really produce a relevant graph if we include the modules initcall too. I had good results after this patch (see svg in attachment). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * tracing/fastboot: get the initcall name before it disappearsFrederic Weisbecker2008-10-141-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | After some initcall traces, some initcall names may be inconsistent. That's because these functions will disappear from the .init section and also their name from the symbols table. So we have to copy the name of the function in a buffer large enough during the trace appending. It is not costly for the ring_buffer because the number of initcall entries is commonly not really large. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * tracing/fastboot: change the printing of boot tracer according to bootgraph.plFrederic Weisbecker2008-10-141-11/+9
| | | | | | | | | | | | | | | | | | | | | | | | Change the boot tracer printing to make it parsable for the scripts/bootgraph.pl script. We have now to output two lines for each initcall, according to the printk in do_one_initcall() in init/main.c We need now the call's time and the return's time. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * tracing/ftrace: launch boot tracing after pre-smp initcallsFrédéric Weisbecker2008-10-141-6/+10
| | | | | | | | | | | | | | | | | | | | Launch the boot tracing inside the initcall_debug area. Old printk have not been removed to keep the old way of initcall tracing for backward compatibility. [ mingo@elte.hu: resolved conflicts ] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * tracing/fastboot: add a script to visualize the kernel boot process / timeArjan van de Ven2008-10-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When optimizing the kernel boot time, it's very valuable to visualize what is going on at which time. In addition, with the fastboot asynchronous initcall level, it's very valuable to see which initcall gets run where and when. This patch adds a script to turn a dmesg into a SVG graph (that can be shown with tools such as InkScape, Gimp or Firefox) and a small change to the initcall code to print the PID of the thread calling the initcall (so that the script can work out the parallelism). Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
| * ftrace: mcount call site on boot nops coreSteven Rostedt2008-10-141-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the infrastructure to the converting the mcount call sites recorded by the __mcount_loc section into nops on boot. It also allows for using these sites to enable tracing as normal. When the __mcount_loc section is used, the "ftraced" kernel thread is disabled. This uses the current infrastructure to record the mcount call sites as well as convert them to nops. The mcount function is kept as a stub on boot up and not converted to the ftrace_record_ip function. We use the ftrace_record_ip to only record from the table. This patch does not handle modules. That comes with a later patch. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * tracing: clean up tracepoints kconfig structureIngo Molnar2008-10-141-4/+5
| | | | | | | | | | | | | | | | | | do not expose users to CONFIG_TRACEPOINTS - tracers can select it just fine. update ftrace to select CONFIG_TRACEPOINTS. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * tracing: disable tracepoints by defaultIngo Molnar2008-10-141-1/+0
| | | | | | | | | | | | while it's arguably low overhead, we dont enable new features by default. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * tracing: Kernel TracepointsMathieu Desnoyers2008-10-141-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers. Allows complete typing verification by declaring both tracing statement inline functions and probe registration/unregistration static inline functions within the same macro "DEFINE_TRACE". No format string is required. See the tracepoint Documentation and Samples patches for usage examples. Taken from the documentation patch : "A tracepoint placed in code provides a hook to call a function (probe) that you can provide at runtime. A tracepoint can be "on" (a probe is connected to it) or "off" (no probe is attached). When a tracepoint is "off" it has no effect, except for adding a tiny time penalty (checking a condition for a branch) and space penalty (adding a few bytes for the function call at the end of the instrumented function and adds a data structure in a separate section). When a tracepoint is "on", the function you provide is called each time the tracepoint is executed, in the execution context of the caller. When the function provided ends its execution, it returns to the caller (continuing from the tracepoint site). You can put tracepoints at important locations in the code. They are lightweight hooks that can pass an arbitrary number of parameters, which prototypes are described in a tracepoint declaration placed in a header file." Addition and removal of tracepoints is synchronized by RCU using the scheduler (and preempt_disable) as guarantees to find a quiescent state (this is really RCU "classic"). The update side uses rcu_barrier_sched() with call_rcu_sched() and the read/execute side uses "preempt_disable()/preempt_enable()". We make sure the previous array containing probes, which has been scheduled for deletion by the rcu callback, is indeed freed before we proceed to the next update. It therefore limits the rate of modification of a single tracepoint to one update per RCU period. The objective here is to permit fast batch add/removal of probes on _different_ tracepoints. Changelog : - Use #name ":" #proto as string to identify the tracepoint in the tracepoint table. This will make sure not type mismatch happens due to connexion of a probe with the wrong type to a tracepoint declared with the same name in a different header. - Add tracepoint_entry_free_old. - Change __TO_TRACE to get rid of the 'i' iterator. Masami Hiramatsu <mhiramat@redhat.com> : Tested on x86-64. Performance impact of a tracepoint : same as markers, except that it adds about 70 bytes of instructions in an unlikely branch of each instrumented function (the for loop, the stack setup and the function call). It currently adds a memory read, a test and a conditional branch at the instrumentation site (in the hot path). Immediate values will eventually change this into a load immediate, test and branch, which removes the memory read which will make the i-cache impact smaller (changing the memory read for a load immediate removes 3-4 bytes per site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it also saves the d-cache hit). About the performance impact of tracepoints (which is comparable to markers), even without immediate values optimizations, tests done by Hideo Aoki on ia64 show no regression. His test case was using hackbench on a kernel where scheduler instrumentation (about 5 events in code scheduler code) was added. Quoting Hideo Aoki about Markers : I evaluated overhead of kernel marker using linux-2.6-sched-fixes git tree, which includes several markers for LTTng, using an ia64 server. While the immediate trace mark feature isn't implemented on ia64, there is no major performance regression. So, I think that we don't have any issues to propose merging marker point patches into Linus's tree from the viewpoint of performance impact. I prepared two kernels to evaluate. The first one was compiled without CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS. I downloaded the original hackbench from the following URL: http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c I ran hackbench 5 times in each condition and calculated the average and difference between the kernels. The parameter of hackbench: every 50 from 50 to 800 The number of CPUs of the server: 2, 4, and 8 Below is the results. As you can see, major performance regression wasn't found in any case. Even if number of processes increases, differences between marker-enabled kernel and marker- disabled kernel doesn't increase. Moreover, if number of CPUs increases, the differences doesn't increase either. Curiously, marker-enabled kernel is better than marker-disabled kernel in more than half cases, although I guess it comes from the difference of memory access pattern. * 2 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 4.811 | 4.872 | +0.061 | +1.27 | 100 | 9.854 | 10.309 | +0.454 | +4.61 | 150 | 15.602 | 15.040 | -0.562 | -3.6 | 200 | 20.489 | 20.380 | -0.109 | -0.53 | 250 | 25.798 | 25.652 | -0.146 | -0.56 | 300 | 31.260 | 30.797 | -0.463 | -1.48 | 350 | 36.121 | 35.770 | -0.351 | -0.97 | 400 | 42.288 | 42.102 | -0.186 | -0.44 | 450 | 47.778 | 47.253 | -0.526 | -1.1 | 500 | 51.953 | 52.278 | +0.325 | +0.63 | 550 | 58.401 | 57.700 | -0.701 | -1.2 | 600 | 63.334 | 63.222 | -0.112 | -0.18 | 650 | 68.816 | 68.511 | -0.306 | -0.44 | 700 | 74.667 | 74.088 | -0.579 | -0.78 | 750 | 78.612 | 79.582 | +0.970 | +1.23 | 800 | 85.431 | 85.263 | -0.168 | -0.2 | -------------------------------------------------------------- * 4 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 2.586 | 2.584 | -0.003 | -0.1 | 100 | 5.254 | 5.283 | +0.030 | +0.56 | 150 | 8.012 | 8.074 | +0.061 | +0.76 | 200 | 11.172 | 11.000 | -0.172 | -1.54 | 250 | 13.917 | 14.036 | +0.119 | +0.86 | 300 | 16.905 | 16.543 | -0.362 | -2.14 | 350 | 19.901 | 20.036 | +0.135 | +0.68 | 400 | 22.908 | 23.094 | +0.186 | +0.81 | 450 | 26.273 | 26.101 | -0.172 | -0.66 | 500 | 29.554 | 29.092 | -0.461 | -1.56 | 550 | 32.377 | 32.274 | -0.103 | -0.32 | 600 | 35.855 | 35.322 | -0.533 | -1.49 | 650 | 39.192 | 38.388 | -0.804 | -2.05 | 700 | 41.744 | 41.719 | -0.025 | -0.06 | 750 | 45.016 | 44.496 | -0.520 | -1.16 | 800 | 48.212 | 47.603 | -0.609 | -1.26 | -------------------------------------------------------------- * 8 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 2.094 | 2.072 | -0.022 | -1.07 | 100 | 4.162 | 4.273 | +0.111 | +2.66 | 150 | 6.485 | 6.540 | +0.055 | +0.84 | 200 | 8.556 | 8.478 | -0.078 | -0.91 | 250 | 10.458 | 10.258 | -0.200 | -1.91 | 300 | 12.425 | 12.750 | +0.325 | +2.62 | 350 | 14.807 | 14.839 | +0.032 | +0.22 | 400 | 16.801 | 16.959 | +0.158 | +0.94 | 450 | 19.478 | 19.009 | -0.470 | -2.41 | 500 | 21.296 | 21.504 | +0.208 | +0.98 | 550 | 23.842 | 23.979 | +0.137 | +0.57 | 600 | 26.309 | 26.111 | -0.198 | -0.75 | 650 | 28.705 | 28.446 | -0.259 | -0.9 | 700 | 31.233 | 31.394 | +0.161 | +0.52 | 750 | 34.064 | 33.720 | -0.344 | -1.01 | 800 | 36.320 | 36.114 | -0.206 | -0.57 | -------------------------------------------------------------- Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: 'Peter Zijlstra' <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | container freezer: implement freezer cgroup subsystemMatt Helsley2008-10-201-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements a new freezer subsystem in the control groups framework. It provides a way to stop and resume execution of all tasks in a cgroup by writing in the cgroup filesystem. The freezer subsystem in the container filesystem defines a file named freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in the cgroup. Reading will return the current state. * Examples of usage : # mkdir /containers/freezer # mount -t cgroup -ofreezer freezer /containers # mkdir /containers/0 # echo $some_pid > /containers/0/tasks to get status of the freezer subsystem : # cat /containers/0/freezer.state RUNNING to freeze all tasks in the container : # echo FROZEN > /containers/0/freezer.state # cat /containers/0/freezer.state FREEZING # cat /containers/0/freezer.state FROZEN to unfreeze all tasks in the container : # echo RUNNING > /containers/0/freezer.state # cat /containers/0/freezer.state RUNNING This is the basic mechanism which should do the right thing for user space task in a simple scenario. It's important to note that freezing can be incomplete. In that case we return EBUSY. This means that some tasks in the cgroup are busy doing something that prevents us from completely freezing the cgroup at this time. After EBUSY, the cgroup will remain partially frozen -- reflected by freezer.state reporting "FREEZING" when read. The state will remain "FREEZING" until one of these things happens: 1) Userspace cancels the freezing operation by writing "RUNNING" to the freezer.state file 2) Userspace retries the freezing operation by writing "FROZEN" to the freezer.state file (writing "FREEZING" is not legal and returns EIO) 3) The tasks that blocked the cgroup from entering the "FROZEN" state disappear from the cgroup's set of tasks. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: export thaw_process] Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: rewrite vmap layerNick Piggin2008-10-201-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and provide a fast, scalable percpu frontend for small vmaps (requires a slightly different API, though). The biggest problem with vmap is actually vunmap. Presently this requires a global kernel TLB flush, which on most architectures is a broadcast IPI to all CPUs to flush the cache. This is all done under a global lock. As the number of CPUs increases, so will the number of vunmaps a scaled workload will want to perform, and so will the cost of a global TLB flush. This gives terrible quadratic scalability characteristics. Another problem is that the entire vmap subsystem works under a single lock. It is a rwlock, but it is actually taken for write in all the fast paths, and the read locking would likely never be run concurrently anyway, so it's just pointless. This is a rewrite of vmap subsystem to solve those problems. The existing vmalloc API is implemented on top of the rewritten subsystem. The TLB flushing problem is solved by using lazy TLB unmapping. vmap addresses do not have to be flushed immediately when they are vunmapped, because the kernel will not reuse them again (would be a use-after-free) until they are reallocated. So the addresses aren't allocated again until a subsequent TLB flush. A single TLB flush then can flush multiple vunmaps from each CPU. XEN and PAT and such do not like deferred TLB flushing because they can't always handle multiple aliasing virtual addresses to a physical address. They now call vm_unmap_aliases() in order to flush any deferred mappings. That call is very expensive (well, actually not a lot more expensive than a single vunmap under the old scheme), however it should be OK if not called too often. The virtual memory extent information is stored in an rbtree rather than a linked list to improve the algorithmic scalability. There is a per-CPU allocator for small vmaps, which amortizes or avoids global locking. To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces must be used in place of vmap and vunmap. Vmalloc does not use these interfaces at the moment, so it will not be quite so scalable (although it will use lazy TLB flushing). As a quick test of performance, I ran a test that loops in the kernel, linearly mapping then touching then unmapping 4 pages. Different numbers of tests were run in parallel on an 4 core, 2 socket opteron. Results are in nanoseconds per map+touch+unmap. threads vanilla vmap rewrite 1 14700 2900 2 33600 3000 4 49500 2800 8 70631 2900 So with a 8 cores, the rewritten version is already 25x faster. In a slightly more realistic test (although with an older and less scalable version of the patch), I ripped the not-very-good vunmap batching code out of XFS, and implemented the large buffer mapping with vm_map_ram and vm_unmap_ram... along with a couple of other tricks, I was able to speed up a large directory workload by 20x on a 64 CPU system. I believe vmap/vunmap is actually sped up a lot more than 20x on such a system, but I'm running into other locks now. vmap is pretty well blown off the profiles. Before: 1352059 total 0.1401 798784 _write_lock 8320.6667 <- vmlist_lock 529313 default_idle 1181.5022 15242 smp_call_function 15.8771 <- vmap tlb flushing 2472 __get_vm_area_node 1.9312 <- vmap 1762 remove_vm_area 4.5885 <- vunmap 316 map_vm_area 0.2297 <- vmap 312 kfree 0.1950 300 _spin_lock 3.1250 252 sn_send_IPI_phys 0.4375 <- tlb flushing 238 vmap 0.8264 <- vmap 216 find_lock_page 0.5192 196 find_next_bit 0.3603 136 sn2_send_IPI 0.2024 130 pio_phys_write_mmr 2.0312 118 unmap_kernel_range 0.1229 After: 78406 total 0.0081 40053 default_idle 89.4040 33576 ia64_spinlock_contention 349.7500 1650 _spin_lock 17.1875 319 __reg_op 0.5538 281 _atomic_dec_and_lock 1.0977 153 mutex_unlock 1.5938 123 iget_locked 0.1671 117 xfs_dir_lookup 0.1662 117 dput 0.1406 114 xfs_iget_core 0.0268 92 xfs_da_hashname 0.1917 75 d_alloc 0.0670 68 vmap_page_range 0.0462 <- vmap 58 kmem_cache_alloc 0.0604 57 memset 0.0540 52 rb_next 0.1625 50 __copy_user 0.0208 49 bitmap_find_free_region 0.2188 <- vmap 46 ia64_sn_udelay 0.1106 45 find_inode_fast 0.1406 42 memcmp 0.2188 42 finish_task_switch 0.1094 42 __d_lookup 0.0410 40 radix_tree_lookup_slot 0.1250 37 _spin_unlock_irqrestore 0.3854 36 xfs_bmapi 0.0050 36 kmem_cache_free 0.0256 35 xfs_vn_getattr 0.0322 34 radix_tree_lookup 0.1062 33 __link_path_walk 0.0035 31 xfs_da_do_buf 0.0091 30 _xfs_buf_find 0.0204 28 find_get_page 0.0875 27 xfs_iread 0.0241 27 __strncpy_from_user 0.2812 26 _xfs_buf_initialize 0.0406 24 _xfs_buf_lookup_pages 0.0179 24 vunmap_page_range 0.0250 <- vunmap 23 find_lock_page 0.0799 22 vm_map_ram 0.0087 <- vmap 20 kfree 0.0125 19 put_page 0.0330 18 __kmalloc 0.0176 17 xfs_da_node_lookup_int 0.0086 17 _read_lock 0.0885 17 page_waitqueue 0.0664 vmap has gone from being the top 5 on the profiles and flushing the crap out of all TLBs, to using less than 1% of kernel time. [akpm@linux-foundation.org: cleanups, section fix] [akpm@linux-foundation.org: fix build on alpha] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | init/do_mounts_md.c must #include <linux/delay.h>Adrian Bunk2008-10-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the following compile error caused by commit 589f800bb12c5cd6c9167bbf9bf3cb70cd8e422c ("fastboot: make the raid autodetect code wait for all devices to init"): CC init/do_mounts_md.o init/do_mounts_md.c: In function 'autodetect_raid': init/do_mounts_md.c:285: error: implicit declaration of function 'msleep' make[2]: *** [init/do_mounts_md.o] Error 1 Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Configure out AIO supportThomas Petazzoni2008-10-161-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patchs adds the CONFIG_AIO option which allows to remove support for asynchronous I/O operations, that are not necessarly used by applications, particularly on embedded devices. As this is a size-reduction option, it depends on CONFIG_EMBEDDED. It allows to save ~7 kilobytes of kernel code/data: text data bss dec hex filename 1115067 119180 217088 1451335 162547 vmlinux 1108025 119048 217088 1444161 160941 vmlinux.new -7042 -132 0 -7174 -1C06 +/- This patch has been originally written by Matt Mackall <mpm@selenic.com>, and is part of the Linux Tiny project. [randy.dunlap@oracle.com: build fix] Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Zach Brown <zach.brown@oracle.com> Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | initramfs: add option to preserve mtime from initramfs cpio imagesNye Liu2008-10-161-0/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When unpacking the cpio into the initramfs, mtimes are not preserved by default. This patch adds an INITRAMFS_PRESERVE_MTIME option that allows mtimes stored in the cpio image to be used when constructing the initramfs. For embedded applications that run exclusively out of the initramfs, this is invaluable: When building embedded application initramfs images, its nice to know when the files were actually created during the build process - that makes it easier to see what files were modified when so we can compare the files that are being used on the image with the files used during the build process. This might help (for example) to determine if the target system has all the updated files you expect to see w/o having to check MD5s etc. In our environment, the whole system runs off the initramfs partition, and seeing the modified times of the shared libraries (for example) helps us find bugs that may have been introduced by the build system incorrectly propogating outdated shared libraries into the image. Similarly, many of the initializion/configuration files in /etc might be dynamically built by the build system, and knowing when they were modified helps us sanity check whether the target system has the "latest" files etc. Finally, we might use last modified times to determine whether a hot fix should be applied or not to the running ramfs. Signed-off-by: Nye Liu <nyet@nyet.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | identify_ramdisk_image(): correct typo about return value in commentGeert Uytterhoeven2008-10-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | identify_ramdisk_image() returns 0 (not -1) if a gzipped ramdisk is found: if (buf[0] == 037 && ((buf[1] == 0213) || (buf[1] == 0236))) { printk(KERN_NOTICE "RAMDISK: Compressed image found at block %d\n", start_block); nblocks = 0; ^^^^^^^^^^^ goto done; } ... done: sys_lseek(fd, start_block * BLOCK_SIZE, 0); kfree(buf); return nblocks; ^^^^^^^^^^^^^^ Hence correct the typo in the comment, which has existed since the addition of compressed ramdisk support in 1.3.48. Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-linus' of ↵Linus Torvalds2008-10-142-10/+32
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-fastboot * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-fastboot: raid, fastboot: hide RAID autodetect option if MD is compiled as a module raid: make RAID autodetect default a KConfig option warning: fix init do_mounts_md c fastboot: make the RAID autostart code print a message just before waiting fastboot: make the raid autodetect code wait for all devices to init fastboot: Fix bootgraph.pl initcall name regexp fastboot: fix issues and improve output of bootgraph.pl Add a script to visualize the kernel boot process / time
| * | raid: make RAID autodetect default a KConfig optionArjan van de Ven2008-10-121-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RAID autodetect has the side effect of requiring synchronisation of all device drivers, which can make the boot several seconds longer (I've measured 7 on one of my laptops).... even for systems that don't have RAID setup for the root filesystem (the only FS where this matters). This patch makes the default for autodetect a config option; either way the user can always override via the kernel command line. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: NeilBrown <neilb@suse.de>
| * | warning: fix init do_mounts_md cIngo Molnar2008-10-121-15/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fix warning: init/do_mounts_md.c: In function ‘md_run_setup’: init/do_mounts_md.c:282: warning: ISO C90 forbids mixed declarations and code also, use the opportunity to put the RAID autodetection code into a separate function - this also solves a checkpatch style warning. No code changed: md5: aa36a35faef371b05f1974ad583bdbbd do_mounts_md.o.before.asm aa36a35faef371b05f1974ad583bdbbd do_mounts_md.o.after.asm Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | fastboot: make the RAID autostart code print a message just before waitingArjan van de Ven2008-10-121-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | As requested/suggested by Neil Brown: make the raid code print that it's about to wait for probing to be done as well as give a suggestion on how to disable the probing if the user doesn't use raid. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com
| * | fastboot: make the raid autodetect code wait for all devices to initArjan van de Ven2008-10-121-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The raid autodetect code really needs to have all devices probed before it can detect raid arrays; not doing so would give rather messy situations where arrays would get detected as degraded while they shouldn't be etc. This is in preparation of removing the "wait for everything to init" code that makes everyone pay, not just raid users. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
| * | Add a script to visualize the kernel boot process / timeArjan van de Ven2008-10-121-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | When optimizing the kernel boot time, it's very valuable to visualize what is going on at which time. In addition, with some of the initializing going asynchronous soon, it's valuable to track/print which worker thread is executing the initialization. This patch adds a script to turn a dmesg into a SVG graph (that can be shown with tools such as InkScape, Gimp or Firefox) and a small change to the initcall code to print the PID of the thread calling the initcall (so that the script can work out the parallelism). Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>