summaryrefslogtreecommitdiffstats
path: root/mm (follow)
Commit message (Collapse)AuthorAgeFilesLines
* [PATCH] slab: clean up leak tracking ifdefs a little bitChristoph Hellwig2006-10-042-8/+11
| | | | | | | | | | | | | | | | - rename ____kmalloc to kmalloc_track_caller so that people have a chance to guess what it does just from it's name. Add a comment describing it for those who don't. Also move it after kmalloc in slab.h so people get less confused when they are just looking for kmalloc - move things around in slab.c a little to reduce the ifdef mess. [penberg@cs.helsinki.fi: Fix up reversed #ifdef] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] page_alloc: fix kernel-doc and func. declarationRandy Dunlap2006-10-041-25/+25
| | | | | | | | | | | | | | Fix kernel-doc and function declaration (missing "void") in mm/page_alloc.c. Add mm/page_alloc.c to kernel-api.tmpl in DocBook. mm/page_alloc.c:2589:38: warning: non-ANSI function declaration of function 'remove_all_active_ranges' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] enforce proper tlb flush in unmap_hugepage_rangeChen, Kenneth W2006-10-041-1/+7
| | | | | | | | | | | | | | | | Spotted by Hugh that hugetlb page is free'ed back to global pool before performing any TLB flush in unmap_hugepage_range(). This potentially allow threads to abuse free-alloc race condition. The generic tlb gather code is unsuitable to use by hugetlb, I just open coded a page gathering list and delayed put_page until tlb flush is performed. Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: micro optimise zone_watermark_okNick Piggin2006-10-041-1/+2
| | | | | | | | | | | | | Having min be a signed quantity means gcc can't turn high latency divides into shifts. There happen to be two such divides for GFP_ATOMIC (ie. networking, ie. important) allocations, one of which depends on the other. Fixing this makes code smaller as a bonus. Shame on somebody (probably me). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: fix in kerneldocHenrik Kretzschmar2006-10-041-2/+2
| | | | | | | | | | Fixes an kerneldoc error. Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Spelling fix: "control" instead of "cotrol"Michael Opdenacker2006-10-032-4/+4
| | | | | | | | This patch against fixes a spelling mistake ("control" instead of "cotrol"). Signed-off-by: Michael Opdenacker <michael@free-electrons.com> Acked-by: Alan Cox <alan@redhat.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>
* fix file specification in commentsUwe Zeisberger2006-10-031-1/+1
| | | | | | | Many files include the filename at the beginning, serveral used a wrong one. Signed-off-by: Uwe Zeisberger <Uwe_Zeisberger@digi.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>
* Fix "can not" in Documentation and KconfigMatt LaPlante2006-10-031-1/+1
| | | | | | | | | | | Randy brought it to my attention that in proper english "can not" should always be written "cannot". I donot see any reason to argue, even if I mightnot understand why this rule exists. This patch fixes "can not" in several Documentation files as well as three Kconfigs. Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
* more misc typo fixesMatt LaPlante2006-10-031-1/+1
| | | | Signed-off-by: Adrian Bunk <bunk@stusta.de>
* [PATCH] paravirt: lazy mmu mode hooks.patchZachary Amsden2006-10-013-0/+12
| | | | | | | | | | | | | | | | | | | | | | | Implement lazy MMU update hooks which are SMP safe for both direct and shadow page tables. The idea is that PTE updates and page invalidations while in lazy mode can be batched into a single hypercall. We use this in VMI for shadow page table synchronization, and it is a win. It also can be used by PPC and for direct page tables on Xen. For SMP, the enter / leave must happen under protection of the page table locks for page tables which are being modified. This is because otherwise, you end up with stale state in the batched hypercall, which other CPUs can race ahead of. Doing this under the protection of the locks guarantees the synchronization is correct, and also means that spurious faults which are generated during this window by remote CPUs are properly handled, as the page fault handler must re-check the PTE under protection of the same lock. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] paravirt: pte clear not presentZachary Amsden2006-10-012-2/+2
| | | | | | | | | | | | | | Change pte_clear_full to a more appropriately named pte_clear_not_present, allowing optimizations when not-present mapping changes need not be reflected in the hardware TLB for protected page table modes. There is also another case that can use it in the fremap code. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] paravirt: remove read hazard from cowZachary Amsden2006-10-011-1/+1
| | | | | | | | | | | | | We don't want to read PTEs directly like this after they have been modified, as a lazy MMU implementation of direct page tables may not have written the updated PTE back to memory yet. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] invalidate_inode_pages2(): ignore page refcountsAndrew Morton2006-10-011-2/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | The recent fix to invalidate_inode_pages() (git commit 016eb4a) managed to unfix invalidate_inode_pages2(). The problem is that various bits of code in the kernel can take transient refs on pages: the page scanner will do this when inspecting a batch of pages, and the lru_cache_add() batching pagevecs also hold a ref. Net result is transient failures in invalidate_inode_pages2(). This affects NFS directory invalidation (observed) and presumably also block-backed direct-io (not yet reported). Fix it by reverting invalidate_inode_pages2() back to the old version which ignores the page refcounts. We may come up with something more clever later, but for now we need a 2.6.18 fix for NFS. Cc: Chuck Lever <cel@citi.umich.edu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] r/o bind mount prepwork: inc_nlink() helperDave Hansen2006-10-011-4/+4
| | | | | | | | | | | This is mostly included for parity with dec_nlink(), where we will have some more hooks. This one should stay pretty darn straightforward for now. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] r/o bind mounts: unlink: monitor i_nlinkDave Hansen2006-10-011-5/+5
| | | | | | | | | | | | | | | | | When a filesystem decrements i_nlink to zero, it means that a write must be performed in order to drop the inode from the filesystem. We're shortly going to have keep filesystems from being remounted r/o between the time that this i_nlink decrement and that write occurs. So, add a little helper function to do the decrements. We'll tie into it in a bit to note when i_nlink hits zero. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Streamline generic_file_* interfaces and filemap cleanupsBadari Pulavarty2006-10-011-83/+4
| | | | | | | | | | | | | | | | | | | | | | This patch cleans up generic_file_*_read/write() interfaces. Christoph Hellwig gave me the idea for this clean ups. In a nutshell, all filesystems should set .aio_read/.aio_write methods and use do_sync_read/ do_sync_write() as their .read/.write methods. This allows us to cleanup all variants of generic_file_* routines. Final available interfaces: generic_file_aio_read() - read handler generic_file_aio_write() - write handler generic_file_aio_write_nolock() - no lock write handler __generic_file_aio_write_nolock() - internal worker routine Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Remove readv/writev methods and use aio_read/aio_write insteadBadari Pulavarty2006-10-011-36/+0
| | | | | | | | | | This patch removes readv() and writev() methods and replaces them with aio_read()/aio_write() methods. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Vectorize aio_read/aio_write fileop methodsBadari Pulavarty2006-10-011-19/+20
| | | | | | | | | | | | This patch vectorizes aio_read() and aio_write() methods to prepare for collapsing all aio & vectored operations into one interface - which is aio_read()/aio_write(). Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Michael Holzheu <HOLZHEU@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kmemdup: some usersAlexey Dobriyan2006-10-011-2/+1
| | | | | Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kmemdup: introduceAlexey Dobriyan2006-10-011-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of idiomatic ways to duplicate a region of memory is dst = kmalloc(len, GFP_KERNEL); if (!dst) return -ENOMEM; memcpy(dst, src, len); which is neat code except a programmer needs to write size twice. Which sometimes leads to mistakes. If len passed to kmalloc is smaller that len passed to memcpy, it's straight overwrite-beyond-end. If len passed to memcpy is smaller than len passed to kmalloc, it's either a) legit behaviour ;-), or b) cloned buffer will contain garbage in second half. Slight trolling of commit lists shows several duplications bugs done exactly because of diverged lenghts: Linux: [CRYPTO]: Fix memcpy/memset args. [PATCH] memcpy/memset fixes OpenBSD: kerberosV/src/lib/asn1: der_copy.c:1.4 If programmer is given only one place to play with lengths, I believe, such mistakes could be avoided. With kmemdup, the snippet above will be rewritten as: dst = kmemdup(src, len, GFP_KERNEL); if (!dst) return -ENOMEM; This also leads to smaller code (kzalloc effect). Quick grep shows 200+ places where kmemdup() can be used. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] hot-add-mem x86_64: use CONFIG_MEMORY_HOTPLUG_RESERVEKeith Mannthey2006-10-011-30/+30
| | | | | | | | | | | | | The api for hot-add memory already has a construct for finding nodes based on an address, memory_add_physaddr_to_nid. This patch allows the fucntion to do something besides return 0. It uses the nodes_add infomation to lookup to node info for a hot add event. Signed-off-by: Keith Mannthey <kmannth@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] hot-add-mem x86_64: use CONFIG_MEMORY_HOTPLUG_SPARSEKeith Mannthey2006-10-011-0/+2
| | | | | | | | | | Migate CONFIG_MEMORY_HOTPLUG to CONFIG_MEMORY_HOTPLUG_SPARSE where needed. Signed-off-by: Keith Mannthey <kmannth@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] hot-add-mem x86_64: Kconfig changesKeith Mannthey2006-10-011-1/+6
| | | | | | | | | | | | | | Create Kconfig namespace for MEMORY_HOTPLUG_RESERVE and MEMORY_HOTPLUG_SPARSE. This is needed to create a disticiton between the 2 paths. Selecting the high level opiton of MEMORY_HOTPLUG will get you MEMORY_HOTPLUG_SPARSE if you have sparsemem enabled or MEMORY_HOTPLUG_RESERVE if you are x86_64 with discontig and ACPI numa support. Signed-off-by: Keith Mannthey <kmannth@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] hot-add-mem x86_64: fixup externsKeith Mannthey2006-10-011-4/+0
| | | | | | | | | | Fix up externs in memory_hotplug.c. Cleanup. Signed-off-by: Keith Mannthey <kmannth@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] NOMMU: don't try and give NULL to fput()Gavin Lambert2006-10-011-1/+2
| | | | | | | | | Don't try and give NULL to fput() in the error handling in do_mmap_pgoff() as it'll cause an oops. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] BLOCK: Make it possible to disable the block layer [try #6]David Howells2006-09-305-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make it possible to disable the block layer. Not all embedded devices require it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require the block layer to be present. This patch does the following: (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev support. (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls an item that uses the block layer. This includes: (*) Block I/O tracing. (*) Disk partition code. (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS. (*) The SCSI layer. As far as I can tell, even SCSI chardevs use the block layer to do scheduling. Some drivers that use SCSI facilities - such as USB storage - end up disabled indirectly from this. (*) Various block-based device drivers, such as IDE and the old CDROM drivers. (*) MTD blockdev handling and FTL. (*) JFFS - which uses set_bdev_super(), something it could avoid doing by taking a leaf out of JFFS2's book. (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is, however, still used in places, and so is still available. (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and parts of linux/fs.h. (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK. (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK. (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK is not enabled. (*) fs/no-block.c is created to hold out-of-line stubs and things that are required when CONFIG_BLOCK is not set: (*) Default blockdev file operations (to give error ENODEV on opening). (*) Makes some /proc changes: (*) /proc/devices does not list any blockdevs. (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK. (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK. (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if given command other than Q_SYNC or if a special device is specified. (*) In init/do_mounts.c, no reference is made to the blockdev routines if CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2. (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return error ENOSYS by way of cond_syscall if so). (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if CONFIG_BLOCK is not set, since they can't then happen. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* [PATCH] BLOCK: Dissociate generic_writepages() from mpage stuff [try #6]David Howells2006-09-301-0/+134
| | | | | | | | | | | | | | Dissociate the generic_writepages() function from the mpage stuff, moving its declaration to linux/mm.h and actually emitting a full implementation into mm/page-writeback.c. The implementation is a partial duplicate of mpage_writepages() with all BIO references removed. It is used by NFS to do writeback. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* [PATCH] BLOCK: Separate the bounce buffering code from the highmem code [try #6]David Howells2006-09-303-281/+305
| | | | | | | | | | | Move the bounce buffer code from mm/highmem.c to mm/bounce.c so that it can be more easily disabled when the block layer is disabled. !!!NOTE!!! There may be a bug in this code: Should init_emergency_pool() be contingent on CONFIG_HIGHMEM? Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* [PATCH] BLOCK: Stop fallback_migrate_page() from using page_has_buffers() ↵David Howells2006-09-301-1/+1
| | | | | | | | | | [try #6] Stop fallback_migrate_page() from using page_has_buffers() since that might not be available. Use PagePrivate() instead since that's more general. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* [PATCH] BLOCK: Move functions out of buffer code [try #6]David Howells2006-09-303-0/+55
| | | | | | | | | | | | | | | | | | | | | | | | | Move some functions out of the buffering code that aren't strictly buffering specific. This is a precursor to being able to disable the block layer. (*) Moved some stuff out of fs/buffer.c: (*) The file sync and general sync stuff moved to fs/sync.c. (*) The superblock sync stuff moved to fs/super.c. (*) do_invalidatepage() moved to mm/truncate.c. (*) try_to_release_page() moved to mm/filemap.c. (*) Moved some related declarations between header files: (*) declarations for do_invalidatepage() and try_to_release_page() moved to linux/mm.h. (*) __set_page_dirty_buffers() moved to linux/buffer_head.h. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* [PATCH] Access Control Lists for tmpfsAndreas Gruenbacher2006-09-293-2/+295
| | | | | | | | | Add access control lists for tmpfs. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] valid_swaphandles() fixHugh Dickins2006-09-291-3/+4
| | | | | | | | | | | akpm draws my attention to the fact that sysctl(VM_PAGE_CLUSTER) might conceivably change page_cluster to 0 while valid_swaphandles() is in the middle of using it, leading to an embarrassingly long loop: take a local snapshot of page_cluster and work with that. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] call mm/page-writeback.c:set_ratelimit() when new pages are hot-addedChandra Seetharaman2006-09-292-3/+5
| | | | | | | | | | | | | | ratelimit_pages in page-writeback.c is recalculated (in set_ratelimit()) every time a CPU is hot-added/removed. But this value is not recalculated when new pages are hot-added. This patch fixes that problem by calling set_ratelimit() when new pages are hot-added. [akpm@osdl.org: cleanups] Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] remove static variable mm/page-writeback.c:total_pagesChandra Seetharaman2006-09-291-7/+4
| | | | | | | | | | | | | | | | | | | | | page-writeback.c has a static local variable "total_pages", which is the total number of pages in the system. There is a global variable "vm_total_pages", which is the total number of pages the VM controls. Both are assigned from the return value of nr_free_pagecache_pages(). This patch removes the local variable and uses the global variable in that place. One more issue with the local static variable "total_pages" is that it is not updated when new pages are hot-added. Since vm_total_pages is updated when new pages are hot-added, this patch fixes that problem too. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_mapPaul Jackson2006-09-291-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the list of memory nodes allowed to tasks in the top (root) nodeset to dynamically track what cpus are online, using a call to a cpuset hook from the memory hotplug code. Make this top cpus file read-only. On systems that have cpusets configured in their kernel, but that aren't actively using cpusets (for some distros, this covers the majority of systems) all tasks end up in the top cpuset. If that system does support memory hotplug, then these tasks cannot make use of memory nodes that are added after system boot, because the memory nodes are not allowed in the top cpuset. This is a surprising regression over earlier kernels that didn't have cpusets enabled. One key motivation for this change is to remain consistent with the behaviour for the top_cpuset's 'cpus', which is also read-only, and which automatically tracks the cpu_online_map. This change also has the minor benefit that it fixes a long standing, little noticed, minor bug in cpusets. The cpuset performance tweak to short circuit the cpuset_zone_allowed() check on systems with just a single cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply changing the 'mems' of the top_cpuset had no affect, even though the change (the write system call) appeared to succeed. With the following change, that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly refuses to be changed via user space writes. Thus no one should be mislead into thinking they've changed the top_cpusets's 'mems' when in affect they haven't. In order to keep the behaviour of cpusets consistent between systems actively making use of them and systems not using them, this patch changes the behaviour of the 'mems' file in the top (root) cpuset, making it read only, and making it automatically track the value of node_online_map. Thus tasks in the top cpuset will have automatic use of hot plugged memory nodes allowed by their cpuset. [akpm@osdl.org: build fix] [bunk@stusta.de: build fix] Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom: don't kill current when another OOM in progressNick Piggin2006-09-291-6/+17
| | | | | | | | | | | | | | | A previous patch to allow an exiting task to OOM kill itself (and thereby avoid a little deadlock) introduced a problem. We don't want the PF_EXITING task, even if it is 'current', to access mem reserves if there is already a TIF_MEMDIE process in the system sucking up reserves. Also make the commenting a little bit clearer, and note that our current scheme of effectively single threading the OOM killer is not itself perfect. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] oom_kill_task(): cleanup ->mm checksOleg Nesterov2006-09-291-5/+2
| | | | | | | | | | | - It is not possible to have task->mm == &init_mm. - task_lock() buys nothing for 'if (!p->mm)' check. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] select_bad_process(): cleanup 'releasing' checkOleg Nesterov2006-09-291-10/+9
| | | | | | | | | No logic changes, but imho easier to read. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] select_bad_process(): kill a bogus PF_DEAD/TASK_DEAD checkOleg Nesterov2006-09-291-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only one usage of TASK_DEAD outside of last schedule path, select_bad_process: for_each_task(p) { if (!p->mm) continue; ... if (p->state == TASK_DEAD) continue; ... TASK_DEAD state is set at the end of do_exit(), this means that p->mm was already set == NULL by exit_mm(), so this task was already rejected by 'if (!p->mm)' above. Note also that the caller holds tasklist_lock, this means that p can't pass exit_notify() and then set TASK_DEAD when p->mm != NULL. Also, remove open-coded is_init(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] introduce TASK_DEAD stateOleg Nesterov2006-09-291-1/+1
| | | | | | | | | | | | | | | | I am not sure about this patch, I am asking Ingo to take a decision. task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should live only in ->exit_state as documented in sched.h. Note that this state is not visible to user-space, get_task_state() masks off unsuitable states. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kill PF_DEAD flagOleg Nesterov2006-09-291-2/+2
| | | | | | | | | | After the previous change (->flags & PF_DEAD) <=> (->state == EXIT_DEAD), we don't need PF_DEAD any longer. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pidspace: is_init()Sukadev Bhattiprolu2006-09-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] single bit flip detectorDave Jones2006-09-291-1/+23
| | | | | | | | | | | | | | | | | | In cases where we detect a single bit has been flipped, we spew the usual slab corruption message, which users instantly think is a kernel bug. In a lot of cases, single bit errors are down to bad memory, or other hardware failure. This patch adds an extra line to the slab debug messages in those cases, in the hope that users will try memtest before they report a bug. 000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Single bit error detected. Possibly bad RAM. Run memtest86. [akpm@osdl.org: cleanups] Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: make filemap_nopage use NOPAGE_SIGBUSAdam Litke2006-09-291-3/+3
| | | | | | | | | Don't open-code NOPAGE_SIGBUS. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: fix a race condition under SMC + COWSiddha, Suresh B2006-09-291-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Failing context is a multi threaded process context and the failing sequence is as follows. One thread T0 doing self modifying code on page X on processor P0 and another thread T1 doing COW (breaking the COW setup as part of just happened fork() in another thread T2) on the same page X on processor P1. T0 doing SMC can endup modifying the new page Y (allocated by the T1 doing COW on P1) but because of different I/D TLB's, P0 ITLB will not see the new mapping till the flush TLB IPI from P1 is received. During this interval, if T0 executes the code created by SMC it can result in an app error (as ITLB still points to old page X and endup executing the content in page X rather than using the content in page Y). Fix this issue by first clearing the PTE and flushing it, before updating it with new entry. Hugh sayeth: I was a bit sceptical, in the habit of thinking that Self Modifying Code must look such issues itself: but I guess there's nothing it can do to avoid this one. Fair enough, what you're changing it to is pretty much what powerpc and s390 were already doing, and is a more robust way of proceeding, consistent with how ptes are set everywhere else. The ptep_clear_flush is a bit heavy-handed (it's anxious to return the pte that was atomically cleared), but we'd have to wander through lots of arches to get the right minimal behaviour. It'd also be nice to eliminate ptep_establish completely, now only used to define other macros/inlines: it always seemed obfuscation to me, what you've got there now is clearer. Let's put those cleanups on a TODO list. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] inode-diet: Eliminate i_blksize from the inode structureTheodore Ts'o2006-09-271-1/+0
| | | | | | | | | | | | | | | | This eliminates the i_blksize field from struct inode. Filesystems that want to provide a per-inode st_blksize can do so by providing their own getattr routine instead of using the generic_fillattr() function. Note that some filesystems were providing pretty much random (and incorrect) values for i_blksize. [bunk@stusta.de: cleanup] [akpm@osdl.org: generic_fillattr() fix] Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] NOMMU: Make futexes work under NOMMU conditionsDavid Howells2006-09-271-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make futexes work under NOMMU conditions. This can be tested by running this in one shell: #define SYSERROR(X, Y) \ do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0) int main() { int shmid, tmp, *f, n; shmid = shmget(23, 4, IPC_CREAT|0666); SYSERROR(shmid, "shmget"); f = shmat(shmid, NULL, 0); SYSERROR(f, "shmat"); n = *f; printf("WAIT: %p{%x}\n", f, n); tmp = futex(f, FUTEX_WAIT, n, NULL, NULL, 0); SYSERROR(tmp, "futex"); printf("WAITED: %d\n", tmp); tmp = shmdt(f); SYSERROR(tmp, "shmdt"); exit(0); } And then this in the other shell: #define SYSERROR(X, Y) \ do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0) int main() { int shmid, tmp, *f; shmid = shmget(23, 4, IPC_CREAT|0666); SYSERROR(shmid, "shmget"); f = shmat(shmid, NULL, 0); SYSERROR(f, "shmat"); (*f)++; printf("WAKE: %p{%x}\n", f, *f); tmp = futex(f, FUTEX_WAKE, 1, NULL, NULL, 0); SYSERROR(tmp, "futex"); printf("WOKE: %d\n", tmp); tmp = shmdt(f); SYSERROR(tmp, "shmdt"); exit(0); } The first program will set up a SYSV IPC SHM segment and wait on a futex in it for the number at the start to change. The program will increment that number and wake the first program up. This leads to output of the form: SHELL 1 SHELL 2 ======================= ======================= # /dowait WAIT: 0xc32ac000{0} # /dowake WAKE: 0xc32ac000{1} WAITED: 0 WOKE: 1 Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] NOMMU: Make mremap() partially work for NOMMU kernelsDavid Howells2006-09-271-17/+46
| | | | | | | | | | | | | | | Make mremap() partially work for NOMMU kernels. It may resize a VMA provided that it doesn't exceed the size of the slab object in which the storage is allocated that the VMA refers to. Shareable VMAs may not be resized. Moving VMAs (as permitted by MREMAP_MAYMOVE) is not currently supported. This patch also makes use of the fact that the VMA list is now ordered to cut it short when possible. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] NOMMU: Order the per-mm_struct VMA listDavid Howells2006-09-271-32/+72
| | | | | | | | | Order the per-mm_struct VMA list by address so that searching it can be cut short when the appropriate address has been exceeded. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] NOMMU: Permit ptrace to ignore non-PROT_WRITE VMAs in NOMMU modeDavid Howells2006-09-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Permit ptrace to modify a section that's non-shared but is marked unwritable, such as is obtained by mapping the text segment of an ELF-FDPIC executable binary with into a binary that's being ptraced[*]. [*] Under NOMMU conditions ptrace causes read-only MAP_PRIVATE mmaps to become totally private copies because if a private mapping was actually shared then the debugging setting breakpoints in it would potentially crash other processes. This is done by using the VM_MAYWRITE flag rather than the VM_WRITE flag when deciding whether to permit a write. Without this patch a debugger can't set breakpoints in the mapped text sections of executables that are mapped read-only private, even if the mmap() syscall has taken a private copy because PT_PTRACED is set. In addition, VM_MAYREAD is used instead of VM_READ for similar reasons. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>