diff options
Diffstat (limited to 'Documentation')
73 files changed, 4937 insertions, 916 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 5f7f7d7f77d2..02457ec9c94f 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -184,6 +184,8 @@ mtrr.txt - how to use PPro Memory Type Range Registers to increase performance. nbd.txt - info on a TCP implementation of a network block device. +netlabel/ + - directory with information on the NetLabel subsystem. networking/ - directory with info on various aspects of networking with Linux. nfsroot.txt diff --git a/Documentation/ABI/obsolete/devfs b/Documentation/ABI/removed/devfs index b8b87399bc8f..8195c4e0d0a1 100644 --- a/Documentation/ABI/obsolete/devfs +++ b/Documentation/ABI/removed/devfs @@ -1,13 +1,12 @@ What: devfs -Date: July 2005 +Date: July 2005 (scheduled), finally removed in kernel v2.6.18 Contact: Greg Kroah-Hartman <gregkh@suse.de> Description: devfs has been unmaintained for a number of years, has unfixable races, contains a naming policy within the kernel that is against the LSB, and can be replaced by using udev. - The files fs/devfs/*, include/linux/devfs_fs*.h will be removed, + The files fs/devfs/*, include/linux/devfs_fs*.h were removed, along with the the assorted devfs function calls throughout the kernel tree. Users: - diff --git a/Documentation/ABI/testing/sysfs-power b/Documentation/ABI/testing/sysfs-power new file mode 100644 index 000000000000..d882f8093871 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-power @@ -0,0 +1,88 @@ +What: /sys/power/ +Date: August 2006 +Contact: Rafael J. Wysocki <rjw@sisk.pl> +Description: + The /sys/power directory will contain files that will + provide a unified interface to the power management + subsystem. + +What: /sys/power/state +Date: August 2006 +Contact: Rafael J. Wysocki <rjw@sisk.pl> +Description: + The /sys/power/state file controls the system power state. + Reading from this file returns what states are supported, + which is hard-coded to 'standby' (Power-On Suspend), 'mem' + (Suspend-to-RAM), and 'disk' (Suspend-to-Disk). + + Writing to this file one of these strings causes the system to + transition into that state. Please see the file + Documentation/power/states.txt for a description of each of + these states. + +What: /sys/power/disk +Date: August 2006 +Contact: Rafael J. Wysocki <rjw@sisk.pl> +Description: + The /sys/power/disk file controls the operating mode of the + suspend-to-disk mechanism. Reading from this file returns + the name of the method by which the system will be put to + sleep on the next suspend. There are four methods supported: + 'firmware' - means that the memory image will be saved to disk + by some firmware, in which case we also assume that the + firmware will handle the system suspend. + 'platform' - the memory image will be saved by the kernel and + the system will be put to sleep by the platform driver (e.g. + ACPI or other PM registers). + 'shutdown' - the memory image will be saved by the kernel and + the system will be powered off. + 'reboot' - the memory image will be saved by the kernel and + the system will be rebooted. + + The suspend-to-disk method may be chosen by writing to this + file one of the accepted strings: + + 'firmware' + 'platform' + 'shutdown' + 'reboot' + + It will only change to 'firmware' or 'platform' if the system + supports that. + +What: /sys/power/image_size +Date: August 2006 +Contact: Rafael J. Wysocki <rjw@sisk.pl> +Description: + The /sys/power/image_size file controls the size of the image + created by the suspend-to-disk mechanism. It can be written a + string representing a non-negative integer that will be used + as an upper limit of the image size, in bytes. The kernel's + suspend-to-disk code will do its best to ensure the image size + will not exceed this number. However, if it turns out to be + impossible, the kernel will try to suspend anyway using the + smallest image possible. In particular, if "0" is written to + this file, the suspend image will be as small as possible. + + Reading from this file will display the current image size + limit, which is set to 500 MB by default. + +What: /sys/power/pm_trace +Date: August 2006 +Contact: Rafael J. Wysocki <rjw@sisk.pl> +Description: + The /sys/power/pm_trace file controls the code which saves the + last PM event point in the RTC across reboots, so that you can + debug a machine that just hangs during suspend (or more + commonly, during resume). Namely, the RTC is only used to save + the last PM event point if this file contains '1'. Initially + it contains '0' which may be changed to '1' by writing a + string representing a nonzero integer into it. + + To use this debugging feature you should attempt to suspend + the machine, then reboot it and run + + dmesg -s 1000000 | grep 'hash matches' + + CAUTION: Using it will cause your machine's real-time (CMOS) + clock to be set to a random invalid time after a resume. diff --git a/Documentation/Changes b/Documentation/Changes index 488272074c36..abee7f58c1ed 100644 --- a/Documentation/Changes +++ b/Documentation/Changes @@ -37,15 +37,14 @@ o e2fsprogs 1.29 # tune2fs o jfsutils 1.1.3 # fsck.jfs -V o reiserfsprogs 3.6.3 # reiserfsck -V 2>&1|grep reiserfsprogs o xfsprogs 2.6.0 # xfs_db -V -o pcmciautils 004 -o pcmcia-cs 3.1.21 # cardmgr -V +o pcmciautils 004 # pccardctl -V o quota-tools 3.09 # quota -V o PPP 2.4.0 # pppd --version o isdn4k-utils 3.1pre1 # isdnctrl 2>&1|grep version o nfs-utils 1.0.5 # showmount --version o procps 3.2.0 # ps --version o oprofile 0.9 # oprofiled --version -o udev 071 # udevinfo -V +o udev 081 # udevinfo -V Kernel compilation ================== @@ -268,7 +267,7 @@ active clients. To enable this new functionality, you need to: - mount -t nfsd nfsd /proc/fs/nfs + mount -t nfsd nfsd /proc/fs/nfsd before running exportfs or mountd. It is recommended that all NFS services be protected from the internet-at-large by a firewall where diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle index 6d2412ec91ed..29c18966b050 100644 --- a/Documentation/CodingStyle +++ b/Documentation/CodingStyle @@ -532,6 +532,40 @@ appears outweighs the potential value of the hint that tells gcc to do something it would have done anyway. + Chapter 16: Function return values and names + +Functions can return values of many different kinds, and one of the +most common is a value indicating whether the function succeeded or +failed. Such a value can be represented as an error-code integer +(-Exxx = failure, 0 = success) or a "succeeded" boolean (0 = failure, +non-zero = success). + +Mixing up these two sorts of representations is a fertile source of +difficult-to-find bugs. If the C language included a strong distinction +between integers and booleans then the compiler would find these mistakes +for us... but it doesn't. To help prevent such bugs, always follow this +convention: + + If the name of a function is an action or an imperative command, + the function should return an error-code integer. If the name + is a predicate, the function should return a "succeeded" boolean. + +For example, "add work" is a command, and the add_work() function returns 0 +for success or -EBUSY for failure. In the same way, "PCI device present" is +a predicate, and the pci_dev_present() function returns 1 if it succeeds in +finding a matching device or 0 if it doesn't. + +All EXPORTed functions must respect this convention, and so should all +public functions. Private (static) functions need not, but it is +recommended that they do. + +Functions whose return value is the actual result of a computation, rather +than an indication of whether the computation succeeded, are not subject to +this rule. Generally they indicate failure by returning some out-of-range +result. Typical examples would be functions that return pointers; they use +NULL or the ERR_PTR mechanism to report failure. + + Appendix I: References diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index f8fe882e33dc..6d4b1ef5b6f1 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -181,27 +181,6 @@ X!Ilib/string.c </sect1> </chapter> - <chapter id="proc"> - <title>The proc filesystem</title> - - <sect1><title>sysctl interface</title> -!Ekernel/sysctl.c - </sect1> - - <sect1><title>proc filesystem interface</title> -!Ifs/proc/base.c - </sect1> - </chapter> - - <chapter id="debugfs"> - <title>The debugfs filesystem</title> - - <sect1><title>debugfs interface</title> -!Efs/debugfs/inode.c -!Efs/debugfs/file.c - </sect1> - </chapter> - <chapter id="vfs"> <title>The Linux VFS</title> <sect1><title>The Filesystem types</title> @@ -234,6 +213,50 @@ X!Ilib/string.c </sect1> </chapter> + <chapter id="proc"> + <title>The proc filesystem</title> + + <sect1><title>sysctl interface</title> +!Ekernel/sysctl.c + </sect1> + + <sect1><title>proc filesystem interface</title> +!Ifs/proc/base.c + </sect1> + </chapter> + + <chapter id="sysfs"> + <title>The Filesystem for Exporting Kernel Objects</title> +!Efs/sysfs/file.c +!Efs/sysfs/symlink.c +!Efs/sysfs/bin.c + </chapter> + + <chapter id="debugfs"> + <title>The debugfs filesystem</title> + + <sect1><title>debugfs interface</title> +!Efs/debugfs/inode.c +!Efs/debugfs/file.c + </sect1> + </chapter> + + <chapter id="relayfs"> + <title>relay interface support</title> + + <para> + Relay interface support + is designed to provide an efficient mechanism for tools and + facilities to relay large amounts of data from kernel space to + user space. + </para> + + <sect1><title>relay interface</title> +!Ekernel/relay.c +!Ikernel/relay.c + </sect1> + </chapter> + <chapter id="netcore"> <title>Linux Networking</title> <sect1><title>Networking Base Types</title> @@ -349,13 +372,6 @@ X!Earch/i386/kernel/mca.c </sect1> </chapter> - <chapter id="sysfs"> - <title>The Filesystem for Exporting Kernel Objects</title> -!Efs/sysfs/file.c -!Efs/sysfs/symlink.c -!Efs/sysfs/bin.c - </chapter> - <chapter id="security"> <title>Security Framework</title> !Esecurity/security.c @@ -386,6 +402,7 @@ X!Iinclude/linux/device.h --> !Edrivers/base/driver.c !Edrivers/base/core.c +!Edrivers/base/class.c !Edrivers/base/firmware_class.c !Edrivers/base/transport_class.c !Edrivers/base/dmapool.c @@ -437,6 +454,11 @@ X!Edrivers/pnp/system.c !Eblock/ll_rw_blk.c </chapter> + <chapter id="chrdev"> + <title>Char devices</title> +!Efs/char_dev.c + </chapter> + <chapter id="miscdev"> <title>Miscellaneous Devices</title> !Edrivers/char/misc.c diff --git a/Documentation/DocBook/libata.tmpl b/Documentation/DocBook/libata.tmpl index e97c32314541..065e8dc23e3a 100644 --- a/Documentation/DocBook/libata.tmpl +++ b/Documentation/DocBook/libata.tmpl @@ -868,18 +868,18 @@ and other resources, etc. <chapter id="libataExt"> <title>libata Library</title> -!Edrivers/scsi/libata-core.c +!Edrivers/ata/libata-core.c </chapter> <chapter id="libataInt"> <title>libata Core Internals</title> -!Idrivers/scsi/libata-core.c +!Idrivers/ata/libata-core.c </chapter> <chapter id="libataScsiInt"> <title>libata SCSI translation/emulation</title> -!Edrivers/scsi/libata-scsi.c -!Idrivers/scsi/libata-scsi.c +!Edrivers/ata/libata-scsi.c +!Idrivers/ata/libata-scsi.c </chapter> <chapter id="ataExceptions"> @@ -1600,12 +1600,12 @@ and other resources, etc. <chapter id="PiixInt"> <title>ata_piix Internals</title> -!Idrivers/scsi/ata_piix.c +!Idrivers/ata/ata_piix.c </chapter> <chapter id="SILInt"> <title>sata_sil Internals</title> -!Idrivers/scsi/sata_sil.c +!Idrivers/ata/sata_sil.c </chapter> <chapter id="libataThanks"> diff --git a/Documentation/DocBook/usb.tmpl b/Documentation/DocBook/usb.tmpl index 320af25de3a2..3608472d7b74 100644 --- a/Documentation/DocBook/usb.tmpl +++ b/Documentation/DocBook/usb.tmpl @@ -43,59 +43,52 @@ <para>A Universal Serial Bus (USB) is used to connect a host, such as a PC or workstation, to a number of peripheral - devices. USB uses a tree structure, with the host at the + devices. USB uses a tree structure, with the host as the root (the system's master), hubs as interior nodes, and - peripheral devices as leaves (and slaves). + peripherals as leaves (and slaves). Modern PCs support several such trees of USB devices, usually one USB 2.0 tree (480 Mbit/sec each) with a few USB 1.1 trees (12 Mbit/sec each) that are used when you connect a USB 1.1 device directly to the machine's "root hub". </para> - <para>That master/slave asymmetry was designed in part for - ease of use. It is not physically possible to assemble - (legal) USB cables incorrectly: all upstream "to-the-host" - connectors are the rectangular type, matching the sockets on - root hubs, and the downstream type are the squarish type - (or they are built in to the peripheral). - Software doesn't need to deal with distributed autoconfiguration - since the pre-designated master node manages all that. - At the electrical level, bus protocol overhead is reduced by - eliminating arbitration and moving scheduling into host software. + <para>That master/slave asymmetry was designed-in for a number of + reasons, one being ease of use. It is not physically possible to + assemble (legal) USB cables incorrectly: all upstream "to the host" + connectors are the rectangular type (matching the sockets on + root hubs), and all downstream connectors are the squarish type + (or they are built into the peripheral). + Also, the host software doesn't need to deal with distributed + auto-configuration since the pre-designated master node manages all that. + And finally, at the electrical level, bus protocol overhead is reduced by + eliminating arbitration and moving scheduling into the host software. </para> - <para>USB 1.0 was announced in January 1996, and was revised + <para>USB 1.0 was announced in January 1996 and was revised as USB 1.1 (with improvements in hub specification and support for interrupt-out transfers) in September 1998. - USB 2.0 was released in April 2000, including high speed - transfers and transaction translating hubs (used for USB 1.1 + USB 2.0 was released in April 2000, adding high-speed + transfers and transaction-translating hubs (used for USB 1.1 and 1.0 backward compatibility). </para> - <para>USB support was added to Linux early in the 2.2 kernel series - shortly before the 2.3 development forked off. Updates - from 2.3 were regularly folded back into 2.2 releases, bringing - new features such as <filename>/sbin/hotplug</filename> support, - more drivers, and more robustness. - The 2.5 kernel series continued such improvements, and also - worked on USB 2.0 support, - higher performance, - better consistency between host controller drivers, - API simplification (to make bugs less likely), - and providing internal "kerneldoc" documentation. + <para>Kernel developers added USB support to Linux early in the 2.2 kernel + series, shortly before 2.3 development forked. Updates from 2.3 were + regularly folded back into 2.2 releases, which improved reliability and + brought <filename>/sbin/hotplug</filename> support as well more drivers. + Such improvements were continued in the 2.5 kernel series, where they added + USB 2.0 support, improved performance, and made the host controller drivers + (HCDs) more consistent. They also simplified the API (to make bugs less + likely) and added internal "kerneldoc" documentation. </para> <para>Linux can run inside USB devices as well as on the hosts that control the devices. - Because the Linux 2.x USB support evolved to support mass market - platforms such as Apple Macintosh or PC-compatible systems, - it didn't address design concerns for those types of USB systems. - So it can't be used inside mass-market PDAs, or other peripherals. - USB device drivers running inside those Linux peripherals + But USB device drivers running inside those peripherals don't do the same things as the ones running inside hosts, - and so they've been given a different name: - they're called <emphasis>gadget drivers</emphasis>. - This document does not present gadget drivers. + so they've been given a different name: + <emphasis>gadget drivers</emphasis>. + This document does not cover gadget drivers. </para> </chapter> @@ -103,17 +96,14 @@ <chapter id="host"> <title>USB Host-Side API Model</title> - <para>Within the kernel, - host-side drivers for USB devices talk to the "usbcore" APIs. - There are two types of public "usbcore" APIs, targetted at two different - layers of USB driver. Those are - <emphasis>general purpose</emphasis> drivers, exposed through - driver frameworks such as block, character, or network devices; - and drivers that are <emphasis>part of the core</emphasis>, - which are involved in managing a USB bus. - Such core drivers include the <emphasis>hub</emphasis> driver, - which manages trees of USB devices, and several different kinds - of <emphasis>host controller driver (HCD)</emphasis>, + <para>Host-side drivers for USB devices talk to the "usbcore" APIs. + There are two. One is intended for + <emphasis>general-purpose</emphasis> drivers (exposed through + driver frameworks), and the other is for drivers that are + <emphasis>part of the core</emphasis>. + Such core drivers include the <emphasis>hub</emphasis> driver + (which manages trees of USB devices) and several different kinds + of <emphasis>host controller drivers</emphasis>, which control individual busses. </para> @@ -122,21 +112,21 @@ <itemizedlist> - <listitem><para>USB supports four kinds of data transfer - (control, bulk, interrupt, and isochronous). Two transfer - types use bandwidth as it's available (control and bulk), - while the other two types of transfer (interrupt and isochronous) + <listitem><para>USB supports four kinds of data transfers + (control, bulk, interrupt, and isochronous). Two of them (control + and bulk) use bandwidth as it's available, + while the other two (interrupt and isochronous) are scheduled to provide guaranteed bandwidth. </para></listitem> <listitem><para>The device description model includes one or more "configurations" per device, only one of which is active at a time. - Devices that are capable of high speed operation must also support - full speed configurations, along with a way to ask about the - "other speed" configurations that might be used. + Devices that are capable of high-speed operation must also support + full-speed configurations, along with a way to ask about the + "other speed" configurations which might be used. </para></listitem> - <listitem><para>Configurations have one or more "interface", each + <listitem><para>Configurations have one or more "interfaces", each of which may have "alternate settings". Interfaces may be standardized by USB "Class" specifications, or may be specific to a vendor or device.</para> @@ -162,7 +152,7 @@ </para></listitem> <listitem><para>The Linux USB API supports synchronous calls for - control and bulk messaging. + control and bulk messages. It also supports asynchnous calls for all kinds of data transfer, using request structures called "URBs" (USB Request Blocks). </para></listitem> @@ -463,14 +453,25 @@ file in your Linux kernel sources. </para> - <para>Otherwise the main use for this file from programs - is to poll() it to get notifications of usb devices - as they're plugged or unplugged. - To see what changed, you'd need to read the file and - compare "before" and "after" contents, scan the filesystem, - or see its hotplug event. + <para>This file, in combination with the poll() system call, can + also be used to detect when devices are added or removed: +<programlisting>int fd; +struct pollfd pfd; + +fd = open("/proc/bus/usb/devices", O_RDONLY); +pfd = { fd, POLLIN, 0 }; +for (;;) { + /* The first time through, this call will return immediately. */ + poll(&pfd, 1, -1); + + /* To see what's changed, compare the file's previous and current + contents or scan the filesystem. (Scanning is more precise.) */ +}</programlisting> + Note that this behavior is intended to be used for informational + and debug purposes. It would be more appropriate to use programs + such as udev or HAL to initialize a device or start a user-mode + helper program, for instance. </para> - </sect1> <sect1> diff --git a/Documentation/HOWTO b/Documentation/HOWTO index 915ae8c986c6..d6f3dd1a3464 100644 --- a/Documentation/HOWTO +++ b/Documentation/HOWTO @@ -358,7 +358,8 @@ Here is a list of some of the different kernel trees available: quilt trees: - USB, PCI, Driver Core, and I2C, Greg Kroah-Hartman <gregkh@suse.de> kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/ - + - x86-64, partly i386, Andi Kleen <ak@suse.de> + ftp.firstfloor.org:/pub/ak/x86_64/quilt/ Bug Reporting ------------- @@ -374,6 +375,26 @@ of information is needed by the kernel developers to help track down the problem. +Managing bug reports +-------------------- + +One of the best ways to put into practice your hacking skills is by fixing +bugs reported by other people. Not only you will help to make the kernel +more stable, you'll learn to fix real world problems and you will improve +your skills, and other developers will be aware of your presence. Fixing +bugs is one of the best ways to earn merit amongst the developers, because +not many people like wasting time fixing other people's bugs. + +To work in the already reported bug reports, go to http://bugzilla.kernel.org. +If you want to be advised of the future bug reports, you can subscribe to the +bugme-new mailing list (only new bug reports are mailed here) or to the +bugme-janitor mailing list (every change in the bugzilla is mailed here) + + http://lists.osdl.org/mailman/listinfo/bugme-new + http://lists.osdl.org/mailman/listinfo/bugme-janitors + + + Mailing lists ------------- diff --git a/Documentation/IPMI.txt b/Documentation/IPMI.txt index 0256805b548f..7756e09ea759 100644 --- a/Documentation/IPMI.txt +++ b/Documentation/IPMI.txt @@ -326,9 +326,12 @@ for events, they will all receive all events that come in. For receiving commands, you have to individually register commands you want to receive. Call ipmi_register_for_cmd() and supply the netfn -and command name for each command you want to receive. Only one user -may be registered for each netfn/cmd, but different users may register -for different commands. +and command name for each command you want to receive. You also +specify a bitmask of the channels you want to receive the command from +(or use IPMI_CHAN_ALL for all channels if you don't care). Only one +user may be registered for each netfn/cmd/channel, but different users +may register for different commands, or the same command if the +channel bitmasks do not overlap. From userland, equivalent IOCTLs are provided to do these functions. diff --git a/Documentation/SubmitChecklist b/Documentation/SubmitChecklist index a10bfb6ecd9f..7ac61f60037a 100644 --- a/Documentation/SubmitChecklist +++ b/Documentation/SubmitChecklist @@ -61,3 +61,8 @@ kernel patches. Documentation/kernel-parameters.txt. 18: All new module parameters are documented with MODULE_PARM_DESC() + +19: All new userspace interfaces are documented in Documentation/ABI/. + See Documentation/ABI/README for more information. + +20: Check that it all passes `make headers_check'. diff --git a/Documentation/SubmittingDrivers b/Documentation/SubmittingDrivers index 6bd30fdd0786..58bead05eabb 100644 --- a/Documentation/SubmittingDrivers +++ b/Documentation/SubmittingDrivers @@ -59,11 +59,11 @@ Copyright: The copyright owner must agree to use of GPL. are the same person/entity. If not, the name of the person/entity authorizing use of GPL should be listed in case it's necessary to verify the will of - the copright owner. + the copyright owner. Interfaces: If your driver uses existing interfaces and behaves like other drivers in the same class it will be much more likely - to be accepted than if it invents gratuitous new ones. + to be accepted than if it invents gratuitous new ones. If you need to implement a common API over Linux and NT drivers do it in userspace. @@ -88,7 +88,7 @@ Clarity: It helps if anyone can see how to fix the driver. It helps it will go in the bitbucket. Control: In general if there is active maintainance of a driver by - the author then patches will be redirected to them unless + the author then patches will be redirected to them unless they are totally obvious and without need of checking. If you want to be the contact and update point for the driver it is a good idea to state this in the comments, @@ -100,7 +100,7 @@ What Criteria Do Not Determine Acceptance Vendor: Being the hardware vendor and maintaining the driver is often a good thing. If there is a stable working driver from other people already in the tree don't expect 'we are the - vendor' to get your driver chosen. Ideally work with the + vendor' to get your driver chosen. Ideally work with the existing driver author to build a single perfect driver. Author: It doesn't matter if a large Linux company wrote the driver, @@ -116,17 +116,13 @@ Linux kernel master tree: ftp.??.kernel.org:/pub/linux/kernel/... ?? == your country code, such as "us", "uk", "fr", etc. -Linux kernel mailing list: +Linux kernel mailing list: linux-kernel@vger.kernel.org [mail majordomo@vger.kernel.org to subscribe] Linux Device Drivers, Third Edition (covers 2.6.10): http://lwn.net/Kernel/LDD3/ (free version) -Kernel traffic: - Weekly summary of kernel list activity (much easier to read) - http://www.kerneltraffic.org/kernel-traffic/ - LWN.net: Weekly summary of kernel development activity - http://lwn.net/ 2.6 API changes: @@ -145,11 +141,8 @@ KernelNewbies: Linux USB project: http://www.linux-usb.org/ -How to NOT write kernel driver by arjanv@redhat.com - http://people.redhat.com/arjanv/olspaper.pdf +How to NOT write kernel driver by Arjan van de Ven: + http://www.fenrus.org/how-to-not-write-a-device-driver-paper.pdf Kernel Janitor: http://janitor.kernelnewbies.org/ - --- -Last updated on 17 Nov 2005. diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index d42ab4c9e893..302d148c2e18 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -173,15 +173,15 @@ For small patches you may want to CC the Trivial Patch Monkey trivial@kernel.org managed by Adrian Bunk; which collects "trivial" patches. Trivial patches must qualify for one of the following rules: Spelling fixes in documentation - Spelling fixes which could break grep(1). + Spelling fixes which could break grep(1) Warning fixes (cluttering with useless warnings is bad) Compilation fixes (only if they are actually correct) Runtime fixes (only if they actually fix things) - Removing use of deprecated functions/macros (eg. check_region). + Removing use of deprecated functions/macros (eg. check_region) Contact detail and documentation fixes Non-portable code replaced by portable code (even in arch-specific, since people copy, as long as it's trivial) - Any fix by the author/maintainer of the file. (ie. patch monkey + Any fix by the author/maintainer of the file (ie. patch monkey in re-transmission mode) URL: <http://www.kernel.org/pub/linux/kernel/people/bunk/trivial/> @@ -209,6 +209,19 @@ Exception: If your mailer is mangling patches then someone may ask you to re-send them using MIME. +WARNING: Some mailers like Mozilla send your messages with +---- message header ---- +Content-Type: text/plain; charset=us-ascii; format=flowed +---- message header ---- +The problem is that "format=flowed" makes some of the mailers +on receiving side to replace TABs with spaces and do similar +changes. Thus the patches from you can look corrupted. + +To fix this just make your mozilla defaults/pref/mailnews.js file to look like: +pref("mailnews.send_plaintext_flowed", false); // RFC 2646======= +pref("mailnews.display.disable_format_flowed_support", true); + + 7) E-mail size. @@ -245,13 +258,13 @@ updated change. It is quite common for Linus to "drop" your patch without comment. That's the nature of the system. If he drops your patch, it could be due to -* Your patch did not apply cleanly to the latest kernel version +* Your patch did not apply cleanly to the latest kernel version. * Your patch was not sufficiently discussed on linux-kernel. -* A style issue (see section 2), -* An e-mail formatting issue (re-read this section) -* A technical problem with your change -* He gets tons of e-mail, and yours got lost in the shuffle -* You are being annoying (See Figure 1) +* A style issue (see section 2). +* An e-mail formatting issue (re-read this section). +* A technical problem with your change. +* He gets tons of e-mail, and yours got lost in the shuffle. +* You are being annoying. When in doubt, solicit comments on linux-kernel mailing list. @@ -476,10 +489,10 @@ SECTION 3 - REFERENCES Andrew Morton, "The perfect patch" (tpp). <http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt> -Jeff Garzik, "Linux kernel patch submission format." +Jeff Garzik, "Linux kernel patch submission format". <http://linux.yyz.us/patch-format.html> -Greg Kroah-Hartman "How to piss off a kernel subsystem maintainer". +Greg Kroah-Hartman, "How to piss off a kernel subsystem maintainer". <http://www.kroah.com/log/2005/03/31/> <http://www.kroah.com/log/2005/07/08/> <http://www.kroah.com/log/2005/10/19/> @@ -488,9 +501,9 @@ Greg Kroah-Hartman "How to piss off a kernel subsystem maintainer". NO!!!! No more huge patch bombs to linux-kernel@vger.kernel.org people! <http://marc.theaimsgroup.com/?l=linux-kernel&m=112112749912944&w=2> -Kernel Documentation/CodingStyle +Kernel Documentation/CodingStyle: <http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle> -Linus Torvald's mail on the canonical patch format: +Linus Torvalds's mail on the canonical patch format: <http://lkml.org/lkml/2005/4/7/183> -- diff --git a/Documentation/accounting/getdelays.c b/Documentation/accounting/getdelays.c index 795ca3911cc5..b11792abd6b6 100644 --- a/Documentation/accounting/getdelays.c +++ b/Documentation/accounting/getdelays.c @@ -285,7 +285,7 @@ int main(int argc, char *argv[]) if (maskset) { rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET, TASKSTATS_CMD_ATTR_REGISTER_CPUMASK, - &cpumask, sizeof(cpumask)); + &cpumask, strlen(cpumask) + 1); PRINTF("Sent register cpumask, retval %d\n", rc); if (rc < 0) { printf("error sending register cpumask\n"); @@ -315,7 +315,8 @@ int main(int argc, char *argv[]) } if (msg.n.nlmsg_type == NLMSG_ERROR || !NLMSG_OK((&msg.n), rep_len)) { - printf("fatal reply error, errno %d\n", errno); + struct nlmsgerr *err = NLMSG_DATA(&msg); + printf("fatal reply error, errno %d\n", err->error); goto done; } @@ -383,7 +384,7 @@ done: if (maskset) { rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET, TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK, - &cpumask, sizeof(cpumask)); + &cpumask, strlen(cpumask) + 1); printf("Sent deregister mask, retval %d\n", rc); if (rc < 0) err(rc, "error sending deregister cpumask\n"); diff --git a/Documentation/accounting/taskstats-struct.txt b/Documentation/accounting/taskstats-struct.txt new file mode 100644 index 000000000000..661c797eaf79 --- /dev/null +++ b/Documentation/accounting/taskstats-struct.txt @@ -0,0 +1,161 @@ +The struct taskstats +-------------------- + +This document contains an explanation of the struct taskstats fields. + +There are three different groups of fields in the struct taskstats: + +1) Common and basic accounting fields + If CONFIG_TASKSTATS is set, the taskstats inteface is enabled and + the common fields and basic accounting fields are collected for + delivery at do_exit() of a task. +2) Delay accounting fields + These fields are placed between + /* Delay accounting fields start */ + and + /* Delay accounting fields end */ + Their values are collected if CONFIG_TASK_DELAY_ACCT is set. +3) Extended accounting fields + These fields are placed between + /* Extended accounting fields start */ + and + /* Extended accounting fields end */ + Their values are collected if CONFIG_TASK_XACCT is set. + +Future extension should add fields to the end of the taskstats struct, and +should not change the relative position of each field within the struct. + + +struct taskstats { + +1) Common and basic accounting fields: + /* The version number of this struct. This field is always set to + * TAKSTATS_VERSION, which is defined in <linux/taskstats.h>. + * Each time the struct is changed, the value should be incremented. + */ + __u16 version; + + /* The exit code of a task. */ + __u32 ac_exitcode; /* Exit status */ + + /* The accounting flags of a task as defined in <linux/acct.h> + * Defined values are AFORK, ASU, ACOMPAT, ACORE, and AXSIG. + */ + __u8 ac_flag; /* Record flags */ + + /* The value of task_nice() of a task. */ + __u8 ac_nice; /* task_nice */ + + /* The name of the command that started this task. */ + char ac_comm[TS_COMM_LEN]; /* Command name */ + + /* The scheduling discipline as set in task->policy field. */ + __u8 ac_sched; /* Scheduling discipline */ + + __u8 ac_pad[3]; + __u32 ac_uid; /* User ID */ + __u32 ac_gid; /* Group ID */ + __u32 ac_pid; /* Process ID */ + __u32 ac_ppid; /* Parent process ID */ + + /* The time when a task begins, in [secs] since 1970. */ + __u32 ac_btime; /* Begin time [sec since 1970] */ + + /* The elapsed time of a task, in [usec]. */ + __u64 ac_etime; /* Elapsed time [usec] */ + + /* The user CPU time of a task, in [usec]. */ + __u64 ac_utime; /* User CPU time [usec] */ + + /* The system CPU time of a task, in [usec]. */ + __u64 ac_stime; /* System CPU time [usec] */ + + /* The minor page fault count of a task, as set in task->min_flt. */ + __u64 ac_minflt; /* Minor Page Fault Count */ + + /* The major page fault count of a task, as set in task->maj_flt. */ + __u64 ac_majflt; /* Major Page Fault Count */ + + +2) Delay accounting fields: + /* Delay accounting fields start + * + * All values, until the comment "Delay accounting fields end" are + * available only if delay accounting is enabled, even though the last + * few fields are not delays + * + * xxx_count is the number of delay values recorded + * xxx_delay_total is the corresponding cumulative delay in nanoseconds + * + * xxx_delay_total wraps around to zero on overflow + * xxx_count incremented regardless of overflow + */ + + /* Delay waiting for cpu, while runnable + * count, delay_total NOT updated atomically + */ + __u64 cpu_count; + __u64 cpu_delay_total; + + /* Following four fields atomically updated using task->delays->lock */ + + /* Delay waiting for synchronous block I/O to complete + * does not account for delays in I/O submission + */ + __u64 blkio_count; + __u64 blkio_delay_total; + + /* Delay waiting for page fault I/O (swap in only) */ + __u64 swapin_count; + __u64 swapin_delay_total; + + /* cpu "wall-clock" running time + * On some architectures, value will adjust for cpu time stolen + * from the kernel in involuntary waits due to virtualization. + * Value is cumulative, in nanoseconds, without a corresponding count + * and wraps around to zero silently on overflow + */ + __u64 cpu_run_real_total; + + /* cpu "virtual" running time + * Uses time intervals seen by the kernel i.e. no adjustment + * for kernel's involuntary waits due to virtualization. + * Value is cumulative, in nanoseconds, without a corresponding count + * and wraps around to zero silently on overflow + */ + __u64 cpu_run_virtual_total; + /* Delay accounting fields end */ + /* version 1 ends here */ + + +3) Extended accounting fields + /* Extended accounting fields start */ + + /* Accumulated RSS usage in duration of a task, in MBytes-usecs. + * The current rss usage is added to this counter every time + * a tick is charged to a task's system time. So, at the end we + * will have memory usage multiplied by system time. Thus an + * average usage per system time unit can be calculated. + */ + __u64 coremem; /* accumulated RSS usage in MB-usec */ + + /* Accumulated virtual memory usage in duration of a task. + * Same as acct_rss_mem1 above except that we keep track of VM usage. + */ + __u64 virtmem; /* accumulated VM usage in MB-usec */ + + /* High watermark of RSS usage in duration of a task, in KBytes. */ + __u64 hiwater_rss; /* High-watermark of RSS usage */ + + /* High watermark of VM usage in duration of a task, in KBytes. */ + __u64 hiwater_vm; /* High-water virtual memory usage */ + + /* The following four fields are I/O statistics of a task. */ + __u64 read_char; /* bytes read */ + __u64 write_char; /* bytes written */ + __u64 read_syscalls; /* read syscalls */ + __u64 write_syscalls; /* write syscalls */ + + /* Extended accounting fields end */ + +} diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 76b44290c154..842f0d1ab216 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt @@ -217,11 +217,11 @@ exclusive cpuset. Also, the use of a Linux virtual file system (vfs) to represent the cpuset hierarchy provides for a familiar permission and name space for cpusets, with a minimum of additional kernel code. -The cpus file in the root (top_cpuset) cpuset is read-only. -It automatically tracks the value of cpu_online_map, using a CPU -hotplug notifier. If and when memory nodes can be hotplugged, -we expect to make the mems file in the root cpuset read-only -as well, and have it track the value of node_online_map. +The cpus and mems files in the root (top_cpuset) cpuset are +read-only. The cpus file automatically tracks the value of +cpu_online_map using a CPU hotplug notifier, and the mems file +automatically tracks the value of node_online_map using the +cpuset_track_online_nodes() hook. 1.4 What are exclusive cpusets ? diff --git a/Documentation/crypto/api-intro.txt b/Documentation/crypto/api-intro.txt index 74dffc68ff9f..5a03a2801d67 100644 --- a/Documentation/crypto/api-intro.txt +++ b/Documentation/crypto/api-intro.txt @@ -19,15 +19,14 @@ At the lowest level are algorithms, which register dynamically with the API. 'Transforms' are user-instantiated objects, which maintain state, handle all -of the implementation logic (e.g. manipulating page vectors), provide an -abstraction to the underlying algorithms, and handle common logical -operations (e.g. cipher modes, HMAC for digests). However, at the user +of the implementation logic (e.g. manipulating page vectors) and provide an +abstraction to the underlying algorithms. However, at the user level they are very simple. Conceptually, the API layering looks like this: [transform api] (user interface) - [transform ops] (per-type logic glue e.g. cipher.c, digest.c) + [transform ops] (per-type logic glue e.g. cipher.c, compress.c) [algorithm api] (for registering algorithms) The idea is to make the user interface and algorithm registration API @@ -44,22 +43,27 @@ under development. Here's an example of how to use the API: #include <linux/crypto.h> + #include <linux/err.h> + #include <linux/scatterlist.h> struct scatterlist sg[2]; char result[128]; - struct crypto_tfm *tfm; + struct crypto_hash *tfm; + struct hash_desc desc; - tfm = crypto_alloc_tfm("md5", 0); - if (tfm == NULL) + tfm = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC); + if (IS_ERR(tfm)) fail(); /* ... set up the scatterlists ... */ + + desc.tfm = tfm; + desc.flags = 0; - crypto_digest_init(tfm); - crypto_digest_update(tfm, &sg, 2); - crypto_digest_final(tfm, result); + if (crypto_hash_digest(&desc, &sg, 2, result)) + fail(); - crypto_free_tfm(tfm); + crypto_free_hash(tfm); Many real examples are available in the regression test module (tcrypt.c). @@ -126,7 +130,7 @@ might already be working on. BUGS Send bug reports to: -James Morris <jmorris@redhat.com> +Herbert Xu <herbert@gondor.apana.org.au> Cc: David S. Miller <davem@redhat.com> @@ -134,13 +138,14 @@ FURTHER INFORMATION For further patches and various updates, including the current TODO list, see: -http://samba.org/~jamesm/crypto/ +http://gondor.apana.org.au/~herbert/crypto/ AUTHORS James Morris David S. Miller +Herbert Xu CREDITS @@ -238,8 +243,11 @@ Anubis algorithm contributors: Tiger algorithm contributors: Aaron Grothe +VIA PadLock contributors: + Michal Ludvig + Generic scatterwalk code by Adam J. Richter <adam@yggdrasil.com> Please send any credits updates or corrections to: -James Morris <jmorris@redhat.com> +Herbert Xu <herbert@gondor.apana.org.au> diff --git a/Documentation/devices.txt b/Documentation/devices.txt index 66c725f530f3..addc67b1d770 100644 --- a/Documentation/devices.txt +++ b/Documentation/devices.txt @@ -2543,6 +2543,9 @@ Your cooperation is appreciated. 64 = /dev/usb/rio500 Diamond Rio 500 65 = /dev/usb/usblcd USBLCD Interface (info@usblcd.de) 66 = /dev/usb/cpad0 Synaptics cPad (mouse/LCD) + 67 = /dev/usb/adutux0 1st Ontrak ADU device + ... + 76 = /dev/usb/adutux10 10th Ontrak ADU device 96 = /dev/usb/hiddev0 1st USB HID device ... 111 = /dev/usb/hiddev15 16th USB HID device diff --git a/Documentation/dontdiff b/Documentation/dontdiff index 24adfe9af3ca..63c2d0c55aa2 100644 --- a/Documentation/dontdiff +++ b/Documentation/dontdiff @@ -135,6 +135,7 @@ tags times.h* tkparse trix_boot.h +utsrelease.h* version.h* vmlinux vmlinux-* diff --git a/Documentation/fb/intelfb.txt b/Documentation/fb/intelfb.txt index c12d39a23c3d..aa0d322db171 100644 --- a/Documentation/fb/intelfb.txt +++ b/Documentation/fb/intelfb.txt @@ -1,16 +1,19 @@ -Intel 830M/845G/852GM/855GM/865G/915G Framebuffer driver +Intel 830M/845G/852GM/855GM/865G/915G/945G Framebuffer driver ================================================================ A. Introduction - This is a framebuffer driver for various Intel 810/815 compatible + This is a framebuffer driver for various Intel 8xx/9xx compatible graphics devices. These would include: Intel 830M - Intel 810E845G + Intel 845G Intel 852GM Intel 855GM Intel 865G Intel 915G + Intel 915GM + Intel 945G + Intel 945GM B. List of available options @@ -78,7 +81,7 @@ C. Kernel booting Separate each option/option-pair by commas (,) and the option from its value with an equals sign (=) as in the following: -video=i810fb:option1,option2=value2 +video=intelfb:option1,option2=value2 Sample Usage ------------ diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 552507fe9a7e..9364f47c7116 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -6,6 +6,21 @@ be removed from this file. --------------------------- +What: /sys/devices/.../power/state + dev->power.power_state + dpm_runtime_{suspend,resume)() +When: July 2007 +Why: Broken design for runtime control over driver power states, confusing + driver-internal runtime power management with: mechanisms to support + system-wide sleep state transitions; event codes that distinguish + different phases of swsusp "sleep" transitions; and userspace policy + inputs. This framework was never widely used, and most attempts to + use it were broken. Drivers should instead be exposing domain-specific + interfaces either to kernel or to userspace. +Who: Pavel Machek <pavel@suse.cz> + +--------------------------- + What: RAW driver (CONFIG_RAW_DRIVER) When: December 2005 Why: declared obsolete since kernel 2.6.3 @@ -31,17 +46,8 @@ Who: Jody McIntyre <scjody@modernduck.com> --------------------------- -What: sbp2: module parameter "force_inquiry_hack" -When: July 2006 -Why: Superceded by parameter "workarounds". Both parameters are meant to be - used ad-hoc and for single devices only, i.e. not in modprobe.conf, - therefore the impact of this feature replacement should be low. -Who: Stefan Richter <stefanr@s5r6.in-berlin.de> - ---------------------------- - What: Video4Linux API 1 ioctls and video_decoder.h from Video devices. -When: July 2006 +When: December 2006 Why: V4L1 AP1 was replaced by V4L2 API. during migration from 2.4 to 2.6 series. The old API have lots of drawbacks and don't provide enough means to work with all video and audio standards. The newer API is @@ -55,6 +61,18 @@ Who: Mauro Carvalho Chehab <mchehab@brturbo.com.br> --------------------------- +What: sys_sysctl +When: January 2007 +Why: The same information is available through /proc/sys and that is the + interface user space prefers to use. And there do not appear to be + any existing user in user space of sys_sysctl. The additional + maintenance overhead of keeping a set of binary names gets + in the way of doing a good job of maintaining this interface. + +Who: Eric Biederman <ebiederm@xmission.com> + +--------------------------- + What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) When: November 2005 Files: drivers/pcmcia/: pcmcia_ioctl.c @@ -202,14 +220,6 @@ Who: Nick Piggin <npiggin@suse.de> --------------------------- -What: Support for the MIPS EV96100 evaluation board -When: September 2006 -Why: Does no longer build since at least November 15, 2003, apparently - no userbase left. -Who: Ralf Baechle <ralf@linux-mips.org> - ---------------------------- - What: Support for the Momentum / PMC-Sierra Jaguar ATX evaluation board When: September 2006 Why: Does no longer build since quite some time, and was never popular, @@ -294,3 +304,24 @@ Why: The frame diverter is included in most distribution kernels, but is It is not clear if anyone is still using it. Who: Stephen Hemminger <shemminger@osdl.org> +--------------------------- + + +What: PHYSDEVPATH, PHYSDEVBUS, PHYSDEVDRIVER in the uevent environment +When: Oktober 2008 +Why: The stacking of class devices makes these values misleading and + inconsistent. + Class devices should not carry any of these properties, and bus + devices have SUBSYTEM and DRIVER as a replacement. +Who: Kay Sievers <kay.sievers@suse.de> + +--------------------------- + +What: i2c-isa +When: December 2006 +Why: i2c-isa is a non-sense and doesn't fit in the device driver + model. Drivers relying on it are better implemented as platform + drivers. +Who: Jean Delvare <khali@linux-fr.org> + +--------------------------- diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 247d7f619aa2..eb1a6cad21e6 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -356,10 +356,9 @@ The last two are called only from check_disk_change(). prototypes: loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, - loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 99902ae6804e..7240ee7515de 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -39,6 +39,8 @@ Table of Contents 2.9 Appletalk 2.10 IPX 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem + 2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score + 2.13 /proc/<pid>/oom_score - Display current oom-killer score ------------------------------------------------------------------------------ Preface @@ -1124,11 +1126,15 @@ debugging information is displayed on console. NMI switch that most IA32 servers have fires unknown NMI up, for example. If a system hangs up, try pressing the NMI switch. -[NOTE] - This function and oprofile share a NMI callback. Therefore this function - cannot be enabled when oprofile is activated. - And NMI watchdog will be disabled when the value in this file is set to - non-zero. +nmi_watchdog +------------ + +Enables/Disables the NMI watchdog on x86 systems. When the value is non-zero +the NMI watchdog is enabled and will continuously test all online cpus to +determine whether or not they are still functioning properly. + +Because the NMI watchdog shares registers with oprofile, by disabling the NMI +watchdog, oprofile may have more registers to utilize. 2.4 /proc/sys/vm - The virtual memory subsystem @@ -1958,6 +1964,22 @@ a queue must be less or equal then msg_max. maximum message size value (it is every message queue's attribute set during its creation). +2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score +------------------------------------------------------ + +This file can be used to adjust the score used to select which processes +should be killed in an out-of-memory situation. Giving it a high score will +increase the likelihood of this process being killed by the oom-killer. Valid +values are in the range -16 to +15, plus the special value -17, which disables +oom-killing altogether for this process. + +2.13 /proc/<pid>/oom_score - Display current oom-killer score +------------------------------------------------------------- + +------------------------------------------------------------------------------ +This file can be used to check the current score used by the oom-killer is for +any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which +process should be killed in an out-of-memory situation. ------------------------------------------------------------------------------ Summary diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 1cb7e8be927a..cd07c21b8400 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -699,9 +699,9 @@ This describes how the VFS can manipulate an open file. As of kernel struct file_operations { loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); diff --git a/Documentation/hwmon/it87 b/Documentation/hwmon/it87 index 9555be1ed999..e783fd62e308 100644 --- a/Documentation/hwmon/it87 +++ b/Documentation/hwmon/it87 @@ -13,12 +13,25 @@ Supported chips: from Super I/O config space (8 I/O ports) Datasheet: Publicly available at the ITE website http://www.ite.com.tw/ + * IT8716F + Prefix: 'it8716' + Addresses scanned: from Super I/O config space (8 I/O ports) + Datasheet: Publicly available at the ITE website + http://www.ite.com.tw/product_info/file/pc/IT8716F_V0.3.ZIP + * IT8718F + Prefix: 'it8718' + Addresses scanned: from Super I/O config space (8 I/O ports) + Datasheet: Publicly available at the ITE website + http://www.ite.com.tw/product_info/file/pc/IT8718F_V0.2.zip + http://www.ite.com.tw/product_info/file/pc/IT8718F_V0%203_(for%20C%20version).zip * SiS950 [clone of IT8705F] Prefix: 'it87' Addresses scanned: from Super I/O config space (8 I/O ports) Datasheet: No longer be available -Author: Christophe Gauthron <chrisg@0-in.com> +Authors: + Christophe Gauthron <chrisg@0-in.com> + Jean Delvare <khali@linux-fr.org> Module Parameters @@ -43,26 +56,46 @@ Module Parameters Description ----------- -This driver implements support for the IT8705F, IT8712F and SiS950 chips. - -This driver also supports IT8712F, which adds SMBus access, and a VID -input, used to report the Vcore voltage of the Pentium processor. -The IT8712F additionally features VID inputs. +This driver implements support for the IT8705F, IT8712F, IT8716F, +IT8718F and SiS950 chips. These chips are 'Super I/O chips', supporting floppy disks, infrared ports, joysticks and other miscellaneous stuff. For hardware monitoring, they include an 'environment controller' with 3 temperature sensors, 3 fan rotation speed sensors, 8 voltage sensors, and associated alarms. +The IT8712F and IT8716F additionally feature VID inputs, used to report +the Vcore voltage of the processor. The early IT8712F have 5 VID pins, +the IT8716F and late IT8712F have 6. They are shared with other functions +though, so the functionality may not be available on a given system. +The driver dumbly assume it is there. + +The IT8718F also features VID inputs (up to 8 pins) but the value is +stored in the Super-I/O configuration space. Due to technical limitations, +this value can currently only be read once at initialization time, so +the driver won't notice and report changes in the VID value. The two +upper VID bits share their pins with voltage inputs (in5 and in6) so you +can't have both on a given board. + +The IT8716F, IT8718F and later IT8712F revisions have support for +2 additional fans. They are not yet supported by the driver. + +The IT8716F and IT8718F, and late IT8712F and IT8705F also have optional +16-bit tachometer counters for fans 1 to 3. This is better (no more fan +clock divider mess) but not compatible with the older chips and +revisions. For now, the driver only uses the 16-bit mode on the +IT8716F and IT8718F. + Temperatures are measured in degrees Celsius. An alarm is triggered once when the Overtemperature Shutdown limit is crossed. Fan rotation speeds are reported in RPM (rotations per minute). An alarm is -triggered if the rotation speed has dropped below a programmable limit. Fan -readings can be divided by a programmable divider (1, 2, 4 or 8) to give the -readings more range or accuracy. Not all RPM values can accurately be -represented, so some rounding is done. With a divider of 2, the lowest -representable value is around 2600 RPM. +triggered if the rotation speed has dropped below a programmable limit. When +16-bit tachometer counters aren't used, fan readings can be divided by +a programmable divider (1, 2, 4 or 8) to give the readings more range or +accuracy. With a divider of 2, the lowest representable value is around +2600 RPM. Not all RPM values can accurately be represented, so some rounding +is done. Voltage sensors (also known as IN sensors) report their values in volts. An alarm is triggered if the voltage has crossed a programmable minimum or @@ -71,9 +104,9 @@ zero'; this is important for negative voltage measurements. All voltage inputs can measure voltages between 0 and 4.08 volts, with a resolution of 0.016 volt. The battery voltage in8 does not have limit registers. -The VID lines (IT8712F only) encode the core voltage value: the voltage -level your processor should work with. This is hardcoded by the mainboard -and/or processor itself. It is a value in volts. +The VID lines (IT8712F/IT8716F/IT8718F) encode the core voltage value: +the voltage level your processor should work with. This is hardcoded by +the mainboard and/or processor itself. It is a value in volts. If an alarm triggers, it will remain triggered until the hardware register is read at least once. This means that the cause for the alarm may already diff --git a/Documentation/hwmon/k8temp b/Documentation/hwmon/k8temp new file mode 100644 index 000000000000..bab445ab0f52 --- /dev/null +++ b/Documentation/hwmon/k8temp @@ -0,0 +1,52 @@ +Kernel driver k8temp +==================== + +Supported chips: + * AMD K8 CPU + Prefix: 'k8temp' + Addresses scanned: PCI space + Datasheet: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/32559.pdf + +Author: Rudolf Marek +Contact: Rudolf Marek <r.marek@sh.cvut.cz> + +Description +----------- + +This driver permits reading temperature sensor(s) embedded inside AMD K8 CPUs. +Official documentation says that it works from revision F of K8 core, but +in fact it seems to be implemented for all revisions of K8 except the first +two revisions (SH-B0 and SH-B3). + +There can be up to four temperature sensors inside single CPU. The driver +will auto-detect the sensors and will display only temperatures from +implemented sensors. + +Mapping of /sys files is as follows: + +temp1_input - temperature of Core 0 and "place" 0 +temp2_input - temperature of Core 0 and "place" 1 +temp3_input - temperature of Core 1 and "place" 0 +temp4_input - temperature of Core 1 and "place" 1 + +Temperatures are measured in degrees Celsius and measurement resolution is +1 degree C. It is expected that future CPU will have better resolution. The +temperature is updated once a second. Valid temperatures are from -49 to +206 degrees C. + +Temperature known as TCaseMax was specified for processors up to revision E. +This temperature is defined as temperature between heat-spreader and CPU +case, so the internal CPU temperature supplied by this driver can be higher. +There is no easy way how to measure the temperature which will correlate +with TCaseMax temperature. + +For newer revisions of CPU (rev F, socket AM2) there is a mathematically +computed temperature called TControl, which must be lower than TControlMax. + +The relationship is following: + +temp1_input - TjOffset*2 < TControlMax, + +TjOffset is not yet exported by the driver, TControlMax is usually +70 degrees C. The rule of the thumb -> CPU temperature should not cross +60 degrees C too much. diff --git a/Documentation/hwmon/vt1211 b/Documentation/hwmon/vt1211 new file mode 100644 index 000000000000..77fa633b97a8 --- /dev/null +++ b/Documentation/hwmon/vt1211 @@ -0,0 +1,206 @@ +Kernel driver vt1211 +==================== + +Supported chips: + * VIA VT1211 + Prefix: 'vt1211' + Addresses scanned: none, address read from Super-I/O config space + Datasheet: Provided by VIA upon request and under NDA + +Authors: Juerg Haefliger <juergh@gmail.com> + +This driver is based on the driver for kernel 2.4 by Mark D. Studebaker and +its port to kernel 2.6 by Lars Ekman. + +Thanks to Joseph Chan and Fiona Gatt from VIA for providing documentation and +technical support. + + +Module Parameters +----------------- + +* uch_config: int Override the BIOS default universal channel (UCH) + configuration for channels 1-5. + Legal values are in the range of 0-31. Bit 0 maps to + UCH1, bit 1 maps to UCH2 and so on. Setting a bit to 1 + enables the thermal input of that particular UCH and + setting a bit to 0 enables the voltage input. + +* int_mode: int Override the BIOS default temperature interrupt mode. + The only possible value is 0 which forces interrupt + mode 0. In this mode, any pending interrupt is cleared + when the status register is read but is regenerated as + long as the temperature stays above the hysteresis + limit. + +Be aware that overriding BIOS defaults might cause some unwanted side effects! + + +Description +----------- + +The VIA VT1211 Super-I/O chip includes complete hardware monitoring +capabilities. It monitors 2 dedicated temperature sensor inputs (temp1 and +temp2), 1 dedicated voltage (in5) and 2 fans. Additionally, the chip +implements 5 universal input channels (UCH1-5) that can be individually +programmed to either monitor a voltage or a temperature. + +This chip also provides manual and automatic control of fan speeds (according +to the datasheet). The driver only supports automatic control since the manual +mode doesn't seem to work as advertised in the datasheet. In fact I couldn't +get manual mode to work at all! Be aware that automatic mode hasn't been +tested very well (due to the fact that my EPIA M10000 doesn't have the fans +connected to the PWM outputs of the VT1211 :-(). + +The following table shows the relationship between the vt1211 inputs and the +sysfs nodes. + +Sensor Voltage Mode Temp Mode Default Use (from the datasheet) +------ ------------ --------- -------------------------------- +Reading 1 temp1 Intel thermal diode +Reading 3 temp2 Internal thermal diode +UCH1/Reading2 in0 temp3 NTC type thermistor +UCH2 in1 temp4 +2.5V +UCH3 in2 temp5 VccP (processor core) +UCH4 in3 temp6 +5V +UCH5 in4 temp7 +12V ++3.3V in5 Internal VCC (+3.3V) + + +Voltage Monitoring +------------------ + +Voltages are sampled by an 8-bit ADC with a LSB of ~10mV. The supported input +range is thus from 0 to 2.60V. Voltage values outside of this range need +external scaling resistors. This external scaling needs to be compensated for +via compute lines in sensors.conf, like: + +compute inx @*(1+R1/R2), @/(1+R1/R2) + +The board level scaling resistors according to VIA's recommendation are as +follows. And this is of course totally dependent on the actual board +implementation :-) You will have to find documentation for your own +motherboard and edit sensors.conf accordingly. + + Expected +Voltage R1 R2 Divider Raw Value +----------------------------------------------- ++2.5V 2K 10K 1.2 2083 mV +VccP --- --- 1.0 1400 mV (1) ++5V 14K 10K 2.4 2083 mV ++12V 47K 10K 5.7 2105 mV ++3.3V (int) 2K 3.4K 1.588 3300 mV (2) ++3.3V (ext) 6.8K 10K 1.68 1964 mV + +(1) Depending on the CPU (1.4V is for a VIA C3 Nehemiah). +(2) R1 and R2 for 3.3V (int) are internal to the VT1211 chip and the driver + performs the scaling and returns the properly scaled voltage value. + +Each measured voltage has an associated low and high limit which triggers an +alarm when crossed. + + +Temperature Monitoring +---------------------- + +Temperatures are reported in millidegree Celsius. Each measured temperature +has a high limit which triggers an alarm if crossed. There is an associated +hysteresis value with each temperature below which the temperature has to drop +before the alarm is cleared (this is only true for interrupt mode 0). The +interrupt mode can be forced to 0 in case the BIOS doesn't do it +automatically. See the 'Module Parameters' section for details. + +All temperature channels except temp2 are external. Temp2 is the VT1211 +internal thermal diode and the driver does all the scaling for temp2 and +returns the temperature in millidegree Celsius. For the external channels +temp1 and temp3-temp7, scaling depends on the board implementation and needs +to be performed in userspace via sensors.conf. + +Temp1 is an Intel-type thermal diode which requires the following formula to +convert between sysfs readings and real temperatures: + +compute temp1 (@-Offset)/Gain, (@*Gain)+Offset + +According to the VIA VT1211 BIOS porting guide, the following gain and offset +values should be used: + +Diode Type Offset Gain +---------- ------ ---- +Intel CPU 88.638 0.9528 + 65.000 0.9686 *) +VIA C3 Ezra 83.869 0.9528 +VIA C3 Ezra-T 73.869 0.9528 + +*) This is the formula from the lm_sensors 2.10.0 sensors.conf file. I don't +know where it comes from or how it was derived, it's just listed here for +completeness. + +Temp3-temp7 support NTC thermistors. For these channels, the driver returns +the voltages as seen at the individual pins of UCH1-UCH5. The voltage at the +pin (Vpin) is formed by a voltage divider made of the thermistor (Rth) and a +scaling resistor (Rs): + +Vpin = 2200 * Rth / (Rs + Rth) (2200 is the ADC max limit of 2200 mV) + +The equation for the thermistor is as follows (google it if you want to know +more about it): + +Rth = Ro * exp(B * (1 / T - 1 / To)) (To is 298.15K (25C) and Ro is the + nominal resistance at 25C) + +Mingling the above two equations and assuming Rs = Ro and B = 3435 yields the +following formula for sensors.conf: + +compute tempx 1 / (1 / 298.15 - (` (2200 / @ - 1)) / 3435) - 273.15, + 2200 / (1 + (^ (3435 / 298.15 - 3435 / (273.15 + @)))) + + +Fan Speed Control +----------------- + +The VT1211 provides 2 programmable PWM outputs to control the speeds of 2 +fans. Writing a 2 to any of the two pwm[1-2]_enable sysfs nodes will put the +PWM controller in automatic mode. There is only a single controller that +controls both PWM outputs but each PWM output can be individually enabled and +disabled. + +Each PWM has 4 associated distinct output duty-cycles: full, high, low and +off. Full and off are internally hard-wired to 255 (100%) and 0 (0%), +respectively. High and low can be programmed via +pwm[1-2]_auto_point[2-3]_pwm. Each PWM output can be associated with a +different thermal input but - and here's the weird part - only one set of +thermal thresholds exist that controls both PWMs output duty-cycles. The +thermal thresholds are accessible via pwm[1-2]_auto_point[1-4]_temp. Note +that even though there are 2 sets of 4 auto points each, they map to the same +registers in the VT1211 and programming one set is sufficient (actually only +the first set pwm1_auto_point[1-4]_temp is writable, the second set is +read-only). + +PWM Auto Point PWM Output Duty-Cycle +------------------------------------------------ +pwm[1-2]_auto_point4_pwm full speed duty-cycle (hard-wired to 255) +pwm[1-2]_auto_point3_pwm high speed duty-cycle +pwm[1-2]_auto_point2_pwm low speed duty-cycle +pwm[1-2]_auto_point1_pwm off duty-cycle (hard-wired to 0) + +Temp Auto Point Thermal Threshold +--------------------------------------------- +pwm[1-2]_auto_point4_temp full speed temp +pwm[1-2]_auto_point3_temp high speed temp +pwm[1-2]_auto_point2_temp low speed temp +pwm[1-2]_auto_point1_temp off temp + +Long story short, the controller implements the following algorithm to set the +PWM output duty-cycle based on the input temperature: + +Thermal Threshold Output Duty-Cycle + (Rising Temp) (Falling Temp) +---------------------------------------------------------- + full speed duty-cycle full speed duty-cycle +full speed temp + high speed duty-cycle full speed duty-cycle +high speed temp + low speed duty-cycle high speed duty-cycle +low speed temp + off duty-cycle low speed duty-cycle +off temp diff --git a/Documentation/hwmon/w83627ehf b/Documentation/hwmon/w83627ehf new file mode 100644 index 000000000000..fae3b781d82d --- /dev/null +++ b/Documentation/hwmon/w83627ehf @@ -0,0 +1,85 @@ +Kernel driver w83627ehf +======================= + +Supported chips: + * Winbond W83627EHF/EHG (ISA access ONLY) + Prefix: 'w83627ehf' + Addresses scanned: ISA address retrieved from Super I/O registers + Datasheet: http://www.winbond-usa.com/products/winbond_products/pdfs/PCIC/W83627EHF_%20W83627EHGb.pdf + +Authors: + Jean Delvare <khali@linux-fr.org> + Yuan Mu (Winbond) + Rudolf Marek <r.marek@sh.cvut.cz> + +Description +----------- + +This driver implements support for the Winbond W83627EHF and W83627EHG +super I/O chips. We will refer to them collectively as Winbond chips. + +The chips implement three temperature sensors, five fan rotation +speed sensors, ten analog voltage sensors, alarms with beep warnings (control +unimplemented), and some automatic fan regulation strategies (plus manual +fan control mode). + +Temperatures are measured in degrees Celsius and measurement resolution is 1 +degC for temp1 and 0.5 degC for temp2 and temp3. An alarm is triggered when +the temperature gets higher than high limit; it stays on until the temperature +falls below the Hysteresis value. + +Fan rotation speeds are reported in RPM (rotations per minute). An alarm is +triggered if the rotation speed has dropped below a programmable limit. Fan +readings can be divided by a programmable divider (1, 2, 4, 8, 16, 32, 64 or +128) to give the readings more range or accuracy. The driver sets the most +suitable fan divisor itself. Some fans might not be present because they +share pins with other functions. + +Voltage sensors (also known as IN sensors) report their values in millivolts. +An alarm is triggered if the voltage has crossed a programmable minimum +or maximum limit. + +The driver supports automatic fan control mode known as Thermal Cruise. +In this mode, the chip attempts to keep the measured temperature in a +predefined temperature range. If the temperature goes out of range, fan +is driven slower/faster to reach the predefined range again. + +The mode works for fan1-fan4. Mapping of temperatures to pwm outputs is as +follows: + +temp1 -> pwm1 +temp2 -> pwm2 +temp3 -> pwm3 +prog -> pwm4 (the programmable setting is not supported by the driver) + +/sys files +---------- + +pwm[1-4] - this file stores PWM duty cycle or DC value (fan speed) in range: + 0 (stop) to 255 (full) + +pwm[1-4]_enable - this file controls mode of fan/temperature control: + * 1 Manual Mode, write to pwm file any value 0-255 (full speed) + * 2 Thermal Cruise + +Thermal Cruise mode +------------------- + +If the temperature is in the range defined by: + +pwm[1-4]_target - set target temperature, unit millidegree Celcius + (range 0 - 127000) +pwm[1-4]_tolerance - tolerance, unit millidegree Celcius (range 0 - 15000) + +there are no changes to fan speed. Once the temperature leaves the interval, +fan speed increases (temp is higher) or decreases if lower than desired. +There are defined steps and times, but not exported by the driver yet. + +pwm[1-4]_min_output - minimum fan speed (range 1 - 255), when the temperature + is below defined range. +pwm[1-4]_stop_time - how many milliseconds [ms] must elapse to switch + corresponding fan off. (when the temperature was below + defined range). + +Note: last two functions are influenced by other control bits, not yet exported + by the driver, so a change might not have any effect. diff --git a/Documentation/hwmon/w83791d b/Documentation/hwmon/w83791d index 83a3836289c2..19b2ed739fa1 100644 --- a/Documentation/hwmon/w83791d +++ b/Documentation/hwmon/w83791d @@ -5,7 +5,7 @@ Supported chips: * Winbond W83791D Prefix: 'w83791d' Addresses scanned: I2C 0x2c - 0x2f - Datasheet: http://www.winbond-usa.com/products/winbond_products/pdfs/PCIC/W83791Da.pdf + Datasheet: http://www.winbond-usa.com/products/winbond_products/pdfs/PCIC/W83791D_W83791Gb.pdf Author: Charles Spirakis <bezaur@gmail.com> @@ -20,6 +20,9 @@ Credits: Chunhao Huang <DZShen@Winbond.com.tw>, Rudolf Marek <r.marek@sh.cvut.cz> +Additional contributors: + Sven Anders <anders@anduras.de> + Module Parameters ----------------- @@ -46,7 +49,8 @@ Module Parameters Description ----------- -This driver implements support for the Winbond W83791D chip. +This driver implements support for the Winbond W83791D chip. The W83791G +chip appears to be the same as the W83791D but is lead free. Detection of the chip can sometimes be foiled because it can be in an internal state that allows no clean access (Bank with ID register is not @@ -71,34 +75,36 @@ Voltage sensors (also known as IN sensors) report their values in millivolts. An alarm is triggered if the voltage has crossed a programmable minimum or maximum limit. -Alarms are provided as output from a "realtime status register". The -following bits are defined: - -bit - alarm on: -0 - Vcore -1 - VINR0 -2 - +3.3VIN -3 - 5VDD -4 - temp1 -5 - temp2 -6 - fan1 -7 - fan2 -8 - +12VIN -9 - -12VIN -10 - -5VIN -11 - fan3 -12 - chassis -13 - temp3 -14 - VINR1 -15 - reserved -16 - tart1 -17 - tart2 -18 - tart3 -19 - VSB -20 - VBAT -21 - fan4 -22 - fan5 -23 - reserved +The bit ordering for the alarm "realtime status register" and the +"beep enable registers" are different. + +in0 (VCORE) : alarms: 0x000001 beep_enable: 0x000001 +in1 (VINR0) : alarms: 0x000002 beep_enable: 0x002000 <== mismatch +in2 (+3.3VIN): alarms: 0x000004 beep_enable: 0x000004 +in3 (5VDD) : alarms: 0x000008 beep_enable: 0x000008 +in4 (+12VIN) : alarms: 0x000100 beep_enable: 0x000100 +in5 (-12VIN) : alarms: 0x000200 beep_enable: 0x000200 +in6 (-5VIN) : alarms: 0x000400 beep_enable: 0x000400 +in7 (VSB) : alarms: 0x080000 beep_enable: 0x010000 <== mismatch +in8 (VBAT) : alarms: 0x100000 beep_enable: 0x020000 <== mismatch +in9 (VINR1) : alarms: 0x004000 beep_enable: 0x004000 +temp1 : alarms: 0x000010 beep_enable: 0x000010 +temp2 : alarms: 0x000020 beep_enable: 0x000020 +temp3 : alarms: 0x002000 beep_enable: 0x000002 <== mismatch +fan1 : alarms: 0x000040 beep_enable: 0x000040 +fan2 : alarms: 0x000080 beep_enable: 0x000080 +fan3 : alarms: 0x000800 beep_enable: 0x000800 +fan4 : alarms: 0x200000 beep_enable: 0x200000 +fan5 : alarms: 0x400000 beep_enable: 0x400000 +tart1 : alarms: 0x010000 beep_enable: 0x040000 <== mismatch +tart2 : alarms: 0x020000 beep_enable: 0x080000 <== mismatch +tart3 : alarms: 0x040000 beep_enable: 0x100000 <== mismatch +case_open : alarms: 0x001000 beep_enable: 0x001000 +user_enable : alarms: -------- beep_enable: 0x800000 + +*** NOTE: It is the responsibility of user-space code to handle the fact +that the beep enable and alarm bits are in different positions when using that +feature of the chip. When an alarm goes off, you can be warned by a beeping signal through your computer speaker. It is possible to enable all beeping globally, or only @@ -109,5 +115,6 @@ often will do no harm, but will return 'old' values. W83791D TODO: --------------- -Provide a patch for per-file alarms as discussed on the mailing list +Provide a patch for per-file alarms and beep enables as defined in the hwmon + documentation (Documentation/hwmon/sysfs-interface) Provide a patch for smart-fan control (still need appropriate motherboard/fans) diff --git a/Documentation/i2c/busses/i2c-viapro b/Documentation/i2c/busses/i2c-viapro index 16775663b9f5..25680346e0ac 100644 --- a/Documentation/i2c/busses/i2c-viapro +++ b/Documentation/i2c/busses/i2c-viapro @@ -7,9 +7,12 @@ Supported adapters: * VIA Technologies, Inc. VT82C686A/B Datasheet: Sometimes available at the VIA website - * VIA Technologies, Inc. VT8231, VT8233, VT8233A, VT8235, VT8237R + * VIA Technologies, Inc. VT8231, VT8233, VT8233A Datasheet: available on request from VIA + * VIA Technologies, Inc. VT8235, VT8237R, VT8237A, VT8251 + Datasheet: available on request and under NDA from VIA + Authors: Kyösti Mälkki <kmalkki@cc.hut.fi>, Mark D. Studebaker <mdsxyz123@yahoo.com>, @@ -39,6 +42,8 @@ Your lspci -n listing must show one of these : device 1106:8235 (VT8231 function 4) device 1106:3177 (VT8235) device 1106:3227 (VT8237R) + device 1106:3337 (VT8237A) + device 1106:3287 (VT8251) If none of these show up, you should look in the BIOS for settings like enable ACPI / SMBus or even USB. diff --git a/Documentation/i2c/i2c-stub b/Documentation/i2c/i2c-stub index d6dcb138abf5..9cc081e69764 100644 --- a/Documentation/i2c/i2c-stub +++ b/Documentation/i2c/i2c-stub @@ -6,9 +6,12 @@ This module is a very simple fake I2C/SMBus driver. It implements four types of SMBus commands: write quick, (r/w) byte, (r/w) byte data, and (r/w) word data. +You need to provide a chip address as a module parameter when loading +this driver, which will then only react to SMBus commands to this address. + No hardware is needed nor associated with this module. It will accept write -quick commands to all addresses; it will respond to the other commands (also -to all addresses) by reading from or writing to an array in memory. It will +quick commands to one address; it will respond to the other commands (also +to one address) by reading from or writing to an array in memory. It will also spam the kernel logs for every command it handles. A pointer register with auto-increment is implemented for all byte @@ -21,6 +24,11 @@ The typical use-case is like this: 3. load the target sensors chip driver module 4. observe its behavior in the kernel log +PARAMETERS: + +int chip_addr: + The SMBus address to emulate a chip at. + CAVEATS: There are independent arrays for byte/data and word/data commands. Depending @@ -33,6 +41,9 @@ If the hardware for your driver has banked registers (e.g. Winbond sensors chips) this module will not work well - although it could be extended to support that pretty easily. +Only one chip address is supported - although this module could be +extended to support more. + If you spam it hard enough, printk can be lossy. This module really wants something like relayfs. diff --git a/Documentation/kbuild/kconfig-language.txt b/Documentation/kbuild/kconfig-language.txt index ca1967f36423..003fccc14d24 100644 --- a/Documentation/kbuild/kconfig-language.txt +++ b/Documentation/kbuild/kconfig-language.txt @@ -67,19 +67,19 @@ applicable everywhere (see syntax). - default value: "default" <expr> ["if" <expr>] A config option can have any number of default values. If multiple default values are visible, only the first defined one is active. - Default values are not limited to the menu entry, where they are - defined, this means the default can be defined somewhere else or be + Default values are not limited to the menu entry where they are + defined. This means the default can be defined somewhere else or be overridden by an earlier definition. The default value is only assigned to the config symbol if no other value was set by the user (via the input prompt above). If an input prompt is visible the default value is presented to the user and can be overridden by him. - Optionally dependencies only for this default value can be added with + Optionally, dependencies only for this default value can be added with "if". - dependencies: "depends on"/"requires" <expr> This defines a dependency for this menu entry. If multiple - dependencies are defined they are connected with '&&'. Dependencies + dependencies are defined, they are connected with '&&'. Dependencies are applied to all other options within this menu entry (which also accept an "if" expression), so these two examples are equivalent: @@ -153,7 +153,7 @@ Nonconstant symbols are the most common ones and are defined with the 'config' statement. Nonconstant symbols consist entirely of alphanumeric characters or underscores. Constant symbols are only part of expressions. Constant symbols are -always surrounded by single or double quotes. Within the quote any +always surrounded by single or double quotes. Within the quote, any other character is allowed and the quotes can be escaped using '\'. Menu structure @@ -237,7 +237,7 @@ choices: <choice block> "endchoice" -This defines a choice group and accepts any of above attributes as +This defines a choice group and accepts any of the above attributes as options. A choice can only be of type bool or tristate, while a boolean choice only allows a single config entry to be selected, a tristate choice also allows any number of config entries to be set to 'm'. This diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt index 0706699c9da9..e2cbd59cf2d0 100644 --- a/Documentation/kbuild/makefiles.txt +++ b/Documentation/kbuild/makefiles.txt @@ -22,7 +22,7 @@ This document describes the Linux kernel Makefiles. === 4 Host Program support --- 4.1 Simple Host Program --- 4.2 Composite Host Programs - --- 4.3 Defining shared libraries + --- 4.3 Defining shared libraries --- 4.4 Using C++ for host programs --- 4.5 Controlling compiler options for host programs --- 4.6 When host programs are actually built @@ -69,7 +69,7 @@ architecture-specific information to the top Makefile. Each subdirectory has a kbuild Makefile which carries out the commands passed down from above. The kbuild Makefile uses information from the -.config file to construct various file lists used by kbuild to build +.config file to construct various file lists used by kbuild to build any built-in or modular targets. scripts/Makefile.* contains all the definitions/rules etc. that @@ -86,7 +86,7 @@ any kernel Makefiles (or any other source files). *Normal developers* are people who work on features such as device drivers, file systems, and network protocols. These people need to -maintain the kbuild Makefiles for the subsystem that they are +maintain the kbuild Makefiles for the subsystem they are working on. In order to do this effectively, they need some overall knowledge about the kernel Makefiles, plus detailed knowledge about the public interface for kbuild. @@ -104,10 +104,10 @@ This document is aimed towards normal developers and arch developers. === 3 The kbuild files Most Makefiles within the kernel are kbuild Makefiles that use the -kbuild infrastructure. This chapter introduce the syntax used in the +kbuild infrastructure. This chapter introduces the syntax used in the kbuild makefiles. The preferred name for the kbuild files are 'Makefile' but 'Kbuild' can -be used and if both a 'Makefile' and a 'Kbuild' file exists then the 'Kbuild' +be used and if both a 'Makefile' and a 'Kbuild' file exists, then the 'Kbuild' file will be used. Section 3.1 "Goal definitions" is a quick intro, further chapters provide @@ -124,7 +124,7 @@ more details, with real examples. Example: obj-y += foo.o - This tell kbuild that there is one object in that directory named + This tell kbuild that there is one object in that directory, named foo.o. foo.o will be built from foo.c or foo.S. If foo.o shall be built as a module, the variable obj-m is used. @@ -140,7 +140,7 @@ more details, with real examples. --- 3.2 Built-in object goals - obj-y The kbuild Makefile specifies object files for vmlinux - in the lists $(obj-y). These lists depend on the kernel + in the $(obj-y) lists. These lists depend on the kernel configuration. Kbuild compiles all the $(obj-y) files. It then calls @@ -154,8 +154,8 @@ more details, with real examples. Link order is significant, because certain functions (module_init() / __initcall) will be called during boot in the order they appear. So keep in mind that changing the link - order may e.g. change the order in which your SCSI - controllers are detected, and thus you disks are renumbered. + order may e.g. change the order in which your SCSI + controllers are detected, and thus your disks are renumbered. Example: #drivers/isdn/i4l/Makefile @@ -203,11 +203,11 @@ more details, with real examples. Example: #fs/ext2/Makefile obj-$(CONFIG_EXT2_FS) += ext2.o - ext2-y := balloc.o bitmap.o + ext2-y := balloc.o bitmap.o ext2-$(CONFIG_EXT2_FS_XATTR) += xattr.o - - In this example xattr.o is only part of the composite object - ext2.o, if $(CONFIG_EXT2_FS_XATTR) evaluates to 'y'. + + In this example, xattr.o is only part of the composite object + ext2.o if $(CONFIG_EXT2_FS_XATTR) evaluates to 'y'. Note: Of course, when you are building objects into the kernel, the syntax above will also work. So, if you have CONFIG_EXT2_FS=y, @@ -221,16 +221,16 @@ more details, with real examples. --- 3.5 Library file goals - lib-y - Objects listed with obj-* are used for modules or + Objects listed with obj-* are used for modules, or combined in a built-in.o for that specific directory. There is also the possibility to list objects that will be included in a library, lib.a. All objects listed with lib-y are combined in a single library for that directory. - Objects that are listed in obj-y and additional listed in + Objects that are listed in obj-y and additionaly listed in lib-y will not be included in the library, since they will anyway be accessible. - For consistency objects listed in lib-m will be included in lib.a. + For consistency, objects listed in lib-m will be included in lib.a. Note that the same kbuild makefile may list files to be built-in and to be part of a library. Therefore the same directory @@ -241,11 +241,11 @@ more details, with real examples. lib-y := checksum.o delay.o This will create a library lib.a based on checksum.o and delay.o. - For kbuild to actually recognize that there is a lib.a being build + For kbuild to actually recognize that there is a lib.a being built, the directory shall be listed in libs-y. See also "6.3 List directories to visit when descending". - - Usage of lib-y is normally restricted to lib/ and arch/*/lib. + + Use of lib-y is normally restricted to lib/ and arch/*/lib. --- 3.6 Descending down in directories @@ -255,7 +255,7 @@ more details, with real examples. invoke make recursively in subdirectories, provided you let it know of them. - To do so obj-y and obj-m are used. + To do so, obj-y and obj-m are used. ext2 lives in a separate directory, and the Makefile present in fs/ tells kbuild to descend down using the following assignment. @@ -353,8 +353,8 @@ more details, with real examples. Special rules are used when the kbuild infrastructure does not provide the required support. A typical example is header files generated during the build process. - Another example is the architecture specific Makefiles which - needs special rules to prepare boot images etc. + Another example are the architecture specific Makefiles which + need special rules to prepare boot images etc. Special rules are written as normal Make rules. Kbuild is not executing in the directory where the Makefile is @@ -387,28 +387,28 @@ more details, with real examples. --- 3.11 $(CC) support functions - The kernel may be build with several different versions of + The kernel may be built with several different versions of $(CC), each supporting a unique set of features and options. kbuild provide basic support to check for valid options for $(CC). $(CC) is useally the gcc compiler, but other alternatives are available. as-option - as-option is used to check if $(CC) when used to compile - assembler (*.S) files supports the given option. An optional - second option may be specified if first option are not supported. + as-option is used to check if $(CC) -- when used to compile + assembler (*.S) files -- supports the given option. An optional + second option may be specified if the first option is not supported. Example: #arch/sh/Makefile cflags-y += $(call as-option,-Wa$(comma)-isa=$(isa-y),) - In the above example cflags-y will be assinged the the option + In the above example, cflags-y will be assigned the option -Wa$(comma)-isa=$(isa-y) if it is supported by $(CC). The second argument is optional, and if supplied will be used if first argument is not supported. ld-option - ld-option is used to check if $(CC) when used to link object files + ld-option is used to check if $(CC) when used to link object files supports the given option. An optional second option may be specified if first option are not supported. @@ -421,8 +421,13 @@ more details, with real examples. The second argument is optional, and if supplied will be used if first argument is not supported. + as-instr + as-instr checks if the assembler reports a specific instruction + and then outputs either option1 or option2 + C escapes are supported in the test instruction + cc-option - cc-option is used to check if $(CC) support a given option, and not + cc-option is used to check if $(CC) supports a given option, and not supported to use an optional second option. Example: @@ -430,12 +435,12 @@ more details, with real examples. cflags-y += $(call cc-option,-march=pentium-mmx,-march=i586) In the above example cflags-y will be assigned the option - -march=pentium-mmx if supported by $(CC), otherwise -march-i586. - The second argument to cc-option is optional, and if omitted + -march=pentium-mmx if supported by $(CC), otherwise -march=i586. + The second argument to cc-option is optional, and if omitted, cflags-y will be assigned no value if first option is not supported. cc-option-yn - cc-option-yn is used to check if gcc supports a given option + cc-option-yn is used to check if gcc supports a given option and return 'y' if supported, otherwise 'n'. Example: @@ -443,32 +448,33 @@ more details, with real examples. biarch := $(call cc-option-yn, -m32) aflags-$(biarch) += -a32 cflags-$(biarch) += -m32 - - In the above example $(biarch) is set to y if $(CC) supports the -m32 - option. When $(biarch) equals to y the expanded variables $(aflags-y) - and $(cflags-y) will be assigned the values -a32 and -m32. + + In the above example, $(biarch) is set to y if $(CC) supports the -m32 + option. When $(biarch) equals 'y', the expanded variables $(aflags-y) + and $(cflags-y) will be assigned the values -a32 and -m32, + respectively. cc-option-align - gcc version >= 3.0 shifted type of options used to speify - alignment of functions, loops etc. $(cc-option-align) whrn used - as prefix to the align options will select the right prefix: + gcc versions >= 3.0 changed the type of options used to specify + alignment of functions, loops etc. $(cc-option-align), when used + as prefix to the align options, will select the right prefix: gcc < 3.00 cc-option-align = -malign gcc >= 3.00 cc-option-align = -falign - + Example: CFLAGS += $(cc-option-align)-functions=4 - In the above example the option -falign-functions=4 is used for - gcc >= 3.00. For gcc < 3.00 -malign-functions=4 is used. - + In the above example, the option -falign-functions=4 is used for + gcc >= 3.00. For gcc < 3.00, -malign-functions=4 is used. + cc-version - cc-version return a numerical version of the $(CC) compiler version. + cc-version returns a numerical version of the $(CC) compiler version. The format is <major><minor> where both are two digits. So for example gcc 3.41 would return 0341. cc-version is useful when a specific $(CC) version is faulty in one - area, for example the -mregparm=3 were broken in some gcc version + area, for example -mregparm=3 was broken in some gcc versions even though the option was accepted by gcc. Example: @@ -477,20 +483,20 @@ more details, with real examples. if [ $(call cc-version) -ge 0300 ] ; then \ echo "-mregparm=3"; fi ;) - In the above example -mregparm=3 is only used for gcc version greater + In the above example, -mregparm=3 is only used for gcc version greater than or equal to gcc 3.0. cc-ifversion - cc-ifversion test the version of $(CC) and equals last argument if + cc-ifversion tests the version of $(CC) and equals last argument if version expression is true. Example: #fs/reiserfs/Makefile EXTRA_CFLAGS := $(call cc-ifversion, -lt, 0402, -O1) - In this example EXTRA_CFLAGS will be assigned the value -O1 if the + In this example, EXTRA_CFLAGS will be assigned the value -O1 if the $(CC) version is less than 4.2. - cc-ifversion takes all the shell operators: + cc-ifversion takes all the shell operators: -eq, -ne, -lt, -le, -gt, and -ge The third parameter may be a text as in this example, but it may also be an expanded variable or a macro. @@ -506,7 +512,7 @@ The first step is to tell kbuild that a host program exists. This is done utilising the variable hostprogs-y. The second step is to add an explicit dependency to the executable. -This can be done in two ways. Either add the dependency in a rule, +This can be done in two ways. Either add the dependency in a rule, or utilise the variable $(always). Both possibilities are described in the following. @@ -523,28 +529,28 @@ Both possibilities are described in the following. Kbuild assumes in the above example that bin2hex is made from a single c-source file named bin2hex.c located in the same directory as the Makefile. - + --- 4.2 Composite Host Programs Host programs can be made up based on composite objects. The syntax used to define composite objects for host programs is similar to the syntax used for kernel objects. - $(<executeable>-objs) list all objects used to link the final + $(<executeable>-objs) lists all objects used to link the final executable. Example: #scripts/lxdialog/Makefile - hostprogs-y := lxdialog + hostprogs-y := lxdialog lxdialog-objs := checklist.o lxdialog.o Objects with extension .o are compiled from the corresponding .c - files. In the above example checklist.c is compiled to checklist.o + files. In the above example, checklist.c is compiled to checklist.o and lxdialog.c is compiled to lxdialog.o. - Finally the two .o files are linked to the executable, lxdialog. + Finally, the two .o files are linked to the executable, lxdialog. Note: The syntax <executable>-y is not permitted for host-programs. ---- 4.3 Defining shared libraries - +--- 4.3 Defining shared libraries + Objects with extension .so are considered shared libraries, and will be compiled as position independent objects. Kbuild provides support for shared libraries, but the usage @@ -557,7 +563,7 @@ Both possibilities are described in the following. hostprogs-y := conf conf-objs := conf.o libkconfig.so libkconfig-objs := expr.o type.o - + Shared libraries always require a corresponding -objs line, and in the example above the shared library libkconfig is composed by the two objects expr.o and type.o. @@ -578,7 +584,7 @@ Both possibilities are described in the following. In the example above the executable is composed of the C++ file qconf.cc - identified by $(qconf-cxxobjs). - + If qconf is composed by a mixture of .c and .cc files, then an additional line can be used to identify this. @@ -587,34 +593,35 @@ Both possibilities are described in the following. hostprogs-y := qconf qconf-cxxobjs := qconf.o qconf-objs := check.o - + --- 4.5 Controlling compiler options for host programs When compiling host programs, it is possible to set specific flags. The programs will always be compiled utilising $(HOSTCC) passed the options specified in $(HOSTCFLAGS). To set flags that will take effect for all host programs created - in that Makefile use the variable HOST_EXTRACFLAGS. + in that Makefile, use the variable HOST_EXTRACFLAGS. Example: #scripts/lxdialog/Makefile HOST_EXTRACFLAGS += -I/usr/include/ncurses - + To set specific flags for a single file the following construction is used: Example: #arch/ppc64/boot/Makefile HOSTCFLAGS_piggyback.o := -DKERNELBASE=$(KERNELBASE) - + It is also possible to specify additional options to the linker. - + Example: #scripts/kconfig/Makefile HOSTLOADLIBES_qconf := -L$(QTDIR)/lib - When linking qconf it will be passed the extra option "-L$(QTDIR)/lib". - + When linking qconf, it will be passed the extra option + "-L$(QTDIR)/lib". + --- 4.6 When host programs are actually built Kbuild will only build host-programs when they are referenced @@ -629,7 +636,7 @@ Both possibilities are described in the following. $(obj)/devlist.h: $(src)/pci.ids $(obj)/gen-devlist ( cd $(obj); ./gen-devlist ) < $< - The target $(obj)/devlist.h will not be built before + The target $(obj)/devlist.h will not be built before $(obj)/gen-devlist is updated. Note that references to the host programs in special rules must be prefixed with $(obj). @@ -648,7 +655,7 @@ Both possibilities are described in the following. --- 4.7 Using hostprogs-$(CONFIG_FOO) - A typcal pattern in a Kbuild file lok like this: + A typical pattern in a Kbuild file looks like this: Example: #scripts/Makefile @@ -656,13 +663,13 @@ Both possibilities are described in the following. Kbuild knows about both 'y' for built-in and 'm' for module. So if a config symbol evaluate to 'm', kbuild will still build - the binary. In other words Kbuild handle hostprogs-m exactly - like hostprogs-y. But only hostprogs-y is recommend used - when no CONFIG symbol are involved. + the binary. In other words, Kbuild handles hostprogs-m exactly + like hostprogs-y. But only hostprogs-y is recommended to be used + when no CONFIG symbols are involved. === 5 Kbuild clean infrastructure -"make clean" deletes most generated files in the src tree where the kernel +"make clean" deletes most generated files in the obj tree where the kernel is compiled. This includes generated files such as host programs. Kbuild knows targets listed in $(hostprogs-y), $(hostprogs-m), $(always), $(extra-y) and $(targets). They are all deleted during "make clean". @@ -680,7 +687,8 @@ When executing "make clean", the two files "devlist.h classlist.h" will be deleted. Kbuild will assume files to be in same relative directory as the Makefile except if an absolute path is specified (path starting with '/'). -To delete a directory hirachy use: +To delete a directory hierarchy use: + Example: #scripts/package/Makefile clean-dirs := $(objtree)/debian/ @@ -723,29 +731,29 @@ be visited during "make clean". The top level Makefile sets up the environment and does the preparation, before starting to descend down in the individual directories. -The top level makefile contains the generic part, whereas the -arch/$(ARCH)/Makefile contains what is required to set-up kbuild -to the said architecture. -To do so arch/$(ARCH)/Makefile sets a number of variables, and defines +The top level makefile contains the generic part, whereas +arch/$(ARCH)/Makefile contains what is required to set up kbuild +for said architecture. +To do so, arch/$(ARCH)/Makefile sets up a number of variables and defines a few targets. -When kbuild executes the following steps are followed (roughly): -1) Configuration of the kernel => produced .config +When kbuild executes, the following steps are followed (roughly): +1) Configuration of the kernel => produce .config 2) Store kernel version in include/linux/version.h 3) Symlink include/asm to include/asm-$(ARCH) 4) Updating all other prerequisites to the target prepare: - Additional prerequisites are specified in arch/$(ARCH)/Makefile 5) Recursively descend down in all directories listed in init-* core* drivers-* net-* libs-* and build all targets. - - The value of the above variables are extended in arch/$(ARCH)/Makefile. -6) All object files are then linked and the resulting file vmlinux is - located at the root of the src tree. + - The values of the above variables are expanded in arch/$(ARCH)/Makefile. +6) All object files are then linked and the resulting file vmlinux is + located at the root of the obj tree. The very first objects linked are listed in head-y, assigned by arch/$(ARCH)/Makefile. -7) Finally the architecture specific part does any required post processing +7) Finally, the architecture specific part does any required post processing and builds the final bootimage. - This includes building boot records - - Preparing initrd images and the like + - Preparing initrd images and thelike --- 6.1 Set variables to tweak the build to the architecture @@ -760,7 +768,7 @@ When kbuild executes the following steps are followed (roughly): LDFLAGS := -m elf_s390 Note: EXTRA_LDFLAGS and LDFLAGS_$@ can be used to further customise the flags used. See chapter 7. - + LDFLAGS_MODULE Options for $(LD) when linking modules LDFLAGS_MODULE is used to set specific flags for $(LD) when @@ -770,7 +778,7 @@ When kbuild executes the following steps are followed (roughly): LDFLAGS_vmlinux Options for $(LD) when linking vmlinux LDFLAGS_vmlinux is used to specify additional flags to pass to - the linker when linking the final vmlinux. + the linker when linking the final vmlinux image. LDFLAGS_vmlinux uses the LDFLAGS_$@ support. Example: @@ -780,7 +788,7 @@ When kbuild executes the following steps are followed (roughly): OBJCOPYFLAGS objcopy flags When $(call if_changed,objcopy) is used to translate a .o file, - then the flags specified in OBJCOPYFLAGS will be used. + the flags specified in OBJCOPYFLAGS will be used. $(call if_changed,objcopy) is often used to generate raw binaries on vmlinux. @@ -792,7 +800,7 @@ When kbuild executes the following steps are followed (roughly): $(obj)/image: vmlinux FORCE $(call if_changed,objcopy) - In this example the binary $(obj)/image is a binary version of + In this example, the binary $(obj)/image is a binary version of vmlinux. The usage of $(call if_changed,xxx) will be described later. AFLAGS $(AS) assembler flags @@ -809,7 +817,7 @@ When kbuild executes the following steps are followed (roughly): Default value - see top level Makefile Append or modify as required per architecture. - Often the CFLAGS variable depends on the configuration. + Often, the CFLAGS variable depends on the configuration. Example: #arch/i386/Makefile @@ -830,7 +838,7 @@ When kbuild executes the following steps are followed (roughly): ... - The first examples utilises the trick that a config option expands + The first example utilises the trick that a config option expands to 'y' when selected. CFLAGS_KERNEL $(CC) options specific for built-in @@ -843,18 +851,18 @@ When kbuild executes the following steps are followed (roughly): $(CFLAGS_MODULE) contains extra C compiler flags used to compile code for loadable kernel modules. - + --- 6.2 Add prerequisites to archprepare: - The archprepare: rule is used to list prerequisites that needs to be + The archprepare: rule is used to list prerequisites that need to be built before starting to descend down in the subdirectories. - This is usual header files containing assembler constants. + This is usually used for header files containing assembler constants. Example: #arch/arm/Makefile archprepare: maketools - In this example the file target maketools will be processed + In this example, the file target maketools will be processed before descending down in the subdirectories. See also chapter XXX-TODO that describe how kbuild supports generating offset header files. @@ -867,18 +875,19 @@ When kbuild executes the following steps are followed (roughly): corresponding arch-specific section for modules; the module-building machinery is all architecture-independent. - + head-y, init-y, core-y, libs-y, drivers-y, net-y - $(head-y) list objects to be linked first in vmlinux. - $(libs-y) list directories where a lib.a archive can be located. - The rest list directories where a built-in.o object file can be located. + $(head-y) lists objects to be linked first in vmlinux. + $(libs-y) lists directories where a lib.a archive can be located. + The rest lists directories where a built-in.o object file can be + located. $(init-y) objects will be located after $(head-y). Then the rest follows in this order: $(core-y), $(libs-y), $(drivers-y) and $(net-y). - The top level Makefile define values for all generic directories, + The top level Makefile defines values for all generic directories, and arch/$(ARCH)/Makefile only adds architecture specific directories. Example: @@ -915,27 +924,27 @@ When kbuild executes the following steps are followed (roughly): "$(Q)$(MAKE) $(build)=<dir>" is the recommended way to invoke make in a subdirectory. - There are no rules for naming of the architecture specific targets, + There are no rules for naming architecture specific targets, but executing "make help" will list all relevant targets. - To support this $(archhelp) must be defined. + To support this, $(archhelp) must be defined. Example: #arch/i386/Makefile define archhelp echo '* bzImage - Image (arch/$(ARCH)/boot/bzImage)' - endef + endif When make is executed without arguments, the first goal encountered will be built. In the top level Makefile the first goal present is all:. - An architecture shall always per default build a bootable image. - In "make help" the default goal is highlighted with a '*'. + An architecture shall always, per default, build a bootable image. + In "make help", the default goal is highlighted with a '*'. Add a new prerequisite to all: to select a default goal different from vmlinux. Example: #arch/i386/Makefile - all: bzImage + all: bzImage When "make" is executed without arguments, bzImage will be built. @@ -955,10 +964,10 @@ When kbuild executes the following steps are followed (roughly): #arch/i386/kernel/Makefile extra-y := head.o init_task.o - In this example extra-y is used to list object files that + In this example, extra-y is used to list object files that shall be built, but shall not be linked as part of built-in.o. - + --- 6.6 Commands useful for building a boot image Kbuild provides a few macros that are useful when building a @@ -972,8 +981,8 @@ When kbuild executes the following steps are followed (roughly): target: source(s) FORCE $(call if_changed,ld/objcopy/gzip) - When the rule is evaluated it is checked to see if any files - needs an update, or the commandline has changed since last + When the rule is evaluated, it is checked to see if any files + needs an update, or the command line has changed since the last invocation. The latter will force a rebuild if any options to the executable have changed. Any target that utilises if_changed must be listed in $(targets), @@ -991,8 +1000,8 @@ When kbuild executes the following steps are followed (roughly): #WRONG!# $(call if_changed, ld/objcopy/gzip) ld - Link target. Often LDFLAGS_$@ is used to set specific options to ld. - + Link target. Often, LDFLAGS_$@ is used to set specific options to ld. + objcopy Copy binary. Uses OBJCOPYFLAGS usually specified in arch/$(ARCH)/Makefile. @@ -1010,10 +1019,10 @@ When kbuild executes the following steps are followed (roughly): $(obj)/setup $(obj)/bootsect: %: %.o FORCE $(call if_changed,ld) - In this example there are two possible targets, requiring different - options to the linker. the linker options are specified using the + In this example, there are two possible targets, requiring different + options to the linker. The linker options are specified using the LDFLAGS_$@ syntax - one for each potential target. - $(targets) are assinged all potential targets, herby kbuild knows + $(targets) are assinged all potential targets, by which kbuild knows the targets and will: 1) check for commandline changes 2) delete target during make clean @@ -1027,7 +1036,7 @@ When kbuild executes the following steps are followed (roughly): --- 6.7 Custom kbuild commands - When kbuild is executing with KBUILD_VERBOSE=0 then only a shorthand + When kbuild is executing with KBUILD_VERBOSE=0, then only a shorthand of a command is normally displayed. To enable this behaviour for custom commands kbuild requires two variables to be set: @@ -1045,34 +1054,34 @@ When kbuild executes the following steps are followed (roughly): $(call if_changed,image) @echo 'Kernel: $@ is ready' - When updating the $(obj)/bzImage target the line: + When updating the $(obj)/bzImage target, the line BUILD arch/i386/boot/bzImage will be displayed with "make KBUILD_VERBOSE=0". - + --- 6.8 Preprocessing linker scripts - When the vmlinux image is build the linker script: + When the vmlinux image is built, the linker script arch/$(ARCH)/kernel/vmlinux.lds is used. The script is a preprocessed variant of the file vmlinux.lds.S located in the same directory. - kbuild knows .lds file and includes a rule *lds.S -> *lds. - + kbuild knows .lds files and includes a rule *lds.S -> *lds. + Example: #arch/i386/kernel/Makefile always := vmlinux.lds - + #Makefile export CPPFLAGS_vmlinux.lds += -P -C -U$(ARCH) - - The assigment to $(always) is used to tell kbuild to build the - target: vmlinux.lds. - The assignment to $(CPPFLAGS_vmlinux.lds) tell kbuild to use the + + The assignment to $(always) is used to tell kbuild to build the + target vmlinux.lds. + The assignment to $(CPPFLAGS_vmlinux.lds) tells kbuild to use the specified options when building the target vmlinux.lds. - - When building the *.lds target kbuild used the variakles: + + When building the *.lds target, kbuild uses the variables: CPPFLAGS : Set in top-level Makefile EXTRA_CPPFLAGS : May be set in the kbuild makefile CPPFLAGS_$(@F) : Target specific flags. @@ -1147,7 +1156,7 @@ The top Makefile exports the following variables: === 8 Makefile language -The kernel Makefiles are designed to run with GNU Make. The Makefiles +The kernel Makefiles are designed to be run with GNU Make. The Makefiles use only the documented features of GNU Make, but they do use many GNU extensions. @@ -1169,10 +1178,13 @@ is the right choice. Original version made by Michael Elizabeth Chastain, <mailto:mec@shout.net> Updates by Kai Germaschewski <kai@tp1.ruhr-uni-bochum.de> Updates by Sam Ravnborg <sam@ravnborg.org> +Language QA by Jan Engelhardt <jengelh@gmx.de> === 10 TODO -- Describe how kbuild support shipped files with _shipped. +- Describe how kbuild supports shipped files with _shipped. - Generating offset header files. - Add more variables to section 7? + + diff --git a/Documentation/kbuild/modules.txt b/Documentation/kbuild/modules.txt index 61fc079eb966..2e7702e94a78 100644 --- a/Documentation/kbuild/modules.txt +++ b/Documentation/kbuild/modules.txt @@ -1,7 +1,7 @@ In this document you will find information about: - how to build external modules -- how to make your module use kbuild infrastructure +- how to make your module use the kbuild infrastructure - how kbuild will install a kernel - how to install modules in a non-standard location @@ -24,7 +24,7 @@ In this document you will find information about: --- 6.1 INSTALL_MOD_PATH --- 6.2 INSTALL_MOD_DIR === 7. Module versioning & Module.symvers - --- 7.1 Symbols fron the kernel (vmlinux + modules) + --- 7.1 Symbols from the kernel (vmlinux + modules) --- 7.2 Symbols and external modules --- 7.3 Symbols from another external module === 8. Tips & Tricks @@ -36,13 +36,13 @@ In this document you will find information about: kbuild includes functionality for building modules both within the kernel source tree and outside the kernel source tree. -The latter is usually referred to as external modules and is used -both during development and for modules that are not planned to be -included in the kernel tree. +The latter is usually referred to as external or "out-of-tree" +modules and is used both during development and for modules that +are not planned to be included in the kernel tree. What is covered within this file is mainly information to authors -of modules. The author of an external modules should supply -a makefile that hides most of the complexity so one only has to type +of modules. The author of an external module should supply +a makefile that hides most of the complexity, so one only has to type 'make' to build the module. A complete example will be present in chapter 4, "Creating a kbuild file for an external module". @@ -63,14 +63,15 @@ when building an external module. For the running kernel use: make -C /lib/modules/`uname -r`/build M=`pwd` - For the above command to succeed the kernel must have been built with - modules enabled. + For the above command to succeed, the kernel must have been + built with modules enabled. To install the modules that were just built: make -C <path-to-kernel> M=`pwd` modules_install - More complex examples later, the above should get you going. + More complex examples will be shown later, the above should + be enough to get you started. --- 2.2 Available targets @@ -89,13 +90,13 @@ when building an external module. Same functionality as if no target was specified. See description above. - make -C $KDIR M=$PWD modules_install + make -C $KDIR M=`pwd` modules_install Install the external module(s). Installation default is in /lib/modules/<kernel-version>/extra, but may be prefixed with INSTALL_MOD_PATH - see separate chapter. - make -C $KDIR M=$PWD clean + make -C $KDIR M=`pwd` clean Remove all generated files for the module - the kernel source directory is not modified. @@ -129,29 +130,28 @@ when building an external module. To make sure the kernel contains the information required to build external modules the target 'modules_prepare' must be used. - 'module_prepare' solely exists as a simple way to prepare - a kernel for building external modules. + 'module_prepare' exists solely as a simple way to prepare + a kernel source tree for building external modules. Note: modules_prepare will not build Module.symvers even if - CONFIG_MODULEVERSIONING is set. - Therefore a full kernel build needs to be executed to make - module versioning work. + CONFIG_MODULEVERSIONING is set. Therefore a full kernel build + needs to be executed to make module versioning work. --- 2.5 Building separate files for a module - It is possible to build single files which is part of a module. - This works equal for the kernel, a module and even for external - modules. + It is possible to build single files which are part of a module. + This works equally well for the kernel, a module and even for + external modules. Examples (module foo.ko, consist of bar.o, baz.o): make -C $KDIR M=`pwd` bar.lst make -C $KDIR M=`pwd` bar.o make -C $KDIR M=`pwd` foo.ko make -C $KDIR M=`pwd` / - + === 3. Example commands This example shows the actual commands to be executed when building an external module for the currently running kernel. -In the example below the distribution is supposed to use the +In the example below, the distribution is supposed to use the facility to locate output files for a kernel compile in a different directory than the kernel source - but the examples will also work when the source and the output files are mixed in the same directory. @@ -170,14 +170,14 @@ the following commands to build the module: O=/lib/modules/`uname-r`/build \ M=`pwd` -Then to install the module use the following command: +Then, to install the module use the following command: make -C /usr/src/`uname -r`/source \ O=/lib/modules/`uname-r`/build \ M=`pwd` \ modules_install -If one looks closely you will see that this is the same commands as +If you look closely you will see that this is the same command as listed before - with the directories spelled out. The above are rather long commands, and the following chapter @@ -230,7 +230,7 @@ following files: endif - In example 1 the check for KERNELRELEASE is used to separate + In example 1, the check for KERNELRELEASE is used to separate the two parts of the Makefile. kbuild will only see the two assignments whereas make will see everything except the two kbuild assignments. @@ -255,7 +255,7 @@ following files: echo "X" > 8123_bin_shipped - In example 2 we are down to two fairly simple files and for simple + In example 2, we are down to two fairly simple files and for simple files as used in this example the split is questionable. But some external modules use Makefiles of several hundred lines and here it really pays off to separate the kbuild part from the rest. @@ -282,9 +282,9 @@ following files: endif - The trick here is to include the Kbuild file from Makefile so - if an older version of kbuild picks up the Makefile the Kbuild - file will be included. + The trick here is to include the Kbuild file from Makefile, so + if an older version of kbuild picks up the Makefile, the Kbuild + file will be included. --- 4.2 Binary blobs included in a module @@ -301,18 +301,19 @@ following files: obj-m := 8123.o 8123-y := 8123_if.o 8123_pci.o 8123_bin.o - In example 4 there is no distinction between the ordinary .c/.h files + In example 4, there is no distinction between the ordinary .c/.h files and the binary file. But kbuild will pick up different rules to create the .o file. === 5. Include files -Include files are a necessity when a .c file uses something from another .c -files (not strictly in the sense of .c but if good programming practice is -used). Any module that consist of more than one .c file will have a .h file -for one of the .c files. -- If the .h file only describes a module internal interface then the .h file +Include files are a necessity when a .c file uses something from other .c +files (not strictly in the sense of C, but if good programming practice is +used). Any module that consists of more than one .c file will have a .h file +for one of the .c files. + +- If the .h file only describes a module internal interface, then the .h file shall be placed in the same directory as the .c files. - If the .h files describe an interface used by other parts of the kernel located in different directories, the .h files shall be located in @@ -323,11 +324,11 @@ under include/ such as include/scsi. Another exception is arch-specific .h files which are located under include/asm-$(ARCH)/*. External modules have a tendency to locate include files in a separate include/ -directory and therefore needs to deal with this in their kbuild file. +directory and therefore need to deal with this in their kbuild file. --- 5.1 How to include files from the kernel include dir - When a module needs to include a file from include/linux/ then one + When a module needs to include a file from include/linux/, then one just uses: #include <linux/modules.h> @@ -348,7 +349,7 @@ directory and therefore needs to deal with this in their kbuild file. The trick here is to use either EXTRA_CFLAGS (take effect for all .c files) or CFLAGS_$F.o (take effect only for a single file). - In our example if we move 8123_if.h to a subdirectory named include/ + In our example, if we move 8123_if.h to a subdirectory named include/ the resulting Kbuild file would look like: --> filename: Kbuild @@ -362,19 +363,19 @@ directory and therefore needs to deal with this in their kbuild file. --- 5.3 External modules using several directories - If an external module does not follow the usual kernel style but - decide to spread files over several directories then kbuild can - support this too. + If an external module does not follow the usual kernel style, but + decides to spread files over several directories, then kbuild can + handle this too. Consider the following example: - + | +- src/complex_main.c | +- hal/hardwareif.c | +- hal/include/hardwareif.h +- include/complex.h - - To build a single module named complex.ko we then need the following + + To build a single module named complex.ko, we then need the following kbuild file: Kbuild: @@ -387,12 +388,12 @@ directory and therefore needs to deal with this in their kbuild file. kbuild knows how to handle .o files located in another directory - - although this is NOT reccommended practice. The syntax is to specify + although this is NOT recommended practice. The syntax is to specify the directory relative to the directory where the Kbuild file is located. - To find the .h files we have to explicitly tell kbuild where to look - for the .h files. When kbuild executes current directory is always + To find the .h files, we have to explicitly tell kbuild where to look + for the .h files. When kbuild executes, the current directory is always the root of the kernel tree (argument to -C) and therefore we have to tell kbuild how to find the .h files using absolute paths. $(src) will specify the absolute path to the directory where the @@ -412,7 +413,7 @@ External modules are installed in the directory: --- 6.1 INSTALL_MOD_PATH - Above are the default directories, but as always some level of + Above are the default directories, but as always, some level of customization is possible. One can prefix the path using the variable INSTALL_MOD_PATH: @@ -420,17 +421,17 @@ External modules are installed in the directory: => Install dir: /frodo/lib/modules/$(KERNELRELEASE)/kernel INSTALL_MOD_PATH may be set as an ordinary shell variable or as in the - example above be specified on the command line when calling make. + example above, can be specified on the command line when calling make. INSTALL_MOD_PATH has effect both when installing modules included in the kernel as well as when installing external modules. --- 6.2 INSTALL_MOD_DIR - When installing external modules they are default installed in a + When installing external modules they are by default installed to a directory under /lib/modules/$(KERNELRELEASE)/extra, but one may wish to locate modules for a specific functionality in a separate - directory. For this purpose one can use INSTALL_MOD_DIR to specify an - alternative name than 'extra'. + directory. For this purpose, one can use INSTALL_MOD_DIR to specify an + alternative name to 'extra'. $ make INSTALL_MOD_DIR=gandalf -C KERNELDIR \ M=`pwd` modules_install @@ -444,16 +445,16 @@ Module versioning is enabled by the CONFIG_MODVERSIONS tag. Module versioning is used as a simple ABI consistency check. The Module versioning creates a CRC value of the full prototype for an exported symbol and when a module is loaded/used then the CRC values contained in the kernel are -compared with similar values in the module. If they are not equal then the +compared with similar values in the module. If they are not equal, then the kernel refuses to load the module. Module.symvers contains a list of all exported symbols from a kernel build. --- 7.1 Symbols fron the kernel (vmlinux + modules) - During a kernel build a file named Module.symvers will be generated. + During a kernel build, a file named Module.symvers will be generated. Module.symvers contains all exported symbols from the kernel and - compiled modules. For each symbols the corresponding CRC value + compiled modules. For each symbols, the corresponding CRC value is stored too. The syntax of the Module.symvers file is: @@ -461,27 +462,27 @@ Module.symvers contains a list of all exported symbols from a kernel build. Sample: 0x2d036834 scsi_remove_host drivers/scsi/scsi_mod - For a kernel build without CONFIG_MODVERSIONING enabled the crc + For a kernel build without CONFIG_MODVERSIONS enabled, the crc would read: 0x00000000 - Module.symvers serve two purposes. - 1) It list all exported symbols both from vmlinux and all modules - 2) It list CRC if CONFIG_MODVERSION is enabled + Module.symvers serves two purposes: + 1) It lists all exported symbols both from vmlinux and all modules + 2) It lists the CRC if CONFIG_MODVERSIONS is enabled --- 7.2 Symbols and external modules - When building an external module the build system needs access to + When building an external module, the build system needs access to the symbols from the kernel to check if all external symbols are defined. This is done in the MODPOST step and to obtain all - symbols modpost reads Module.symvers from the kernel. + symbols, modpost reads Module.symvers from the kernel. If a Module.symvers file is present in the directory where - the external module is being build this file will be read too. - During the MODPOST step a new Module.symvers file will be written - containing all exported symbols that was not defined in the kernel. - + the external module is being built, this file will be read too. + During the MODPOST step, a new Module.symvers file will be written + containing all exported symbols that were not defined in the kernel. + --- 7.3 Symbols from another external module - Sometimes one external module uses exported symbols from another + Sometimes, an external module uses exported symbols from another external module. Kbuild needs to have full knowledge on all symbols to avoid spitting out warnings about undefined symbols. Two solutions exist to let kbuild know all symbols of more than @@ -490,15 +491,15 @@ Module.symvers contains a list of all exported symbols from a kernel build. impractical in certain situations. Use a top-level Kbuild file - If you have two modules: 'foo', 'bar' and 'foo' needs symbols - from 'bar' then one can use a common top-level kbuild file so - both modules are compiled in same build. + If you have two modules: 'foo' and 'bar', and 'foo' needs + symbols from 'bar', then one can use a common top-level kbuild + file so both modules are compiled in same build. Consider following directory layout: ./foo/ <= contains the foo module ./bar/ <= contains the bar module The top-level Kbuild file would then look like: - + #./Kbuild: (this file may also be named Makefile) obj-y := foo/ bar/ @@ -509,23 +510,23 @@ Module.symvers contains a list of all exported symbols from a kernel build. knowledge on symbols from both modules. Use an extra Module.symvers file - When an external module is build a Module.symvers file is + When an external module is built, a Module.symvers file is generated containing all exported symbols which are not defined in the kernel. - To get access to symbols from module 'bar' one can copy the + To get access to symbols from module 'bar', one can copy the Module.symvers file from the compilation of the 'bar' module - to the directory where the 'foo' module is build. - During the module build kbuild will read the Module.symvers + to the directory where the 'foo' module is built. + During the module build, kbuild will read the Module.symvers file in the directory of the external module and when the - build is finished a new Module.symvers file is created + build is finished, a new Module.symvers file is created containing the sum of all symbols defined and not part of the kernel. - + === 8. Tips & Tricks --- 8.1 Testing for CONFIG_FOO_BAR - Modules often needs to check for certain CONFIG_ options to decide if + Modules often need to check for certain CONFIG_ options to decide if a specific feature shall be included in the module. When kbuild is used this is done by referencing the CONFIG_ variable directly. @@ -537,7 +538,7 @@ Module.symvers contains a list of all exported symbols from a kernel build. External modules have traditionally used grep to check for specific CONFIG_ settings directly in .config. This usage is broken. - As introduced before external modules shall use kbuild when building - and therefore can use the same methods as in-kernel modules when testing - for CONFIG_ definitions. + As introduced before, external modules shall use kbuild when building + and therefore can use the same methods as in-kernel modules when + testing for CONFIG_ definitions. diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 87a17337c7f6..137e993f4329 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -110,6 +110,13 @@ be entered as an environment variable, whereas its absence indicates that it will appear as a kernel argument readable via /proc/cmdline by programs running once the system is up. +The number of kernel parameters is not limited, but the length of the +complete command line (parameters including spaces etc.) is limited to +a fixed number of characters. This limit depends on the architecture +and is between 256 and 4096 characters. It is defined in the file +./include/asm/setup.h as COMMAND_LINE_SIZE. + + 53c7xx= [HW,SCSI] Amiga SCSI controllers See header of drivers/scsi/53c7xx.c. See also Documentation/scsi/ncr53c7xx.txt. @@ -573,8 +580,6 @@ running once the system is up. gscd= [HW,CD] Format: <io> - gt96100eth= [NET] MIPS GT96100 Advanced Communication Controller - gus= [HW,OSS] Format: <io>,<irq>,<dma>,<dma16> @@ -1189,8 +1194,6 @@ running once the system is up. Mechanism 2. nommconf [IA-32,X86_64] Disable use of MMCONFIG for PCI Configuration - mmconf [IA-32,X86_64] Force MMCONFIG. This is useful - to override the builtin blacklist. nomsi [MSI] If the PCI_MSI kernel config parameter is enabled, this kernel boot option can be used to disable the use of MSI interrupts system-wide. @@ -1242,7 +1245,11 @@ running once the system is up. bootloader. This is currently used on IXP2000 systems where the bus has to be configured a certain way for adjunct CPUs. - + noearly [X86] Don't do any early type 1 scanning. + This might help on some broken boards which + machine check when some devices' config space + is read. But various workarounds are disabled + and some IOMMU drivers will not work. pcmv= [HW,PCMCIA] BadgePAD 4 pd. [PARIDE] @@ -1324,7 +1331,7 @@ running once the system is up. pt. [PARIDE] See Documentation/paride.txt. - quiet= [KNL] Disable log messages + quiet [KNL] Disable most log messages r128= [HW,DRM] @@ -1365,6 +1372,14 @@ running once the system is up. reserve= [KNL,BUGS] Force the kernel to ignore some iomem area + reservetop= [IA-32] + Format: nn[KMG] + Reserves a hole at the top of the kernel virtual + address space. + + reset_devices [KNL] Force drivers to reset the underlying device + during initialization. + resume= [SWSUSP] Specify the partition device for software suspend diff --git a/Documentation/kprobes.txt b/Documentation/kprobes.txt index 2c3b1eae4280..ba26201d5023 100644 --- a/Documentation/kprobes.txt +++ b/Documentation/kprobes.txt @@ -151,9 +151,9 @@ So that you can load and unload Kprobes-based instrumentation modules, make sure "Loadable module support" (CONFIG_MODULES) and "Module unloading" (CONFIG_MODULE_UNLOAD) are set to "y". -You may also want to ensure that CONFIG_KALLSYMS and perhaps even -CONFIG_KALLSYMS_ALL are set to "y", since kallsyms_lookup_name() -is a handy, version-independent way to find a function's address. +Also make sure that CONFIG_KALLSYMS and perhaps even CONFIG_KALLSYMS_ALL +are set to "y", since kallsyms_lookup_name() is used by the in-kernel +kprobe address resolution code. If you need to insert a probe in the middle of a function, you may find it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO), @@ -179,6 +179,27 @@ occurs during execution of kp->pre_handler or kp->post_handler, or during single-stepping of the probed instruction, Kprobes calls kp->fault_handler. Any or all handlers can be NULL. +NOTE: +1. With the introduction of the "symbol_name" field to struct kprobe, +the probepoint address resolution will now be taken care of by the kernel. +The following will now work: + + kp.symbol_name = "symbol_name"; + +(64-bit powerpc intricacies such as function descriptors are handled +transparently) + +2. Use the "offset" field of struct kprobe if the offset into the symbol +to install a probepoint is known. This field is used to calculate the +probepoint. + +3. Specify either the kprobe "symbol_name" OR the "addr". If both are +specified, kprobe registration will fail with -EINVAL. + +4. With CISC architectures (such as i386 and x86_64), the kprobes code +does not validate if the kprobe.addr is at an instruction boundary. +Use "offset" with caution. + register_kprobe() returns 0 on success, or a negative errno otherwise. User's pre-handler (kp->pre_handler): @@ -225,6 +246,12 @@ control to Kprobes.) If the probed function is declared asmlinkage, fastcall, or anything else that affects how args are passed, the handler's declaration must match. +NOTE: A macro JPROBE_ENTRY is provided to handle architecture-specific +aliasing of jp->entry. In the interest of portability, it is advised +to use: + + jp->entry = JPROBE_ENTRY(handler); + register_jprobe() returns 0 on success, or a negative errno otherwise. 4.3 register_kretprobe @@ -251,6 +278,11 @@ of interest: - ret_addr: the return address - rp: points to the corresponding kretprobe object - task: points to the corresponding task struct + +The regs_return_value(regs) macro provides a simple abstraction to +extract the return value from the appropriate register as defined by +the architecture's ABI. + The handler's return value is currently ignored. 4.4 unregister_*probe @@ -369,7 +401,6 @@ stack trace and selected i386 registers when do_fork() is called. #include <linux/kernel.h> #include <linux/module.h> #include <linux/kprobes.h> -#include <linux/kallsyms.h> #include <linux/sched.h> /*For each probe you need to allocate a kprobe structure*/ @@ -403,18 +434,14 @@ int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr) return 0; } -int init_module(void) +static int __init kprobe_init(void) { int ret; kp.pre_handler = handler_pre; kp.post_handler = handler_post; kp.fault_handler = handler_fault; - kp.addr = (kprobe_opcode_t*) kallsyms_lookup_name("do_fork"); - /* register the kprobe now */ - if (!kp.addr) { - printk("Couldn't find %s to plant kprobe\n", "do_fork"); - return -1; - } + kp.symbol_name = "do_fork"; + if ((ret = register_kprobe(&kp) < 0)) { printk("register_kprobe failed, returned %d\n", ret); return -1; @@ -423,12 +450,14 @@ int init_module(void) return 0; } -void cleanup_module(void) +static void __exit kprobe_exit(void) { unregister_kprobe(&kp); printk("kprobe unregistered\n"); } +module_init(kprobe_init) +module_exit(kprobe_exit) MODULE_LICENSE("GPL"); ----- cut here ----- @@ -463,7 +492,6 @@ the arguments of do_fork(). #include <linux/fs.h> #include <linux/uio.h> #include <linux/kprobes.h> -#include <linux/kallsyms.h> /* * Jumper probe for do_fork. @@ -485,17 +513,13 @@ long jdo_fork(unsigned long clone_flags, unsigned long stack_start, } static struct jprobe my_jprobe = { - .entry = (kprobe_opcode_t *) jdo_fork + .entry = JPROBE_ENTRY(jdo_fork) }; -int init_module(void) +static int __init jprobe_init(void) { int ret; - my_jprobe.kp.addr = (kprobe_opcode_t *) kallsyms_lookup_name("do_fork"); - if (!my_jprobe.kp.addr) { - printk("Couldn't find %s to plant jprobe\n", "do_fork"); - return -1; - } + my_jprobe.kp.symbol_name = "do_fork"; if ((ret = register_jprobe(&my_jprobe)) <0) { printk("register_jprobe failed, returned %d\n", ret); @@ -506,12 +530,14 @@ int init_module(void) return 0; } -void cleanup_module(void) +static void __exit jprobe_exit(void) { unregister_jprobe(&my_jprobe); printk("jprobe unregistered\n"); } +module_init(jprobe_init) +module_exit(jprobe_exit) MODULE_LICENSE("GPL"); ----- cut here ----- @@ -530,16 +556,13 @@ report failed calls to sys_open(). #include <linux/kernel.h> #include <linux/module.h> #include <linux/kprobes.h> -#include <linux/kallsyms.h> static const char *probed_func = "sys_open"; /* Return-probe handler: If the probed function fails, log the return value. */ static int ret_handler(struct kretprobe_instance *ri, struct pt_regs *regs) { - // Substitute the appropriate register name for your architecture -- - // e.g., regs->rax for x86_64, regs->gpr[3] for ppc64. - int retval = (int) regs->eax; + int retval = regs_return_value(regs); if (retval < 0) { printk("%s returns %d\n", probed_func, retval); } @@ -552,15 +575,11 @@ static struct kretprobe my_kretprobe = { .maxactive = 20 }; -int init_module(void) +static int __init kretprobe_init(void) { int ret; - my_kretprobe.kp.addr = - (kprobe_opcode_t *) kallsyms_lookup_name(probed_func); - if (!my_kretprobe.kp.addr) { - printk("Couldn't find %s to plant return probe\n", probed_func); - return -1; - } + my_kretprobe.kp.symbol_name = (char *)probed_func; + if ((ret = register_kretprobe(&my_kretprobe)) < 0) { printk("register_kretprobe failed, returned %d\n", ret); return -1; @@ -569,7 +588,7 @@ int init_module(void) return 0; } -void cleanup_module(void) +static void __exit kretprobe_exit(void) { unregister_kretprobe(&my_kretprobe); printk("kretprobe unregistered\n"); @@ -578,6 +597,8 @@ void cleanup_module(void) my_kretprobe.nmissed, probed_func); } +module_init(kretprobe_init) +module_exit(kretprobe_exit) MODULE_LICENSE("GPL"); ----- cut here ----- @@ -590,3 +611,5 @@ messages.) For additional information on Kprobes, refer to the following URLs: http://www-106.ibm.com/developerworks/library/l-kprobes.html?ca=dgr-lnxw42Kprobe http://www.redhat.com/magazine/005mar05/features/kprobes/ +http://www-users.cs.umn.edu/~boutcher/kprobes/ +http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf (pages 101-115) diff --git a/Documentation/lockdep-design.txt b/Documentation/lockdep-design.txt index 00d93605bfd3..55a7e4fa8cc2 100644 --- a/Documentation/lockdep-design.txt +++ b/Documentation/lockdep-design.txt @@ -36,6 +36,28 @@ The validator tracks lock-class usage history into 5 separate state bits: - 'ever used' [ == !unused ] +When locking rules are violated, these 4 state bits are presented in the +locking error messages, inside curlies. A contrived example: + + modprobe/2287 is trying to acquire lock: + (&sio_locks[i].lock){--..}, at: [<c02867fd>] mutex_lock+0x21/0x24 + + but task is already holding lock: + (&sio_locks[i].lock){--..}, at: [<c02867fd>] mutex_lock+0x21/0x24 + + +The bit position indicates hardirq, softirq, hardirq-read, +softirq-read respectively, and the character displayed in each +indicates: + + '.' acquired while irqs enabled + '+' acquired in irq context + '-' acquired in process context with irqs disabled + '?' read-acquired both with irqs enabled and in irq context + +Unused mutexes cannot be part of the cause of an error. + + Single-lock state rules: ------------------------ diff --git a/Documentation/netlabel/00-INDEX b/Documentation/netlabel/00-INDEX new file mode 100644 index 000000000000..837bf35990e2 --- /dev/null +++ b/Documentation/netlabel/00-INDEX @@ -0,0 +1,10 @@ +00-INDEX + - this file. +cipso_ipv4.txt + - documentation on the IPv4 CIPSO protocol engine. +draft-ietf-cipso-ipsecurity-01.txt + - IETF draft of the CIPSO protocol, dated 16 July 1992. +introduction.txt + - NetLabel introduction, READ THIS FIRST. +lsm_interface.txt + - documentation on the NetLabel kernel security module API. diff --git a/Documentation/netlabel/cipso_ipv4.txt b/Documentation/netlabel/cipso_ipv4.txt new file mode 100644 index 000000000000..93dacb132c3c --- /dev/null +++ b/Documentation/netlabel/cipso_ipv4.txt @@ -0,0 +1,48 @@ +NetLabel CIPSO/IPv4 Protocol Engine +============================================================================== +Paul Moore, paul.moore@hp.com + +May 17, 2006 + + * Overview + +The NetLabel CIPSO/IPv4 protocol engine is based on the IETF Commercial IP +Security Option (CIPSO) draft from July 16, 1992. A copy of this draft can be +found in this directory, consult '00-INDEX' for the filename. While the IETF +draft never made it to an RFC standard it has become a de-facto standard for +labeled networking and is used in many trusted operating systems. + + * Outbound Packet Processing + +The CIPSO/IPv4 protocol engine applies the CIPSO IP option to packets by +adding the CIPSO label to the socket. This causes all packets leaving the +system through the socket to have the CIPSO IP option applied. The socket's +CIPSO label can be changed at any point in time, however, it is recommended +that it is set upon the socket's creation. The LSM can set the socket's CIPSO +label by using the NetLabel security module API; if the NetLabel "domain" is +configured to use CIPSO for packet labeling then a CIPSO IP option will be +generated and attached to the socket. + + * Inbound Packet Processing + +The CIPSO/IPv4 protocol engine validates every CIPSO IP option it finds at the +IP layer without any special handling required by the LSM. However, in order +to decode and translate the CIPSO label on the packet the LSM must use the +NetLabel security module API to extract the security attributes of the packet. +This is typically done at the socket layer using the 'socket_sock_rcv_skb()' +LSM hook. + + * Label Translation + +The CIPSO/IPv4 protocol engine contains a mechanism to translate CIPSO security +attributes such as sensitivity level and category to values which are +appropriate for the host. These mappings are defined as part of a CIPSO +Domain Of Interpretation (DOI) definition and are configured through the +NetLabel user space communication layer. Each DOI definition can have a +different security attribute mapping table. + + * Label Translation Cache + +The NetLabel system provides a framework for caching security attribute +mappings from the network labels to the corresponding LSM identifiers. The +CIPSO/IPv4 protocol engine supports this caching mechanism. diff --git a/Documentation/netlabel/draft-ietf-cipso-ipsecurity-01.txt b/Documentation/netlabel/draft-ietf-cipso-ipsecurity-01.txt new file mode 100644 index 000000000000..256c2c9d4f50 --- /dev/null +++ b/Documentation/netlabel/draft-ietf-cipso-ipsecurity-01.txt @@ -0,0 +1,791 @@ +IETF CIPSO Working Group +16 July, 1992 + + + + COMMERCIAL IP SECURITY OPTION (CIPSO 2.2) + + + +1. Status + +This Internet Draft provides the high level specification for a Commercial +IP Security Option (CIPSO). This draft reflects the version as approved by +the CIPSO IETF Working Group. Distribution of this memo is unlimited. + +This document is an Internet Draft. Internet Drafts are working documents +of the Internet Engineering Task Force (IETF), its Areas, and its Working +Groups. Note that other groups may also distribute working documents as +Internet Drafts. + +Internet Drafts are draft documents valid for a maximum of six months. +Internet Drafts may be updated, replaced, or obsoleted by other documents +at any time. It is not appropriate to use Internet Drafts as reference +material or to cite them other than as a "working draft" or "work in +progress." + +Please check the I-D abstract listing contained in each Internet Draft +directory to learn the current status of this or any other Internet Draft. + + + + +2. Background + +Currently the Internet Protocol includes two security options. One of +these options is the DoD Basic Security Option (BSO) (Type 130) which allows +IP datagrams to be labeled with security classifications. This option +provides sixteen security classifications and a variable number of handling +restrictions. To handle additional security information, such as security +categories or compartments, another security option (Type 133) exists and +is referred to as the DoD Extended Security Option (ESO). The values for +the fixed fields within these two options are administered by the Defense +Information Systems Agency (DISA). + +Computer vendors are now building commercial operating systems with +mandatory access controls and multi-level security. These systems are +no longer built specifically for a particular group in the defense or +intelligence communities. They are generally available commercial systems +for use in a variety of government and civil sector environments. + +The small number of ESO format codes can not support all the possible +applications of a commercial security option. The BSO and ESO were +designed to only support the United States DoD. CIPSO has been designed +to support multiple security policies. This Internet Draft provides the +format and procedures required to support a Mandatory Access Control +security policy. Support for additional security policies shall be +defined in future RFCs. + + + + +Internet Draft, Expires 15 Jan 93 [PAGE 1] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + + +3. CIPSO Format + +Option type: 134 (Class 0, Number 6, Copy on Fragmentation) +Option length: Variable + +This option permits security related information to be passed between +systems within a single Domain of Interpretation (DOI). A DOI is a +collection of systems which agree on the meaning of particular values +in the security option. An authority that has been assigned a DOI +identifier will define a mapping between appropriate CIPSO field values +and their human readable equivalent. This authority will distribute that +mapping to hosts within the authority's domain. These mappings may be +sensitive, therefore a DOI authority is not required to make these +mappings available to anyone other than the systems that are included in +the DOI. + +This option MUST be copied on fragmentation. This option appears at most +once in a datagram. All multi-octet fields in the option are defined to be +transmitted in network byte order. The format of this option is as follows: + ++----------+----------+------//------+-----------//---------+ +| 10000110 | LLLLLLLL | DDDDDDDDDDDD | TTTTTTTTTTTTTTTTTTTT | ++----------+----------+------//------+-----------//---------+ + + TYPE=134 OPTION DOMAIN OF TAGS + LENGTH INTERPRETATION + + + Figure 1. CIPSO Format + + +3.1 Type + +This field is 1 octet in length. Its value is 134. + + +3.2 Length + +This field is 1 octet in length. It is the total length of the option +including the type and length fields. With the current IP header length +restriction of 40 octets the value of this field MUST not exceed 40. + + +3.3 Domain of Interpretation Identifier + +This field is an unsigned 32 bit integer. The value 0 is reserved and MUST +not appear as the DOI identifier in any CIPSO option. Implementations +should assume that the DOI identifier field is not aligned on any particular +byte boundary. + +To conserve space in the protocol, security levels and categories are +represented by numbers rather than their ASCII equivalent. This requires +a mapping table within CIPSO hosts to map these numbers to their +corresponding ASCII representations. Non-related groups of systems may + + + +Internet Draft, Expires 15 Jan 93 [PAGE 2] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +have their own unique mappings. For example, one group of systems may +use the number 5 to represent Unclassified while another group may use the +number 1 to represent that same security level. The DOI identifier is used +to identify which mapping was used for the values within the option. + + +3.4 Tag Types + +A common format for passing security related information is necessary +for interoperability. CIPSO uses sets of "tags" to contain the security +information relevant to the data in the IP packet. Each tag begins with +a tag type identifier followed by the length of the tag and ends with the +actual security information to be passed. All multi-octet fields in a tag +are defined to be transmitted in network byte order. Like the DOI +identifier field in the CIPSO header, implementations should assume that +all tags, as well as fields within a tag, are not aligned on any particular +octet boundary. The tag types defined in this document contain alignment +bytes to assist alignment of some information, however alignment can not +be guaranteed if CIPSO is not the first IP option. + +CIPSO tag types 0 through 127 are reserved for defining standard tag +formats. Their definitions will be published in RFCs. Tag types whose +identifiers are greater than 127 are defined by the DOI authority and may +only be meaningful in certain Domains of Interpretation. For these tag +types, implementations will require the DOI identifier as well as the tag +number to determine the security policy and the format associated with the +tag. Use of tag types above 127 are restricted to closed networks where +interoperability with other networks will not be an issue. Implementations +that support a tag type greater than 127 MUST support at least one DOI that +requires only tag types 1 to 127. + +Tag type 0 is reserved. Tag types 1, 2, and 5 are defined in this +Internet Draft. Types 3 and 4 are reserved for work in progress. +The standard format for all current and future CIPSO tags is shown below: + ++----------+----------+--------//--------+ +| TTTTTTTT | LLLLLLLL | IIIIIIIIIIIIIIII | ++----------+----------+--------//--------+ + TAG TAG TAG + TYPE LENGTH INFORMATION + + Figure 2: Standard Tag Format + +In the three tag types described in this document, the length and count +restrictions are based on the current IP limitation of 40 octets for all +IP options. If the IP header is later expanded, then the length and count +restrictions specified in this document may increase to use the full area +provided for IP options. + + +3.4.1 Tag Type Classes + +Tag classes consist of tag types that have common processing requirements +and support the same security policy. The three tags defined in this +Internet Draft belong to the Mandatory Access Control (MAC) Sensitivity + + + +Internet Draft, Expires 15 Jan 93 [PAGE 3] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +class and support the MAC Sensitivity security policy. + + +3.4.2 Tag Type 1 + +This is referred to as the "bit-mapped" tag type. Tag type 1 is included +in the MAC Sensitivity tag type class. The format of this tag type is as +follows: + ++----------+----------+----------+----------+--------//---------+ +| 00000001 | LLLLLLLL | 00000000 | LLLLLLLL | CCCCCCCCCCCCCCCCC | ++----------+----------+----------+----------+--------//---------+ + + TAG TAG ALIGNMENT SENSITIVITY BIT MAP OF + TYPE LENGTH OCTET LEVEL CATEGORIES + + Figure 3. Tag Type 1 Format + + +3.4.2.1 Tag Type + +This field is 1 octet in length and has a value of 1. + + +3.4.2.2 Tag Length + +This field is 1 octet in length. It is the total length of the tag type +including the type and length fields. With the current IP header length +restriction of 40 bytes the value within this field is between 4 and 34. + + +3.4.2.3 Alignment Octet + +This field is 1 octet in length and always has the value of 0. Its purpose +is to align the category bitmap field on an even octet boundary. This will +speed many implementations including router implementations. + + +3.4.2.4 Sensitivity Level + +This field is 1 octet in length. Its value is from 0 to 255. The values +are ordered with 0 being the minimum value and 255 representing the maximum +value. + + +3.4.2.5 Bit Map of Categories + +The length of this field is variable and ranges from 0 to 30 octets. This +provides representation of categories 0 to 239. The ordering of the bits +is left to right or MSB to LSB. For example category 0 is represented by +the most significant bit of the first byte and category 15 is represented +by the least significant bit of the second byte. Figure 4 graphically +shows this ordering. Bit N is binary 1 if category N is part of the label +for the datagram, and bit N is binary 0 if category N is not part of the +label. Except for the optimized tag 1 format described in the next section, + + + +Internet Draft, Expires 15 Jan 93 [PAGE 4] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +minimal encoding SHOULD be used resulting in no trailing zero octets in the +category bitmap. + + octet 0 octet 1 octet 2 octet 3 octet 4 octet 5 + XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX . . . +bit 01234567 89111111 11112222 22222233 33333333 44444444 +number 012345 67890123 45678901 23456789 01234567 + + Figure 4. Ordering of Bits in Tag 1 Bit Map + + +3.4.2.6 Optimized Tag 1 Format + +Routers work most efficiently when processing fixed length fields. To +support these routers there is an optimized form of tag type 1. The format +does not change. The only change is to the category bitmap which is set to +a constant length of 10 octets. Trailing octets required to fill out the 10 +octets are zero filled. Ten octets, allowing for 80 categories, was chosen +because it makes the total length of the CIPSO option 20 octets. If CIPSO +is the only option then the option will be full word aligned and additional +filler octets will not be required. + + +3.4.3 Tag Type 2 + +This is referred to as the "enumerated" tag type. It is used to describe +large but sparsely populated sets of categories. Tag type 2 is in the MAC +Sensitivity tag type class. The format of this tag type is as follows: + ++----------+----------+----------+----------+-------------//-------------+ +| 00000010 | LLLLLLLL | 00000000 | LLLLLLLL | CCCCCCCCCCCCCCCCCCCCCCCCCC | ++----------+----------+----------+----------+-------------//-------------+ + + TAG TAG ALIGNMENT SENSITIVITY ENUMERATED + TYPE LENGTH OCTET LEVEL CATEGORIES + + Figure 5. Tag Type 2 Format + + +3.4.3.1 Tag Type + +This field is one octet in length and has a value of 2. + + +3.4.3.2 Tag Length + +This field is 1 octet in length. It is the total length of the tag type +including the type and length fields. With the current IP header length +restriction of 40 bytes the value within this field is between 4 and 34. + + +3.4.3.3 Alignment Octet + +This field is 1 octet in length and always has the value of 0. Its purpose +is to align the category field on an even octet boundary. This will + + + +Internet Draft, Expires 15 Jan 93 [PAGE 5] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +speed many implementations including router implementations. + + +3.4.3.4 Sensitivity Level + +This field is 1 octet in length. Its value is from 0 to 255. The values +are ordered with 0 being the minimum value and 255 representing the +maximum value. + + +3.4.3.5 Enumerated Categories + +In this tag, categories are represented by their actual value rather than +by their position within a bit field. The length of each category is 2 +octets. Up to 15 categories may be represented by this tag. Valid values +for categories are 0 to 65534. Category 65535 is not a valid category +value. The categories MUST be listed in ascending order within the tag. + + +3.4.4 Tag Type 5 + +This is referred to as the "range" tag type. It is used to represent +labels where all categories in a range, or set of ranges, are included +in the sensitivity label. Tag type 5 is in the MAC Sensitivity tag type +class. The format of this tag type is as follows: + ++----------+----------+----------+----------+------------//-------------+ +| 00000101 | LLLLLLLL | 00000000 | LLLLLLLL | Top/Bottom | Top/Bottom | ++----------+----------+----------+----------+------------//-------------+ + + TAG TAG ALIGNMENT SENSITIVITY CATEGORY RANGES + TYPE LENGTH OCTET LEVEL + + Figure 6. Tag Type 5 Format + + +3.4.4.1 Tag Type + +This field is one octet in length and has a value of 5. + + +3.4.4.2 Tag Length + +This field is 1 octet in length. It is the total length of the tag type +including the type and length fields. With the current IP header length +restriction of 40 bytes the value within this field is between 4 and 34. + + +3.4.4.3 Alignment Octet + +This field is 1 octet in length and always has the value of 0. Its purpose +is to align the category range field on an even octet boundary. This will +speed many implementations including router implementations. + + + + + +Internet Draft, Expires 15 Jan 93 [PAGE 6] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +3.4.4.4 Sensitivity Level + +This field is 1 octet in length. Its value is from 0 to 255. The values +are ordered with 0 being the minimum value and 255 representing the maximum +value. + + +3.4.4.5 Category Ranges + +A category range is a 4 octet field comprised of the 2 octet index of the +highest numbered category followed by the 2 octet index of the lowest +numbered category. These range endpoints are inclusive within the range of +categories. All categories within a range are included in the sensitivity +label. This tag may contain a maximum of 7 category pairs. The bottom +category endpoint for the last pair in the tag MAY be omitted and SHOULD be +assumed to be 0. The ranges MUST be non-overlapping and be listed in +descending order. Valid values for categories are 0 to 65534. Category +65535 is not a valid category value. + + +3.4.5 Minimum Requirements + +A CIPSO implementation MUST be capable of generating at least tag type 1 in +the non-optimized form. In addition, a CIPSO implementation MUST be able +to receive any valid tag type 1 even those using the optimized tag type 1 +format. + + +4. Configuration Parameters + +The configuration parameters defined below are required for all CIPSO hosts, +gateways, and routers that support multiple sensitivity labels. A CIPSO +host is defined to be the origination or destination system for an IP +datagram. A CIPSO gateway provides IP routing services between two or more +IP networks and may be required to perform label translations between +networks. A CIPSO gateway may be an enhanced CIPSO host or it may just +provide gateway services with no end system CIPSO capabilities. A CIPSO +router is a dedicated IP router that routes IP datagrams between two or more +IP networks. + +An implementation of CIPSO on a host MUST have the capability to reject a +datagram for reasons that the information contained can not be adequately +protected by the receiving host or if acceptance may result in violation of +the host or network security policy. In addition, a CIPSO gateway or router +MUST be able to reject datagrams going to networks that can not provide +adequate protection or may violate the network's security policy. To +provide this capability the following minimal set of configuration +parameters are required for CIPSO implementations: + +HOST_LABEL_MAX - This parameter contains the maximum sensitivity label that +a CIPSO host is authorized to handle. All datagrams that have a label +greater than this maximum MUST be rejected by the CIPSO host. This +parameter does not apply to CIPSO gateways or routers. This parameter need +not be defined explicitly as it can be implicitly derived from the +PORT_LABEL_MAX parameters for the associated interfaces. + + + +Internet Draft, Expires 15 Jan 93 [PAGE 7] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + + +HOST_LABEL_MIN - This parameter contains the minimum sensitivity label that +a CIPSO host is authorized to handle. All datagrams that have a label less +than this minimum MUST be rejected by the CIPSO host. This parameter does +not apply to CIPSO gateways or routers. This parameter need not be defined +explicitly as it can be implicitly derived from the PORT_LABEL_MIN +parameters for the associated interfaces. + +PORT_LABEL_MAX - This parameter contains the maximum sensitivity label for +all datagrams that may exit a particular network interface port. All +outgoing datagrams that have a label greater than this maximum MUST be +rejected by the CIPSO system. The label within this parameter MUST be +less than or equal to the label within the HOST_LABEL_MAX parameter. This +parameter does not apply to CIPSO hosts that support only one network port. + +PORT_LABEL_MIN - This parameter contains the minimum sensitivity label for +all datagrams that may exit a particular network interface port. All +outgoing datagrams that have a label less than this minimum MUST be +rejected by the CIPSO system. The label within this parameter MUST be +greater than or equal to the label within the HOST_LABEL_MIN parameter. +This parameter does not apply to CIPSO hosts that support only one network +port. + +PORT_DOI - This parameter is used to assign a DOI identifier value to a +particular network interface port. All CIPSO labels within datagrams +going out this port MUST use the specified DOI identifier. All CIPSO +hosts and gateways MUST support either this parameter, the NET_DOI +parameter, or the HOST_DOI parameter. + +NET_DOI - This parameter is used to assign a DOI identifier value to a +particular IP network address. All CIPSO labels within datagrams destined +for the particular IP network MUST use the specified DOI identifier. All +CIPSO hosts and gateways MUST support either this parameter, the PORT_DOI +parameter, or the HOST_DOI parameter. + +HOST_DOI - This parameter is used to assign a DOI identifier value to a +particular IP host address. All CIPSO labels within datagrams destined for +the particular IP host will use the specified DOI identifier. All CIPSO +hosts and gateways MUST support either this parameter, the PORT_DOI +parameter, or the NET_DOI parameter. + +This list represents the minimal set of configuration parameters required +to be compliant. Implementors are encouraged to add to this list to +provide enhanced functionality and control. For example, many security +policies may require both incoming and outgoing datagrams be checked against +the port and host label ranges. + + +4.1 Port Range Parameters + +The labels represented by the PORT_LABEL_MAX and PORT_LABEL_MIN parameters +MAY be in CIPSO or local format. Some CIPSO systems, such as routers, may +want to have the range parameters expressed in CIPSO format so that incoming +labels do not have to be converted to a local format before being compared +against the range. If multiple DOIs are supported by one of these CIPSO + + + +Internet Draft, Expires 15 Jan 93 [PAGE 8] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +systems then multiple port range parameters would be needed, one set for +each DOI supported on a particular port. + +The port range will usually represent the total set of labels that may +exist on the logical network accessed through the corresponding network +interface. It may, however, represent a subset of these labels that are +allowed to enter the CIPSO system. + + +4.2 Single Label CIPSO Hosts + +CIPSO implementations that support only one label are not required to +support the parameters described above. These limited implementations are +only required to support a NET_LABEL parameter. This parameter contains +the CIPSO label that may be inserted in datagrams that exit the host. In +addition, the host MUST reject any incoming datagram that has a label which +is not equivalent to the NET_LABEL parameter. + + +5. Handling Procedures + +This section describes the processing requirements for incoming and +outgoing IP datagrams. Just providing the correct CIPSO label format +is not enough. Assumptions will be made by one system on how a +receiving system will handle the CIPSO label. Wrong assumptions may +lead to non-interoperability or even a security incident. The +requirements described below represent the minimal set needed for +interoperability and that provide users some level of confidence. +Many other requirements could be added to increase user confidence, +however at the risk of restricting creativity and limiting vendor +participation. + + +5.1 Input Procedures + +All datagrams received through a network port MUST have a security label +associated with them, either contained in the datagram or assigned to the +receiving port. Without this label the host, gateway, or router will not +have the information it needs to make security decisions. This security +label will be obtained from the CIPSO if the option is present in the +datagram. See section 4.1.2 for handling procedures for unlabeled +datagrams. This label will be compared against the PORT (if appropriate) +and HOST configuration parameters defined in section 3. + +If any field within the CIPSO option, such as the DOI identifier, is not +recognized the IP datagram is discarded and an ICMP "parameter problem" +(type 12) is generated and returned. The ICMP code field is set to "bad +parameter" (code 0) and the pointer is set to the start of the CIPSO field +that is unrecognized. + +If the contents of the CIPSO are valid but the security label is +outside of the configured host or port label range, the datagram is +discarded and an ICMP "destination unreachable" (type 3) is generated +and returned. The code field of the ICMP is set to "communication with +destination network administratively prohibited" (code 9) or to + + + +Internet Draft, Expires 15 Jan 93 [PAGE 9] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +"communication with destination host administratively prohibited" +(code 10). The value of the code field used is dependent upon whether +the originator of the ICMP message is acting as a CIPSO host or a CIPSO +gateway. The recipient of the ICMP message MUST be able to handle either +value. The same procedure is performed if a CIPSO can not be added to an +IP packet because it is too large to fit in the IP options area. + +If the error is triggered by receipt of an ICMP message, the message +is discarded and no response is permitted (consistent with general ICMP +processing rules). + + +5.1.1 Unrecognized tag types + +The default condition for any CIPSO implementation is that an +unrecognized tag type MUST be treated as a "parameter problem" and +handled as described in section 4.1. A CIPSO implementation MAY allow +the system administrator to identify tag types that may safely be +ignored. This capability is an allowable enhancement, not a +requirement. + + +5.1.2 Unlabeled Packets + +A network port may be configured to not require a CIPSO label for all +incoming datagrams. For this configuration a CIPSO label must be +assigned to that network port and associated with all unlabeled IP +datagrams. This capability might be used for single level networks or +networks that have CIPSO and non-CIPSO hosts and the non-CIPSO hosts +all operate at the same label. + +If a CIPSO option is required and none is found, the datagram is +discarded and an ICMP "parameter problem" (type 12) is generated and +returned to the originator of the datagram. The code field of the ICMP +is set to "option missing" (code 1) and the ICMP pointer is set to 134 +(the value of the option type for the missing CIPSO option). + + +5.2 Output Procedures + +A CIPSO option MUST appear only once in a datagram. Only one tag type +from the MAC Sensitivity class MAY be included in a CIPSO option. Given +the current set of defined tag types, this means that CIPSO labels at +first will contain only one tag. + +All datagrams leaving a CIPSO system MUST meet the following condition: + + PORT_LABEL_MIN <= CIPSO label <= PORT_LABEL_MAX + +If this condition is not satisfied the datagram MUST be discarded. +If the CIPSO system only supports one port, the HOST_LABEL_MIN and the +HOST_LABEL_MAX parameters MAY be substituted for the PORT parameters in +the above condition. + +The DOI identifier to be used for all outgoing datagrams is configured by + + + +Internet Draft, Expires 15 Jan 93 [PAGE 10] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +the administrator. If port level DOI identifier assignment is used, then +the PORT_DOI configuration parameter MUST contain the DOI identifier to +use. If network level DOI assignment is used, then the NET_DOI parameter +MUST contain the DOI identifier to use. And if host level DOI assignment +is employed, then the HOST_DOI parameter MUST contain the DOI identifier +to use. A CIPSO implementation need only support one level of DOI +assignment. + + +5.3 DOI Processing Requirements + +A CIPSO implementation MUST support at least one DOI and SHOULD support +multiple DOIs. System and network administrators are cautioned to +ensure that at least one DOI is common within an IP network to allow for +broadcasting of IP datagrams. + +CIPSO gateways MUST be capable of translating a CIPSO option from one +DOI to another when forwarding datagrams between networks. For +efficiency purposes this capability is only a desired feature for CIPSO +routers. + + +5.4 Label of ICMP Messages + +The CIPSO label to be used on all outgoing ICMP messages MUST be equivalent +to the label of the datagram that caused the ICMP message. If the ICMP was +generated due to a problem associated with the original CIPSO label then the +following responses are allowed: + + a. Use the CIPSO label of the original IP datagram + b. Drop the original datagram with no return message generated + +In most cases these options will have the same effect. If you can not +interpret the label or if it is outside the label range of your host or +interface then an ICMP message with the same label will probably not be +able to exit the system. + + +6. Assignment of DOI Identifier Numbers = + +Requests for assignment of a DOI identifier number should be addressed to +the Internet Assigned Numbers Authority (IANA). + + +7. Acknowledgements + +Much of the material in this RFC is based on (and copied from) work +done by Gary Winiger of Sun Microsystems and published as Commercial +IP Security Option at the INTEROP 89, Commercial IPSO Workshop. + + +8. Author's Address + +To submit mail for distribution to members of the IETF CIPSO Working +Group, send mail to: cipso@wdl1.wdl.loral.com. + + + +Internet Draft, Expires 15 Jan 93 [PAGE 11] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + + +To be added to or deleted from this distribution, send mail to: +cipso-request@wdl1.wdl.loral.com. + + +9. References + +RFC 1038, "Draft Revised IP Security Option", M. St. Johns, IETF, January +1988. + +RFC 1108, "U.S. Department of Defense Security Options +for the Internet Protocol", Stephen Kent, IAB, 1 March, 1991. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Internet Draft, Expires 15 Jan 93 [PAGE 12] + + + diff --git a/Documentation/netlabel/introduction.txt b/Documentation/netlabel/introduction.txt new file mode 100644 index 000000000000..a4ffba1694c8 --- /dev/null +++ b/Documentation/netlabel/introduction.txt @@ -0,0 +1,46 @@ +NetLabel Introduction +============================================================================== +Paul Moore, paul.moore@hp.com + +August 2, 2006 + + * Overview + +NetLabel is a mechanism which can be used by kernel security modules to attach +security attributes to outgoing network packets generated from user space +applications and read security attributes from incoming network packets. It +is composed of three main components, the protocol engines, the communication +layer, and the kernel security module API. + + * Protocol Engines + +The protocol engines are responsible for both applying and retrieving the +network packet's security attributes. If any translation between the network +security attributes and those on the host are required then the protocol +engine will handle those tasks as well. Other kernel subsystems should +refrain from calling the protocol engines directly, instead they should use +the NetLabel kernel security module API described below. + +Detailed information about each NetLabel protocol engine can be found in this +directory, consult '00-INDEX' for filenames. + + * Communication Layer + +The communication layer exists to allow NetLabel configuration and monitoring +from user space. The NetLabel communication layer uses a message based +protocol built on top of the Generic NETLINK transport mechanism. The exact +formatting of these NetLabel messages as well as the Generic NETLINK family +names can be found in the the 'net/netlabel/' directory as comments in the +header files as well as in 'include/net/netlabel.h'. + + * Security Module API + +The purpose of the NetLabel security module API is to provide a protocol +independent interface to the underlying NetLabel protocol engines. In addition +to protocol independence, the security module API is designed to be completely +LSM independent which should allow multiple LSMs to leverage the same code +base. + +Detailed information about the NetLabel security module API can be found in the +'include/net/netlabel.h' header file as well as the 'lsm_interface.txt' file +found in this directory. diff --git a/Documentation/netlabel/lsm_interface.txt b/Documentation/netlabel/lsm_interface.txt new file mode 100644 index 000000000000..98dd9f7430f2 --- /dev/null +++ b/Documentation/netlabel/lsm_interface.txt @@ -0,0 +1,47 @@ +NetLabel Linux Security Module Interface +============================================================================== +Paul Moore, paul.moore@hp.com + +May 17, 2006 + + * Overview + +NetLabel is a mechanism which can set and retrieve security attributes from +network packets. It is intended to be used by LSM developers who want to make +use of a common code base for several different packet labeling protocols. +The NetLabel security module API is defined in 'include/net/netlabel.h' but a +brief overview is given below. + + * NetLabel Security Attributes + +Since NetLabel supports multiple different packet labeling protocols and LSMs +it uses the concept of security attributes to refer to the packet's security +labels. The NetLabel security attributes are defined by the +'netlbl_lsm_secattr' structure in the NetLabel header file. Internally the +NetLabel subsystem converts the security attributes to and from the correct +low-level packet label depending on the NetLabel build time and run time +configuration. It is up to the LSM developer to translate the NetLabel +security attributes into whatever security identifiers are in use for their +particular LSM. + + * NetLabel LSM Protocol Operations + +These are the functions which allow the LSM developer to manipulate the labels +on outgoing packets as well as read the labels on incoming packets. Functions +exist to operate both on sockets as well as the sk_buffs directly. These high +level functions are translated into low level protocol operations based on how +the administrator has configured the NetLabel subsystem. + + * NetLabel Label Mapping Cache Operations + +Depending on the exact configuration, translation between the network packet +label and the internal LSM security identifier can be time consuming. The +NetLabel label mapping cache is a caching mechanism which can be used to +sidestep much of this overhead once a mapping has been established. Once the +LSM has received a packet, used NetLabel to decode it's security attributes, +and translated the security attributes into a LSM internal identifier the LSM +can use the NetLabel caching functions to associate the LSM internal +identifier with the network packet's label. This means that in the future +when a incoming packet matches a cached value not only are the internal +NetLabel translation mechanisms bypassed but the LSM translation mechanisms are +bypassed as well which should result in a significant reduction in overhead. diff --git a/Documentation/networking/LICENSE.qla3xxx b/Documentation/networking/LICENSE.qla3xxx new file mode 100644 index 000000000000..2f2077e34d81 --- /dev/null +++ b/Documentation/networking/LICENSE.qla3xxx @@ -0,0 +1,46 @@ +Copyright (c) 2003-2006 QLogic Corporation +QLogic Linux Networking HBA Driver + +This program includes a device driver for Linux 2.6 that may be +distributed with QLogic hardware specific firmware binary file. +You may modify and redistribute the device driver code under the +GNU General Public License as published by the Free Software +Foundation (version 2 or a later version). + +You may redistribute the hardware specific firmware binary file +under the following terms: + + 1. Redistribution of source code (only if applicable), + must retain the above copyright notice, this list of + conditions and the following disclaimer. + + 2. Redistribution in binary form must reproduce the above + copyright notice, this list of conditions and the + following disclaimer in the documentation and/or other + materials provided with the distribution. + + 3. The name of QLogic Corporation may not be used to + endorse or promote products derived from this software + without specific prior written permission + +REGARDLESS OF WHAT LICENSING MECHANISM IS USED OR APPLICABLE, +THIS PROGRAM IS PROVIDED BY QLOGIC CORPORATION "AS IS'' AND ANY +EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR +BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED +TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON +ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. + +USER ACKNOWLEDGES AND AGREES THAT USE OF THIS PROGRAM WILL NOT +CREATE OR GIVE GROUNDS FOR A LICENSE BY IMPLICATION, ESTOPPEL, OR +OTHERWISE IN ANY INTELLECTUAL PROPERTY RIGHTS (PATENT, COPYRIGHT, +TRADE SECRET, MASK WORK, OR OTHER PROPRIETARY RIGHT) EMBODIED IN +ANY OTHER QLOGIC HARDWARE OR SOFTWARE EITHER SOLELY OR IN +COMBINATION WITH THIS PROGRAM. + diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index afac780445cd..dc942eaf490f 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -192,6 +192,17 @@ or, for backwards compatibility, the option value. E.g., arp_interval Specifies the ARP link monitoring frequency in milliseconds. + + The ARP monitor works by periodically checking the slave + devices to determine whether they have sent or received + traffic recently (the precise criteria depends upon the + bonding mode, and the state of the slave). Regular traffic is + generated via ARP probes issued for the addresses specified by + the arp_ip_target option. + + This behavior can be modified by the arp_validate option, + below. + If ARP monitoring is used in an etherchannel compatible mode (modes 0 and 2), the switch should be configured in a mode that evenly distributes packets across all links. If the @@ -213,6 +224,54 @@ arp_ip_target maximum number of targets that can be specified is 16. The default value is no IP addresses. +arp_validate + + Specifies whether or not ARP probes and replies should be + validated in the active-backup mode. This causes the ARP + monitor to examine the incoming ARP requests and replies, and + only consider a slave to be up if it is receiving the + appropriate ARP traffic. + + Possible values are: + + none or 0 + + No validation is performed. This is the default. + + active or 1 + + Validation is performed only for the active slave. + + backup or 2 + + Validation is performed only for backup slaves. + + all or 3 + + Validation is performed for all slaves. + + For the active slave, the validation checks ARP replies to + confirm that they were generated by an arp_ip_target. Since + backup slaves do not typically receive these replies, the + validation performed for backup slaves is on the ARP request + sent out via the active slave. It is possible that some + switch or network configurations may result in situations + wherein the backup slaves do not receive the ARP requests; in + such a situation, validation of backup slaves must be + disabled. + + This option is useful in network configurations in which + multiple bonding hosts are concurrently issuing ARPs to one or + more targets beyond a common switch. Should the link between + the switch and target fail (but not the switch itself), the + probe traffic generated by the multiple bonding instances will + fool the standard ARP monitor into considering the links as + still up. Use of the arp_validate option can resolve this, as + the ARP monitor will only consider ARP requests and replies + associated with its own instance of bonding. + + This option was added in bonding version 3.1.0. + downdelay Specifies the time, in milliseconds, to wait before disabling diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt index c45daabd3bfe..74563b38ffd9 100644 --- a/Documentation/networking/dccp.txt +++ b/Documentation/networking/dccp.txt @@ -1,7 +1,6 @@ DCCP protocol ============ -Last updated: 10 November 2005 Contents ======== @@ -42,8 +41,11 @@ Socket options DCCP_SOCKOPT_PACKET_SIZE is used for CCID3 to set default packet size for calculations. -DCCP_SOCKOPT_SERVICE sets the service. This is compulsory as per the -specification. If you don't set it you will get EPROTO. +DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of +service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, +the socket will fall back to 0 (which means that no meaningful service code +is present). Connecting sockets set at most one service option; for +listening sockets, multiple service codes can be specified. Notes ===== diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 90ed78110fd4..935e298f674a 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -375,6 +375,41 @@ tcp_slow_start_after_idle - BOOLEAN be timed out after an idle period. Default: 1 +CIPSOv4 Variables: + +cipso_cache_enable - BOOLEAN + If set, enable additions to and lookups from the CIPSO label mapping + cache. If unset, additions are ignored and lookups always result in a + miss. However, regardless of the setting the cache is still + invalidated when required when means you can safely toggle this on and + off and the cache will always be "safe". + Default: 1 + +cipso_cache_bucket_size - INTEGER + The CIPSO label cache consists of a fixed size hash table with each + hash bucket containing a number of cache entries. This variable limits + the number of entries in each hash bucket; the larger the value the + more CIPSO label mappings that can be cached. When the number of + entries in a given hash bucket reaches this limit adding new entries + causes the oldest entry in the bucket to be removed to make room. + Default: 10 + +cipso_rbm_optfmt - BOOLEAN + Enable the "Optimized Tag 1 Format" as defined in section 3.4.2.6 of + the CIPSO draft specification (see Documentation/netlabel for details). + This means that when set the CIPSO tag will be padded with empty + categories in order to make the packet data 32-bit aligned. + Default: 0 + +cipso_rbm_structvalid - BOOLEAN + If set, do a very strict check of the CIPSO option when + ip_options_compile() is called. If unset, relax the checks done during + ip_options_compile(). Either way is "safe" as errors are caught else + where in the CIPSO processing code but setting this to 0 (False) should + result in less work (i.e. it should be faster) but could cause problems + with other implementations that require strict checking. + Default: 0 + IP Variables: ip_local_port_range - 2 INTEGERS @@ -730,6 +765,9 @@ conf/all/forwarding - BOOLEAN This referred to as global forwarding. +proxy_ndp - BOOLEAN + Do proxy ndp. + conf/interface/*: Change special settings per interface. diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt index 44f2f769e865..18d385c068fc 100644 --- a/Documentation/networking/pktgen.txt +++ b/Documentation/networking/pktgen.txt @@ -100,6 +100,7 @@ Examples: are: IPSRC_RND #IP Source is random (between min/max), IPDST_RND, UDPSRC_RND, UDPDST_RND, MACSRC_RND, MACDST_RND + MPLS_RND, VID_RND, SVID_RND pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then cycle through the port range. @@ -125,6 +126,21 @@ Examples: pgset "mpls 0" turn off mpls (or any invalid argument works too!) + pgset "vlan_id 77" set VLAN ID 0-4095 + pgset "vlan_p 3" set priority bit 0-7 (default 0) + pgset "vlan_cfi 0" set canonical format identifier 0-1 (default 0) + + pgset "svlan_id 22" set SVLAN ID 0-4095 + pgset "svlan_p 3" set priority bit 0-7 (default 0) + pgset "svlan_cfi 0" set canonical format identifier 0-1 (default 0) + + pgset "vlan_id 9999" > 4095 remove vlan and svlan tags + pgset "svlan 9999" > 4095 remove svlan tag + + + pgset "tos XX" set former IPv4 TOS field (e.g. "tos 28" for AF11 no ECN, default 00) + pgset "traffic_class XX" set former IPv6 TRAFFIC CLASS (e.g. "traffic_class B8" for EF no ECN, default 00) + pgset stop aborts injection. Also, ^C aborts generator. diff --git a/Documentation/networking/secid.txt b/Documentation/networking/secid.txt new file mode 100644 index 000000000000..95ea06784333 --- /dev/null +++ b/Documentation/networking/secid.txt @@ -0,0 +1,14 @@ +flowi structure: + +The secid member in the flow structure is used in LSMs (e.g. SELinux) to indicate +the label of the flow. This label of the flow is currently used in selecting +matching labeled xfrm(s). + +If this is an outbound flow, the label is derived from the socket, if any, or +the incoming packet this flow is being generated as a response to (e.g. tcp +resets, timewait ack, etc.). It is also conceivable that the label could be +derived from other sources such as process context, device, etc., in special +cases, as may be appropriate. + +If this is an inbound flow, the label is derived from the IPSec security +associations, if any, used by the packet. diff --git a/Documentation/nommu-mmap.txt b/Documentation/nommu-mmap.txt index b88ebe4d808c..7714f57caad5 100644 --- a/Documentation/nommu-mmap.txt +++ b/Documentation/nommu-mmap.txt @@ -116,6 +116,9 @@ FURTHER NOTES ON NO-MMU MMAP (*) A list of all the mappings on the system is visible through /proc/maps in no-MMU mode. + (*) A list of all the mappings in use by a process is visible through + /proc/<pid>/maps in no-MMU mode. + (*) Supplying MAP_FIXED or a requesting a particular mapping address will result in an error. @@ -125,6 +128,49 @@ FURTHER NOTES ON NO-MMU MMAP error will result if they don't. This is most likely to be encountered with character device files, pipes, fifos and sockets. + +========================== +INTERPROCESS SHARED MEMORY +========================== + +Both SYSV IPC SHM shared memory and POSIX shared memory is supported in NOMMU +mode. The former through the usual mechanism, the latter through files created +on ramfs or tmpfs mounts. + + +======= +FUTEXES +======= + +Futexes are supported in NOMMU mode if the arch supports them. An error will +be given if an address passed to the futex system call lies outside the +mappings made by a process or if the mapping in which the address lies does not +support futexes (such as an I/O chardev mapping). + + +============= +NO-MMU MREMAP +============= + +The mremap() function is partially supported. It may change the size of a +mapping, and may move it[*] if MREMAP_MAYMOVE is specified and if the new size +of the mapping exceeds the size of the slab object currently occupied by the +memory to which the mapping refers, or if a smaller slab object could be used. + +MREMAP_FIXED is not supported, though it is ignored if there's no change of +address and the object does not need to be moved. + +Shared mappings may not be moved. Shareable mappings may not be moved either, +even if they are not currently shared. + +The mremap() function must be given an exact match for base address and size of +a previously mapped object. It may not be used to create holes in existing +mappings, move parts of existing mappings or resize parts of mappings. It must +act on a complete mapping. + +[*] Not currently supported. + + ============================================ PROVIDING SHAREABLE CHARACTER DEVICE SUPPORT ============================================ diff --git a/Documentation/pcieaer-howto.txt b/Documentation/pcieaer-howto.txt new file mode 100644 index 000000000000..16c251230c82 --- /dev/null +++ b/Documentation/pcieaer-howto.txt @@ -0,0 +1,253 @@ + The PCI Express Advanced Error Reporting Driver Guide HOWTO + T. Long Nguyen <tom.l.nguyen@intel.com> + Yanmin Zhang <yanmin.zhang@intel.com> + 07/29/2006 + + +1. Overview + +1.1 About this guide + +This guide describes the basics of the PCI Express Advanced Error +Reporting (AER) driver and provides information on how to use it, as +well as how to enable the drivers of endpoint devices to conform with +PCI Express AER driver. + +1.2 Copyright © Intel Corporation 2006. + +1.3 What is the PCI Express AER Driver? + +PCI Express error signaling can occur on the PCI Express link itself +or on behalf of transactions initiated on the link. PCI Express +defines two error reporting paradigms: the baseline capability and +the Advanced Error Reporting capability. The baseline capability is +required of all PCI Express components providing a minimum defined +set of error reporting requirements. Advanced Error Reporting +capability is implemented with a PCI Express advanced error reporting +extended capability structure providing more robust error reporting. + +The PCI Express AER driver provides the infrastructure to support PCI +Express Advanced Error Reporting capability. The PCI Express AER +driver provides three basic functions: + +- Gathers the comprehensive error information if errors occurred. +- Reports error to the users. +- Performs error recovery actions. + +AER driver only attaches root ports which support PCI-Express AER +capability. + + +2. User Guide + +2.1 Include the PCI Express AER Root Driver into the Linux Kernel + +The PCI Express AER Root driver is a Root Port service driver attached +to the PCI Express Port Bus driver. If a user wants to use it, the driver +has to be compiled. Option CONFIG_PCIEAER supports this capability. It +depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and +CONFIG_PCIEAER = y. + +2.2 Load PCI Express AER Root Driver +There is a case where a system has AER support in BIOS. Enabling the AER +Root driver and having AER support in BIOS may result unpredictable +behavior. To avoid this conflict, a successful load of the AER Root driver +requires ACPI _OSC support in the BIOS to allow the AER Root driver to +request for native control of AER. See the PCI FW 3.0 Specification for +details regarding OSC usage. Currently, lots of firmwares don't provide +_OSC support while they use PCI Express. To support such firmwares, +forceload, a parameter of type bool, could enable AER to continue to +be initiated although firmwares have no _OSC support. To enable the +walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line +when booting kernel. Note that forceload=n by default. + +2.3 AER error output +When a PCI-E AER error is captured, an error message will be outputed to +console. If it's a correctable error, it is outputed as a warning. +Otherwise, it is printed as an error. So users could choose different +log level to filter out correctable error messages. + +Below shows an example. ++------ PCI-Express Device Error -----+ +Error Severity : Uncorrected (Fatal) +PCIE Bus Error type : Transaction Layer +Unsupported Request : First +Requester ID : 0500 +VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h +TLB Header: +04000001 00200a03 05010000 00050100 + +In the example, 'Requester ID' means the ID of the device who sends +the error message to root port. Pls. refer to pci express specs for +other fields. + + +3. Developer Guide + +To enable AER aware support requires a software driver to configure +the AER capability structure within its device and to provide callbacks. + +To support AER better, developers need understand how AER does work +firstly. + +PCI Express errors are classified into two types: correctable errors +and uncorrectable errors. This classification is based on the impacts +of those errors, which may result in degraded performance or function +failure. + +Correctable errors pose no impacts on the functionality of the +interface. The PCI Express protocol can recover without any software +intervention or any loss of data. These errors are detected and +corrected by hardware. Unlike correctable errors, uncorrectable +errors impact functionality of the interface. Uncorrectable errors +can cause a particular transaction or a particular PCI Express link +to be unreliable. Depending on those error conditions, uncorrectable +errors are further classified into non-fatal errors and fatal errors. +Non-fatal errors cause the particular transaction to be unreliable, +but the PCI Express link itself is fully functional. Fatal errors, on +the other hand, cause the link to be unreliable. + +When AER is enabled, a PCI Express device will automatically send an +error message to the PCIE root port above it when the device captures +an error. The Root Port, upon receiving an error reporting message, +internally processes and logs the error message in its PCI Express +capability structure. Error information being logged includes storing +the error reporting agent's requestor ID into the Error Source +Identification Registers and setting the error bits of the Root Error +Status Register accordingly. If AER error reporting is enabled in Root +Error Command Register, the Root Port generates an interrupt if an +error is detected. + +Note that the errors as described above are related to the PCI Express +hierarchy and links. These errors do not include any device specific +errors because device specific errors will still get sent directly to +the device driver. + +3.1 Configure the AER capability structure + +AER aware drivers of PCI Express component need change the device +control registers to enable AER. They also could change AER registers, +including mask and severity registers. Helper function +pci_enable_pcie_error_reporting could be used to enable AER. See +section 3.3. + +3.2. Provide callbacks + +3.2.1 callback reset_link to reset pci express link + +This callback is used to reset the pci express physical link when a +fatal error happens. The root port aer service driver provides a +default reset_link function, but different upstream ports might +have different specifications to reset pci express link, so all +upstream ports should provide their own reset_link functions. + +In struct pcie_port_service_driver, a new pointer, reset_link, is +added. + +pci_ers_result_t (*reset_link) (struct pci_dev *dev); + +Section 3.2.2.2 provides more detailed info on when to call +reset_link. + +3.2.2 PCI error-recovery callbacks + +The PCI Express AER Root driver uses error callbacks to coordinate +with downstream device drivers associated with a hierarchy in question +when performing error recovery actions. + +Data struct pci_driver has a pointer, err_handler, to point to +pci_error_handlers who consists of a couple of callback function +pointers. AER driver follows the rules defined in +pci-error-recovery.txt except pci express specific parts (e.g. +reset_link). Pls. refer to pci-error-recovery.txt for detailed +definitions of the callbacks. + +Below sections specify when to call the error callback functions. + +3.2.2.1 Correctable errors + +Correctable errors pose no impacts on the functionality of +the interface. The PCI Express protocol can recover without any +software intervention or any loss of data. These errors do not +require any recovery actions. The AER driver clears the device's +correctable error status register accordingly and logs these errors. + +3.2.2.2 Non-correctable (non-fatal and fatal) errors + +If an error message indicates a non-fatal error, performing link reset +at upstream is not required. The AER driver calls error_detected(dev, +pci_channel_io_normal) to all drivers associated within a hierarchy in +question. for example, +EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. +If Upstream port A captures an AER error, the hierarchy consists of +Downstream port B and EndPoint. + +A driver may return PCI_ERS_RESULT_CAN_RECOVER, +PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on +whether it can recover or the AER driver calls mmio_enabled as next. + +If an error message indicates a fatal error, kernel will broadcast +error_detected(dev, pci_channel_io_frozen) to all drivers within +a hierarchy in question. Then, performing link reset at upstream is +necessary. As different kinds of devices might use different approaches +to reset link, AER port service driver is required to provide the +function to reset link. Firstly, kernel looks for if the upstream +component has an aer driver. If it has, kernel uses the reset_link +callback of the aer driver. If the upstream component has no aer driver +and the port is downstream port, we will use the aer driver of the +root port who reports the AER error. As for upstream ports, +they should provide their own aer service drivers with reset_link +function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and +reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes +to mmio_enabled. + +3.3 helper functions + +3.3.1 int pci_find_aer_capability(struct pci_dev *dev); +pci_find_aer_capability locates the PCI Express AER capability +in the device configuration space. If the device doesn't support +PCI-Express AER, the function returns 0. + +3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev); +pci_enable_pcie_error_reporting enables the device to send error +messages to root port when an error is detected. Note that devices +don't enable the error reporting by default, so device drivers need +call this function to enable it. + +3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev); +pci_disable_pcie_error_reporting disables the device to send error +messages to root port when an error is detected. + +3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); +pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable +error status register. + +3.4 Frequent Asked Questions + +Q: What happens if a PCI Express device driver does not provide an +error recovery handler (pci_driver->err_handler is equal to NULL)? + +A: The devices attached with the driver won't be recovered. If the +error is fatal, kernel will print out warning messages. Please refer +to section 3 for more information. + +Q: What happens if an upstream port service driver does not provide +callback reset_link? + +A: Fatal error recovery will fail if the errors are reported by the +upstream ports who are attached by the service driver. + +Q: How does this infrastructure deal with driver that is not PCI +Express aware? + +A: This infrastructure calls the error callback functions of the +driver when an error happens. But if the driver is not aware of +PCI Express, the device might not report its own errors to root +port. + +Q: What modifications will that driver need to make it compatible +with the PCI Express AER Root driver? + +A: It could call the helper functions to enable AER in devices and +cleanup uncorrectable status register. Pls. refer to section 3.3. + diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt index fba1e05c47c7..d0e79d5820a5 100644 --- a/Documentation/power/devices.txt +++ b/Documentation/power/devices.txt @@ -1,208 +1,553 @@ +Most of the code in Linux is device drivers, so most of the Linux power +management code is also driver-specific. Most drivers will do very little; +others, especially for platforms with small batteries (like cell phones), +will do a lot. + +This writeup gives an overview of how drivers interact with system-wide +power management goals, emphasizing the models and interfaces that are +shared by everything that hooks up to the driver model core. Read it as +background for the domain-specific work you'd do with any specific driver. + + +Two Models for Device Power Management +====================================== +Drivers will use one or both of these models to put devices into low-power +states: + + System Sleep model: + Drivers can enter low power states as part of entering system-wide + low-power states like "suspend-to-ram", or (mostly for systems with + disks) "hibernate" (suspend-to-disk). + + This is something that device, bus, and class drivers collaborate on + by implementing various role-specific suspend and resume methods to + cleanly power down hardware and software subsystems, then reactivate + them without loss of data. + + Some drivers can manage hardware wakeup events, which make the system + leave that low-power state. This feature may be disabled using the + relevant /sys/devices/.../power/wakeup file; enabling it may cost some + power usage, but let the whole system enter low power states more often. + + Runtime Power Management model: + Drivers may also enter low power states while the system is running, + independently of other power management activity. Upstream drivers + will normally not know (or care) if the device is in some low power + state when issuing requests; the driver will auto-resume anything + that's needed when it gets a request. + + This doesn't have, or need much infrastructure; it's just something you + should do when writing your drivers. For example, clk_disable() unused + clocks as part of minimizing power drain for currently-unused hardware. + Of course, sometimes clusters of drivers will collaborate with each + other, which could involve task-specific power management. + +There's not a lot to be said about those low power states except that they +are very system-specific, and often device-specific. Also, that if enough +drivers put themselves into low power states (at "runtime"), the effect may be +the same as entering some system-wide low-power state (system sleep) ... and +that synergies exist, so that several drivers using runtime pm might put the +system into a state where even deeper power saving options are available. + +Most suspended devices will have quiesced all I/O: no more DMA or irqs, no +more data read or written, and requests from upstream drivers are no longer +accepted. A given bus or platform may have different requirements though. + +Examples of hardware wakeup events include an alarm from a real time clock, +network wake-on-LAN packets, keyboard or mouse activity, and media insertion +or removal (for PCMCIA, MMC/SD, USB, and so on). + + +Interfaces for Entering System Sleep States +=========================================== +Most of the programming interfaces a device driver needs to know about +relate to that first model: entering a system-wide low power state, +rather than just minimizing power consumption by one device. + + +Bus Driver Methods +------------------ +The core methods to suspend and resume devices reside in struct bus_type. +These are mostly of interest to people writing infrastructure for busses +like PCI or USB, or because they define the primitives that device drivers +may need to apply in domain-specific ways to their devices: -Device Power Management +struct bus_type { + ... + int (*suspend)(struct device *dev, pm_message_t state); + int (*suspend_late)(struct device *dev, pm_message_t state); + int (*resume_early)(struct device *dev); + int (*resume)(struct device *dev); +}; -Device power management encompasses two areas - the ability to save -state and transition a device to a low-power state when the system is -entering a low-power state; and the ability to transition a device to -a low-power state while the system is running (and independently of -any other power management activity). +Bus drivers implement those methods as appropriate for the hardware and +the drivers using it; PCI works differently from USB, and so on. Not many +people write bus drivers; most driver code is a "device driver" that +builds on top of bus-specific framework code. + +For more information on these driver calls, see the description later; +they are called in phases for every device, respecting the parent-child +sequencing in the driver model tree. Note that as this is being written, +only the suspend() and resume() are widely available; not many bus drivers +leverage all of those phases, or pass them down to lower driver levels. + + +/sys/devices/.../power/wakeup files +----------------------------------- +All devices in the driver model have two flags to control handling of +wakeup events, which are hardware signals that can force the device and/or +system out of a low power state. These are initialized by bus or device +driver code using device_init_wakeup(dev,can_wakeup). + +The "can_wakeup" flag just records whether the device (and its driver) can +physically support wakeup events. When that flag is clear, the sysfs +"wakeup" file is empty, and device_may_wakeup() returns false. + +For devices that can issue wakeup events, a separate flag controls whether +that device should try to use its wakeup mechanism. The initial value of +device_may_wakeup() will be true, so that the device's "wakeup" file holds +the value "enabled". Userspace can change that to "disabled" so that +device_may_wakeup() returns false; or change it back to "enabled" (so that +it returns true again). + + +EXAMPLE: PCI Device Driver Methods +----------------------------------- +PCI framework software calls these methods when the PCI device driver bound +to a device device has provided them: + +struct pci_driver { + ... + int (*suspend)(struct pci_device *pdev, pm_message_t state); + int (*suspend_late)(struct pci_device *pdev, pm_message_t state); + + int (*resume_early)(struct pci_device *pdev); + int (*resume)(struct pci_device *pdev); +}; +Drivers will implement those methods, and call PCI-specific procedures +like pci_set_power_state(), pci_enable_wake(), pci_save_state(), and +pci_restore_state() to manage PCI-specific mechanisms. (PCI config space +could be saved during driver probe, if it weren't for the fact that some +systems rely on userspace tweaking using setpci.) Devices are suspended +before their bridges enter low power states, and likewise bridges resume +before their devices. + + +Upper Layers of Driver Stacks +----------------------------- +Device drivers generally have at least two interfaces, and the methods +sketched above are the ones which apply to the lower level (nearer PCI, USB, +or other bus hardware). The network and block layers are examples of upper +level interfaces, as is a character device talking to userspace. + +Power management requests normally need to flow through those upper levels, +which often use domain-oriented requests like "blank that screen". In +some cases those upper levels will have power management intelligence that +relates to end-user activity, or other devices that work in cooperation. + +When those interfaces are structured using class interfaces, there is a +standard way to have the upper layer stop issuing requests to a given +class device (and restart later): + +struct class { + ... + int (*suspend)(struct device *dev, pm_message_t state); + int (*resume)(struct device *dev); +}; -Methods +Those calls are issued in specific phases of the process by which the +system enters a low power "suspend" state, or resumes from it. + + +Calling Drivers to Enter System Sleep States +============================================ +When the system enters a low power state, each device's driver is asked +to suspend the device by putting it into state compatible with the target +system state. That's usually some version of "off", but the details are +system-specific. Also, wakeup-enabled devices will usually stay partly +functional in order to wake the system. + +When the system leaves that low power state, the device's driver is asked +to resume it. The suspend and resume operations always go together, and +both are multi-phase operations. + +For simple drivers, suspend might quiesce the device using the class code +and then turn its hardware as "off" as possible with late_suspend. The +matching resume calls would then completely reinitialize the hardware +before reactivating its class I/O queues. + +More power-aware drivers drivers will use more than one device low power +state, either at runtime or during system sleep states, and might trigger +system wakeup events. + + +Call Sequence Guarantees +------------------------ +To ensure that bridges and similar links needed to talk to a device are +available when the device is suspended or resumed, the device tree is +walked in a bottom-up order to suspend devices. A top-down order is +used to resume those devices. + +The ordering of the device tree is defined by the order in which devices +get registered: a child can never be registered, probed or resumed before +its parent; and can't be removed or suspended after that parent. + +The policy is that the device tree should match hardware bus topology. +(Or at least the control bus, for devices which use multiple busses.) + + +Suspending Devices +------------------ +Suspending a given device is done in several phases. Suspending the +system always includes every phase, executing calls for every device +before the next phase begins. Not all busses or classes support all +these callbacks; and not all drivers use all the callbacks. + +The phases are seen by driver notifications issued in this order: + + 1 class.suspend(dev, message) is called after tasks are frozen, for + devices associated with a class that has such a method. This + method may sleep. + + Since I/O activity usually comes from such higher layers, this is + a good place to quiesce all drivers of a given type (and keep such + code out of those drivers). + + 2 bus.suspend(dev, message) is called next. This method may sleep, + and is often morphed into a device driver call with bus-specific + parameters and/or rules. + + This call should handle parts of device suspend logic that require + sleeping. It probably does work to quiesce the device which hasn't + been abstracted into class.suspend() or bus.suspend_late(). + + 3 bus.suspend_late(dev, message) is called with IRQs disabled, and + with only one CPU active. Until the bus.resume_early() phase + completes (see later), IRQs are not enabled again. This method + won't be exposed by all busses; for message based busses like USB, + I2C, or SPI, device interactions normally require IRQs. This bus + call may be morphed into a driver call with bus-specific parameters. + + This call might save low level hardware state that might otherwise + be lost in the upcoming low power state, and actually put the + device into a low power state ... so that in some cases the device + may stay partly usable until this late. This "late" call may also + help when coping with hardware that behaves badly. + +The pm_message_t parameter is currently used to refine those semantics +(described later). + +At the end of those phases, drivers should normally have stopped all I/O +transactions (DMA, IRQs), saved enough state that they can re-initialize +or restore previous state (as needed by the hardware), and placed the +device into a low-power state. On many platforms they will also use +clk_disable() to gate off one or more clock sources; sometimes they will +also switch off power supplies, or reduce voltages. Drivers which have +runtime PM support may already have performed some or all of the steps +needed to prepare for the upcoming system sleep state. + +When any driver sees that its device_can_wakeup(dev), it should make sure +to use the relevant hardware signals to trigger a system wakeup event. +For example, enable_irq_wake() might identify GPIO signals hooked up to +a switch or other external hardware, and pci_enable_wake() does something +similar for PCI's PME# signal. + +If a driver (or bus, or class) fails it suspend method, the system won't +enter the desired low power state; it will resume all the devices it's +suspended so far. + +Note that drivers may need to perform different actions based on the target +system lowpower/sleep state. At this writing, there are only platform +specific APIs through which drivers could determine those target states. + + +Device Low Power (suspend) States +--------------------------------- +Device low-power states aren't very standard. One device might only handle +"on" and "off, while another might support a dozen different versions of +"on" (how many engines are active?), plus a state that gets back to "on" +faster than from a full "off". + +Some busses define rules about what different suspend states mean. PCI +gives one example: after the suspend sequence completes, a non-legacy +PCI device may not perform DMA or issue IRQs, and any wakeup events it +issues would be issued through the PME# bus signal. Plus, there are +several PCI-standard device states, some of which are optional. + +In contrast, integrated system-on-chip processors often use irqs as the +wakeup event sources (so drivers would call enable_irq_wake) and might +be able to treat DMA completion as a wakeup event (sometimes DMA can stay +active too, it'd only be the CPU and some peripherals that sleep). + +Some details here may be platform-specific. Systems may have devices that +can be fully active in certain sleep states, such as an LCD display that's +refreshed using DMA while most of the system is sleeping lightly ... and +its frame buffer might even be updated by a DSP or other non-Linux CPU while +the Linux control processor stays idle. + +Moreover, the specific actions taken may depend on the target system state. +One target system state might allow a given device to be very operational; +another might require a hard shut down with re-initialization on resume. +And two different target systems might use the same device in different +ways; the aforementioned LCD might be active in one product's "standby", +but a different product using the same SOC might work differently. + + +Meaning of pm_message_t.event +----------------------------- +Parameters to suspend calls include the device affected and a message of +type pm_message_t, which has one field: the event. If driver does not +recognize the event code, suspend calls may abort the request and return +a negative errno. However, most drivers will be fine if they implement +PM_EVENT_SUSPEND semantics for all messages. + +The event codes are used to refine the goal of suspending the device, and +mostly matter when creating or resuming system memory image snapshots, as +used with suspend-to-disk: + + PM_EVENT_SUSPEND -- quiesce the driver and put hardware into a low-power + state. When used with system sleep states like "suspend-to-RAM" or + "standby", the upcoming resume() call will often be able to rely on + state kept in hardware, or issue system wakeup events. When used + instead with suspend-to-disk, few devices support this capability; + most are completely powered off. + + PM_EVENT_FREEZE -- quiesce the driver, but don't necessarily change into + any low power mode. A system snapshot is about to be taken, often + followed by a call to the driver's resume() method. Neither wakeup + events nor DMA are allowed. + + PM_EVENT_PRETHAW -- quiesce the driver, knowing that the upcoming resume() + will restore a suspend-to-disk snapshot from a different kernel image. + Drivers that are smart enough to look at their hardware state during + resume() processing need that state to be correct ... a PRETHAW could + be used to invalidate that state (by resetting the device), like a + shutdown() invocation would before a kexec() or system halt. Other + drivers might handle this the same way as PM_EVENT_FREEZE. Neither + wakeup events nor DMA are allowed. + +To enter "standby" (ACPI S1) or "Suspend to RAM" (STR, ACPI S3) states, or +the similarly named APM states, only PM_EVENT_SUSPEND is used; for "Suspend +to Disk" (STD, hibernate, ACPI S4), all of those event codes are used. + +There's also PM_EVENT_ON, a value which never appears as a suspend event +but is sometimes used to record the "not suspended" device state. + + +Resuming Devices +---------------- +Resuming is done in multiple phases, much like suspending, with all +devices processing each phase's calls before the next phase begins. + +The phases are seen by driver notifications issued in this order: + + 1 bus.resume_early(dev) is called with IRQs disabled, and with + only one CPU active. As with bus.suspend_late(), this method + won't be supported on busses that require IRQs in order to + interact with devices. + + This reverses the effects of bus.suspend_late(). + + 2 bus.resume(dev) is called next. This may be morphed into a device + driver call with bus-specific parameters; implementations may sleep. + + This reverses the effects of bus.suspend(). + + 3 class.resume(dev) is called for devices associated with a class + that has such a method. Implementations may sleep. + + This reverses the effects of class.suspend(), and would usually + reactivate the device's I/O queue. + +At the end of those phases, drivers should normally be as functional as +they were before suspending: I/O can be performed using DMA and IRQs, and +the relevant clocks are gated on. The device need not be "fully on"; it +might be in a runtime lowpower/suspend state that acts as if it were. + +However, the details here may again be platform-specific. For example, +some systems support multiple "run" states, and the mode in effect at +the end of resume() might not be the one which preceded suspension. +That means availability of certain clocks or power supplies changed, +which could easily affect how a driver works. + + +Drivers need to be able to handle hardware which has been reset since the +suspend methods were called, for example by complete reinitialization. +This may be the hardest part, and the one most protected by NDA'd documents +and chip errata. It's simplest if the hardware state hasn't changed since +the suspend() was called, but that can't always be guaranteed. + +Drivers must also be prepared to notice that the device has been removed +while the system was powered off, whenever that's physically possible. +PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busses +where common Linux platforms will see such removal. Details of how drivers +will notice and handle such removals are currently bus-specific, and often +involve a separate thread. -The methods to suspend and resume devices reside in struct bus_type: -struct bus_type { - ... - int (*suspend)(struct device * dev, pm_message_t state); - int (*resume)(struct device * dev); -}; +Note that the bus-specific runtime PM wakeup mechanism can exist, and might +be defined to share some of the same driver code as for system wakeup. For +example, a bus-specific device driver's resume() method might be used there, +so it wouldn't only be called from bus.resume() during system-wide wakeup. +See bus-specific information about how runtime wakeup events are handled. -Each bus driver is responsible implementing these methods, translating -the call into a bus-specific request and forwarding the call to the -bus-specific drivers. For example, PCI drivers implement suspend() and -resume() methods in struct pci_driver. The PCI core is simply -responsible for translating the pointers to PCI-specific ones and -calling the low-level driver. - -This is done to a) ease transition to the new power management methods -and leverage the existing PM code in various bus drivers; b) allow -buses to implement generic and default PM routines for devices, and c) -make the flow of execution obvious to the reader. - - -System Power Management - -When the system enters a low-power state, the device tree is walked in -a depth-first fashion to transition each device into a low-power -state. The ordering of the device tree is guaranteed by the order in -which devices get registered - children are never registered before -their ancestors, and devices are placed at the back of the list when -registered. By walking the list in reverse order, we are guaranteed to -suspend devices in the proper order. - -Devices are suspended once with interrupts enabled. Drivers are -expected to stop I/O transactions, save device state, and place the -device into a low-power state. Drivers may sleep, allocate memory, -etc. at will. - -Some devices are broken and will inevitably have problems powering -down or disabling themselves with interrupts enabled. For these -special cases, they may return -EAGAIN. This will put the device on a -list to be taken care of later. When interrupts are disabled, before -we enter the low-power state, their drivers are called again to put -their device to sleep. - -On resume, the devices that returned -EAGAIN will be called to power -themselves back on with interrupts disabled. Once interrupts have been -re-enabled, the rest of the drivers will be called to resume their -devices. On resume, a driver is responsible for powering back on each -device, restoring state, and re-enabling I/O transactions for that -device. +System Devices +-------------- System devices follow a slightly different API, which can be found in include/linux/sysdev.h drivers/base/sys.c -System devices will only be suspended with interrupts disabled, and -after all other devices have been suspended. On resume, they will be -resumed before any other devices, and also with interrupts disabled. +System devices will only be suspended with interrupts disabled, and after +all other devices have been suspended. On resume, they will be resumed +before any other devices, and also with interrupts disabled. +That is, IRQs are disabled, the suspend_late() phase begins, then the +sysdev_driver.suspend() phase, and the system enters a sleep state. Then +the sysdev_driver.resume() phase begins, followed by the resume_early() +phase, after which IRQs are enabled. -Runtime Power Management - -Many devices are able to dynamically power down while the system is -still running. This feature is useful for devices that are not being -used, and can offer significant power savings on a running system. - -In each device's directory, there is a 'power' directory, which -contains at least a 'state' file. Reading from this file displays what -power state the device is currently in. Writing to this file initiates -a transition to the specified power state, which must be a decimal in -the range 1-3, inclusive; or 0 for 'On'. +Code to actually enter and exit the system-wide low power state sometimes +involves hardware details that are only known to the boot firmware, and +may leave a CPU running software (from SRAM or flash memory) that monitors +the system and manages its wakeup sequence. -The PM core will call the ->suspend() method in the bus_type object -that the device belongs to if the specified state is not 0, or -->resume() if it is. -Nothing will happen if the specified state is the same state the -device is currently in. - -If the device is already in a low-power state, and the specified state -is another, but different, low-power state, the ->resume() method will -first be called to power the device back on, then ->suspend() will be -called again with the new state. - -The driver is responsible for saving the working state of the device -and putting it into the low-power state specified. If this was -successful, it returns 0, and the device's power_state field is -updated. - -The driver must take care to know whether or not it is able to -properly resume the device, including all step of reinitialization -necessary. (This is the hardest part, and the one most protected by -NDA'd documents). - -The driver must also take care not to suspend a device that is -currently in use. It is their responsibility to provide their own -exclusion mechanisms. - -The runtime power transition happens with interrupts enabled. If a -device cannot support being powered down with interrupts, it may -return -EAGAIN (as it would during a system power management -transition), but it will _not_ be called again, and the transaction -will fail. - -There is currently no way to know what states a device or driver -supports a priori. This will change in the future. - -pm_message_t meaning - -pm_message_t has two fields. event ("major"), and flags. If driver -does not know event code, it aborts the request, returning error. Some -drivers may need to deal with special cases based on the actual type -of suspend operation being done at the system level. This is why -there are flags. - -Event codes are: - -ON -- no need to do anything except special cases like broken -HW. - -# NOTIFICATION -- pretty much same as ON? - -FREEZE -- stop DMA and interrupts, and be prepared to reinit HW from -scratch. That probably means stop accepting upstream requests, the -actual policy of what to do with them being specific to a given -driver. It's acceptable for a network driver to just drop packets -while a block driver is expected to block the queue so no request is -lost. (Use IDE as an example on how to do that). FREEZE requires no -power state change, and it's expected for drivers to be able to -quickly transition back to operating state. - -SUSPEND -- like FREEZE, but also put hardware into low-power state. If -there's need to distinguish several levels of sleep, additional flag -is probably best way to do that. - -Transitions are only from a resumed state to a suspended state, never -between 2 suspended states. (ON -> FREEZE or ON -> SUSPEND can happen, -FREEZE -> SUSPEND or SUSPEND -> FREEZE can not). - -All events are: - -[NOTE NOTE NOTE: If you are driver author, you should not care; you -should only look at event, and ignore flags.] - -#Prepare for suspend -- userland is still running but we are going to -#enter suspend state. This gives drivers chance to load firmware from -#disk and store it in memory, or do other activities taht require -#operating userland, ability to kmalloc GFP_KERNEL, etc... All of these -#are forbiden once the suspend dance is started.. event = ON, flags = -#PREPARE_TO_SUSPEND - -Apm standby -- prepare for APM event. Quiesce devices to make life -easier for APM BIOS. event = FREEZE, flags = APM_STANDBY - -Apm suspend -- same as APM_STANDBY, but it we should probably avoid -spinning down disks. event = FREEZE, flags = APM_SUSPEND - -System halt, reboot -- quiesce devices to make life easier for BIOS. event -= FREEZE, flags = SYSTEM_HALT or SYSTEM_REBOOT - -System shutdown -- at least disks need to be spun down, or data may be -lost. Quiesce devices, just to make life easier for BIOS. event = -FREEZE, flags = SYSTEM_SHUTDOWN - -Kexec -- turn off DMAs and put hardware into some state where new -kernel can take over. event = FREEZE, flags = KEXEC - -Powerdown at end of swsusp -- very similar to SYSTEM_SHUTDOWN, except wake -may need to be enabled on some devices. This actually has at least 3 -subtypes, system can reboot, enter S4 and enter S5 at the end of -swsusp. event = FREEZE, flags = SWSUSP and one of SYSTEM_REBOOT, -SYSTEM_SHUTDOWN, SYSTEM_S4 - -Suspend to ram -- put devices into low power state. event = SUSPEND, -flags = SUSPEND_TO_RAM - -Freeze for swsusp snapshot -- stop DMA and interrupts. No need to put -devices into low power mode, but you must be able to reinitialize -device from scratch in resume method. This has two flavors, its done -once on suspending kernel, once on resuming kernel. event = FREEZE, -flags = DURING_SUSPEND or DURING_RESUME - -Device detach requested from /sys -- deinitialize device; proably same as -SYSTEM_SHUTDOWN, I do not understand this one too much. probably event -= FREEZE, flags = DEV_DETACH. - -#These are not really events sent: -# -#System fully on -- device is working normally; this is probably never -#passed to suspend() method... event = ON, flags = 0 -# -#Ready after resume -- userland is now running, again. Time to free any -#memory you ate during prepare to suspend... event = ON, flags = -#READY_AFTER_RESUME -# +Runtime Power Management +======================== +Many devices are able to dynamically power down while the system is still +running. This feature is useful for devices that are not being used, and +can offer significant power savings on a running system. These devices +often support a range of runtime power states, which might use names such +as "off", "sleep", "idle", "active", and so on. Those states will in some +cases (like PCI) be partially constrained by a bus the device uses, and will +usually include hardware states that are also used in system sleep states. + +However, note that if a driver puts a device into a runtime low power state +and the system then goes into a system-wide sleep state, it normally ought +to resume into that runtime low power state rather than "full on". Such +distinctions would be part of the driver-internal state machine for that +hardware; the whole point of runtime power management is to be sure that +drivers are decoupled in that way from the state machine governing phases +of the system-wide power/sleep state transitions. + + +Power Saving Techniques +----------------------- +Normally runtime power management is handled by the drivers without specific +userspace or kernel intervention, by device-aware use of techniques like: + + Using information provided by other system layers + - stay deeply "off" except between open() and close() + - if transceiver/PHY indicates "nobody connected", stay "off" + - application protocols may include power commands or hints + + Using fewer CPU cycles + - using DMA instead of PIO + - removing timers, or making them lower frequency + - shortening "hot" code paths + - eliminating cache misses + - (sometimes) offloading work to device firmware + + Reducing other resource costs + - gating off unused clocks in software (or hardware) + - switching off unused power supplies + - eliminating (or delaying/merging) IRQs + - tuning DMA to use word and/or burst modes + + Using device-specific low power states + - using lower voltages + - avoiding needless DMA transfers + +Read your hardware documentation carefully to see the opportunities that +may be available. If you can, measure the actual power usage and check +it against the budget established for your project. + + +Examples: USB hosts, system timer, system CPU +---------------------------------------------- +USB host controllers make interesting, if complex, examples. In many cases +these have no work to do: no USB devices are connected, or all of them are +in the USB "suspend" state. Linux host controller drivers can then disable +periodic DMA transfers that would otherwise be a constant power drain on the +memory subsystem, and enter a suspend state. In power-aware controllers, +entering that suspend state may disable the clock used with USB signaling, +saving a certain amount of power. + +The controller will be woken from that state (with an IRQ) by changes to the +signal state on the data lines of a given port, for example by an existing +peripheral requesting "remote wakeup" or by plugging a new peripheral. The +same wakeup mechanism usually works from "standby" sleep states, and on some +systems also from "suspend to RAM" (or even "suspend to disk") states. +(Except that ACPI may be involved instead of normal IRQs, on some hardware.) + +System devices like timers and CPUs may have special roles in the platform +power management scheme. For example, system timers using a "dynamic tick" +approach don't just save CPU cycles (by eliminating needless timer IRQs), +but they may also open the door to using lower power CPU "idle" states that +cost more than a jiffie to enter and exit. On x86 systems these are states +like "C3"; note that periodic DMA transfers from a USB host controller will +also prevent entry to a C3 state, much like a periodic timer IRQ. + +That kind of runtime mechanism interaction is common. "System On Chip" (SOC) +processors often have low power idle modes that can't be entered unless +certain medium-speed clocks (often 12 or 48 MHz) are gated off. When the +drivers gate those clocks effectively, then the system idle task may be able +to use the lower power idle modes and thereby increase battery life. + +If the CPU can have a "cpufreq" driver, there also may be opportunities +to shift to lower voltage settings and reduce the power cost of executing +a given number of instructions. (Without voltage adjustment, it's rare +for cpufreq to save much power; the cost-per-instruction must go down.) + + +/sys/devices/.../power/state files +================================== +For now you can also test some of this functionality using sysfs. + + DEPRECATED: USE "power/state" ONLY FOR DRIVER TESTING, AND + AVOID USING dev->power.power_state IN DRIVERS. + + THESE WILL BE REMOVED. IF THE "power/state" FILE GETS REPLACED, + IT WILL BECOME SOMETHING COUPLED TO THE BUS OR DRIVER. + +In each device's directory, there is a 'power' directory, which contains +at least a 'state' file. The value of this field is effectively boolean, +PM_EVENT_ON or PM_EVENT_SUSPEND. + + * Reading from this file displays a value corresponding to + the power.power_state.event field. All nonzero values are + displayed as "2", corresponding to a low power state; zero + is displayed as "0", corresponding to normal operation. + + * Writing to this file initiates a transition using the + specified event code number; only '0', '2', and '3' are + accepted (without a newline); '2' and '3' are both + mapped to PM_EVENT_SUSPEND. + +On writes, the PM core relies on that recorded event code and the device/bus +capabilities to determine whether it uses a partial suspend() or resume() +sequence to change things so that the recorded event corresponds to the +numeric parameter. + + - If the bus requires the irqs-disabled suspend_late()/resume_early() + phases, writes fail because those operations are not supported here. + + - If the recorded value is the expected value, nothing is done. + + - If the recorded value is nonzero, the device is partially resumed, + using the bus.resume() and/or class.resume() methods. + + - If the target value is nonzero, the device is partially suspended, + using the class.suspend() and/or bus.suspend() methods and the + PM_EVENT_SUSPEND message. + +Drivers have no way to tell whether their suspend() and resume() calls +have come through the sysfs power/state file or as part of entering a +system sleep state, except that when accessed through sysfs the normal +parent/child sequencing rules are ignored. Drivers (such as bus, bridge, +or hub drivers) which expose child devices may need to enforce those rules +on their own. diff --git a/Documentation/power/interface.txt b/Documentation/power/interface.txt index 4117802af0f8..a66bec222b16 100644 --- a/Documentation/power/interface.txt +++ b/Documentation/power/interface.txt @@ -52,3 +52,18 @@ suspend image will be as small as possible. Reading from this file will display the current image size limit, which is set to 500 MB by default. + +/sys/power/pm_trace controls the code which saves the last PM event point in +the RTC across reboots, so that you can debug a machine that just hangs +during suspend (or more commonly, during resume). Namely, the RTC is only +used to save the last PM event point if this file contains '1'. Initially it +contains '0' which may be changed to '1' by writing a string representing a +nonzero integer into it. + +To use this debugging feature you should attempt to suspend the machine, then +reboot it and run + + dmesg -s 1000000 | grep 'hash matches' + +CAUTION: Using it will cause your machine's real-time (CMOS) clock to be +set to a random invalid time after a resume. diff --git a/Documentation/rt-mutex-design.txt b/Documentation/rt-mutex-design.txt index c472ffacc2f6..4b736d24da7a 100644 --- a/Documentation/rt-mutex-design.txt +++ b/Documentation/rt-mutex-design.txt @@ -333,11 +333,11 @@ cmpxchg is basically the following function performed atomically: unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C) { - unsigned long T = *A; - if (*A == *B) { - *A = *C; - } - return T; + unsigned long T = *A; + if (*A == *B) { + *A = *C; + } + return T; } #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c) @@ -582,7 +582,7 @@ contention). try_to_take_rt_mutex is used every time the task tries to grab a mutex in the slow path. The first thing that is done here is an atomic setting of the "Has Waiters" flag of the mutex's owner field. Yes, this could really -be false, because if the the mutex has no owner, there are no waiters and +be false, because if the mutex has no owner, there are no waiters and the current task also won't have any waiters. But we don't have the lock yet, so we assume we are going to be a waiter. The reason for this is to play nice for those architectures that do have CMPXCHG. By setting this flag @@ -735,7 +735,7 @@ do have CMPXCHG, that check is done in the fast path, but it is still needed in the slow path too. If a waiter of a mutex woke up because of a signal or timeout between the time the owner failed the fast path CMPXCHG check and the grabbing of the wait_lock, the mutex may not have any waiters, thus the -owner still needs to make this check. If there are no waiters than the mutex +owner still needs to make this check. If there are no waiters then the mutex owner field is set to NULL, the wait_lock is released and nothing more is needed. diff --git a/Documentation/scsi/ChangeLog.arcmsr b/Documentation/scsi/ChangeLog.arcmsr new file mode 100644 index 000000000000..162c47fdf45f --- /dev/null +++ b/Documentation/scsi/ChangeLog.arcmsr @@ -0,0 +1,56 @@ +************************************************************************** +** History +** +** REV# DATE NAME DESCRIPTION +** 1.00.00.00 3/31/2004 Erich Chen First release +** 1.10.00.04 7/28/2004 Erich Chen modify for ioctl +** 1.10.00.06 8/28/2004 Erich Chen modify for 2.6.x +** 1.10.00.08 9/28/2004 Erich Chen modify for x86_64 +** 1.10.00.10 10/10/2004 Erich Chen bug fix for SMP & ioctl +** 1.20.00.00 11/29/2004 Erich Chen bug fix with arcmsr_bus_reset when PHY error +** 1.20.00.02 12/09/2004 Erich Chen bug fix with over 2T bytes RAID Volume +** 1.20.00.04 1/09/2005 Erich Chen fits for Debian linux kernel version 2.2.xx +** 1.20.00.05 2/20/2005 Erich Chen cleanly as look like a Linux driver at 2.6.x +** thanks for peoples kindness comment +** Kornel Wieliczek +** Christoph Hellwig +** Adrian Bunk +** Andrew Morton +** Christoph Hellwig +** James Bottomley +** Arjan van de Ven +** 1.20.00.06 3/12/2005 Erich Chen fix with arcmsr_pci_unmap_dma "unsigned long" cast, +** modify PCCB POOL allocated by "dma_alloc_coherent" +** (Kornel Wieliczek's comment) +** 1.20.00.07 3/23/2005 Erich Chen bug fix with arcmsr_scsi_host_template_init +** occur segmentation fault, +** if RAID adapter does not on PCI slot +** and modprobe/rmmod this driver twice. +** bug fix enormous stack usage (Adrian Bunk's comment) +** 1.20.00.08 6/23/2005 Erich Chen bug fix with abort command, +** in case of heavy loading when sata cable +** working on low quality connection +** 1.20.00.09 9/12/2005 Erich Chen bug fix with abort command handling, firmware version check +** and firmware update notify for hardware bug fix +** 1.20.00.10 9/23/2005 Erich Chen enhance sysfs function for change driver's max tag Q number. +** add DMA_64BIT_MASK for backward compatible with all 2.6.x +** add some useful message for abort command +** add ioctl code 'ARCMSR_IOCTL_FLUSH_ADAPTER_CACHE' +** customer can send this command for sync raid volume data +** 1.20.00.11 9/29/2005 Erich Chen by comment of Arjan van de Ven fix incorrect msleep redefine +** cast off sizeof(dma_addr_t) condition for 64bit pci_set_dma_mask +** 1.20.00.12 9/30/2005 Erich Chen bug fix with 64bit platform's ccbs using if over 4G system memory +** change 64bit pci_set_consistent_dma_mask into 32bit +** increcct adapter count if adapter initialize fail. +** miss edit at arcmsr_build_ccb.... +** psge += sizeof(struct _SG64ENTRY *) => +** psge += sizeof(struct _SG64ENTRY) +** 64 bits sg entry would be incorrectly calculated +** thanks Kornel Wieliczek give me kindly notify +** and detail description +** 1.20.00.13 11/15/2005 Erich Chen scheduling pending ccb with FIFO +** change the architecture of arcmsr command queue list +** for linux standard list +** enable usage of pci message signal interrupt +** follow Randy.Danlup kindness suggestion cleanup this code +**************************************************************************
\ No newline at end of file diff --git a/Documentation/scsi/aacraid.txt b/Documentation/scsi/aacraid.txt index be55670851a4..ee03678c8029 100644 --- a/Documentation/scsi/aacraid.txt +++ b/Documentation/scsi/aacraid.txt @@ -11,38 +11,43 @@ the original). Supported Cards/Chipsets ------------------------- PCI ID (pci.ids) OEM Product - 9005:0285:9005:028a Adaptec 2020ZCR (Skyhawk) - 9005:0285:9005:028e Adaptec 2020SA (Skyhawk) - 9005:0285:9005:028b Adaptec 2025ZCR (Terminator) - 9005:0285:9005:028f Adaptec 2025SA (Terminator) - 9005:0285:9005:0286 Adaptec 2120S (Crusader) - 9005:0286:9005:028d Adaptec 2130S (Lancer) + 9005:0283:9005:0283 Adaptec Catapult (3210S with arc firmware) + 9005:0284:9005:0284 Adaptec Tomcat (3410S with arc firmware) 9005:0285:9005:0285 Adaptec 2200S (Vulcan) + 9005:0285:9005:0286 Adaptec 2120S (Crusader) 9005:0285:9005:0287 Adaptec 2200S (Vulcan-2m) + 9005:0285:9005:0288 Adaptec 3230S (Harrier) + 9005:0285:9005:0289 Adaptec 3240S (Tornado) + 9005:0285:9005:028a Adaptec 2020ZCR (Skyhawk) + 9005:0285:9005:028b Adaptec 2025ZCR (Terminator) 9005:0286:9005:028c Adaptec 2230S (Lancer) 9005:0286:9005:028c Adaptec 2230SLP (Lancer) - 9005:0285:9005:0296 Adaptec 2240S (SabreExpress) + 9005:0286:9005:028d Adaptec 2130S (Lancer) + 9005:0285:9005:028e Adaptec 2020SA (Skyhawk) + 9005:0285:9005:028f Adaptec 2025SA (Terminator) 9005:0285:9005:0290 Adaptec 2410SA (Jaguar) - 9005:0285:9005:0293 Adaptec 21610SA (Corsair-16) 9005:0285:103c:3227 Adaptec 2610SA (Bearcat HP release) + 9005:0285:9005:0293 Adaptec 21610SA (Corsair-16) + 9005:0285:9005:0296 Adaptec 2240S (SabreExpress) 9005:0285:9005:0292 Adaptec 2810SA (Corsair-8) 9005:0285:9005:0294 Adaptec Prowler - 9005:0286:9005:029d Adaptec 2420SA (Intruder HP release) - 9005:0286:9005:029c Adaptec 2620SA (Intruder) - 9005:0286:9005:029b Adaptec 2820SA (Intruder) - 9005:0286:9005:02a7 Adaptec 2830SA (Skyray) - 9005:0286:9005:02a8 Adaptec 2430SA (Skyray) - 9005:0285:9005:0288 Adaptec 3230S (Harrier) - 9005:0285:9005:0289 Adaptec 3240S (Tornado) - 9005:0285:9005:0298 Adaptec 4000SAS (BlackBird) 9005:0285:9005:0297 Adaptec 4005SAS (AvonPark) + 9005:0285:9005:0298 Adaptec 4000SAS (BlackBird) 9005:0285:9005:0299 Adaptec 4800SAS (Marauder-X) 9005:0285:9005:029a Adaptec 4805SAS (Marauder-E) + 9005:0286:9005:029b Adaptec 2820SA (Intruder) + 9005:0286:9005:029c Adaptec 2620SA (Intruder) + 9005:0286:9005:029d Adaptec 2420SA (Intruder HP release) 9005:0286:9005:02a2 Adaptec 3800SAS (Hurricane44) + 9005:0286:9005:02a7 Adaptec 3805SAS (Hurricane80) + 9005:0286:9005:02a8 Adaptec 3400SAS (Hurricane40) + 9005:0286:9005:02ac Adaptec 1800SAS (Typhoon44) + 9005:0286:9005:02b3 Adaptec 2400SAS (Hurricane40lm) + 9005:0285:9005:02b5 Adaptec ASR5800 (Voodoo44) + 9005:0285:9005:02b6 Adaptec ASR5805 (Voodoo80) + 9005:0285:9005:02b7 Adaptec ASR5808 (Voodoo08) 1011:0046:9005:0364 Adaptec 5400S (Mustang) 1011:0046:9005:0365 Adaptec 5400S (Mustang) - 9005:0283:9005:0283 Adaptec Catapult (3210S with arc firmware) - 9005:0284:9005:0284 Adaptec Tomcat (3410S with arc firmware) 9005:0287:9005:0800 Adaptec Themisto (Jupiter) 9005:0200:9005:0200 Adaptec Themisto (Jupiter) 9005:0286:9005:0800 Adaptec Callisto (Jupiter) @@ -64,18 +69,20 @@ Supported Cards/Chipsets 9005:0285:9005:0290 IBM ServeRAID 7t (Jaguar) 9005:0285:1014:02F2 IBM ServeRAID 8i (AvonPark) 9005:0285:1014:0312 IBM ServeRAID 8i (AvonParkLite) - 9005:0286:1014:9580 IBM ServeRAID 8k/8k-l8 (Aurora) 9005:0286:1014:9540 IBM ServeRAID 8k/8k-l4 (AuroraLite) - 9005:0286:9005:029f ICP ICP9014R0 (Lancer) + 9005:0286:1014:9580 IBM ServeRAID 8k/8k-l8 (Aurora) + 9005:0286:1014:034d IBM ServeRAID 8s (Hurricane) 9005:0286:9005:029e ICP ICP9024R0 (Lancer) + 9005:0286:9005:029f ICP ICP9014R0 (Lancer) 9005:0286:9005:02a0 ICP ICP9047MA (Lancer) 9005:0286:9005:02a1 ICP ICP9087MA (Lancer) + 9005:0286:9005:02a3 ICP ICP5445AU (Hurricane44) 9005:0286:9005:02a4 ICP ICP9085LI (Marauder-X) 9005:0286:9005:02a5 ICP ICP5085BR (Marauder-E) - 9005:0286:9005:02a3 ICP ICP5445AU (Hurricane44) 9005:0286:9005:02a6 ICP ICP9067MA (Intruder-6) - 9005:0286:9005:02a9 ICP ICP5087AU (Skyray) - 9005:0286:9005:02aa ICP ICP5047AU (Skyray) + 9005:0286:9005:02a9 ICP ICP5085AU (Hurricane80) + 9005:0286:9005:02aa ICP ICP5045AU (Hurricane40) + 9005:0286:9005:02b4 ICP ICP5045AL (Hurricane40lm) People ------------------------- diff --git a/Documentation/scsi/arcmsr_spec.txt b/Documentation/scsi/arcmsr_spec.txt new file mode 100644 index 000000000000..5e0042340fd3 --- /dev/null +++ b/Documentation/scsi/arcmsr_spec.txt @@ -0,0 +1,574 @@ +******************************************************************************* +** ARECA FIRMWARE SPEC +******************************************************************************* +** Usage of IOP331 adapter +** (All In/Out is in IOP331's view) +** 1. Message 0 --> InitThread message and retrun code +** 2. Doorbell is used for RS-232 emulation +** inDoorBell : bit0 -- data in ready +** (DRIVER DATA WRITE OK) +** bit1 -- data out has been read +** (DRIVER DATA READ OK) +** outDooeBell: bit0 -- data out ready +** (IOP331 DATA WRITE OK) +** bit1 -- data in has been read +** (IOP331 DATA READ OK) +** 3. Index Memory Usage +** offset 0xf00 : for RS232 out (request buffer) +** offset 0xe00 : for RS232 in (scratch buffer) +** offset 0xa00 : for inbound message code message_rwbuffer +** (driver send to IOP331) +** offset 0xa00 : for outbound message code message_rwbuffer +** (IOP331 send to driver) +** 4. RS-232 emulation +** Currently 128 byte buffer is used +** 1st uint32_t : Data length (1--124) +** Byte 4--127 : Max 124 bytes of data +** 5. PostQ +** All SCSI Command must be sent through postQ: +** (inbound queue port) Request frame must be 32 bytes aligned +** #bit27--bit31 => flag for post ccb +** #bit0--bit26 => real address (bit27--bit31) of post arcmsr_cdb +** bit31 : +** 0 : 256 bytes frame +** 1 : 512 bytes frame +** bit30 : +** 0 : normal request +** 1 : BIOS request +** bit29 : reserved +** bit28 : reserved +** bit27 : reserved +** --------------------------------------------------------------------------- +** (outbount queue port) Request reply +** #bit27--bit31 +** => flag for reply +** #bit0--bit26 +** => real address (bit27--bit31) of reply arcmsr_cdb +** bit31 : must be 0 (for this type of reply) +** bit30 : reserved for BIOS handshake +** bit29 : reserved +** bit28 : +** 0 : no error, ignore AdapStatus/DevStatus/SenseData +** 1 : Error, error code in AdapStatus/DevStatus/SenseData +** bit27 : reserved +** 6. BIOS request +** All BIOS request is the same with request from PostQ +** Except : +** Request frame is sent from configuration space +** offset: 0x78 : Request Frame (bit30 == 1) +** offset: 0x18 : writeonly to generate +** IRQ to IOP331 +** Completion of request: +** (bit30 == 0, bit28==err flag) +** 7. Definition of SGL entry (structure) +** 8. Message1 Out - Diag Status Code (????) +** 9. Message0 message code : +** 0x00 : NOP +** 0x01 : Get Config +** ->offset 0xa00 :for outbound message code message_rwbuffer +** (IOP331 send to driver) +** Signature 0x87974060(4) +** Request len 0x00000200(4) +** numbers of queue 0x00000100(4) +** SDRAM Size 0x00000100(4)-->256 MB +** IDE Channels 0x00000008(4) +** vendor 40 bytes char +** model 8 bytes char +** FirmVer 16 bytes char +** Device Map 16 bytes char +** FirmwareVersion DWORD <== Added for checking of +** new firmware capability +** 0x02 : Set Config +** ->offset 0xa00 :for inbound message code message_rwbuffer +** (driver send to IOP331) +** Signature 0x87974063(4) +** UPPER32 of Request Frame (4)-->Driver Only +** 0x03 : Reset (Abort all queued Command) +** 0x04 : Stop Background Activity +** 0x05 : Flush Cache +** 0x06 : Start Background Activity +** (re-start if background is halted) +** 0x07 : Check If Host Command Pending +** (Novell May Need This Function) +** 0x08 : Set controller time +** ->offset 0xa00 : for inbound message code message_rwbuffer +** (driver to IOP331) +** byte 0 : 0xaa <-- signature +** byte 1 : 0x55 <-- signature +** byte 2 : year (04) +** byte 3 : month (1..12) +** byte 4 : date (1..31) +** byte 5 : hour (0..23) +** byte 6 : minute (0..59) +** byte 7 : second (0..59) +******************************************************************************* +******************************************************************************* +** RS-232 Interface for Areca Raid Controller +** The low level command interface is exclusive with VT100 terminal +** -------------------------------------------------------------------- +** 1. Sequence of command execution +** -------------------------------------------------------------------- +** (A) Header : 3 bytes sequence (0x5E, 0x01, 0x61) +** (B) Command block : variable length of data including length, +** command code, data and checksum byte +** (C) Return data : variable length of data +** -------------------------------------------------------------------- +** 2. Command block +** -------------------------------------------------------------------- +** (A) 1st byte : command block length (low byte) +** (B) 2nd byte : command block length (high byte) +** note ..command block length shouldn't > 2040 bytes, +** length excludes these two bytes +** (C) 3rd byte : command code +** (D) 4th and following bytes : variable length data bytes +** depends on command code +** (E) last byte : checksum byte (sum of 1st byte until last data byte) +** -------------------------------------------------------------------- +** 3. Command code and associated data +** -------------------------------------------------------------------- +** The following are command code defined in raid controller Command +** code 0x10--0x1? are used for system level management, +** no password checking is needed and should be implemented in separate +** well controlled utility and not for end user access. +** Command code 0x20--0x?? always check the password, +** password must be entered to enable these command. +** enum +** { +** GUI_SET_SERIAL=0x10, +** GUI_SET_VENDOR, +** GUI_SET_MODEL, +** GUI_IDENTIFY, +** GUI_CHECK_PASSWORD, +** GUI_LOGOUT, +** GUI_HTTP, +** GUI_SET_ETHERNET_ADDR, +** GUI_SET_LOGO, +** GUI_POLL_EVENT, +** GUI_GET_EVENT, +** GUI_GET_HW_MONITOR, +** // GUI_QUICK_CREATE=0x20, (function removed) +** GUI_GET_INFO_R=0x20, +** GUI_GET_INFO_V, +** GUI_GET_INFO_P, +** GUI_GET_INFO_S, +** GUI_CLEAR_EVENT, +** GUI_MUTE_BEEPER=0x30, +** GUI_BEEPER_SETTING, +** GUI_SET_PASSWORD, +** GUI_HOST_INTERFACE_MODE, +** GUI_REBUILD_PRIORITY, +** GUI_MAX_ATA_MODE, +** GUI_RESET_CONTROLLER, +** GUI_COM_PORT_SETTING, +** GUI_NO_OPERATION, +** GUI_DHCP_IP, +** GUI_CREATE_PASS_THROUGH=0x40, +** GUI_MODIFY_PASS_THROUGH, +** GUI_DELETE_PASS_THROUGH, +** GUI_IDENTIFY_DEVICE, +** GUI_CREATE_RAIDSET=0x50, +** GUI_DELETE_RAIDSET, +** GUI_EXPAND_RAIDSET, +** GUI_ACTIVATE_RAIDSET, +** GUI_CREATE_HOT_SPARE, +** GUI_DELETE_HOT_SPARE, +** GUI_CREATE_VOLUME=0x60, +** GUI_MODIFY_VOLUME, +** GUI_DELETE_VOLUME, +** GUI_START_CHECK_VOLUME, +** GUI_STOP_CHECK_VOLUME +** }; +** Command description : +** GUI_SET_SERIAL : Set the controller serial# +** byte 0,1 : length +** byte 2 : command code 0x10 +** byte 3 : password length (should be 0x0f) +** byte 4-0x13 : should be "ArEcATecHnoLogY" +** byte 0x14--0x23 : Serial number string (must be 16 bytes) +** GUI_SET_VENDOR : Set vendor string for the controller +** byte 0,1 : length +** byte 2 : command code 0x11 +** byte 3 : password length (should be 0x08) +** byte 4-0x13 : should be "ArEcAvAr" +** byte 0x14--0x3B : vendor string (must be 40 bytes) +** GUI_SET_MODEL : Set the model name of the controller +** byte 0,1 : length +** byte 2 : command code 0x12 +** byte 3 : password length (should be 0x08) +** byte 4-0x13 : should be "ArEcAvAr" +** byte 0x14--0x1B : model string (must be 8 bytes) +** GUI_IDENTIFY : Identify device +** byte 0,1 : length +** byte 2 : command code 0x13 +** return "Areca RAID Subsystem " +** GUI_CHECK_PASSWORD : Verify password +** byte 0,1 : length +** byte 2 : command code 0x14 +** byte 3 : password length +** byte 4-0x?? : user password to be checked +** GUI_LOGOUT : Logout GUI (force password checking on next command) +** byte 0,1 : length +** byte 2 : command code 0x15 +** GUI_HTTP : HTTP interface (reserved for Http proxy service)(0x16) +** +** GUI_SET_ETHERNET_ADDR : Set the ethernet MAC address +** byte 0,1 : length +** byte 2 : command code 0x17 +** byte 3 : password length (should be 0x08) +** byte 4-0x13 : should be "ArEcAvAr" +** byte 0x14--0x19 : Ethernet MAC address (must be 6 bytes) +** GUI_SET_LOGO : Set logo in HTTP +** byte 0,1 : length +** byte 2 : command code 0x18 +** byte 3 : Page# (0/1/2/3) (0xff --> clear OEM logo) +** byte 4/5/6/7 : 0x55/0xaa/0xa5/0x5a +** byte 8 : TITLE.JPG data (each page must be 2000 bytes) +** note page0 1st 2 byte must be +** actual length of the JPG file +** GUI_POLL_EVENT : Poll If Event Log Changed +** byte 0,1 : length +** byte 2 : command code 0x19 +** GUI_GET_EVENT : Read Event +** byte 0,1 : length +** byte 2 : command code 0x1a +** byte 3 : Event Page (0:1st page/1/2/3:last page) +** GUI_GET_HW_MONITOR : Get HW monitor data +** byte 0,1 : length +** byte 2 : command code 0x1b +** byte 3 : # of FANs(example 2) +** byte 4 : # of Voltage sensor(example 3) +** byte 5 : # of temperature sensor(example 2) +** byte 6 : # of power +** byte 7/8 : Fan#0 (RPM) +** byte 9/10 : Fan#1 +** byte 11/12 : Voltage#0 original value in *1000 +** byte 13/14 : Voltage#0 value +** byte 15/16 : Voltage#1 org +** byte 17/18 : Voltage#1 +** byte 19/20 : Voltage#2 org +** byte 21/22 : Voltage#2 +** byte 23 : Temp#0 +** byte 24 : Temp#1 +** byte 25 : Power indicator (bit0 : power#0, +** bit1 : power#1) +** byte 26 : UPS indicator +** GUI_QUICK_CREATE : Quick create raid/volume set +** byte 0,1 : length +** byte 2 : command code 0x20 +** byte 3/4/5/6 : raw capacity +** byte 7 : raid level +** byte 8 : stripe size +** byte 9 : spare +** byte 10/11/12/13: device mask (the devices to create raid/volume) +** This function is removed, application like +** to implement quick create function +** need to use GUI_CREATE_RAIDSET and GUI_CREATE_VOLUMESET function. +** GUI_GET_INFO_R : Get Raid Set Information +** byte 0,1 : length +** byte 2 : command code 0x20 +** byte 3 : raidset# +** typedef struct sGUI_RAIDSET +** { +** BYTE grsRaidSetName[16]; +** DWORD grsCapacity; +** DWORD grsCapacityX; +** DWORD grsFailMask; +** BYTE grsDevArray[32]; +** BYTE grsMemberDevices; +** BYTE grsNewMemberDevices; +** BYTE grsRaidState; +** BYTE grsVolumes; +** BYTE grsVolumeList[16]; +** BYTE grsRes1; +** BYTE grsRes2; +** BYTE grsRes3; +** BYTE grsFreeSegments; +** DWORD grsRawStripes[8]; +** DWORD grsRes4; +** DWORD grsRes5; // Total to 128 bytes +** DWORD grsRes6; // Total to 128 bytes +** } sGUI_RAIDSET, *pGUI_RAIDSET; +** GUI_GET_INFO_V : Get Volume Set Information +** byte 0,1 : length +** byte 2 : command code 0x21 +** byte 3 : volumeset# +** typedef struct sGUI_VOLUMESET +** { +** BYTE gvsVolumeName[16]; // 16 +** DWORD gvsCapacity; +** DWORD gvsCapacityX; +** DWORD gvsFailMask; +** DWORD gvsStripeSize; +** DWORD gvsNewFailMask; +** DWORD gvsNewStripeSize; +** DWORD gvsVolumeStatus; +** DWORD gvsProgress; // 32 +** sSCSI_ATTR gvsScsi; +** BYTE gvsMemberDisks; +** BYTE gvsRaidLevel; // 8 +** BYTE gvsNewMemberDisks; +** BYTE gvsNewRaidLevel; +** BYTE gvsRaidSetNumber; +** BYTE gvsRes0; // 4 +** BYTE gvsRes1[4]; // 64 bytes +** } sGUI_VOLUMESET, *pGUI_VOLUMESET; +** GUI_GET_INFO_P : Get Physical Drive Information +** byte 0,1 : length +** byte 2 : command code 0x22 +** byte 3 : drive # (from 0 to max-channels - 1) +** typedef struct sGUI_PHY_DRV +** { +** BYTE gpdModelName[40]; +** BYTE gpdSerialNumber[20]; +** BYTE gpdFirmRev[8]; +** DWORD gpdCapacity; +** DWORD gpdCapacityX; // Reserved for expansion +** BYTE gpdDeviceState; +** BYTE gpdPioMode; +** BYTE gpdCurrentUdmaMode; +** BYTE gpdUdmaMode; +** BYTE gpdDriveSelect; +** BYTE gpdRaidNumber; // 0xff if not belongs to a raid set +** sSCSI_ATTR gpdScsi; +** BYTE gpdReserved[40]; // Total to 128 bytes +** } sGUI_PHY_DRV, *pGUI_PHY_DRV; +** GUI_GET_INFO_S : Get System Information +** byte 0,1 : length +** byte 2 : command code 0x23 +** typedef struct sCOM_ATTR +** { +** BYTE comBaudRate; +** BYTE comDataBits; +** BYTE comStopBits; +** BYTE comParity; +** BYTE comFlowControl; +** } sCOM_ATTR, *pCOM_ATTR; +** typedef struct sSYSTEM_INFO +** { +** BYTE gsiVendorName[40]; +** BYTE gsiSerialNumber[16]; +** BYTE gsiFirmVersion[16]; +** BYTE gsiBootVersion[16]; +** BYTE gsiMbVersion[16]; +** BYTE gsiModelName[8]; +** BYTE gsiLocalIp[4]; +** BYTE gsiCurrentIp[4]; +** DWORD gsiTimeTick; +** DWORD gsiCpuSpeed; +** DWORD gsiICache; +** DWORD gsiDCache; +** DWORD gsiScache; +** DWORD gsiMemorySize; +** DWORD gsiMemorySpeed; +** DWORD gsiEvents; +** BYTE gsiMacAddress[6]; +** BYTE gsiDhcp; +** BYTE gsiBeeper; +** BYTE gsiChannelUsage; +** BYTE gsiMaxAtaMode; +** BYTE gsiSdramEcc; // 1:if ECC enabled +** BYTE gsiRebuildPriority; +** sCOM_ATTR gsiComA; // 5 bytes +** sCOM_ATTR gsiComB; // 5 bytes +** BYTE gsiIdeChannels; +** BYTE gsiScsiHostChannels; +** BYTE gsiIdeHostChannels; +** BYTE gsiMaxVolumeSet; +** BYTE gsiMaxRaidSet; +** BYTE gsiEtherPort; // 1:if ether net port supported +** BYTE gsiRaid6Engine; // 1:Raid6 engine supported +** BYTE gsiRes[75]; +** } sSYSTEM_INFO, *pSYSTEM_INFO; +** GUI_CLEAR_EVENT : Clear System Event +** byte 0,1 : length +** byte 2 : command code 0x24 +** GUI_MUTE_BEEPER : Mute current beeper +** byte 0,1 : length +** byte 2 : command code 0x30 +** GUI_BEEPER_SETTING : Disable beeper +** byte 0,1 : length +** byte 2 : command code 0x31 +** byte 3 : 0->disable, 1->enable +** GUI_SET_PASSWORD : Change password +** byte 0,1 : length +** byte 2 : command code 0x32 +** byte 3 : pass word length ( must <= 15 ) +** byte 4 : password (must be alpha-numerical) +** GUI_HOST_INTERFACE_MODE : Set host interface mode +** byte 0,1 : length +** byte 2 : command code 0x33 +** byte 3 : 0->Independent, 1->cluster +** GUI_REBUILD_PRIORITY : Set rebuild priority +** byte 0,1 : length +** byte 2 : command code 0x34 +** byte 3 : 0/1/2/3 (low->high) +** GUI_MAX_ATA_MODE : Set maximum ATA mode to be used +** byte 0,1 : length +** byte 2 : command code 0x35 +** byte 3 : 0/1/2/3 (133/100/66/33) +** GUI_RESET_CONTROLLER : Reset Controller +** byte 0,1 : length +** byte 2 : command code 0x36 +** *Response with VT100 screen (discard it) +** GUI_COM_PORT_SETTING : COM port setting +** byte 0,1 : length +** byte 2 : command code 0x37 +** byte 3 : 0->COMA (term port), +** 1->COMB (debug port) +** byte 4 : 0/1/2/3/4/5/6/7 +** (1200/2400/4800/9600/19200/38400/57600/115200) +** byte 5 : data bit +** (0:7 bit, 1:8 bit : must be 8 bit) +** byte 6 : stop bit (0:1, 1:2 stop bits) +** byte 7 : parity (0:none, 1:off, 2:even) +** byte 8 : flow control +** (0:none, 1:xon/xoff, 2:hardware => must use none) +** GUI_NO_OPERATION : No operation +** byte 0,1 : length +** byte 2 : command code 0x38 +** GUI_DHCP_IP : Set DHCP option and local IP address +** byte 0,1 : length +** byte 2 : command code 0x39 +** byte 3 : 0:dhcp disabled, 1:dhcp enabled +** byte 4/5/6/7 : IP address +** GUI_CREATE_PASS_THROUGH : Create pass through disk +** byte 0,1 : length +** byte 2 : command code 0x40 +** byte 3 : device # +** byte 4 : scsi channel (0/1) +** byte 5 : scsi id (0-->15) +** byte 6 : scsi lun (0-->7) +** byte 7 : tagged queue (1 : enabled) +** byte 8 : cache mode (1 : enabled) +** byte 9 : max speed (0/1/2/3/4, +** async/20/40/80/160 for scsi) +** (0/1/2/3/4, 33/66/100/133/150 for ide ) +** GUI_MODIFY_PASS_THROUGH : Modify pass through disk +** byte 0,1 : length +** byte 2 : command code 0x41 +** byte 3 : device # +** byte 4 : scsi channel (0/1) +** byte 5 : scsi id (0-->15) +** byte 6 : scsi lun (0-->7) +** byte 7 : tagged queue (1 : enabled) +** byte 8 : cache mode (1 : enabled) +** byte 9 : max speed (0/1/2/3/4, +** async/20/40/80/160 for scsi) +** (0/1/2/3/4, 33/66/100/133/150 for ide ) +** GUI_DELETE_PASS_THROUGH : Delete pass through disk +** byte 0,1 : length +** byte 2 : command code 0x42 +** byte 3 : device# to be deleted +** GUI_IDENTIFY_DEVICE : Identify Device +** byte 0,1 : length +** byte 2 : command code 0x43 +** byte 3 : Flash Method +** (0:flash selected, 1:flash not selected) +** byte 4/5/6/7 : IDE device mask to be flashed +** note .... no response data available +** GUI_CREATE_RAIDSET : Create Raid Set +** byte 0,1 : length +** byte 2 : command code 0x50 +** byte 3/4/5/6 : device mask +** byte 7-22 : raidset name (if byte 7 == 0:use default) +** GUI_DELETE_RAIDSET : Delete Raid Set +** byte 0,1 : length +** byte 2 : command code 0x51 +** byte 3 : raidset# +** GUI_EXPAND_RAIDSET : Expand Raid Set +** byte 0,1 : length +** byte 2 : command code 0x52 +** byte 3 : raidset# +** byte 4/5/6/7 : device mask for expansion +** byte 8/9/10 : (8:0 no change, 1 change, 0xff:terminate, +** 9:new raid level, +** 10:new stripe size +** 0/1/2/3/4/5->4/8/16/32/64/128K ) +** byte 11/12/13 : repeat for each volume in the raidset +** GUI_ACTIVATE_RAIDSET : Activate incomplete raid set +** byte 0,1 : length +** byte 2 : command code 0x53 +** byte 3 : raidset# +** GUI_CREATE_HOT_SPARE : Create hot spare disk +** byte 0,1 : length +** byte 2 : command code 0x54 +** byte 3/4/5/6 : device mask for hot spare creation +** GUI_DELETE_HOT_SPARE : Delete hot spare disk +** byte 0,1 : length +** byte 2 : command code 0x55 +** byte 3/4/5/6 : device mask for hot spare deletion +** GUI_CREATE_VOLUME : Create volume set +** byte 0,1 : length +** byte 2 : command code 0x60 +** byte 3 : raidset# +** byte 4-19 : volume set name +** (if byte4 == 0, use default) +** byte 20-27 : volume capacity (blocks) +** byte 28 : raid level +** byte 29 : stripe size +** (0/1/2/3/4/5->4/8/16/32/64/128K) +** byte 30 : channel +** byte 31 : ID +** byte 32 : LUN +** byte 33 : 1 enable tag +** byte 34 : 1 enable cache +** byte 35 : speed +** (0/1/2/3/4->async/20/40/80/160 for scsi) +** (0/1/2/3/4->33/66/100/133/150 for IDE ) +** byte 36 : 1 to select quick init +** +** GUI_MODIFY_VOLUME : Modify volume Set +** byte 0,1 : length +** byte 2 : command code 0x61 +** byte 3 : volumeset# +** byte 4-19 : new volume set name +** (if byte4 == 0, not change) +** byte 20-27 : new volume capacity (reserved) +** byte 28 : new raid level +** byte 29 : new stripe size +** (0/1/2/3/4/5->4/8/16/32/64/128K) +** byte 30 : new channel +** byte 31 : new ID +** byte 32 : new LUN +** byte 33 : 1 enable tag +** byte 34 : 1 enable cache +** byte 35 : speed +** (0/1/2/3/4->async/20/40/80/160 for scsi) +** (0/1/2/3/4->33/66/100/133/150 for IDE ) +** GUI_DELETE_VOLUME : Delete volume set +** byte 0,1 : length +** byte 2 : command code 0x62 +** byte 3 : volumeset# +** GUI_START_CHECK_VOLUME : Start volume consistency check +** byte 0,1 : length +** byte 2 : command code 0x63 +** byte 3 : volumeset# +** GUI_STOP_CHECK_VOLUME : Stop volume consistency check +** byte 0,1 : length +** byte 2 : command code 0x64 +** --------------------------------------------------------------------- +** 4. Returned data +** --------------------------------------------------------------------- +** (A) Header : 3 bytes sequence (0x5E, 0x01, 0x61) +** (B) Length : 2 bytes +** (low byte 1st, excludes length and checksum byte) +** (C) status or data : +** <1> If length == 1 ==> 1 byte status code +** #define GUI_OK 0x41 +** #define GUI_RAIDSET_NOT_NORMAL 0x42 +** #define GUI_VOLUMESET_NOT_NORMAL 0x43 +** #define GUI_NO_RAIDSET 0x44 +** #define GUI_NO_VOLUMESET 0x45 +** #define GUI_NO_PHYSICAL_DRIVE 0x46 +** #define GUI_PARAMETER_ERROR 0x47 +** #define GUI_UNSUPPORTED_COMMAND 0x48 +** #define GUI_DISK_CONFIG_CHANGED 0x49 +** #define GUI_INVALID_PASSWORD 0x4a +** #define GUI_NO_DISK_SPACE 0x4b +** #define GUI_CHECKSUM_ERROR 0x4c +** #define GUI_PASSWORD_REQUIRED 0x4d +** <2> If length > 1 ==> +** data block returned from controller +** and the contents depends on the command code +** (E) Checksum : checksum of length and status or data byte +************************************************************************** diff --git a/Documentation/scsi/libsas.txt b/Documentation/scsi/libsas.txt new file mode 100644 index 000000000000..9e2078b2a615 --- /dev/null +++ b/Documentation/scsi/libsas.txt @@ -0,0 +1,484 @@ +SAS Layer +--------- + +The SAS Layer is a management infrastructure which manages +SAS LLDDs. It sits between SCSI Core and SAS LLDDs. The +layout is as follows: while SCSI Core is concerned with +SAM/SPC issues, and a SAS LLDD+sequencer is concerned with +phy/OOB/link management, the SAS layer is concerned with: + + * SAS Phy/Port/HA event management (LLDD generates, + SAS Layer processes), + * SAS Port management (creation/destruction), + * SAS Domain discovery and revalidation, + * SAS Domain device management, + * SCSI Host registration/unregistration, + * Device registration with SCSI Core (SAS) or libata + (SATA), and + * Expander management and exporting expander control + to user space. + +A SAS LLDD is a PCI device driver. It is concerned with +phy/OOB management, and vendor specific tasks and generates +events to the SAS layer. + +The SAS Layer does most SAS tasks as outlined in the SAS 1.1 +spec. + +The sas_ha_struct describes the SAS LLDD to the SAS layer. +Most of it is used by the SAS Layer but a few fields need to +be initialized by the LLDDs. + +After initializing your hardware, from the probe() function +you call sas_register_ha(). It will register your LLDD with +the SCSI subsystem, creating a SCSI host and it will +register your SAS driver with the sysfs SAS tree it creates. +It will then return. Then you enable your phys to actually +start OOB (at which point your driver will start calling the +notify_* event callbacks). + +Structure descriptions: + +struct sas_phy -------------------- +Normally this is statically embedded to your driver's +phy structure: + struct my_phy { + blah; + struct sas_phy sas_phy; + bleh; + }; +And then all the phys are an array of my_phy in your HA +struct (shown below). + +Then as you go along and initialize your phys you also +initialize the sas_phy struct, along with your own +phy structure. + +In general, the phys are managed by the LLDD and the ports +are managed by the SAS layer. So the phys are initialized +and updated by the LLDD and the ports are initialized and +updated by the SAS layer. + +There is a scheme where the LLDD can RW certain fields, +and the SAS layer can only read such ones, and vice versa. +The idea is to avoid unnecessary locking. + +enabled -- must be set (0/1) +id -- must be set [0,MAX_PHYS) +class, proto, type, role, oob_mode, linkrate -- must be set +oob_mode -- you set this when OOB has finished and then notify +the SAS Layer. + +sas_addr -- this normally points to an array holding the sas +address of the phy, possibly somewhere in your my_phy +struct. + +attached_sas_addr -- set this when you (LLDD) receive an +IDENTIFY frame or a FIS frame, _before_ notifying the SAS +layer. The idea is that sometimes the LLDD may want to fake +or provide a different SAS address on that phy/port and this +allows it to do this. At best you should copy the sas +address from the IDENTIFY frame or maybe generate a SAS +address for SATA directly attached devices. The Discover +process may later change this. + +frame_rcvd -- this is where you copy the IDENTIFY/FIS frame +when you get it; you lock, copy, set frame_rcvd_size and +unlock the lock, and then call the event. It is a pointer +since there's no way to know your hw frame size _exactly_, +so you define the actual array in your phy struct and let +this pointer point to it. You copy the frame from your +DMAable memory to that area holding the lock. + +sas_prim -- this is where primitives go when they're +received. See sas.h. Grab the lock, set the primitive, +release the lock, notify. + +port -- this points to the sas_port if the phy belongs +to a port -- the LLDD only reads this. It points to the +sas_port this phy is part of. Set by the SAS Layer. + +ha -- may be set; the SAS layer sets it anyway. + +lldd_phy -- you should set this to point to your phy so you +can find your way around faster when the SAS layer calls one +of your callbacks and passes you a phy. If the sas_phy is +embedded you can also use container_of -- whatever you +prefer. + + +struct sas_port -------------------- +The LLDD doesn't set any fields of this struct -- it only +reads them. They should be self explanatory. + +phy_mask is 32 bit, this should be enough for now, as I +haven't heard of a HA having more than 8 phys. + +lldd_port -- I haven't found use for that -- maybe other +LLDD who wish to have internal port representation can make +use of this. + + +struct sas_ha_struct -------------------- +It normally is statically declared in your own LLDD +structure describing your adapter: +struct my_sas_ha { + blah; + struct sas_ha_struct sas_ha; + struct my_phy phys[MAX_PHYS]; + struct sas_port sas_ports[MAX_PHYS]; /* (1) */ + bleh; +}; + +(1) If your LLDD doesn't have its own port representation. + +What needs to be initialized (sample function given below). + +pcidev +sas_addr -- since the SAS layer doesn't want to mess with + memory allocation, etc, this points to statically + allocated array somewhere (say in your host adapter + structure) and holds the SAS address of the host + adapter as given by you or the manufacturer, etc. +sas_port +sas_phy -- an array of pointers to structures. (see + note above on sas_addr). + These must be set. See more notes below. +num_phys -- the number of phys present in the sas_phy array, + and the number of ports present in the sas_port + array. There can be a maximum num_phys ports (one per + port) so we drop the num_ports, and only use + num_phys. + +The event interface: + + /* LLDD calls these to notify the class of an event. */ + void (*notify_ha_event)(struct sas_ha_struct *, enum ha_event); + void (*notify_port_event)(struct sas_phy *, enum port_event); + void (*notify_phy_event)(struct sas_phy *, enum phy_event); + +When sas_register_ha() returns, those are set and can be +called by the LLDD to notify the SAS layer of such events +the SAS layer. + +The port notification: + + /* The class calls these to notify the LLDD of an event. */ + void (*lldd_port_formed)(struct sas_phy *); + void (*lldd_port_deformed)(struct sas_phy *); + +If the LLDD wants notification when a port has been formed +or deformed it sets those to a function satisfying the type. + +A SAS LLDD should also implement at least one of the Task +Management Functions (TMFs) described in SAM: + + /* Task Management Functions. Must be called from process context. */ + int (*lldd_abort_task)(struct sas_task *); + int (*lldd_abort_task_set)(struct domain_device *, u8 *lun); + int (*lldd_clear_aca)(struct domain_device *, u8 *lun); + int (*lldd_clear_task_set)(struct domain_device *, u8 *lun); + int (*lldd_I_T_nexus_reset)(struct domain_device *); + int (*lldd_lu_reset)(struct domain_device *, u8 *lun); + int (*lldd_query_task)(struct sas_task *); + +For more information please read SAM from T10.org. + +Port and Adapter management: + + /* Port and Adapter management */ + int (*lldd_clear_nexus_port)(struct sas_port *); + int (*lldd_clear_nexus_ha)(struct sas_ha_struct *); + +A SAS LLDD should implement at least one of those. + +Phy management: + + /* Phy management */ + int (*lldd_control_phy)(struct sas_phy *, enum phy_func); + +lldd_ha -- set this to point to your HA struct. You can also +use container_of if you embedded it as shown above. + +A sample initialization and registration function +can look like this (called last thing from probe()) +*but* before you enable the phys to do OOB: + +static int register_sas_ha(struct my_sas_ha *my_ha) +{ + int i; + static struct sas_phy *sas_phys[MAX_PHYS]; + static struct sas_port *sas_ports[MAX_PHYS]; + + my_ha->sas_ha.sas_addr = &my_ha->sas_addr[0]; + + for (i = 0; i < MAX_PHYS; i++) { + sas_phys[i] = &my_ha->phys[i].sas_phy; + sas_ports[i] = &my_ha->sas_ports[i]; + } + + my_ha->sas_ha.sas_phy = sas_phys; + my_ha->sas_ha.sas_port = sas_ports; + my_ha->sas_ha.num_phys = MAX_PHYS; + + my_ha->sas_ha.lldd_port_formed = my_port_formed; + + my_ha->sas_ha.lldd_dev_found = my_dev_found; + my_ha->sas_ha.lldd_dev_gone = my_dev_gone; + + my_ha->sas_ha.lldd_max_execute_num = lldd_max_execute_num; (1) + + my_ha->sas_ha.lldd_queue_size = ha_can_queue; + my_ha->sas_ha.lldd_execute_task = my_execute_task; + + my_ha->sas_ha.lldd_abort_task = my_abort_task; + my_ha->sas_ha.lldd_abort_task_set = my_abort_task_set; + my_ha->sas_ha.lldd_clear_aca = my_clear_aca; + my_ha->sas_ha.lldd_clear_task_set = my_clear_task_set; + my_ha->sas_ha.lldd_I_T_nexus_reset= NULL; (2) + my_ha->sas_ha.lldd_lu_reset = my_lu_reset; + my_ha->sas_ha.lldd_query_task = my_query_task; + + my_ha->sas_ha.lldd_clear_nexus_port = my_clear_nexus_port; + my_ha->sas_ha.lldd_clear_nexus_ha = my_clear_nexus_ha; + + my_ha->sas_ha.lldd_control_phy = my_control_phy; + + return sas_register_ha(&my_ha->sas_ha); +} + +(1) This is normally a LLDD parameter, something of the +lines of a task collector. What it tells the SAS Layer is +whether the SAS layer should run in Direct Mode (default: +value 0 or 1) or Task Collector Mode (value greater than 1). + +In Direct Mode, the SAS Layer calls Execute Task as soon as +it has a command to send to the SDS, _and_ this is a single +command, i.e. not linked. + +Some hardware (e.g. aic94xx) has the capability to DMA more +than one task at a time (interrupt) from host memory. Task +Collector Mode is an optional feature for HAs which support +this in their hardware. (Again, it is completely optional +even if your hardware supports it.) + +In Task Collector Mode, the SAS Layer would do _natural_ +coalescing of tasks and at the appropriate moment it would +call your driver to DMA more than one task in a single HA +interrupt. DMBS may want to use this by insmod/modprobe +setting the lldd_max_execute_num to something greater than +1. + +(2) SAS 1.1 does not define I_T Nexus Reset TMF. + +Events +------ + +Events are _the only way_ a SAS LLDD notifies the SAS layer +of anything. There is no other method or way a LLDD to tell +the SAS layer of anything happening internally or in the SAS +domain. + +Phy events: + PHYE_LOSS_OF_SIGNAL, (C) + PHYE_OOB_DONE, + PHYE_OOB_ERROR, (C) + PHYE_SPINUP_HOLD. + +Port events, passed on a _phy_: + PORTE_BYTES_DMAED, (M) + PORTE_BROADCAST_RCVD, (E) + PORTE_LINK_RESET_ERR, (C) + PORTE_TIMER_EVENT, (C) + PORTE_HARD_RESET. + +Host Adapter event: + HAE_RESET + +A SAS LLDD should be able to generate + - at least one event from group C (choice), + - events marked M (mandatory) are mandatory (only one), + - events marked E (expander) if it wants the SAS layer + to handle domain revalidation (only one such). + - Unmarked events are optional. + +Meaning: + +HAE_RESET -- when your HA got internal error and was reset. + +PORTE_BYTES_DMAED -- on receiving an IDENTIFY/FIS frame +PORTE_BROADCAST_RCVD -- on receiving a primitive +PORTE_LINK_RESET_ERR -- timer expired, loss of signal, loss +of DWS, etc. (*) +PORTE_TIMER_EVENT -- DWS reset timeout timer expired (*) +PORTE_HARD_RESET -- Hard Reset primitive received. + +PHYE_LOSS_OF_SIGNAL -- the device is gone (*) +PHYE_OOB_DONE -- OOB went fine and oob_mode is valid +PHYE_OOB_ERROR -- Error while doing OOB, the device probably +got disconnected. (*) +PHYE_SPINUP_HOLD -- SATA is present, COMWAKE not sent. + +(*) should set/clear the appropriate fields in the phy, + or alternatively call the inlined sas_phy_disconnected() + which is just a helper, from their tasklet. + +The Execute Command SCSI RPC: + + int (*lldd_execute_task)(struct sas_task *, int num, + unsigned long gfp_flags); + +Used to queue a task to the SAS LLDD. @task is the tasks to +be executed. @num should be the number of tasks being +queued at this function call (they are linked listed via +task::list), @gfp_mask should be the gfp_mask defining the +context of the caller. + +This function should implement the Execute Command SCSI RPC, +or if you're sending a SCSI Task as linked commands, you +should also use this function. + +That is, when lldd_execute_task() is called, the command(s) +go out on the transport *immediately*. There is *no* +queuing of any sort and at any level in a SAS LLDD. + +The use of task::list is two-fold, one for linked commands, +the other discussed below. + +It is possible to queue up more than one task at a time, by +initializing the list element of struct sas_task, and +passing the number of tasks enlisted in this manner in num. + +Returns: -SAS_QUEUE_FULL, -ENOMEM, nothing was queued; + 0, the task(s) were queued. + +If you want to pass num > 1, then either +A) you're the only caller of this function and keep track + of what you've queued to the LLDD, or +B) you know what you're doing and have a strategy of + retrying. + +As opposed to queuing one task at a time (function call), +batch queuing of tasks, by having num > 1, greatly +simplifies LLDD code, sequencer code, and _hardware design_, +and has some performance advantages in certain situations +(DBMS). + +The LLDD advertises if it can take more than one command at +a time at lldd_execute_task(), by setting the +lldd_max_execute_num parameter (controlled by "collector" +module parameter in aic94xx SAS LLDD). + +You should leave this to the default 1, unless you know what +you're doing. + +This is a function of the LLDD, to which the SAS layer can +cater to. + +int lldd_queue_size + The host adapter's queue size. This is the maximum +number of commands the lldd can have pending to domain +devices on behalf of all upper layers submitting through +lldd_execute_task(). + +You really want to set this to something (much) larger than +1. + +This _really_ has absolutely nothing to do with queuing. +There is no queuing in SAS LLDDs. + +struct sas_task { + dev -- the device this task is destined to + list -- must be initialized (INIT_LIST_HEAD) + task_proto -- _one_ of enum sas_proto + scatter -- pointer to scatter gather list array + num_scatter -- number of elements in scatter + total_xfer_len -- total number of bytes expected to be transfered + data_dir -- PCI_DMA_... + task_done -- callback when the task has finished execution +}; + +When an external entity, entity other than the LLDD or the +SAS Layer, wants to work with a struct domain_device, it +_must_ call kobject_get() when getting a handle on the +device and kobject_put() when it is done with the device. + +This does two things: + A) implements proper kfree() for the device; + B) increments/decrements the kref for all players: + domain_device + all domain_device's ... (if past an expander) + port + host adapter + pci device + and up the ladder, etc. + +DISCOVERY +--------- + +The sysfs tree has the following purposes: + a) It shows you the physical layout of the SAS domain at + the current time, i.e. how the domain looks in the + physical world right now. + b) Shows some device parameters _at_discovery_time_. + +This is a link to the tree(1) program, very useful in +viewing the SAS domain: +ftp://mama.indstate.edu/linux/tree/ +I expect user space applications to actually create a +graphical interface of this. + +That is, the sysfs domain tree doesn't show or keep state if +you e.g., change the meaning of the READY LED MEANING +setting, but it does show you the current connection status +of the domain device. + +Keeping internal device state changes is responsibility of +upper layers (Command set drivers) and user space. + +When a device or devices are unplugged from the domain, this +is reflected in the sysfs tree immediately, and the device(s) +removed from the system. + +The structure domain_device describes any device in the SAS +domain. It is completely managed by the SAS layer. A task +points to a domain device, this is how the SAS LLDD knows +where to send the task(s) to. A SAS LLDD only reads the +contents of the domain_device structure, but it never creates +or destroys one. + +Expander management from User Space +----------------------------------- + +In each expander directory in sysfs, there is a file called +"smp_portal". It is a binary sysfs attribute file, which +implements an SMP portal (Note: this is *NOT* an SMP port), +to which user space applications can send SMP requests and +receive SMP responses. + +Functionality is deceptively simple: + +1. Build the SMP frame you want to send. The format and layout + is described in the SAS spec. Leave the CRC field equal 0. +open(2) +2. Open the expander's SMP portal sysfs file in RW mode. +write(2) +3. Write the frame you built in 1. +read(2) +4. Read the amount of data you expect to receive for the frame you built. + If you receive different amount of data you expected to receive, + then there was some kind of error. +close(2) +All this process is shown in detail in the function do_smp_func() +and its callers, in the file "expander_conf.c". + +The kernel functionality is implemented in the file +"sas_expander.c". + +The program "expander_conf.c" implements this. It takes one +argument, the sysfs file name of the SMP portal to the +expander, and gives expander information, including routing +tables. + +The SMP portal gives you complete control of the expander, +so please be careful. diff --git a/Documentation/seclvl.txt b/Documentation/seclvl.txt deleted file mode 100644 index 97274d122d0e..000000000000 --- a/Documentation/seclvl.txt +++ /dev/null @@ -1,97 +0,0 @@ -BSD Secure Levels Linux Security Module -Michael A. Halcrow <mike@halcrow.us> - - -Introduction - -Under the BSD Secure Levels security model, sets of policies are -associated with levels. Levels range from -1 to 2, with -1 being the -weakest and 2 being the strongest. These security policies are -enforced at the kernel level, so not even the superuser is able to -disable or circumvent them. This hardens the machine against attackers -who gain root access to the system. - - -Levels and Policies - -Level -1 (Permanently Insecure): - - Cannot increase the secure level - -Level 0 (Insecure): - - Cannot ptrace the init process - -Level 1 (Default): - - /dev/mem and /dev/kmem are read-only - - IMMUTABLE and APPEND extended attributes, if set, may not be unset - - Cannot load or unload kernel modules - - Cannot write directly to a mounted block device - - Cannot perform raw I/O operations - - Cannot perform network administrative tasks - - Cannot setuid any file - -Level 2 (Secure): - - Cannot decrement the system time - - Cannot write to any block device, whether mounted or not - - Cannot unmount any mounted filesystems - - -Compilation - -To compile the BSD Secure Levels LSM, seclvl.ko, enable the -SECURITY_SECLVL configuration option. This is found under Security -options -> BSD Secure Levels in the kernel configuration menu. - - -Basic Usage - -Once the machine is in a running state, with all the necessary modules -loaded and all the filesystems mounted, you can load the seclvl.ko -module: - -# insmod seclvl.ko - -The module defaults to secure level 1, except when compiled directly -into the kernel, in which case it defaults to secure level 0. To raise -the secure level to 2, the administrator writes ``2'' to the -seclvl/seclvl file under the sysfs mount point (assumed to be /sys in -these examples): - -# echo -n "2" > /sys/seclvl/seclvl - -Alternatively, you can initialize the module at secure level 2 with -the initlvl module parameter: - -# insmod seclvl.ko initlvl=2 - -At this point, it is impossible to remove the module or reduce the -secure level. If the administrator wishes to have the option of doing -so, he must provide a module parameter, sha1_passwd, that specifies -the SHA1 hash of the password that can be used to reduce the secure -level to 0. - -To generate this SHA1 hash, the administrator can use OpenSSL: - -# echo -n "boogabooga" | openssl sha1 -abeda4e0f33defa51741217592bf595efb8d289c - -In order to use password-instigated secure level reduction, the SHA1 -crypto module must be loaded or compiled into the kernel: - -# insmod sha1.ko - -The administrator can then insmod the seclvl module, including the -SHA1 hash of the password: - -# insmod seclvl.ko - sha1_passwd=abeda4e0f33defa51741217592bf595efb8d289c - -To reduce the secure level, write the password to seclvl/passwd under -your sysfs mount point: - -# echo -n "boogabooga" > /sys/seclvl/passwd - -The September 2004 edition of Sys Admin Magazine has an article about -the BSD Secure Levels LSM. I encourage you to refer to that article -for a more in-depth treatment of this security module: - -http://www.samag.com/documents/s=9304/sam0409a/0409a.htm diff --git a/Documentation/sh/new-machine.txt b/Documentation/sh/new-machine.txt index eb2dd2e6993b..73988e0d112b 100644 --- a/Documentation/sh/new-machine.txt +++ b/Documentation/sh/new-machine.txt @@ -41,11 +41,6 @@ Board-specific code: | .. more boards here ... -It should also be noted that each board is required to have some certain -headers. At the time of this writing, io.h is the only thing that needs -to be provided for each board, and can generally just reference generic -functions (with the exception of isa_port2addr). - Next, for companion chips: . `-- arch @@ -104,12 +99,13 @@ and then populate that with sub-directories for each member of the family. Both the Solution Engine and the hp6xx boards are an example of this. After you have setup your new arch/sh/boards/ directory, remember that you -also must add a directory in include/asm-sh for headers localized to this -board. In order to interoperate seamlessly with the build system, it's best -to have this directory the same as the arch/sh/boards/ directory name, -though if your board is again part of a family, the build system has ways -of dealing with this, and you can feel free to name the directory after -the family member itself. +should also add a directory in include/asm-sh for headers localized to this +board (if there are going to be more than one). In order to interoperate +seamlessly with the build system, it's best to have this directory the same +as the arch/sh/boards/ directory name, though if your board is again part of +a family, the build system has ways of dealing with this (via incdir-y +overloading), and you can feel free to name the directory after the family +member itself. There are a few things that each board is required to have, both in the arch/sh/boards and the include/asm-sh/ heirarchy. In order to better @@ -122,6 +118,7 @@ might look something like: * arch/sh/boards/vapor/setup.c - Setup code for imaginary board */ #include <linux/init.h> +#include <asm/rtc.h> /* for board_time_init() */ const char *get_system_type(void) { @@ -152,79 +149,57 @@ int __init platform_setup(void) } Our new imaginary board will also have to tie into the machvec in order for it -to be of any use. Currently the machvec is slowly on its way out, but is still -required for the time being. As such, let us take a look at what needs to be -done for the machvec assignment. +to be of any use. machvec functions fall into a number of categories: - I/O functions to IO memory (inb etc) and PCI/main memory (readb etc). - - I/O remapping functions (ioremap etc) - - some initialisation functions - - a 'heartbeat' function - - some miscellaneous flags - -The tree can be built in two ways: - - as a fully generic build. All drivers are linked in, and all functions - go through the machvec - - as a machine specific build. In this case only the required drivers - will be linked in, and some macros may be redefined to not go through - the machvec where performance is important (in particular IO functions). - -There are three ways in which IO can be performed: - - none at all. This is really only useful for the 'unknown' machine type, - which us designed to run on a machine about which we know nothing, and - so all all IO instructions do nothing. - - fully custom. In this case all IO functions go to a machine specific - set of functions which can do what they like - - a generic set of functions. These will cope with most situations, - and rely on a single function, mv_port2addr, which is called through the - machine vector, and converts an IO address into a memory address, which - can be read from/written to directly. - -Thus adding a new machine involves the following steps (I will assume I am -adding a machine called vapor): - - - add a new file include/asm-sh/vapor/io.h which contains prototypes for + - I/O mapping functions (ioport_map, ioport_unmap, etc). + - a 'heartbeat' function. + - PCI and IRQ initialization routines. + - Consistent allocators (for boards that need special allocators, + particularly for allocating out of some board-specific SRAM for DMA + handles). + +There are machvec functions added and removed over time, so always be sure to +consult include/asm-sh/machvec.h for the current state of the machvec. + +The kernel will automatically wrap in generic routines for undefined function +pointers in the machvec at boot time, as machvec functions are referenced +unconditionally throughout most of the tree. Some boards have incredibly +sparse machvecs (such as the dreamcast and sh03), whereas others must define +virtually everything (rts7751r2d). + +Adding a new machine is relatively trivial (using vapor as an example): + +If the board-specific definitions are quite minimalistic, as is the case for +the vast majority of boards, simply having a single board-specific header is +sufficient. + + - add a new file include/asm-sh/vapor.h which contains prototypes for any machine specific IO functions prefixed with the machine name, for example vapor_inb. These will be needed when filling out the machine vector. - This is the minimum that is required, however there are ample - opportunities to optimise this. In particular, by making the prototypes - inline function definitions, it is possible to inline the function when - building machine specific versions. Note that the machine vector - functions will still be needed, so that a module built for a generic - setup can be loaded. - - - add a new file arch/sh/boards/vapor/mach.c. This contains the definition - of the machine vector. When building the machine specific version, this - will be the real machine vector (via an alias), while in the generic - version is used to initialise the machine vector, and then freed, by - making it initdata. This should be defined as: - - struct sh_machine_vector mv_vapor __initmv = { - .mv_name = "vapor", - } - ALIAS_MV(vapor) - - - finally add a file arch/sh/boards/vapor/io.c, which contains - definitions of the machine specific io functions. - -A note about initialisation functions. Three initialisation functions are -provided in the machine vector: - - mv_arch_init - called very early on from setup_arch - - mv_init_irq - called from init_IRQ, after the generic SH interrupt - initialisation - - mv_init_pci - currently not used - -Any other remaining functions which need to be called at start up can be -added to the list using the __initcalls macro (or module_init if the code -can be built as a module). Many generic drivers probe to see if the device -they are targeting is present, however this may not always be appropriate, -so a flag can be added to the machine vector which will be set on those -machines which have the hardware in question, reducing the probe to a -single conditional. + Note that these prototypes are generated automatically by setting + __IO_PREFIX to something sensible. A typical example would be: + + #define __IO_PREFIX vapor + #include <asm/io_generic.h> + + somewhere in the board-specific header. Any boards being ported that still + have a legacy io.h should remove it entirely and switch to the new model. + + - Add machine vector definitions to the board's setup.c. At a bare minimum, + this must be defined as something like: + + struct sh_machine_vector mv_vapor __initmv = { + .mv_name = "vapor", + }; + ALIAS_MV(vapor) + + - finally add a file arch/sh/boards/vapor/io.c, which contains definitions of + the machine specific io functions (if there are enough to warrant it). 3. Hooking into the Build System ================================ @@ -303,4 +278,3 @@ which will in turn copy the defconfig for this board, run it through oldconfig (prompting you for any new options since the time of creation), and start you on your way to having a functional kernel for your new board. - diff --git a/Documentation/sh/register-banks.txt b/Documentation/sh/register-banks.txt new file mode 100644 index 000000000000..a6719f2f6594 --- /dev/null +++ b/Documentation/sh/register-banks.txt @@ -0,0 +1,33 @@ + Notes on register bank usage in the kernel + ========================================== + +Introduction +------------ + +The SH-3 and SH-4 CPU families traditionally include a single partial register +bank (selected by SR.RB, only r0 ... r7 are banked), whereas other families +may have more full-featured banking or simply no such capabilities at all. + +SR.RB banking +------------- + +In the case of this type of banking, banked registers are mapped directly to +r0 ... r7 if SR.RB is set to the bank we are interested in, otherwise ldc/stc +can still be used to reference the banked registers (as r0_bank ... r7_bank) +when in the context of another bank. The developer must keep the SR.RB value +in mind when writing code that utilizes these banked registers, for obvious +reasons. Userspace is also not able to poke at the bank1 values, so these can +be used rather effectively as scratch registers by the kernel. + +Presently the kernel uses several of these registers. + + - r0_bank, r1_bank (referenced as k0 and k1, used for scratch + registers when doing exception handling). + - r2_bank (used to track the EXPEVT/INTEVT code) + - Used by do_IRQ() and friends for doing irq mapping based off + of the interrupt exception vector jump table offset + - r6_bank (global interrupt mask) + - The SR.IMASK interrupt handler makes use of this to set the + interrupt priority level (used by local_irq_enable()) + - r7_bank (current) + diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt index f61af23dd85d..e6b57dd46a4f 100644 --- a/Documentation/sound/alsa/ALSA-Configuration.txt +++ b/Documentation/sound/alsa/ALSA-Configuration.txt @@ -758,6 +758,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. position_fix - Fix DMA pointer (0 = auto, 1 = none, 2 = POSBUF, 3 = FIFO size) single_cmd - Use single immediate commands to communicate with codecs (for debugging only) + disable_msi - Disable Message Signaled Interrupt (MSI) This module supports one card and autoprobe. @@ -778,11 +779,16 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. 6stack-digout 6-jack with a SPDIF out w810 3-jack z71v 3-jack (HP shared SPDIF) - asus 3-jack + asus 3-jack (ASUS Mobo) + asus-w1v ASUS W1V + asus-dig ASUS with SPDIF out + asus-dig2 ASUS with SPDIF out (using GPIO2) uniwill 3-jack F1734 2-jack lg LG laptop (m1 express dual) - lg-lw LG LW20 laptop + lg-lw LG LW20/LW25 laptop + tcl TCL S700 + clevo Clevo laptops (m520G, m665n) test for testing/debugging purpose, almost all controls can be adjusted. Appearing only when compiled with $CONFIG_SND_DEBUG=y @@ -790,6 +796,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. ALC260 hp HP machines + hp-3013 HP machines (3013-variant) fujitsu Fujitsu S7020 acer Acer TravelMate basic fixed pin assignment (old default model) @@ -797,24 +804,32 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. ALC262 fujitsu Fujitsu Laptop + hp-bpc HP xw4400/6400/8400/9400 laptops + benq Benq ED8 basic fixed pin assignment w/o SPDIF auto auto-config reading BIOS (default) ALC882/885 3stack-dig 3-jack with SPDIF I/O 6stck-dig 6-jack digital with SPDIF I/O + arima Arima W820Di1 auto auto-config reading BIOS (default) ALC883/888 3stack-dig 3-jack with SPDIF I/O 6stack-dig 6-jack digital with SPDIF I/O - 6stack-dig-demo 6-stack digital for Intel demo board + 3stack-6ch 3-jack 6-channel + 3stack-6ch-dig 3-jack 6-channel with SPDIF I/O + 6stack-dig-demo 6-jack digital for Intel demo board + acer Acer laptops (Travelmate 3012WTMi, Aspire 5600, etc) auto auto-config reading BIOS (default) ALC861/660 3stack 3-jack 3stack-dig 3-jack with SPDIF I/O 6stack-dig 6-jack with SPDIF I/O + 3stack-660 3-jack (for ALC660) + uniwill-m31 Uniwill M31 laptop auto auto-config reading BIOS (default) CMI9880 @@ -843,10 +858,21 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. 3stack-dig ditto with SPDIF laptop 3-jack with hp-jack automute laptop-dig ditto with SPDIF - auto auto-confgi reading BIOS (default) + auto auto-config reading BIOS (default) + + STAC9200/9205/9220/9221/9254 + ref Reference board + 3stack D945 3stack + 5stack D945 5stack + SPDIF - STAC7661(?) + STAC9227/9228/9229/927x + ref Reference board + 3stack D965 3stack + 5stack D965 5stack + SPDIF + + STAC9872 vaio Setup for VAIO FE550G/SZ110 + vaio-ar Setup for VAIO AR If the default configuration doesn't work and one of the above matches with your device, report it together with the PCI @@ -1213,6 +1239,14 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. Module supports only 1 card. This module has no enable option. + Module snd-mts64 + ---------------- + + Module for Ego Systems (ESI) Miditerminal 4140 + + This module supports multiple devices. + Requires parport (CONFIG_PARPORT). + Module snd-nm256 ---------------- diff --git a/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl b/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl index b8dc51ca776c..4807ef79a94d 100644 --- a/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl +++ b/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl @@ -1054,9 +1054,8 @@ <para> For a device which allows hotplugging, you can use - <function>snd_card_free_in_thread</function>. This one will - postpone the destruction and wait in a kernel-thread until all - devices are closed. + <function>snd_card_free_when_closed</function>. This one will + postpone the destruction until all devices are closed. </para> </section> diff --git a/Documentation/sparse.txt b/Documentation/sparse.txt index 5a311c38dd1a..f9c99c9a54f9 100644 --- a/Documentation/sparse.txt +++ b/Documentation/sparse.txt @@ -69,10 +69,10 @@ recompiled, or use "make C=2" to run sparse on the files whether they need to be recompiled or not. The latter is a fast way to check the whole tree if you have already built it. -The optional make variable CF can be used to pass arguments to sparse. The -build system passes -Wbitwise to sparse automatically. To perform endianness -checks, you may define __CHECK_ENDIAN__: +The optional make variable CHECKFLAGS can be used to pass arguments to sparse. +The build system passes -Wbitwise to sparse automatically. To perform +endianness checks, you may define __CHECK_ENDIAN__: - make C=2 CF="-D__CHECK_ENDIAN__" + make C=2 CHECKFLAGS="-D__CHECK_ENDIAN__" These checks are disabled by default as they generate a host of warnings. diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 7cee90223d3a..20d0d797f539 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -29,6 +29,7 @@ Currently, these files are in /proc/sys/vm: - drop-caches - zone_reclaim_mode - min_unmapped_ratio +- min_slab_ratio - panic_on_oom ============================================================== @@ -138,7 +139,6 @@ This is value ORed together of 1 = Zone reclaim on 2 = Zone reclaim writes dirty pages out 4 = Zone reclaim swaps pages -8 = Also do a global slab reclaim pass zone_reclaim_mode is set during bootup to 1 if it is determined that pages from remote zones will cause a measurable performance reduction. The @@ -162,18 +162,13 @@ Allowing regular swap effectively restricts allocations to the local node unless explicitly overridden by memory policies or cpuset configurations. -It may be advisable to allow slab reclaim if the system makes heavy -use of files and builds up large slab caches. However, the slab -shrink operation is global, may take a long time and free slabs -in all nodes of the system. - ============================================================= min_unmapped_ratio: This is available only on NUMA kernels. -A percentage of the file backed pages in each zone. Zone reclaim will only +A percentage of the total pages in each zone. Zone reclaim will only occur if more than this percentage of pages are file backed and unmapped. This is to insure that a minimal amount of local pages is still available for file I/O even if the node is overallocated. @@ -182,6 +177,24 @@ The default is 1 percent. ============================================================= +min_slab_ratio: + +This is available only on NUMA kernels. + +A percentage of the total pages in each zone. On Zone reclaim +(fallback from the local zone occurs) slabs will be reclaimed if more +than this percentage of pages in a zone are reclaimable slab pages. +This insures that the slab growth stays under control even in NUMA +systems that rarely perform global reclaim. + +The default is 5 percent. + +Note that slab reclaim is triggered in a per zone / node fashion. +The process of reclaiming slab memory is currently not node specific +and may not be fast. + +============================================================= + panic_on_oom This enables or disables panic on out-of-memory feature. If this is set to 1, diff --git a/Documentation/usb/error-codes.txt b/Documentation/usb/error-codes.txt index 867f4c38f356..39c68f8c4e6c 100644 --- a/Documentation/usb/error-codes.txt +++ b/Documentation/usb/error-codes.txt @@ -98,13 +98,13 @@ one or more packets could finish before an error stops further endpoint I/O. error, a failure to respond (often caused by device disconnect), or some other fault. --ETIMEDOUT (**) No response packet received within the prescribed +-ETIME (**) No response packet received within the prescribed bus turn-around time. This error may instead be reported as -EPROTO or -EILSEQ. - Note that the synchronous USB message functions - also use this code to indicate timeout expired - before the transfer completed. +-ETIMEDOUT Synchronous USB message functions use this code + to indicate timeout expired before the transfer + completed, and no other error was reported by HC. -EPIPE (**) Endpoint stalled. For non-control endpoints, reset this status with usb_clear_halt(). @@ -163,6 +163,3 @@ usb_get_*/usb_set_*(): usb_control_msg(): usb_bulk_msg(): -ETIMEDOUT Timeout expired before the transfer completed. - In the future this code may change to -ETIME, - whose definition is a closer match to this sort - of error. diff --git a/Documentation/usb/usb-serial.txt b/Documentation/usb/usb-serial.txt index 02b0f7beb6d1..a2dee6e6190d 100644 --- a/Documentation/usb/usb-serial.txt +++ b/Documentation/usb/usb-serial.txt @@ -433,6 +433,11 @@ Options supported: See http://www.uuhaus.de/linux/palmconnect.html for up-to-date information on this driver. +AIRcable USB Dongle Bluetooth driver + If there is the cdc_acm driver loaded in the system, you will find that the + cdc_acm claims the device before AIRcable can. This is simply corrected + by unloading both modules and then loading the aircable module before + cdc_acm module Generic Serial driver diff --git a/Documentation/video4linux/CARDLIST.cx88 b/Documentation/video4linux/CARDLIST.cx88 index 00d9a1f2a54c..669a09aa5bb4 100644 --- a/Documentation/video4linux/CARDLIST.cx88 +++ b/Documentation/video4linux/CARDLIST.cx88 @@ -7,10 +7,10 @@ 6 -> AverTV Studio 303 (M126) [1461:000b] 7 -> MSI TV-@nywhere Master [1462:8606] 8 -> Leadtek Winfast DV2000 [107d:6620] - 9 -> Leadtek PVR 2000 [107d:663b,107d:663C] + 9 -> Leadtek PVR 2000 [107d:663b,107d:663c,107d:6632] 10 -> IODATA GV-VCP3/PCI [10fc:d003] 11 -> Prolink PlayTV PVR - 12 -> ASUS PVR-416 [1043:4823] + 12 -> ASUS PVR-416 [1043:4823,1461:c111] 13 -> MSI TV-@nywhere 14 -> KWorld/VStream XPert DVB-T [17de:08a6] 15 -> DViCO FusionHDTV DVB-T1 [18ac:db00] @@ -51,3 +51,7 @@ 50 -> NPG Tech Real TV FM Top 10 [14f1:0842] 51 -> WinFast DTV2000 H [107d:665e] 52 -> Geniatech DVB-S [14f1:0084] + 53 -> Hauppauge WinTV-HVR3000 TriMode Analog/DVB-S/DVB-T [0070:1404] + 54 -> Norwood Micro TV Tuner + 55 -> Shenzhen Tungsten Ages Tech TE-DTV-250 / Swann OEM [c180:c980] + 56 -> Hauppauge WinTV-HVR1300 DVB-T/Hybrid MPEG Encoder [0070:9600,0070:9601,0070:9602] diff --git a/Documentation/video4linux/CARDLIST.saa7134 b/Documentation/video4linux/CARDLIST.saa7134 index 9068b669f5ee..94cf695b1378 100644 --- a/Documentation/video4linux/CARDLIST.saa7134 +++ b/Documentation/video4linux/CARDLIST.saa7134 @@ -58,7 +58,7 @@ 57 -> Avermedia AVerTV GO 007 FM [1461:f31f] 58 -> ADS Tech Instant TV (saa7135) [1421:0350,1421:0351,1421:0370,1421:1370] 59 -> Kworld/Tevion V-Stream Xpert TV PVR7134 - 60 -> LifeView/Typhoon FlyDVB-T Duo Cardbus [5168:0502,4e42:0502] + 60 -> LifeView/Typhoon/Genius FlyDVB-T Duo Cardbus [5168:0502,4e42:0502,1489:0502] 61 -> Philips TOUGH DVB-T reference design [1131:2004] 62 -> Compro VideoMate TV Gold+II 63 -> Kworld Xpert TV PVR7134 @@ -83,7 +83,7 @@ 82 -> MSI TV@Anywhere plus [1462:6231] 83 -> Terratec Cinergy 250 PCI TV [153b:1160] 84 -> LifeView FlyDVB Trio [5168:0319] - 85 -> AverTV DVB-T 777 [1461:2c05] + 85 -> AverTV DVB-T 777 [1461:2c05,1461:2c05] 86 -> LifeView FlyDVB-T / Genius VideoWonder DVB-T [5168:0301,1489:0301] 87 -> ADS Instant TV Duo Cardbus PTV331 [0331:1421] 88 -> Tevion/KWorld DVB-T 220RF [17de:7201] @@ -94,3 +94,6 @@ 93 -> Medion 7134 Bridge #2 [16be:0005] 94 -> LifeView FlyDVB-T Hybrid Cardbus [5168:3306,5168:3502] 95 -> LifeView FlyVIDEO3000 (NTSC) [5169:0138] + 96 -> Medion Md8800 Quadro [16be:0007,16be:0008] + 97 -> LifeView FlyDVB-S /Acorp TV134DS [5168:0300,4e42:0300] + 98 -> Proteus Pro 2309 [0919:2003] diff --git a/Documentation/video4linux/bttv/Insmod-options b/Documentation/video4linux/bttv/Insmod-options index fc94ff235ffa..bb7c2cac7917 100644 --- a/Documentation/video4linux/bttv/Insmod-options +++ b/Documentation/video4linux/bttv/Insmod-options @@ -54,6 +54,12 @@ bttv.o dropouts. chroma_agc=0/1 AGC of chroma signal, off by default. adc_crush=0/1 Luminance ADC crush, on by default. + i2c_udelay= Allow reduce I2C speed. Default is 5 usecs + (meaning 66,67 Kbps). The default is the + maximum supported speed by kernel bitbang + algoritm. You may use lower numbers, if I2C + messages are lost (16 is known to work on + all supported cards). bttv_gpio=0/1 gpiomask= diff --git a/Documentation/video4linux/cx2341x/README.hm12 b/Documentation/video4linux/cx2341x/README.hm12 new file mode 100644 index 000000000000..0e213ed095e6 --- /dev/null +++ b/Documentation/video4linux/cx2341x/README.hm12 @@ -0,0 +1,116 @@ +The cx23416 can produce (and the cx23415 can also read) raw YUV output. The +format of a YUV frame is specific to this chip and is called HM12. 'HM' stands +for 'Hauppauge Macroblock', which is a misnomer as 'Conexant Macroblock' would +be more accurate. + +The format is YUV 4:2:0 which uses 1 Y byte per pixel and 1 U and V byte per +four pixels. + +The data is encoded as two macroblock planes, the first containing the Y +values, the second containing UV macroblocks. + +The Y plane is divided into blocks of 16x16 pixels from left to right +and from top to bottom. Each block is transmitted in turn, line-by-line. + +So the first 16 bytes are the first line of the top-left block, the +second 16 bytes are the second line of the top-left block, etc. After +transmitting this block the first line of the block on the right to the +first block is transmitted, etc. + +The UV plane is divided into blocks of 16x8 UV values going from left +to right, top to bottom. Each block is transmitted in turn, line-by-line. + +So the first 16 bytes are the first line of the top-left block and +contain 8 UV value pairs (16 bytes in total). The second 16 bytes are the +second line of 8 UV pairs of the top-left block, etc. After transmitting +this block the first line of the block on the right to the first block is +transmitted, etc. + +The code below is given as an example on how to convert HM12 to separate +Y, U and V planes. This code assumes frames of 720x576 (PAL) pixels. + +The width of a frame is always 720 pixels, regardless of the actual specified +width. + +-------------------------------------------------------------------------- + +#include <stdio.h> +#include <stdlib.h> +#include <string.h> + +static unsigned char frame[576*720*3/2]; +static unsigned char framey[576*720]; +static unsigned char frameu[576*720 / 4]; +static unsigned char framev[576*720 / 4]; + +static void de_macro_y(unsigned char* dst, unsigned char *src, int dstride, int w, int h) +{ + unsigned int y, x, i; + + // descramble Y plane + // dstride = 720 = w + // The Y plane is divided into blocks of 16x16 pixels + // Each block in transmitted in turn, line-by-line. + for (y = 0; y < h; y += 16) { + for (x = 0; x < w; x += 16) { + for (i = 0; i < 16; i++) { + memcpy(dst + x + (y + i) * dstride, src, 16); + src += 16; + } + } + } +} + +static void de_macro_uv(unsigned char *dstu, unsigned char *dstv, unsigned char *src, int dstride, int w, int h) +{ + unsigned int y, x, i; + + // descramble U/V plane + // dstride = 720 / 2 = w + // The U/V values are interlaced (UVUV...). + // Again, the UV plane is divided into blocks of 16x16 UV values. + // Each block in transmitted in turn, line-by-line. + for (y = 0; y < h; y += 16) { + for (x = 0; x < w; x += 8) { + for (i = 0; i < 16; i++) { + int idx = x + (y + i) * dstride; + + dstu[idx+0] = src[0]; dstv[idx+0] = src[1]; + dstu[idx+1] = src[2]; dstv[idx+1] = src[3]; + dstu[idx+2] = src[4]; dstv[idx+2] = src[5]; + dstu[idx+3] = src[6]; dstv[idx+3] = src[7]; + dstu[idx+4] = src[8]; dstv[idx+4] = src[9]; + dstu[idx+5] = src[10]; dstv[idx+5] = src[11]; + dstu[idx+6] = src[12]; dstv[idx+6] = src[13]; + dstu[idx+7] = src[14]; dstv[idx+7] = src[15]; + src += 16; + } + } + } +} + +/*************************************************************************/ +int main(int argc, char **argv) +{ + FILE *fin; + int i; + + if (argc == 1) fin = stdin; + else fin = fopen(argv[1], "r"); + + if (fin == NULL) { + fprintf(stderr, "cannot open input\n"); + exit(-1); + } + while (fread(frame, sizeof(frame), 1, fin) == 1) { + de_macro_y(framey, frame, 720, 720, 576); + de_macro_uv(frameu, framev, frame + 720 * 576, 720 / 2, 720 / 2, 576 / 2); + fwrite(framey, sizeof(framey), 1, stdout); + fwrite(framev, sizeof(framev), 1, stdout); + fwrite(frameu, sizeof(frameu), 1, stdout); + } + fclose(fin); + return 0; +} + +-------------------------------------------------------------------------- diff --git a/Documentation/video4linux/cx2341x/README.vbi b/Documentation/video4linux/cx2341x/README.vbi new file mode 100644 index 000000000000..5807cf156173 --- /dev/null +++ b/Documentation/video4linux/cx2341x/README.vbi @@ -0,0 +1,45 @@ + +Format of embedded V4L2_MPEG_STREAM_VBI_FMT_IVTV VBI data +========================================================= + +This document describes the V4L2_MPEG_STREAM_VBI_FMT_IVTV format of the VBI data +embedded in an MPEG-2 program stream. This format is in part dictated by some +hardware limitations of the ivtv driver (the driver for the Conexant cx23415/6 +chips), in particular a maximum size for the VBI data. Anything longer is cut +off when the MPEG stream is played back through the cx23415. + +The advantage of this format is it is very compact and that all VBI data for +all lines can be stored while still fitting within the maximum allowed size. + +The stream ID of the VBI data is 0xBD. The maximum size of the embedded data is +4 + 43 * 36, which is 4 bytes for a header and 2 * 18 VBI lines with a 1 byte +header and a 42 bytes payload each. Anything beyond this limit is cut off by +the cx23415/6 firmware. Besides the data for the VBI lines we also need 36 bits +for a bitmask determining which lines are captured and 4 bytes for a magic cookie, +signifying that this data package contains V4L2_MPEG_STREAM_VBI_FMT_IVTV VBI data. +If all lines are used, then there is no longer room for the bitmask. To solve this +two different magic numbers were introduced: + +'itv0': After this magic number two unsigned longs follow. Bits 0-17 of the first +unsigned long denote which lines of the first field are captured. Bits 18-31 of +the first unsigned long and bits 0-3 of the second unsigned long are used for the +second field. + +'ITV0': This magic number assumes all VBI lines are captured, i.e. it implicitly +implies that the bitmasks are 0xffffffff and 0xf. + +After these magic cookies (and the 8 byte bitmask in case of cookie 'itv0') the +captured VBI lines start: + +For each line the least significant 4 bits of the first byte contain the data type. +Possible values are shown in the table below. The payload is in the following 42 +bytes. + +Here is the list of possible data types: + +#define IVTV_SLICED_TYPE_TELETEXT 0x1 // Teletext (uses lines 6-22 for PAL) +#define IVTV_SLICED_TYPE_CC 0x4 // Closed Captions (line 21 NTSC) +#define IVTV_SLICED_TYPE_WSS 0x5 // Wide Screen Signal (line 23 PAL) +#define IVTV_SLICED_TYPE_VPS 0x7 // Video Programming System (PAL) (line 16) + +Hans Verkuil <hverkuil@xs4all.nl> diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt index 6da24e7a56cb..74b77f9e91bc 100644 --- a/Documentation/x86_64/boot-options.txt +++ b/Documentation/x86_64/boot-options.txt @@ -199,6 +199,11 @@ IOMMU allowed overwrite iommu off workarounds for specific chipsets. soft Use software bounce buffering (default for Intel machines) noaperture Don't touch the aperture for AGP. + allowdac Allow DMA >4GB + When off all DMA over >4GB is forced through an IOMMU or bounce + buffering. + nodac Forbid DMA >4GB + panic Always panic when IOMMU overflows swiotlb=pages[,force] @@ -245,6 +250,13 @@ Debugging newfallback: use new unwinder but fall back to old if it gets stuck (default) + call_trace=[old|both|newfallback|new] + old: use old inexact backtracer + new: use new exact dwarf2 unwinder + both: print entries from both + newfallback: use new unwinder but fall back to old if it gets + stuck (default) + Misc noreplacement Don't replace instructions with more appropriate ones diff --git a/Documentation/x86_64/kernel-stacks b/Documentation/x86_64/kernel-stacks new file mode 100644 index 000000000000..bddfddd466ab --- /dev/null +++ b/Documentation/x86_64/kernel-stacks @@ -0,0 +1,99 @@ +Most of the text from Keith Owens, hacked by AK + +x86_64 page size (PAGE_SIZE) is 4K. + +Like all other architectures, x86_64 has a kernel stack for every +active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. +These stacks contain useful data as long as a thread is alive or a +zombie. While the thread is in user space the kernel stack is empty +except for the thread_info structure at the bottom. + +In addition to the per thread stacks, there are specialized stacks +associated with each cpu. These stacks are only used while the kernel +is in control on that cpu, when a cpu returns to user space the +specialized stacks contain no useful data. The main cpu stacks is + +* Interrupt stack. IRQSTACKSIZE + + Used for external hardware interrupts. If this is the first external + hardware interrupt (i.e. not a nested hardware interrupt) then the + kernel switches from the current task to the interrupt stack. Like + the split thread and interrupt stacks on i386 (with CONFIG_4KSTACKS), + this gives more room for kernel interrupt processing without having + to increase the size of every per thread stack. + + The interrupt stack is also used when processing a softirq. + +Switching to the kernel interrupt stack is done by software based on a +per CPU interrupt nest counter. This is needed because x86-64 "IST" +hardware stacks cannot nest without races. + +x86_64 also has a feature which is not available on i386, the ability +to automatically switch to a new stack for designated events such as +double fault or NMI, which makes it easier to handle these unusual +events on x86_64. This feature is called the Interrupt Stack Table +(IST). There can be up to 7 IST entries per cpu. The IST code is an +index into the Task State Segment (TSS), the IST entries in the TSS +point to dedicated stacks, each stack can be a different size. + +An IST is selected by an non-zero value in the IST field of an +interrupt-gate descriptor. When an interrupt occurs and the hardware +loads such a descriptor, the hardware automatically sets the new stack +pointer based on the IST value, then invokes the interrupt handler. If +software wants to allow nested IST interrupts then the handler must +adjust the IST values on entry to and exit from the interrupt handler. +(this is occasionally done, e.g. for debug exceptions) + +Events with different IST codes (i.e. with different stacks) can be +nested. For example, a debug interrupt can safely be interrupted by an +NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack +pointers on entry to and exit from all IST events, in theory allowing +IST events with the same code to be nested. However in most cases, the +stack size allocated to an IST assumes no nesting for the same code. +If that assumption is ever broken then the stacks will become corrupt. + +The currently assigned IST stacks are :- + +* STACKFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 12 - Stack Fault Exception (#SS). + + This allows to recover from invalid stack segments. Rarely + happens. + +* DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 8 - Double Fault Exception (#DF). + + Invoked when handling a exception causes another exception. Happens + when the kernel is very confused (e.g. kernel stack pointer corrupt) + Using a separate stack allows to recover from it well enough in many + cases to still output an oops. + +* NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for non-maskable interrupts (NMI). + + NMI can be delivered at any time, including when the kernel is in the + middle of switching stacks. Using IST for NMI events avoids making + assumptions about the previous state of the kernel stack. + +* DEBUG_STACK. DEBUG_STKSZ + + Used for hardware debug interrupts (interrupt 1) and for software + debug interrupts (INT3). + + When debugging a kernel, debug interrupts (both hardware and + software) can occur at any time. Using IST for these interrupts + avoids making assumptions about the previous state of the kernel + stack. + +* MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 18 - Machine Check Exception (#MC). + + MCE can be delivered at any time, including when the kernel is in the + middle of switching stacks. Using IST for MCE events avoids making + assumptions about the previous state of the kernel stack. + +For more details see the Intel IA32 or AMD AMD64 architecture manuals. |