diff options
Diffstat (limited to 'Documentation')
172 files changed, 12649 insertions, 6775 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 5f7f7d7f77d2..02457ec9c94f 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -184,6 +184,8 @@ mtrr.txt - how to use PPro Memory Type Range Registers to increase performance. nbd.txt - info on a TCP implementation of a network block device. +netlabel/ + - directory with information on the NetLabel subsystem. networking/ - directory with info on various aspects of networking with Linux. nfsroot.txt diff --git a/Documentation/ABI/obsolete/devfs b/Documentation/ABI/removed/devfs index b8b87399bc8f..8195c4e0d0a1 100644 --- a/Documentation/ABI/obsolete/devfs +++ b/Documentation/ABI/removed/devfs @@ -1,13 +1,12 @@ What: devfs -Date: July 2005 +Date: July 2005 (scheduled), finally removed in kernel v2.6.18 Contact: Greg Kroah-Hartman <gregkh@suse.de> Description: devfs has been unmaintained for a number of years, has unfixable races, contains a naming policy within the kernel that is against the LSB, and can be replaced by using udev. - The files fs/devfs/*, include/linux/devfs_fs*.h will be removed, + The files fs/devfs/*, include/linux/devfs_fs*.h were removed, along with the the assorted devfs function calls throughout the kernel tree. Users: - diff --git a/Documentation/ABI/testing/sysfs-power b/Documentation/ABI/testing/sysfs-power new file mode 100644 index 000000000000..d882f8093871 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-power @@ -0,0 +1,88 @@ +What: /sys/power/ +Date: August 2006 +Contact: Rafael J. Wysocki <rjw@sisk.pl> +Description: + The /sys/power directory will contain files that will + provide a unified interface to the power management + subsystem. + +What: /sys/power/state +Date: August 2006 +Contact: Rafael J. Wysocki <rjw@sisk.pl> +Description: + The /sys/power/state file controls the system power state. + Reading from this file returns what states are supported, + which is hard-coded to 'standby' (Power-On Suspend), 'mem' + (Suspend-to-RAM), and 'disk' (Suspend-to-Disk). + + Writing to this file one of these strings causes the system to + transition into that state. Please see the file + Documentation/power/states.txt for a description of each of + these states. + +What: /sys/power/disk +Date: August 2006 +Contact: Rafael J. Wysocki <rjw@sisk.pl> +Description: + The /sys/power/disk file controls the operating mode of the + suspend-to-disk mechanism. Reading from this file returns + the name of the method by which the system will be put to + sleep on the next suspend. There are four methods supported: + 'firmware' - means that the memory image will be saved to disk + by some firmware, in which case we also assume that the + firmware will handle the system suspend. + 'platform' - the memory image will be saved by the kernel and + the system will be put to sleep by the platform driver (e.g. + ACPI or other PM registers). + 'shutdown' - the memory image will be saved by the kernel and + the system will be powered off. + 'reboot' - the memory image will be saved by the kernel and + the system will be rebooted. + + The suspend-to-disk method may be chosen by writing to this + file one of the accepted strings: + + 'firmware' + 'platform' + 'shutdown' + 'reboot' + + It will only change to 'firmware' or 'platform' if the system + supports that. + +What: /sys/power/image_size +Date: August 2006 +Contact: Rafael J. Wysocki <rjw@sisk.pl> +Description: + The /sys/power/image_size file controls the size of the image + created by the suspend-to-disk mechanism. It can be written a + string representing a non-negative integer that will be used + as an upper limit of the image size, in bytes. The kernel's + suspend-to-disk code will do its best to ensure the image size + will not exceed this number. However, if it turns out to be + impossible, the kernel will try to suspend anyway using the + smallest image possible. In particular, if "0" is written to + this file, the suspend image will be as small as possible. + + Reading from this file will display the current image size + limit, which is set to 500 MB by default. + +What: /sys/power/pm_trace +Date: August 2006 +Contact: Rafael J. Wysocki <rjw@sisk.pl> +Description: + The /sys/power/pm_trace file controls the code which saves the + last PM event point in the RTC across reboots, so that you can + debug a machine that just hangs during suspend (or more + commonly, during resume). Namely, the RTC is only used to save + the last PM event point if this file contains '1'. Initially + it contains '0' which may be changed to '1' by writing a + string representing a nonzero integer into it. + + To use this debugging feature you should attempt to suspend + the machine, then reboot it and run + + dmesg -s 1000000 | grep 'hash matches' + + CAUTION: Using it will cause your machine's real-time (CMOS) + clock to be set to a random invalid time after a resume. diff --git a/Documentation/Changes b/Documentation/Changes index b02f476c2973..abee7f58c1ed 100644 --- a/Documentation/Changes +++ b/Documentation/Changes @@ -37,15 +37,14 @@ o e2fsprogs 1.29 # tune2fs o jfsutils 1.1.3 # fsck.jfs -V o reiserfsprogs 3.6.3 # reiserfsck -V 2>&1|grep reiserfsprogs o xfsprogs 2.6.0 # xfs_db -V -o pcmciautils 004 -o pcmcia-cs 3.1.21 # cardmgr -V +o pcmciautils 004 # pccardctl -V o quota-tools 3.09 # quota -V o PPP 2.4.0 # pppd --version o isdn4k-utils 3.1pre1 # isdnctrl 2>&1|grep version o nfs-utils 1.0.5 # showmount --version o procps 3.2.0 # ps --version o oprofile 0.9 # oprofiled --version -o udev 071 # udevinfo -V +o udev 081 # udevinfo -V Kernel compilation ================== @@ -181,8 +180,8 @@ Intel IA32 microcode -------------------- A driver has been added to allow updating of Intel IA32 microcode, -accessible as both a devfs regular file and as a normal (misc) -character device. If you are not using devfs you may need to: +accessible as a normal (misc) character device. If you are not using +udev you may need to: mkdir /dev/cpu mknod /dev/cpu/microcode c 10 184 @@ -201,7 +200,9 @@ with programs using shared memory. udev ---- udev is a userspace application for populating /dev dynamically with -only entries for devices actually present. udev replaces devfs. +only entries for devices actually present. udev replaces the basic +functionality of devfs, while allowing persistant device naming for +devices. FUSE ---- @@ -231,18 +232,13 @@ The PPP driver has been restructured to support multilink and to enable it to operate over diverse media layers. If you use PPP, upgrade pppd to at least 2.4.0. -If you are not using devfs, you must have the device file /dev/ppp +If you are not using udev, you must have the device file /dev/ppp which can be made by: mknod /dev/ppp c 108 0 as root. -If you use devfsd and build ppp support as modules, you will need -the following in your /etc/devfsd.conf file: - -LOOKUP PPP MODLOAD - Isdn4k-utils ------------ @@ -271,7 +267,7 @@ active clients. To enable this new functionality, you need to: - mount -t nfsd nfsd /proc/fs/nfs + mount -t nfsd nfsd /proc/fs/nfsd before running exportfs or mountd. It is recommended that all NFS services be protected from the internet-at-large by a firewall where diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle index 6d2412ec91ed..29c18966b050 100644 --- a/Documentation/CodingStyle +++ b/Documentation/CodingStyle @@ -532,6 +532,40 @@ appears outweighs the potential value of the hint that tells gcc to do something it would have done anyway. + Chapter 16: Function return values and names + +Functions can return values of many different kinds, and one of the +most common is a value indicating whether the function succeeded or +failed. Such a value can be represented as an error-code integer +(-Exxx = failure, 0 = success) or a "succeeded" boolean (0 = failure, +non-zero = success). + +Mixing up these two sorts of representations is a fertile source of +difficult-to-find bugs. If the C language included a strong distinction +between integers and booleans then the compiler would find these mistakes +for us... but it doesn't. To help prevent such bugs, always follow this +convention: + + If the name of a function is an action or an imperative command, + the function should return an error-code integer. If the name + is a predicate, the function should return a "succeeded" boolean. + +For example, "add work" is a command, and the add_work() function returns 0 +for success or -EBUSY for failure. In the same way, "PCI device present" is +a predicate, and the pci_dev_present() function returns 1 if it succeeds in +finding a matching device or 0 if it doesn't. + +All EXPORTed functions must respect this convention, and so should all +public functions. Private (static) functions need not, but it is +recommended that they do. + +Functions whose return value is the actual result of a computation, rather +than an indication of whether the computation succeeded, are not subject to +this rule. Generally they indicate failure by returning some out-of-range +result. Typical examples would be functions that return pointers; they use +NULL or the ERR_PTR mechanism to report failure. + + Appendix I: References diff --git a/Documentation/DMA-mapping.txt b/Documentation/DMA-mapping.txt index 7c717699032c..63392c9132b4 100644 --- a/Documentation/DMA-mapping.txt +++ b/Documentation/DMA-mapping.txt @@ -698,12 +698,12 @@ these interfaces. Remember that, as defined, consistent mappings are always going to be SAC addressable. The first thing your driver needs to do is query the PCI platform -layer with your devices DAC addressing capabilities: +layer if it is capable of handling your devices DAC addressing +capabilities: - int pci_dac_set_dma_mask(struct pci_dev *pdev, u64 mask); + int pci_dac_dma_supported(struct pci_dev *hwdev, u64 mask); -This routine behaves identically to pci_set_dma_mask. You may not -use the following interfaces if this routine fails. +You may not use the following interfaces if this routine fails. Next, DMA addresses using this API are kept track of using the dma64_addr_t type. It is guaranteed to be big enough to hold any diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile index 5a2882d275ba..66e1cf733571 100644 --- a/Documentation/DocBook/Makefile +++ b/Documentation/DocBook/Makefile @@ -10,7 +10,8 @@ DOCBOOKS := wanbook.xml z8530book.xml mcabook.xml videobook.xml \ kernel-hacking.xml kernel-locking.xml deviceiobook.xml \ procfs-guide.xml writing_usb_driver.xml \ kernel-api.xml journal-api.xml lsm.xml usb.xml \ - gadget.xml libata.xml mtdnand.xml librs.xml rapidio.xml + gadget.xml libata.xml mtdnand.xml librs.xml rapidio.xml \ + genericirq.xml ### # The build process is as follows (targets): diff --git a/Documentation/DocBook/genericirq.tmpl b/Documentation/DocBook/genericirq.tmpl new file mode 100644 index 000000000000..0f4a4b6321e4 --- /dev/null +++ b/Documentation/DocBook/genericirq.tmpl @@ -0,0 +1,474 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> + +<book id="Generic-IRQ-Guide"> + <bookinfo> + <title>Linux generic IRQ handling</title> + + <authorgroup> + <author> + <firstname>Thomas</firstname> + <surname>Gleixner</surname> + <affiliation> + <address> + <email>tglx@linutronix.de</email> + </address> + </affiliation> + </author> + <author> + <firstname>Ingo</firstname> + <surname>Molnar</surname> + <affiliation> + <address> + <email>mingo@elte.hu</email> + </address> + </affiliation> + </author> + </authorgroup> + + <copyright> + <year>2005-2006</year> + <holder>Thomas Gleixner</holder> + </copyright> + <copyright> + <year>2005-2006</year> + <holder>Ingo Molnar</holder> + </copyright> + + <legalnotice> + <para> + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License version 2 as published by the Free Software Foundation. + </para> + + <para> + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + </para> + + <para> + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + </para> + + <para> + For more details see the file COPYING in the source + distribution of Linux. + </para> + </legalnotice> + </bookinfo> + +<toc></toc> + + <chapter id="intro"> + <title>Introduction</title> + <para> + The generic interrupt handling layer is designed to provide a + complete abstraction of interrupt handling for device drivers. + It is able to handle all the different types of interrupt controller + hardware. Device drivers use generic API functions to request, enable, + disable and free interrupts. The drivers do not have to know anything + about interrupt hardware details, so they can be used on different + platforms without code changes. + </para> + <para> + This documentation is provided to developers who want to implement + an interrupt subsystem based for their architecture, with the help + of the generic IRQ handling layer. + </para> + </chapter> + + <chapter id="rationale"> + <title>Rationale</title> + <para> + The original implementation of interrupt handling in Linux is using + the __do_IRQ() super-handler, which is able to deal with every + type of interrupt logic. + </para> + <para> + Originally, Russell King identified different types of handlers to + build a quite universal set for the ARM interrupt handler + implementation in Linux 2.5/2.6. He distinguished between: + <itemizedlist> + <listitem><para>Level type</para></listitem> + <listitem><para>Edge type</para></listitem> + <listitem><para>Simple type</para></listitem> + </itemizedlist> + In the SMP world of the __do_IRQ() super-handler another type + was identified: + <itemizedlist> + <listitem><para>Per CPU type</para></listitem> + </itemizedlist> + </para> + <para> + This split implementation of highlevel IRQ handlers allows us to + optimize the flow of the interrupt handling for each specific + interrupt type. This reduces complexity in that particular codepath + and allows the optimized handling of a given type. + </para> + <para> + The original general IRQ implementation used hw_interrupt_type + structures and their ->ack(), ->end() [etc.] callbacks to + differentiate the flow control in the super-handler. This leads to + a mix of flow logic and lowlevel hardware logic, and it also leads + to unnecessary code duplication: for example in i386, there is a + ioapic_level_irq and a ioapic_edge_irq irq-type which share many + of the lowlevel details but have different flow handling. + </para> + <para> + A more natural abstraction is the clean separation of the + 'irq flow' and the 'chip details'. + </para> + <para> + Analysing a couple of architecture's IRQ subsystem implementations + reveals that most of them can use a generic set of 'irq flow' + methods and only need to add the chip level specific code. + The separation is also valuable for (sub)architectures + which need specific quirks in the irq flow itself but not in the + chip-details - and thus provides a more transparent IRQ subsystem + design. + </para> + <para> + Each interrupt descriptor is assigned its own highlevel flow + handler, which is normally one of the generic + implementations. (This highlevel flow handler implementation also + makes it simple to provide demultiplexing handlers which can be + found in embedded platforms on various architectures.) + </para> + <para> + The separation makes the generic interrupt handling layer more + flexible and extensible. For example, an (sub)architecture can + use a generic irq-flow implementation for 'level type' interrupts + and add a (sub)architecture specific 'edge type' implementation. + </para> + <para> + To make the transition to the new model easier and prevent the + breakage of existing implementations, the __do_IRQ() super-handler + is still available. This leads to a kind of duality for the time + being. Over time the new model should be used in more and more + architectures, as it enables smaller and cleaner IRQ subsystems. + </para> + </chapter> + <chapter id="bugs"> + <title>Known Bugs And Assumptions</title> + <para> + None (knock on wood). + </para> + </chapter> + + <chapter id="Abstraction"> + <title>Abstraction layers</title> + <para> + There are three main levels of abstraction in the interrupt code: + <orderedlist> + <listitem><para>Highlevel driver API</para></listitem> + <listitem><para>Highlevel IRQ flow handlers</para></listitem> + <listitem><para>Chiplevel hardware encapsulation</para></listitem> + </orderedlist> + </para> + <sect1> + <title>Interrupt control flow</title> + <para> + Each interrupt is described by an interrupt descriptor structure + irq_desc. The interrupt is referenced by an 'unsigned int' numeric + value which selects the corresponding interrupt decription structure + in the descriptor structures array. + The descriptor structure contains status information and pointers + to the interrupt flow method and the interrupt chip structure + which are assigned to this interrupt. + </para> + <para> + Whenever an interrupt triggers, the lowlevel arch code calls into + the generic interrupt code by calling desc->handle_irq(). + This highlevel IRQ handling function only uses desc->chip primitives + referenced by the assigned chip descriptor structure. + </para> + </sect1> + <sect1> + <title>Highlevel Driver API</title> + <para> + The highlevel Driver API consists of following functions: + <itemizedlist> + <listitem><para>request_irq()</para></listitem> + <listitem><para>free_irq()</para></listitem> + <listitem><para>disable_irq()</para></listitem> + <listitem><para>enable_irq()</para></listitem> + <listitem><para>disable_irq_nosync() (SMP only)</para></listitem> + <listitem><para>synchronize_irq() (SMP only)</para></listitem> + <listitem><para>set_irq_type()</para></listitem> + <listitem><para>set_irq_wake()</para></listitem> + <listitem><para>set_irq_data()</para></listitem> + <listitem><para>set_irq_chip()</para></listitem> + <listitem><para>set_irq_chip_data()</para></listitem> + </itemizedlist> + See the autogenerated function documentation for details. + </para> + </sect1> + <sect1> + <title>Highlevel IRQ flow handlers</title> + <para> + The generic layer provides a set of pre-defined irq-flow methods: + <itemizedlist> + <listitem><para>handle_level_irq</para></listitem> + <listitem><para>handle_edge_irq</para></listitem> + <listitem><para>handle_simple_irq</para></listitem> + <listitem><para>handle_percpu_irq</para></listitem> + </itemizedlist> + The interrupt flow handlers (either predefined or architecture + specific) are assigned to specific interrupts by the architecture + either during bootup or during device initialization. + </para> + <sect2> + <title>Default flow implementations</title> + <sect3> + <title>Helper functions</title> + <para> + The helper functions call the chip primitives and + are used by the default flow implementations. + The following helper functions are implemented (simplified excerpt): + <programlisting> +default_enable(irq) +{ + desc->chip->unmask(irq); +} + +default_disable(irq) +{ + if (!delay_disable(irq)) + desc->chip->mask(irq); +} + +default_ack(irq) +{ + chip->ack(irq); +} + +default_mask_ack(irq) +{ + if (chip->mask_ack) { + chip->mask_ack(irq); + } else { + chip->mask(irq); + chip->ack(irq); + } +} + +noop(irq) +{ +} + + </programlisting> + </para> + </sect3> + </sect2> + <sect2> + <title>Default flow handler implementations</title> + <sect3> + <title>Default Level IRQ flow handler</title> + <para> + handle_level_irq provides a generic implementation + for level-triggered interrupts. + </para> + <para> + The following control flow is implemented (simplified excerpt): + <programlisting> +desc->chip->start(); +handle_IRQ_event(desc->action); +desc->chip->end(); + </programlisting> + </para> + </sect3> + <sect3> + <title>Default Edge IRQ flow handler</title> + <para> + handle_edge_irq provides a generic implementation + for edge-triggered interrupts. + </para> + <para> + The following control flow is implemented (simplified excerpt): + <programlisting> +if (desc->status & running) { + desc->chip->hold(); + desc->status |= pending | masked; + return; +} +desc->chip->start(); +desc->status |= running; +do { + if (desc->status & masked) + desc->chip->enable(); + desc-status &= ~pending; + handle_IRQ_event(desc->action); +} while (status & pending); +desc-status &= ~running; +desc->chip->end(); + </programlisting> + </para> + </sect3> + <sect3> + <title>Default simple IRQ flow handler</title> + <para> + handle_simple_irq provides a generic implementation + for simple interrupts. + </para> + <para> + Note: The simple flow handler does not call any + handler/chip primitives. + </para> + <para> + The following control flow is implemented (simplified excerpt): + <programlisting> +handle_IRQ_event(desc->action); + </programlisting> + </para> + </sect3> + <sect3> + <title>Default per CPU flow handler</title> + <para> + handle_percpu_irq provides a generic implementation + for per CPU interrupts. + </para> + <para> + Per CPU interrupts are only available on SMP and + the handler provides a simplified version without + locking. + </para> + <para> + The following control flow is implemented (simplified excerpt): + <programlisting> +desc->chip->start(); +handle_IRQ_event(desc->action); +desc->chip->end(); + </programlisting> + </para> + </sect3> + </sect2> + <sect2> + <title>Quirks and optimizations</title> + <para> + The generic functions are intended for 'clean' architectures and chips, + which have no platform-specific IRQ handling quirks. If an architecture + needs to implement quirks on the 'flow' level then it can do so by + overriding the highlevel irq-flow handler. + </para> + </sect2> + <sect2> + <title>Delayed interrupt disable</title> + <para> + This per interrupt selectable feature, which was introduced by Russell + King in the ARM interrupt implementation, does not mask an interrupt + at the hardware level when disable_irq() is called. The interrupt is + kept enabled and is masked in the flow handler when an interrupt event + happens. This prevents losing edge interrupts on hardware which does + not store an edge interrupt event while the interrupt is disabled at + the hardware level. When an interrupt arrives while the IRQ_DISABLED + flag is set, then the interrupt is masked at the hardware level and + the IRQ_PENDING bit is set. When the interrupt is re-enabled by + enable_irq() the pending bit is checked and if it is set, the + interrupt is resent either via hardware or by a software resend + mechanism. (It's necessary to enable CONFIG_HARDIRQS_SW_RESEND when + you want to use the delayed interrupt disable feature and your + hardware is not capable of retriggering an interrupt.) + The delayed interrupt disable can be runtime enabled, per interrupt, + by setting the IRQ_DELAYED_DISABLE flag in the irq_desc status field. + </para> + </sect2> + </sect1> + <sect1> + <title>Chiplevel hardware encapsulation</title> + <para> + The chip level hardware descriptor structure irq_chip + contains all the direct chip relevant functions, which + can be utilized by the irq flow implementations. + <itemizedlist> + <listitem><para>ack()</para></listitem> + <listitem><para>mask_ack() - Optional, recommended for performance</para></listitem> + <listitem><para>mask()</para></listitem> + <listitem><para>unmask()</para></listitem> + <listitem><para>retrigger() - Optional</para></listitem> + <listitem><para>set_type() - Optional</para></listitem> + <listitem><para>set_wake() - Optional</para></listitem> + </itemizedlist> + These primitives are strictly intended to mean what they say: ack means + ACK, masking means masking of an IRQ line, etc. It is up to the flow + handler(s) to use these basic units of lowlevel functionality. + </para> + </sect1> + </chapter> + + <chapter id="doirq"> + <title>__do_IRQ entry point</title> + <para> + The original implementation __do_IRQ() is an alternative entry + point for all types of interrupts. + </para> + <para> + This handler turned out to be not suitable for all + interrupt hardware and was therefore reimplemented with split + functionality for egde/level/simple/percpu interrupts. This is not + only a functional optimization. It also shortens code paths for + interrupts. + </para> + <para> + To make use of the split implementation, replace the call to + __do_IRQ by a call to desc->chip->handle_irq() and associate + the appropriate handler function to desc->chip->handle_irq(). + In most cases the generic handler implementations should + be sufficient. + </para> + </chapter> + + <chapter id="locking"> + <title>Locking on SMP</title> + <para> + The locking of chip registers is up to the architecture that + defines the chip primitives. There is a chip->lock field that can be used + for serialization, but the generic layer does not touch it. The per-irq + structure is protected via desc->lock, by the generic layer. + </para> + </chapter> + <chapter id="structs"> + <title>Structures</title> + <para> + This chapter contains the autogenerated documentation of the structures which are + used in the generic IRQ layer. + </para> +!Iinclude/linux/irq.h + </chapter> + + <chapter id="pubfunctions"> + <title>Public Functions Provided</title> + <para> + This chapter contains the autogenerated documentation of the kernel API functions + which are exported. + </para> +!Ekernel/irq/manage.c +!Ekernel/irq/chip.c + </chapter> + + <chapter id="intfunctions"> + <title>Internal Functions Provided</title> + <para> + This chapter contains the autogenerated documentation of the internal functions. + </para> +!Ikernel/irq/handle.c +!Ikernel/irq/chip.c + </chapter> + + <chapter id="credits"> + <title>Credits</title> + <para> + The following people have contributed to this document: + <orderedlist> + <listitem><para>Thomas Gleixner<email>tglx@linutronix.de</email></para></listitem> + <listitem><para>Ingo Molnar<email>mingo@elte.hu</email></para></listitem> + </orderedlist> + </para> + </chapter> +</book> diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index 31b727ceb127..6d4b1ef5b6f1 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -59,9 +59,14 @@ !Iinclude/linux/hrtimer.h !Ekernel/hrtimer.c </sect1> + <sect1><title>Workqueues and Kevents</title> +!Ekernel/workqueue.c + </sect1> <sect1><title>Internal Functions</title> !Ikernel/exit.c !Ikernel/signal.c +!Iinclude/linux/kthread.h +!Ekernel/kthread.c </sect1> <sect1><title>Kernel objects manipulation</title> @@ -114,6 +119,29 @@ X!Ilib/string.c </sect1> </chapter> + <chapter id="kernel-lib"> + <title>Basic Kernel Library Functions</title> + + <para> + The Linux kernel provides more basic utility functions. + </para> + + <sect1><title>Bitmap Operations</title> +!Elib/bitmap.c +!Ilib/bitmap.c + </sect1> + + <sect1><title>Command-line Parsing</title> +!Elib/cmdline.c + </sect1> + + <sect1><title>CRC Functions</title> +!Elib/crc16.c +!Elib/crc32.c +!Elib/crc-ccitt.c + </sect1> + </chapter> + <chapter id="mm"> <title>Memory Management in Linux</title> <sect1><title>The Slab Cache</title> @@ -153,27 +181,6 @@ X!Ilib/string.c </sect1> </chapter> - <chapter id="proc"> - <title>The proc filesystem</title> - - <sect1><title>sysctl interface</title> -!Ekernel/sysctl.c - </sect1> - - <sect1><title>proc filesystem interface</title> -!Ifs/proc/base.c - </sect1> - </chapter> - - <chapter id="debugfs"> - <title>The debugfs filesystem</title> - - <sect1><title>debugfs interface</title> -!Efs/debugfs/inode.c -!Efs/debugfs/file.c - </sect1> - </chapter> - <chapter id="vfs"> <title>The Linux VFS</title> <sect1><title>The Filesystem types</title> @@ -206,6 +213,50 @@ X!Ilib/string.c </sect1> </chapter> + <chapter id="proc"> + <title>The proc filesystem</title> + + <sect1><title>sysctl interface</title> +!Ekernel/sysctl.c + </sect1> + + <sect1><title>proc filesystem interface</title> +!Ifs/proc/base.c + </sect1> + </chapter> + + <chapter id="sysfs"> + <title>The Filesystem for Exporting Kernel Objects</title> +!Efs/sysfs/file.c +!Efs/sysfs/symlink.c +!Efs/sysfs/bin.c + </chapter> + + <chapter id="debugfs"> + <title>The debugfs filesystem</title> + + <sect1><title>debugfs interface</title> +!Efs/debugfs/inode.c +!Efs/debugfs/file.c + </sect1> + </chapter> + + <chapter id="relayfs"> + <title>relay interface support</title> + + <para> + Relay interface support + is designed to provide an efficient mechanism for tools and + facilities to relay large amounts of data from kernel space to + user space. + </para> + + <sect1><title>relay interface</title> +!Ekernel/relay.c +!Ikernel/relay.c + </sect1> + </chapter> + <chapter id="netcore"> <title>Linux Networking</title> <sect1><title>Networking Base Types</title> @@ -275,20 +326,19 @@ X!Ekernel/module.c </sect1> <sect1><title>Resources Management</title> -!Ekernel/resource.c +!Ikernel/resource.c </sect1> <sect1><title>MTRR Handling</title> !Earch/i386/kernel/cpu/mtrr/main.c </sect1> + <sect1><title>PCI Support Library</title> !Edrivers/pci/pci.c !Edrivers/pci/pci-driver.c !Edrivers/pci/remove.c !Edrivers/pci/pci-acpi.c -<!-- kerneldoc does not understand to __devinit -X!Edrivers/pci/search.c - --> +!Edrivers/pci/search.c !Edrivers/pci/msi.c !Edrivers/pci/bus.c <!-- FIXME: Removed for now since no structured comments in source @@ -315,16 +365,11 @@ X!Earch/i386/kernel/mca.c </sect1> </chapter> - <chapter id="devfs"> - <title>The Device File System</title> -!Efs/devfs/base.c - </chapter> - - <chapter id="sysfs"> - <title>The Filesystem for Exporting Kernel Objects</title> -!Efs/sysfs/file.c -!Efs/sysfs/symlink.c -!Efs/sysfs/bin.c + <chapter id="firmware"> + <title>Firmware Interfaces</title> + <sect1><title>DMI Interfaces</title> +!Edrivers/firmware/dmi_scan.c + </sect1> </chapter> <chapter id="security"> @@ -357,6 +402,7 @@ X!Iinclude/linux/device.h --> !Edrivers/base/driver.c !Edrivers/base/core.c +!Edrivers/base/class.c !Edrivers/base/firmware_class.c !Edrivers/base/transport_class.c !Edrivers/base/dmapool.c @@ -403,17 +449,29 @@ X!Edrivers/pnp/system.c </sect1> </chapter> - <chapter id="blkdev"> <title>Block Devices</title> !Eblock/ll_rw_blk.c </chapter> + <chapter id="chrdev"> + <title>Char devices</title> +!Efs/char_dev.c + </chapter> + <chapter id="miscdev"> <title>Miscellaneous Devices</title> !Edrivers/char/misc.c </chapter> + <chapter id="parportdev"> + <title>Parallel Port Devices</title> +!Iinclude/linux/parport.h +!Edrivers/parport/ieee1284.c +!Edrivers/parport/share.c +!Idrivers/parport/daisy.c + </chapter> + <chapter id="viddev"> <title>Video4Linux</title> !Edrivers/media/video/videodev.c diff --git a/Documentation/DocBook/kernel-locking.tmpl b/Documentation/DocBook/kernel-locking.tmpl index 158ffe9bfade..644c3884fab9 100644 --- a/Documentation/DocBook/kernel-locking.tmpl +++ b/Documentation/DocBook/kernel-locking.tmpl @@ -1590,7 +1590,7 @@ the amount of locking which needs to be done. <para> Our final dilemma is this: when can we actually destroy the removed element? Remember, a reader might be stepping through - this element in the list right now: it we free this element and + this element in the list right now: if we free this element and the <symbol>next</symbol> pointer changes, the reader will jump off into garbage and crash. We need to wait until we know that all the readers who were traversing the list when we deleted the diff --git a/Documentation/DocBook/libata.tmpl b/Documentation/DocBook/libata.tmpl index e97c32314541..065e8dc23e3a 100644 --- a/Documentation/DocBook/libata.tmpl +++ b/Documentation/DocBook/libata.tmpl @@ -868,18 +868,18 @@ and other resources, etc. <chapter id="libataExt"> <title>libata Library</title> -!Edrivers/scsi/libata-core.c +!Edrivers/ata/libata-core.c </chapter> <chapter id="libataInt"> <title>libata Core Internals</title> -!Idrivers/scsi/libata-core.c +!Idrivers/ata/libata-core.c </chapter> <chapter id="libataScsiInt"> <title>libata SCSI translation/emulation</title> -!Edrivers/scsi/libata-scsi.c -!Idrivers/scsi/libata-scsi.c +!Edrivers/ata/libata-scsi.c +!Idrivers/ata/libata-scsi.c </chapter> <chapter id="ataExceptions"> @@ -1600,12 +1600,12 @@ and other resources, etc. <chapter id="PiixInt"> <title>ata_piix Internals</title> -!Idrivers/scsi/ata_piix.c +!Idrivers/ata/ata_piix.c </chapter> <chapter id="SILInt"> <title>sata_sil Internals</title> -!Idrivers/scsi/sata_sil.c +!Idrivers/ata/sata_sil.c </chapter> <chapter id="libataThanks"> diff --git a/Documentation/DocBook/mtdnand.tmpl b/Documentation/DocBook/mtdnand.tmpl index 6e463d0db266..a8c8cce50633 100644 --- a/Documentation/DocBook/mtdnand.tmpl +++ b/Documentation/DocBook/mtdnand.tmpl @@ -109,7 +109,7 @@ for most of the implementations. These functions can be replaced by the board driver if neccecary. Those functions are called via pointers in the NAND chip description structure. The board driver can set the functions which - should be replaced by board dependend functions before calling nand_scan(). + should be replaced by board dependent functions before calling nand_scan(). If the function pointer is NULL on entry to nand_scan() then the pointer is set to the default function which is suitable for the detected chip type. </para></listitem> @@ -133,7 +133,7 @@ [REPLACEABLE]</para><para> Replaceable members hold hardware related functions which can be provided by the board driver. The board driver can set the functions which - should be replaced by board dependend functions before calling nand_scan(). + should be replaced by board dependent functions before calling nand_scan(). If the function pointer is NULL on entry to nand_scan() then the pointer is set to the default function which is suitable for the detected chip type. </para></listitem> @@ -156,9 +156,8 @@ <title>Basic board driver</title> <para> For most boards it will be sufficient to provide just the - basic functions and fill out some really board dependend + basic functions and fill out some really board dependent members in the nand chip description structure. - See drivers/mtd/nand/skeleton for reference. </para> <sect1> <title>Basic defines</title> @@ -189,9 +188,9 @@ static unsigned long baseaddr; <sect1> <title>Partition defines</title> <para> - If you want to divide your device into parititions, then - enable the configuration switch CONFIG_MTD_PARITIONS and define - a paritioning scheme suitable to your board. + If you want to divide your device into partitions, then + enable the configuration switch CONFIG_MTD_PARTITIONS and define + a partitioning scheme suitable to your board. </para> <programlisting> #define NUM_PARTITIONS 2 @@ -1295,7 +1294,9 @@ in this page</entry> </para> !Idrivers/mtd/nand/nand_base.c !Idrivers/mtd/nand/nand_bbt.c -!Idrivers/mtd/nand/nand_ecc.c +<!-- No internal functions for kernel-doc: +X!Idrivers/mtd/nand/nand_ecc.c +--> </chapter> <chapter id="credits"> diff --git a/Documentation/DocBook/usb.tmpl b/Documentation/DocBook/usb.tmpl index 320af25de3a2..3608472d7b74 100644 --- a/Documentation/DocBook/usb.tmpl +++ b/Documentation/DocBook/usb.tmpl @@ -43,59 +43,52 @@ <para>A Universal Serial Bus (USB) is used to connect a host, such as a PC or workstation, to a number of peripheral - devices. USB uses a tree structure, with the host at the + devices. USB uses a tree structure, with the host as the root (the system's master), hubs as interior nodes, and - peripheral devices as leaves (and slaves). + peripherals as leaves (and slaves). Modern PCs support several such trees of USB devices, usually one USB 2.0 tree (480 Mbit/sec each) with a few USB 1.1 trees (12 Mbit/sec each) that are used when you connect a USB 1.1 device directly to the machine's "root hub". </para> - <para>That master/slave asymmetry was designed in part for - ease of use. It is not physically possible to assemble - (legal) USB cables incorrectly: all upstream "to-the-host" - connectors are the rectangular type, matching the sockets on - root hubs, and the downstream type are the squarish type - (or they are built in to the peripheral). - Software doesn't need to deal with distributed autoconfiguration - since the pre-designated master node manages all that. - At the electrical level, bus protocol overhead is reduced by - eliminating arbitration and moving scheduling into host software. + <para>That master/slave asymmetry was designed-in for a number of + reasons, one being ease of use. It is not physically possible to + assemble (legal) USB cables incorrectly: all upstream "to the host" + connectors are the rectangular type (matching the sockets on + root hubs), and all downstream connectors are the squarish type + (or they are built into the peripheral). + Also, the host software doesn't need to deal with distributed + auto-configuration since the pre-designated master node manages all that. + And finally, at the electrical level, bus protocol overhead is reduced by + eliminating arbitration and moving scheduling into the host software. </para> - <para>USB 1.0 was announced in January 1996, and was revised + <para>USB 1.0 was announced in January 1996 and was revised as USB 1.1 (with improvements in hub specification and support for interrupt-out transfers) in September 1998. - USB 2.0 was released in April 2000, including high speed - transfers and transaction translating hubs (used for USB 1.1 + USB 2.0 was released in April 2000, adding high-speed + transfers and transaction-translating hubs (used for USB 1.1 and 1.0 backward compatibility). </para> - <para>USB support was added to Linux early in the 2.2 kernel series - shortly before the 2.3 development forked off. Updates - from 2.3 were regularly folded back into 2.2 releases, bringing - new features such as <filename>/sbin/hotplug</filename> support, - more drivers, and more robustness. - The 2.5 kernel series continued such improvements, and also - worked on USB 2.0 support, - higher performance, - better consistency between host controller drivers, - API simplification (to make bugs less likely), - and providing internal "kerneldoc" documentation. + <para>Kernel developers added USB support to Linux early in the 2.2 kernel + series, shortly before 2.3 development forked. Updates from 2.3 were + regularly folded back into 2.2 releases, which improved reliability and + brought <filename>/sbin/hotplug</filename> support as well more drivers. + Such improvements were continued in the 2.5 kernel series, where they added + USB 2.0 support, improved performance, and made the host controller drivers + (HCDs) more consistent. They also simplified the API (to make bugs less + likely) and added internal "kerneldoc" documentation. </para> <para>Linux can run inside USB devices as well as on the hosts that control the devices. - Because the Linux 2.x USB support evolved to support mass market - platforms such as Apple Macintosh or PC-compatible systems, - it didn't address design concerns for those types of USB systems. - So it can't be used inside mass-market PDAs, or other peripherals. - USB device drivers running inside those Linux peripherals + But USB device drivers running inside those peripherals don't do the same things as the ones running inside hosts, - and so they've been given a different name: - they're called <emphasis>gadget drivers</emphasis>. - This document does not present gadget drivers. + so they've been given a different name: + <emphasis>gadget drivers</emphasis>. + This document does not cover gadget drivers. </para> </chapter> @@ -103,17 +96,14 @@ <chapter id="host"> <title>USB Host-Side API Model</title> - <para>Within the kernel, - host-side drivers for USB devices talk to the "usbcore" APIs. - There are two types of public "usbcore" APIs, targetted at two different - layers of USB driver. Those are - <emphasis>general purpose</emphasis> drivers, exposed through - driver frameworks such as block, character, or network devices; - and drivers that are <emphasis>part of the core</emphasis>, - which are involved in managing a USB bus. - Such core drivers include the <emphasis>hub</emphasis> driver, - which manages trees of USB devices, and several different kinds - of <emphasis>host controller driver (HCD)</emphasis>, + <para>Host-side drivers for USB devices talk to the "usbcore" APIs. + There are two. One is intended for + <emphasis>general-purpose</emphasis> drivers (exposed through + driver frameworks), and the other is for drivers that are + <emphasis>part of the core</emphasis>. + Such core drivers include the <emphasis>hub</emphasis> driver + (which manages trees of USB devices) and several different kinds + of <emphasis>host controller drivers</emphasis>, which control individual busses. </para> @@ -122,21 +112,21 @@ <itemizedlist> - <listitem><para>USB supports four kinds of data transfer - (control, bulk, interrupt, and isochronous). Two transfer - types use bandwidth as it's available (control and bulk), - while the other two types of transfer (interrupt and isochronous) + <listitem><para>USB supports four kinds of data transfers + (control, bulk, interrupt, and isochronous). Two of them (control + and bulk) use bandwidth as it's available, + while the other two (interrupt and isochronous) are scheduled to provide guaranteed bandwidth. </para></listitem> <listitem><para>The device description model includes one or more "configurations" per device, only one of which is active at a time. - Devices that are capable of high speed operation must also support - full speed configurations, along with a way to ask about the - "other speed" configurations that might be used. + Devices that are capable of high-speed operation must also support + full-speed configurations, along with a way to ask about the + "other speed" configurations which might be used. </para></listitem> - <listitem><para>Configurations have one or more "interface", each + <listitem><para>Configurations have one or more "interfaces", each of which may have "alternate settings". Interfaces may be standardized by USB "Class" specifications, or may be specific to a vendor or device.</para> @@ -162,7 +152,7 @@ </para></listitem> <listitem><para>The Linux USB API supports synchronous calls for - control and bulk messaging. + control and bulk messages. It also supports asynchnous calls for all kinds of data transfer, using request structures called "URBs" (USB Request Blocks). </para></listitem> @@ -463,14 +453,25 @@ file in your Linux kernel sources. </para> - <para>Otherwise the main use for this file from programs - is to poll() it to get notifications of usb devices - as they're plugged or unplugged. - To see what changed, you'd need to read the file and - compare "before" and "after" contents, scan the filesystem, - or see its hotplug event. + <para>This file, in combination with the poll() system call, can + also be used to detect when devices are added or removed: +<programlisting>int fd; +struct pollfd pfd; + +fd = open("/proc/bus/usb/devices", O_RDONLY); +pfd = { fd, POLLIN, 0 }; +for (;;) { + /* The first time through, this call will return immediately. */ + poll(&pfd, 1, -1); + + /* To see what's changed, compare the file's previous and current + contents or scan the filesystem. (Scanning is more precise.) */ +}</programlisting> + Note that this behavior is intended to be used for informational + and debug purposes. It would be more appropriate to use programs + such as udev or HAL to initialize a device or start a user-mode + helper program, for instance. </para> - </sect1> <sect1> diff --git a/Documentation/DocBook/videobook.tmpl b/Documentation/DocBook/videobook.tmpl index fdff984a5161..b629da33951d 100644 --- a/Documentation/DocBook/videobook.tmpl +++ b/Documentation/DocBook/videobook.tmpl @@ -976,7 +976,7 @@ static int camera_close(struct video_device *dev) <title>Interrupt Handling</title> <para> Our example handler is for an ISA bus device. If it was PCI you would be - able to share the interrupt and would have set SA_SHIRQ to indicate a + able to share the interrupt and would have set IRQF_SHARED to indicate a shared IRQ. We pass the device pointer as the interrupt routine argument. We don't need to since we only support one card but doing this will make it easier to upgrade the driver for multiple devices in the future. diff --git a/Documentation/HOWTO b/Documentation/HOWTO index 915ae8c986c6..1d6560413cc5 100644 --- a/Documentation/HOWTO +++ b/Documentation/HOWTO @@ -358,7 +358,8 @@ Here is a list of some of the different kernel trees available: quilt trees: - USB, PCI, Driver Core, and I2C, Greg Kroah-Hartman <gregkh@suse.de> kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/ - + - x86-64, partly i386, Andi Kleen <ak@suse.de> + ftp.firstfloor.org:/pub/ak/x86_64/quilt/ Bug Reporting ------------- diff --git a/Documentation/IPMI.txt b/Documentation/IPMI.txt index bf1cf98d2a27..0256805b548f 100644 --- a/Documentation/IPMI.txt +++ b/Documentation/IPMI.txt @@ -10,7 +10,7 @@ standard for controlling intelligent devices that monitor a system. It provides for dynamic discovery of sensors in the system and the ability to monitor the sensors and be informed when the sensor's values change or go outside certain boundaries. It also has a -standardized database for field-replacable units (FRUs) and a watchdog +standardized database for field-replaceable units (FRUs) and a watchdog timer. To use this, you need an interface to an IPMI controller in your @@ -64,7 +64,7 @@ situation, you need to read the section below named 'The SI Driver' or IPMI defines a standard watchdog timer. You can enable this with the 'IPMI Watchdog Timer' config option. If you compile the driver into the kernel, then via a kernel command-line option you can have the -watchdog timer start as soon as it intitializes. It also have a lot +watchdog timer start as soon as it initializes. It also have a lot of other options, see the 'Watchdog' section below for more details. Note that you can also have the watchdog continue to run if it is closed (by default it is disabled on close). Go into the 'Watchdog diff --git a/Documentation/IRQ.txt b/Documentation/IRQ.txt new file mode 100644 index 000000000000..1011e7175021 --- /dev/null +++ b/Documentation/IRQ.txt @@ -0,0 +1,22 @@ +What is an IRQ? + +An IRQ is an interrupt request from a device. +Currently they can come in over a pin, or over a packet. +Several devices may be connected to the same pin thus +sharing an IRQ. + +An IRQ number is a kernel identifier used to talk about a hardware +interrupt source. Typically this is an index into the global irq_desc +array, but except for what linux/interrupt.h implements the details +are architecture specific. + +An IRQ number is an enumeration of the possible interrupt sources on a +machine. Typically what is enumerated is the number of input pins on +all of the interrupt controller in the system. In the case of ISA +what is enumerated are the 16 input pins on the two i8259 interrupt +controllers. + +Architectures can assign additional meaning to the IRQ numbers, and +are encouraged to in the case where there is any manual configuration +of the hardware involved. The ISA IRQs are a classic example of +assigning this kind of additional meaning. diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 49e27cc19385..1d50cf0c905e 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -144,9 +144,47 @@ over a rather long period of time, but improvements are always welcome! whether the increased speed is worth it. 8. Although synchronize_rcu() is a bit slower than is call_rcu(), - it usually results in simpler code. So, unless update performance - is important or the updaters cannot block, synchronize_rcu() - should be used in preference to call_rcu(). + it usually results in simpler code. So, unless update + performance is critically important or the updaters cannot block, + synchronize_rcu() should be used in preference to call_rcu(). + + An especially important property of the synchronize_rcu() + primitive is that it automatically self-limits: if grace periods + are delayed for whatever reason, then the synchronize_rcu() + primitive will correspondingly delay updates. In contrast, + code using call_rcu() should explicitly limit update rate in + cases where grace periods are delayed, as failing to do so can + result in excessive realtime latencies or even OOM conditions. + + Ways of gaining this self-limiting property when using call_rcu() + include: + + a. Keeping a count of the number of data-structure elements + used by the RCU-protected data structure, including those + waiting for a grace period to elapse. Enforce a limit + on this number, stalling updates as needed to allow + previously deferred frees to complete. + + Alternatively, limit only the number awaiting deferred + free rather than the total number of elements. + + b. Limiting update rate. For example, if updates occur only + once per hour, then no explicit rate limiting is required, + unless your system is already badly broken. The dcache + subsystem takes this approach -- updates are guarded + by a global lock, limiting their rate. + + c. Trusted update -- if updates can only be done manually by + superuser or some other trusted user, then it might not + be necessary to automatically limit them. The theory + here is that superuser already has lots of ways to crash + the machine. + + d. Use call_rcu_bh() rather than call_rcu(), in order to take + advantage of call_rcu_bh()'s faster grace periods. + + e. Periodically invoke synchronize_rcu(), permitting a limited + number of updates per grace period. 9. All RCU list-traversal primitives, which include list_for_each_rcu(), list_for_each_entry_rcu(), diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt index e4c38152f7f7..a4948591607d 100644 --- a/Documentation/RCU/torture.txt +++ b/Documentation/RCU/torture.txt @@ -7,7 +7,7 @@ The CONFIG_RCU_TORTURE_TEST config option is available for all RCU implementations. It creates an rcutorture kernel module that can be loaded to run a torture test. The test periodically outputs status messages via printk(), which can be examined via the dmesg -command (perhaps grepping for "rcutorture"). The test is started +command (perhaps grepping for "torture"). The test is started when the module is loaded, and stops when the module is unloaded. However, actually setting this config option to "y" results in the system @@ -35,6 +35,19 @@ stat_interval The number of seconds between output of torture be printed -only- when the module is unloaded, and this is the default. +shuffle_interval + The number of seconds to keep the test threads affinitied + to a particular subset of the CPUs. Used in conjunction + with test_no_idle_hz. + +test_no_idle_hz Whether or not to test the ability of RCU to operate in + a kernel that disables the scheduling-clock interrupt to + idle CPUs. Boolean parameter, "1" to test, "0" otherwise. + +torture_type The type of RCU to test: "rcu" for the rcu_read_lock() + API, "rcu_bh" for the rcu_read_lock_bh() API, and "srcu" + for the "srcu_read_lock()" API. + verbose Enable debug printk()s. Default is disabled. @@ -42,14 +55,14 @@ OUTPUT The statistics output is as follows: - rcutorture: --- Start of test: nreaders=16 stat_interval=0 verbose=0 - rcutorture: rtc: 0000000000000000 ver: 1916 tfle: 0 rta: 1916 rtaf: 0 rtf: 1915 - rcutorture: Reader Pipe: 1466408 9747 0 0 0 0 0 0 0 0 0 - rcutorture: Reader Batch: 1464477 11678 0 0 0 0 0 0 0 0 - rcutorture: Free-Block Circulation: 1915 1915 1915 1915 1915 1915 1915 1915 1915 1915 0 - rcutorture: --- End of test + rcu-torture: --- Start of test: nreaders=16 stat_interval=0 verbose=0 + rcu-torture: rtc: 0000000000000000 ver: 1916 tfle: 0 rta: 1916 rtaf: 0 rtf: 1915 + rcu-torture: Reader Pipe: 1466408 9747 0 0 0 0 0 0 0 0 0 + rcu-torture: Reader Batch: 1464477 11678 0 0 0 0 0 0 0 0 + rcu-torture: Free-Block Circulation: 1915 1915 1915 1915 1915 1915 1915 1915 1915 1915 0 + rcu-torture: --- End of test -The command "dmesg | grep rcutorture:" will extract this information on +The command "dmesg | grep torture:" will extract this information on most systems. On more esoteric configurations, it may be necessary to use other commands to access the output of the printk()s used by the RCU torture test. The printk()s use KERN_ALERT, so they should @@ -115,8 +128,9 @@ The following script may be used to torture RCU: modprobe rcutorture sleep 100 rmmod rcutorture - dmesg | grep rcutorture: + dmesg | grep torture: The output can be manually inspected for the error flag of "!!!". One could of course create a more elaborate script that automatically -checked for such errors. +checked for such errors. The "rmmod" command forces a "SUCCESS" or +"FAILURE" indication to be printk()ed. diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 6e459420ee9f..318df44259b3 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -184,7 +184,17 @@ synchronize_rcu() blocking, it registers a function and argument which are invoked after all ongoing RCU read-side critical sections have completed. This callback variant is particularly useful in situations where - it is illegal to block. + it is illegal to block or where update-side performance is + critically important. + + However, the call_rcu() API should not be used lightly, as use + of the synchronize_rcu() API generally results in simpler code. + In addition, the synchronize_rcu() API has the nice property + of automatically limiting update rate should grace periods + be delayed. This property results in system resilience in face + of denial-of-service attacks. Code using call_rcu() should limit + update rate in order to gain this same sort of resilience. See + checklist.txt for some approaches to limiting the update rate. rcu_assign_pointer() @@ -677,8 +687,9 @@ diff shows how closely related RCU and reader-writer locking can be. + spin_lock(&listmutex); list_for_each_entry(p, head, lp) { if (p->key == key) { - list_del(&p->list); + - list_del(&p->list); - write_unlock(&listmutex); + + list_del_rcu(&p->list); + spin_unlock(&listmutex); + synchronize_rcu(); kfree(p); @@ -726,7 +737,7 @@ Or, for those who prefer a side-by-side listing: 5 write_lock(&listmutex); 5 spin_lock(&listmutex); 6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) { 7 if (p->key == key) { 7 if (p->key == key) { - 8 list_del(&p->list); 8 list_del(&p->list); + 8 list_del(&p->list); 8 list_del_rcu(&p->list); 9 write_unlock(&listmutex); 9 spin_unlock(&listmutex); 10 synchronize_rcu(); 10 kfree(p); 11 kfree(p); diff --git a/Documentation/README.DAC960 b/Documentation/README.DAC960 index 98ea617a0dd6..0e8f618ab534 100644 --- a/Documentation/README.DAC960 +++ b/Documentation/README.DAC960 @@ -78,9 +78,9 @@ also known as "System Drives", and Drive Groups are also called "Packs". Both terms are in use in the Mylex documentation; I have chosen to standardize on the more generic "Logical Drive" and "Drive Group". -DAC960 RAID disk devices are named in the style of the Device File System -(DEVFS). The device corresponding to Logical Drive D on Controller C is -referred to as /dev/rd/cCdD, and the partitions are called /dev/rd/cCdDp1 +DAC960 RAID disk devices are named in the style of the obsolete Device File +System (DEVFS). The device corresponding to Logical Drive D on Controller C +is referred to as /dev/rd/cCdD, and the partitions are called /dev/rd/cCdDp1 through /dev/rd/cCdDp7. For example, partition 3 of Logical Drive 5 on Controller 2 is referred to as /dev/rd/c2d5p3. Note that unlike with SCSI disks the device names will not change in the event of a disk drive failure. diff --git a/Documentation/SubmitChecklist b/Documentation/SubmitChecklist index 8230098da529..a6cb6ffd2933 100644 --- a/Documentation/SubmitChecklist +++ b/Documentation/SubmitChecklist @@ -1,57 +1,66 @@ Linux Kernel patch sumbittal checklist ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Here are some basic things that developers should do if they -want to see their kernel patch submittals accepted quicker. +Here are some basic things that developers should do if they want to see their +kernel patch submissions accepted more quickly. -These are all above and beyond the documentation that is provided -in Documentation/SubmittingPatches and elsewhere about submitting -Linux kernel patches. +These are all above and beyond the documentation that is provided in +Documentation/SubmittingPatches and elsewhere regarding submitting Linux +kernel patches. -- Builds cleanly with applicable or modified CONFIG options =y, =m, and =n. - No gcc warnings/errors, no linker warnings/errors. +1: Builds cleanly with applicable or modified CONFIG options =y, =m, and + =n. No gcc warnings/errors, no linker warnings/errors. -- Passes allnoconfig, allmodconfig +2: Passes allnoconfig, allmodconfig -- Builds on multiple CPU arch-es by using local cross-compile tools - or something like PLM at OSDL. +3: Builds on multiple CPU architectures by using local cross-compile tools + or something like PLM at OSDL. -- ppc64 is a good architecture for cross-compilation checking because it - tends to use `unsigned long' for 64-bit quantities. +4: ppc64 is a good architecture for cross-compilation checking because it + tends to use `unsigned long' for 64-bit quantities. -- Matches kernel coding style(!) +5: Matches kernel coding style(!) -- Any new or modified CONFIG options don't muck up the config menu. +6: Any new or modified CONFIG options don't muck up the config menu. -- All new Kconfig options have help text. +7: All new Kconfig options have help text. -- Has been carefully reviewed with respect to relevant Kconfig - combinations. This is very hard to get right with testing -- - brainpower pays off here. +8: Has been carefully reviewed with respect to relevant Kconfig + combinations. This is very hard to get right with testing -- brainpower + pays off here. -- Check cleanly with sparse. +9: Check cleanly with sparse. -- Use 'make checkstack' and 'make namespacecheck' and fix any - problems that they find. Note: checkstack does not point out - problems explicitly, but any one function that uses more than - 512 bytes on the stack is a candidate for change. +10: Use 'make checkstack' and 'make namespacecheck' and fix any problems + that they find. Note: checkstack does not point out problems explicitly, + but any one function that uses more than 512 bytes on the stack is a + candidate for change. -- Include kernel-doc to document global kernel APIs. (Not required - for static functions, but OK there also.) Use 'make htmldocs' - or 'make mandocs' to check the kernel-doc and fix any issues. +11: Include kernel-doc to document global kernel APIs. (Not required for + static functions, but OK there also.) Use 'make htmldocs' or 'make + mandocs' to check the kernel-doc and fix any issues. -- Has been tested with CONFIG_PREEMPT, CONFIG_DEBUG_PREEMPT, - CONFIG_DEBUG_SLAB, CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_MUTEXES, - CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_SPINLOCK_SLEEP all simultaneously - enabled. +12: Has been tested with CONFIG_PREEMPT, CONFIG_DEBUG_PREEMPT, + CONFIG_DEBUG_SLAB, CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_MUTEXES, + CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_SPINLOCK_SLEEP all simultaneously + enabled. -- Has been build- and runtime tested with and without CONFIG_SMP and - CONFIG_PREEMPT. +13: Has been build- and runtime tested with and without CONFIG_SMP and + CONFIG_PREEMPT. -- If the patch affects IO/Disk, etc: has been tested with and without - CONFIG_LBD. +14: If the patch affects IO/Disk, etc: has been tested with and without + CONFIG_LBD. +15: All codepaths have been exercised with all lockdep features enabled. -2006-APR-27 +16: All new /proc entries are documented under Documentation/ + +17: All new kernel boot parameters are documented in + Documentation/kernel-parameters.txt. + +18: All new module parameters are documented with MODULE_PARM_DESC() + +19: All new userspace interfaces are documented in Documentation/ABI/. + See Documentation/ABI/README for more information. diff --git a/Documentation/SubmittingDrivers b/Documentation/SubmittingDrivers index 6bd30fdd0786..58bead05eabb 100644 --- a/Documentation/SubmittingDrivers +++ b/Documentation/SubmittingDrivers @@ -59,11 +59,11 @@ Copyright: The copyright owner must agree to use of GPL. are the same person/entity. If not, the name of the person/entity authorizing use of GPL should be listed in case it's necessary to verify the will of - the copright owner. + the copyright owner. Interfaces: If your driver uses existing interfaces and behaves like other drivers in the same class it will be much more likely - to be accepted than if it invents gratuitous new ones. + to be accepted than if it invents gratuitous new ones. If you need to implement a common API over Linux and NT drivers do it in userspace. @@ -88,7 +88,7 @@ Clarity: It helps if anyone can see how to fix the driver. It helps it will go in the bitbucket. Control: In general if there is active maintainance of a driver by - the author then patches will be redirected to them unless + the author then patches will be redirected to them unless they are totally obvious and without need of checking. If you want to be the contact and update point for the driver it is a good idea to state this in the comments, @@ -100,7 +100,7 @@ What Criteria Do Not Determine Acceptance Vendor: Being the hardware vendor and maintaining the driver is often a good thing. If there is a stable working driver from other people already in the tree don't expect 'we are the - vendor' to get your driver chosen. Ideally work with the + vendor' to get your driver chosen. Ideally work with the existing driver author to build a single perfect driver. Author: It doesn't matter if a large Linux company wrote the driver, @@ -116,17 +116,13 @@ Linux kernel master tree: ftp.??.kernel.org:/pub/linux/kernel/... ?? == your country code, such as "us", "uk", "fr", etc. -Linux kernel mailing list: +Linux kernel mailing list: linux-kernel@vger.kernel.org [mail majordomo@vger.kernel.org to subscribe] Linux Device Drivers, Third Edition (covers 2.6.10): http://lwn.net/Kernel/LDD3/ (free version) -Kernel traffic: - Weekly summary of kernel list activity (much easier to read) - http://www.kerneltraffic.org/kernel-traffic/ - LWN.net: Weekly summary of kernel development activity - http://lwn.net/ 2.6 API changes: @@ -145,11 +141,8 @@ KernelNewbies: Linux USB project: http://www.linux-usb.org/ -How to NOT write kernel driver by arjanv@redhat.com - http://people.redhat.com/arjanv/olspaper.pdf +How to NOT write kernel driver by Arjan van de Ven: + http://www.fenrus.org/how-to-not-write-a-device-driver-paper.pdf Kernel Janitor: http://janitor.kernelnewbies.org/ - --- -Last updated on 17 Nov 2005. diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index c2c85bcb3d43..302d148c2e18 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -10,7 +10,9 @@ kernel, the process can sometimes be daunting if you're not familiar with "the system." This text is a collection of suggestions which can greatly increase the chances of your change being accepted. -If you are submitting a driver, also read Documentation/SubmittingDrivers. +Read Documentation/SubmitChecklist for a list of items to check +before submitting code. If you are submitting a driver, also read +Documentation/SubmittingDrivers. @@ -74,9 +76,6 @@ There are a number of scripts which can aid in this: Quilt: http://savannah.nongnu.org/projects/quilt -Randy Dunlap's patch scripts: -http://www.xenotime.net/linux/scripts/patching-scripts-002.tar.gz - Andrew Morton's patch scripts: http://www.zip.com.au/~akpm/linux/patches/ Instead of these scripts, quilt is the recommended patch management @@ -174,15 +173,15 @@ For small patches you may want to CC the Trivial Patch Monkey trivial@kernel.org managed by Adrian Bunk; which collects "trivial" patches. Trivial patches must qualify for one of the following rules: Spelling fixes in documentation - Spelling fixes which could break grep(1). + Spelling fixes which could break grep(1) Warning fixes (cluttering with useless warnings is bad) Compilation fixes (only if they are actually correct) Runtime fixes (only if they actually fix things) - Removing use of deprecated functions/macros (eg. check_region). + Removing use of deprecated functions/macros (eg. check_region) Contact detail and documentation fixes Non-portable code replaced by portable code (even in arch-specific, since people copy, as long as it's trivial) - Any fix by the author/maintainer of the file. (ie. patch monkey + Any fix by the author/maintainer of the file (ie. patch monkey in re-transmission mode) URL: <http://www.kernel.org/pub/linux/kernel/people/bunk/trivial/> @@ -210,6 +209,19 @@ Exception: If your mailer is mangling patches then someone may ask you to re-send them using MIME. +WARNING: Some mailers like Mozilla send your messages with +---- message header ---- +Content-Type: text/plain; charset=us-ascii; format=flowed +---- message header ---- +The problem is that "format=flowed" makes some of the mailers +on receiving side to replace TABs with spaces and do similar +changes. Thus the patches from you can look corrupted. + +To fix this just make your mozilla defaults/pref/mailnews.js file to look like: +pref("mailnews.send_plaintext_flowed", false); // RFC 2646======= +pref("mailnews.display.disable_format_flowed_support", true); + + 7) E-mail size. @@ -246,13 +258,13 @@ updated change. It is quite common for Linus to "drop" your patch without comment. That's the nature of the system. If he drops your patch, it could be due to -* Your patch did not apply cleanly to the latest kernel version +* Your patch did not apply cleanly to the latest kernel version. * Your patch was not sufficiently discussed on linux-kernel. -* A style issue (see section 2), -* An e-mail formatting issue (re-read this section) -* A technical problem with your change -* He gets tons of e-mail, and yours got lost in the shuffle -* You are being annoying (See Figure 1) +* A style issue (see section 2). +* An e-mail formatting issue (re-read this section). +* A technical problem with your change. +* He gets tons of e-mail, and yours got lost in the shuffle. +* You are being annoying. When in doubt, solicit comments on linux-kernel mailing list. @@ -309,6 +321,8 @@ then you just add a line saying Signed-off-by: Random J Developer <random@developer.example.org> +using your real name (sorry, no pseudonyms or anonymous contributions.) + Some people also put extra tags at the end. They'll just be ignored for now, but you can do this to mark internal company procedures or just point out some special detail about the sign-off. @@ -475,22 +489,21 @@ SECTION 3 - REFERENCES Andrew Morton, "The perfect patch" (tpp). <http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt> -Jeff Garzik, "Linux kernel patch submission format." +Jeff Garzik, "Linux kernel patch submission format". <http://linux.yyz.us/patch-format.html> -Greg Kroah-Hartman "How to piss off a kernel subsystem maintainer". +Greg Kroah-Hartman, "How to piss off a kernel subsystem maintainer". <http://www.kroah.com/log/2005/03/31/> <http://www.kroah.com/log/2005/07/08/> <http://www.kroah.com/log/2005/10/19/> <http://www.kroah.com/log/2006/01/11/> -NO!!!! No more huge patch bombs to linux-kernel@vger.kernel.org people!. +NO!!!! No more huge patch bombs to linux-kernel@vger.kernel.org people! <http://marc.theaimsgroup.com/?l=linux-kernel&m=112112749912944&w=2> -Kernel Documentation/CodingStyle +Kernel Documentation/CodingStyle: <http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle> -Linus Torvald's mail on the canonical patch format: +Linus Torvalds's mail on the canonical patch format: <http://lkml.org/lkml/2005/4/7/183> -- -Last updated on 17 Nov 2005. diff --git a/Documentation/accounting/delay-accounting.txt b/Documentation/accounting/delay-accounting.txt new file mode 100644 index 000000000000..1443cd71d263 --- /dev/null +++ b/Documentation/accounting/delay-accounting.txt @@ -0,0 +1,112 @@ +Delay accounting +---------------- + +Tasks encounter delays in execution when they wait +for some kernel resource to become available e.g. a +runnable task may wait for a free CPU to run on. + +The per-task delay accounting functionality measures +the delays experienced by a task while + +a) waiting for a CPU (while being runnable) +b) completion of synchronous block I/O initiated by the task +c) swapping in pages + +and makes these statistics available to userspace through +the taskstats interface. + +Such delays provide feedback for setting a task's cpu priority, +io priority and rss limit values appropriately. Long delays for +important tasks could be a trigger for raising its corresponding priority. + +The functionality, through its use of the taskstats interface, also provides +delay statistics aggregated for all tasks (or threads) belonging to a +thread group (corresponding to a traditional Unix process). This is a commonly +needed aggregation that is more efficiently done by the kernel. + +Userspace utilities, particularly resource management applications, can also +aggregate delay statistics into arbitrary groups. To enable this, delay +statistics of a task are available both during its lifetime as well as on its +exit, ensuring continuous and complete monitoring can be done. + + +Interface +--------- + +Delay accounting uses the taskstats interface which is described +in detail in a separate document in this directory. Taskstats returns a +generic data structure to userspace corresponding to per-pid and per-tgid +statistics. The delay accounting functionality populates specific fields of +this structure. See + include/linux/taskstats.h +for a description of the fields pertaining to delay accounting. +It will generally be in the form of counters returning the cumulative +delay seen for cpu, sync block I/O, swapin etc. + +Taking the difference of two successive readings of a given +counter (say cpu_delay_total) for a task will give the delay +experienced by the task waiting for the corresponding resource +in that interval. + +When a task exits, records containing the per-task statistics +are sent to userspace without requiring a command. If it is the last exiting +task of a thread group, the per-tgid statistics are also sent. More details +are given in the taskstats interface description. + +The getdelays.c userspace utility in this directory allows simple commands to +be run and the corresponding delay statistics to be displayed. It also serves +as an example of using the taskstats interface. + +Usage +----- + +Compile the kernel with + CONFIG_TASK_DELAY_ACCT=y + CONFIG_TASKSTATS=y + +Delay accounting is enabled by default at boot up. +To disable, add + nodelayacct +to the kernel boot options. The rest of the instructions +below assume this has not been done. + +After the system has booted up, use a utility +similar to getdelays.c to access the delays +seen by a given task or a task group (tgid). +The utility also allows a given command to be +executed and the corresponding delays to be +seen. + +General format of the getdelays command + +getdelays [-t tgid] [-p pid] [-c cmd...] + + +Get delays, since system boot, for pid 10 +# ./getdelays -p 10 +(output similar to next case) + +Get sum of delays, since system boot, for all pids with tgid 5 +# ./getdelays -t 5 + + +CPU count real total virtual total delay total + 7876 92005750 100000000 24001500 +IO count delay total + 0 0 +MEM count delay total + 0 0 + +Get delays seen in executing a given simple command +# ./getdelays -c ls / + +bin data1 data3 data5 dev home media opt root srv sys usr +boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var + + +CPU count real total virtual total delay total + 6 4000250 4000000 0 +IO count delay total + 0 0 +MEM count delay total + 0 0 diff --git a/Documentation/accounting/getdelays.c b/Documentation/accounting/getdelays.c new file mode 100644 index 000000000000..795ca3911cc5 --- /dev/null +++ b/Documentation/accounting/getdelays.c @@ -0,0 +1,396 @@ +/* getdelays.c + * + * Utility to get per-pid and per-tgid delay accounting statistics + * Also illustrates usage of the taskstats interface + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2005 + * Copyright (C) Balbir Singh, IBM Corp. 2006 + * Copyright (c) Jay Lan, SGI. 2006 + * + */ + +#include <stdio.h> +#include <stdlib.h> +#include <errno.h> +#include <unistd.h> +#include <poll.h> +#include <string.h> +#include <fcntl.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/socket.h> +#include <sys/types.h> +#include <signal.h> + +#include <linux/genetlink.h> +#include <linux/taskstats.h> + +/* + * Generic macros for dealing with netlink sockets. Might be duplicated + * elsewhere. It is recommended that commercial grade applications use + * libnl or libnetlink and use the interfaces provided by the library + */ +#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN)) +#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN) +#define NLA_DATA(na) ((void *)((char*)(na) + NLA_HDRLEN)) +#define NLA_PAYLOAD(len) (len - NLA_HDRLEN) + +#define err(code, fmt, arg...) do { printf(fmt, ##arg); exit(code); } while (0) +int done = 0; +int rcvbufsz=0; + + char name[100]; +int dbg=0, print_delays=0; +__u64 stime, utime; +#define PRINTF(fmt, arg...) { \ + if (dbg) { \ + printf(fmt, ##arg); \ + } \ + } + +/* Maximum size of response requested or message sent */ +#define MAX_MSG_SIZE 256 +/* Maximum number of cpus expected to be specified in a cpumask */ +#define MAX_CPUS 32 +/* Maximum length of pathname to log file */ +#define MAX_FILENAME 256 + +struct msgtemplate { + struct nlmsghdr n; + struct genlmsghdr g; + char buf[MAX_MSG_SIZE]; +}; + +char cpumask[100+6*MAX_CPUS]; + +/* + * Create a raw netlink socket and bind + */ +static int create_nl_socket(int protocol) +{ + int fd; + struct sockaddr_nl local; + + fd = socket(AF_NETLINK, SOCK_RAW, protocol); + if (fd < 0) + return -1; + + if (rcvbufsz) + if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF, + &rcvbufsz, sizeof(rcvbufsz)) < 0) { + printf("Unable to set socket rcv buf size to %d\n", + rcvbufsz); + return -1; + } + + memset(&local, 0, sizeof(local)); + local.nl_family = AF_NETLINK; + + if (bind(fd, (struct sockaddr *) &local, sizeof(local)) < 0) + goto error; + + return fd; +error: + close(fd); + return -1; +} + + +int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid, + __u8 genl_cmd, __u16 nla_type, + void *nla_data, int nla_len) +{ + struct nlattr *na; + struct sockaddr_nl nladdr; + int r, buflen; + char *buf; + + struct msgtemplate msg; + + msg.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN); + msg.n.nlmsg_type = nlmsg_type; + msg.n.nlmsg_flags = NLM_F_REQUEST; + msg.n.nlmsg_seq = 0; + msg.n.nlmsg_pid = nlmsg_pid; + msg.g.cmd = genl_cmd; + msg.g.version = 0x1; + na = (struct nlattr *) GENLMSG_DATA(&msg); + na->nla_type = nla_type; + na->nla_len = nla_len + 1 + NLA_HDRLEN; + memcpy(NLA_DATA(na), nla_data, nla_len); + msg.n.nlmsg_len += NLMSG_ALIGN(na->nla_len); + + buf = (char *) &msg; + buflen = msg.n.nlmsg_len ; + memset(&nladdr, 0, sizeof(nladdr)); + nladdr.nl_family = AF_NETLINK; + while ((r = sendto(sd, buf, buflen, 0, (struct sockaddr *) &nladdr, + sizeof(nladdr))) < buflen) { + if (r > 0) { + buf += r; + buflen -= r; + } else if (errno != EAGAIN) + return -1; + } + return 0; +} + + +/* + * Probe the controller in genetlink to find the family id + * for the TASKSTATS family + */ +int get_family_id(int sd) +{ + struct { + struct nlmsghdr n; + struct genlmsghdr g; + char buf[256]; + } ans; + + int id, rc; + struct nlattr *na; + int rep_len; + + strcpy(name, TASKSTATS_GENL_NAME); + rc = send_cmd(sd, GENL_ID_CTRL, getpid(), CTRL_CMD_GETFAMILY, + CTRL_ATTR_FAMILY_NAME, (void *)name, + strlen(TASKSTATS_GENL_NAME)+1); + + rep_len = recv(sd, &ans, sizeof(ans), 0); + if (ans.n.nlmsg_type == NLMSG_ERROR || + (rep_len < 0) || !NLMSG_OK((&ans.n), rep_len)) + return 0; + + na = (struct nlattr *) GENLMSG_DATA(&ans); + na = (struct nlattr *) ((char *) na + NLA_ALIGN(na->nla_len)); + if (na->nla_type == CTRL_ATTR_FAMILY_ID) { + id = *(__u16 *) NLA_DATA(na); + } + return id; +} + +void print_delayacct(struct taskstats *t) +{ + printf("\n\nCPU %15s%15s%15s%15s\n" + " %15llu%15llu%15llu%15llu\n" + "IO %15s%15s\n" + " %15llu%15llu\n" + "MEM %15s%15s\n" + " %15llu%15llu\n\n", + "count", "real total", "virtual total", "delay total", + t->cpu_count, t->cpu_run_real_total, t->cpu_run_virtual_total, + t->cpu_delay_total, + "count", "delay total", + t->blkio_count, t->blkio_delay_total, + "count", "delay total", t->swapin_count, t->swapin_delay_total); +} + +int main(int argc, char *argv[]) +{ + int c, rc, rep_len, aggr_len, len2, cmd_type; + __u16 id; + __u32 mypid; + + struct nlattr *na; + int nl_sd = -1; + int len = 0; + pid_t tid = 0; + pid_t rtid = 0; + + int fd = 0; + int count = 0; + int write_file = 0; + int maskset = 0; + char logfile[128]; + int loop = 0; + + struct msgtemplate msg; + + while (1) { + c = getopt(argc, argv, "dw:r:m:t:p:v:l"); + if (c < 0) + break; + + switch (c) { + case 'd': + printf("print delayacct stats ON\n"); + print_delays = 1; + break; + case 'w': + strncpy(logfile, optarg, MAX_FILENAME); + printf("write to file %s\n", logfile); + write_file = 1; + break; + case 'r': + rcvbufsz = atoi(optarg); + printf("receive buf size %d\n", rcvbufsz); + if (rcvbufsz < 0) + err(1, "Invalid rcv buf size\n"); + break; + case 'm': + strncpy(cpumask, optarg, sizeof(cpumask)); + maskset = 1; + printf("cpumask %s maskset %d\n", cpumask, maskset); + break; + case 't': + tid = atoi(optarg); + if (!tid) + err(1, "Invalid tgid\n"); + cmd_type = TASKSTATS_CMD_ATTR_TGID; + print_delays = 1; + break; + case 'p': + tid = atoi(optarg); + if (!tid) + err(1, "Invalid pid\n"); + cmd_type = TASKSTATS_CMD_ATTR_PID; + print_delays = 1; + break; + case 'v': + printf("debug on\n"); + dbg = 1; + break; + case 'l': + printf("listen forever\n"); + loop = 1; + break; + default: + printf("Unknown option %d\n", c); + exit(-1); + } + } + + if (write_file) { + fd = open(logfile, O_WRONLY | O_CREAT | O_TRUNC, + S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH); + if (fd == -1) { + perror("Cannot open output file\n"); + exit(1); + } + } + + if ((nl_sd = create_nl_socket(NETLINK_GENERIC)) < 0) + err(1, "error creating Netlink socket\n"); + + + mypid = getpid(); + id = get_family_id(nl_sd); + if (!id) { + printf("Error getting family id, errno %d", errno); + goto err; + } + PRINTF("family id %d\n", id); + + if (maskset) { + rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET, + TASKSTATS_CMD_ATTR_REGISTER_CPUMASK, + &cpumask, sizeof(cpumask)); + PRINTF("Sent register cpumask, retval %d\n", rc); + if (rc < 0) { + printf("error sending register cpumask\n"); + goto err; + } + } + + if (tid) { + rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET, + cmd_type, &tid, sizeof(__u32)); + PRINTF("Sent pid/tgid, retval %d\n", rc); + if (rc < 0) { + printf("error sending tid/tgid cmd\n"); + goto done; + } + } + + do { + int i; + + rep_len = recv(nl_sd, &msg, sizeof(msg), 0); + PRINTF("received %d bytes\n", rep_len); + + if (rep_len < 0) { + printf("nonfatal reply error: errno %d\n", errno); + continue; + } + if (msg.n.nlmsg_type == NLMSG_ERROR || + !NLMSG_OK((&msg.n), rep_len)) { + printf("fatal reply error, errno %d\n", errno); + goto done; + } + + PRINTF("nlmsghdr size=%d, nlmsg_len=%d, rep_len=%d\n", + sizeof(struct nlmsghdr), msg.n.nlmsg_len, rep_len); + + + rep_len = GENLMSG_PAYLOAD(&msg.n); + + na = (struct nlattr *) GENLMSG_DATA(&msg); + len = 0; + i = 0; + while (len < rep_len) { + len += NLA_ALIGN(na->nla_len); + switch (na->nla_type) { + case TASKSTATS_TYPE_AGGR_TGID: + /* Fall through */ + case TASKSTATS_TYPE_AGGR_PID: + aggr_len = NLA_PAYLOAD(na->nla_len); + len2 = 0; + /* For nested attributes, na follows */ + na = (struct nlattr *) NLA_DATA(na); + done = 0; + while (len2 < aggr_len) { + switch (na->nla_type) { + case TASKSTATS_TYPE_PID: + rtid = *(int *) NLA_DATA(na); + if (print_delays) + printf("PID\t%d\n", rtid); + break; + case TASKSTATS_TYPE_TGID: + rtid = *(int *) NLA_DATA(na); + if (print_delays) + printf("TGID\t%d\n", rtid); + break; + case TASKSTATS_TYPE_STATS: + count++; + if (print_delays) + print_delayacct((struct taskstats *) NLA_DATA(na)); + if (fd) { + if (write(fd, NLA_DATA(na), na->nla_len) < 0) { + err(1,"write error\n"); + } + } + if (!loop) + goto done; + break; + default: + printf("Unknown nested nla_type %d\n", na->nla_type); + break; + } + len2 += NLA_ALIGN(na->nla_len); + na = (struct nlattr *) ((char *) na + len2); + } + break; + + default: + printf("Unknown nla_type %d\n", na->nla_type); + break; + } + na = (struct nlattr *) (GENLMSG_DATA(&msg) + len); + } + } while (loop); +done: + if (maskset) { + rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET, + TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK, + &cpumask, sizeof(cpumask)); + printf("Sent deregister mask, retval %d\n", rc); + if (rc < 0) + err(rc, "error sending deregister cpumask\n"); + } +err: + close(nl_sd); + if (fd) + close(fd); + return 0; +} diff --git a/Documentation/accounting/taskstats.txt b/Documentation/accounting/taskstats.txt new file mode 100644 index 000000000000..92ebf29e9041 --- /dev/null +++ b/Documentation/accounting/taskstats.txt @@ -0,0 +1,181 @@ +Per-task statistics interface +----------------------------- + + +Taskstats is a netlink-based interface for sending per-task and +per-process statistics from the kernel to userspace. + +Taskstats was designed for the following benefits: + +- efficiently provide statistics during lifetime of a task and on its exit +- unified interface for multiple accounting subsystems +- extensibility for use by future accounting patches + +Terminology +----------- + +"pid", "tid" and "task" are used interchangeably and refer to the standard +Linux task defined by struct task_struct. per-pid stats are the same as +per-task stats. + +"tgid", "process" and "thread group" are used interchangeably and refer to the +tasks that share an mm_struct i.e. the traditional Unix process. Despite the +use of tgid, there is no special treatment for the task that is thread group +leader - a process is deemed alive as long as it has any task belonging to it. + +Usage +----- + +To get statistics during a task's lifetime, userspace opens a unicast netlink +socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid. +The response contains statistics for a task (if pid is specified) or the sum of +statistics for all tasks of the process (if tgid is specified). + +To obtain statistics for tasks which are exiting, the userspace listener +sends a register command and specifies a cpumask. Whenever a task exits on +one of the cpus in the cpumask, its per-pid statistics are sent to the +registered listener. Using cpumasks allows the data received by one listener +to be limited and assists in flow control over the netlink interface and is +explained in more detail below. + +If the exiting task is the last thread exiting its thread group, +an additional record containing the per-tgid stats is also sent to userspace. +The latter contains the sum of per-pid stats for all threads in the thread +group, both past and present. + +getdelays.c is a simple utility demonstrating usage of the taskstats interface +for reporting delay accounting statistics. Users can register cpumasks, +send commands and process responses, listen for per-tid/tgid exit data, +write the data received to a file and do basic flow control by increasing +receive buffer sizes. + +Interface +--------- + +The user-kernel interface is encapsulated in include/linux/taskstats.h + +To avoid this documentation becoming obsolete as the interface evolves, only +an outline of the current version is given. taskstats.h always overrides the +description here. + +struct taskstats is the common accounting structure for both per-pid and +per-tgid data. It is versioned and can be extended by each accounting subsystem +that is added to the kernel. The fields and their semantics are defined in the +taskstats.h file. + +The data exchanged between user and kernel space is a netlink message belonging +to the NETLINK_GENERIC family and using the netlink attributes interface. +The messages are in the format + + +----------+- - -+-------------+-------------------+ + | nlmsghdr | Pad | genlmsghdr | taskstats payload | + +----------+- - -+-------------+-------------------+ + + +The taskstats payload is one of the following three kinds: + +1. Commands: Sent from user to kernel. Commands to get data on +a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID, +containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes +the task/process for which userspace wants statistics. + +Commands to register/deregister interest in exit data from a set of cpus +consist of one attribute, of type +TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the +attribute payload. The cpumask is specified as an ascii string of +comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8 +the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest +in cpus before closing the listening socket, the kernel cleans up its interest +set over time. However, for the sake of efficiency, an explicit deregistration +is advisable. + +2. Response for a command: sent from the kernel in response to a userspace +command. The payload is a series of three attributes of type: + +a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates +a pid/tgid will be followed by some stats. + +b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats +is being returned. + +c) TASKSTATS_TYPE_STATS: attribute with a struct taskstsats as payload. The +same structure is used for both per-pid and per-tgid stats. + +3. New message sent by kernel whenever a task exits. The payload consists of a + series of attributes of the following type: + +a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats +b) TASKSTATS_TYPE_PID: contains exiting task's pid +c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats +d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats +e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs +f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process + + +per-tgid stats +-------------- + +Taskstats provides per-process stats, in addition to per-task stats, since +resource management is often done at a process granularity and aggregating task +stats in userspace alone is inefficient and potentially inaccurate (due to lack +of atomicity). + +However, maintaining per-process, in addition to per-task stats, within the +kernel has space and time overheads. To address this, the taskstats code +accumalates each exiting task's statistics into a process-wide data structure. +When the last task of a process exits, the process level data accumalated also +gets sent to userspace (along with the per-task data). + +When a user queries to get per-tgid data, the sum of all other live threads in +the group is added up and added to the accumalated total for previously exited +threads of the same thread group. + +Extending taskstats +------------------- + +There are two ways to extend the taskstats interface to export more +per-task/process stats as patches to collect them get added to the kernel +in future: + +1. Adding more fields to the end of the existing struct taskstats. Backward + compatibility is ensured by the version number within the + structure. Userspace will use only the fields of the struct that correspond + to the version its using. + +2. Defining separate statistic structs and using the netlink attributes + interface to return them. Since userspace processes each netlink attribute + independently, it can always ignore attributes whose type it does not + understand (because it is using an older version of the interface). + + +Choosing between 1. and 2. is a matter of trading off flexibility and +overhead. If only a few fields need to be added, then 1. is the preferable +path since the kernel and userspace don't need to incur the overhead of +processing new netlink attributes. But if the new fields expand the existing +struct too much, requiring disparate userspace accounting utilities to +unnecessarily receive large structures whose fields are of no interest, then +extending the attributes structure would be worthwhile. + +Flow control for taskstats +-------------------------- + +When the rate of task exits becomes large, a listener may not be able to keep +up with the kernel's rate of sending per-tid/tgid exit data leading to data +loss. This possibility gets compounded when the taskstats structure gets +extended and the number of cpus grows large. + +To avoid losing statistics, userspace should do one or more of the following: + +- increase the receive buffer sizes for the netlink sockets opened by +listeners to receive exit data. + +- create more listeners and reduce the number of cpus being listened to by +each listener. In the extreme case, there could be one listener for each cpu. +Users may also consider setting the cpu affinity of the listener to the subset +of cpus to which it listens, especially if they are listening to just one cpu. + +Despite these measures, if the userspace receives ENOBUFS error messages +indicated overflow of receive buffers, it should take measures to handle the +loss of data. + +---- diff --git a/Documentation/arm/IXP4xx b/Documentation/arm/IXP4xx index d4c6d3aa0c25..43edb4ecf27d 100644 --- a/Documentation/arm/IXP4xx +++ b/Documentation/arm/IXP4xx @@ -85,7 +85,7 @@ IXP4xx provides two methods of accessing PCI memory space: 2) If > 64MB of memory space is required, the IXP4xx can be configured to use indirect registers to access PCI This allows for up to 128MB (0x48000000 to 0x4fffffff) of memory on the bus. - The disadvantadge of this is that every PCI access requires + The disadvantage of this is that every PCI access requires three local register accesses plus a spinlock, but in some cases the performance hit is acceptable. In addition, you cannot mmap() PCI devices in this case due to the indirect nature diff --git a/Documentation/arm/Samsung-S3C24XX/Overview.txt b/Documentation/arm/Samsung-S3C24XX/Overview.txt index 8c6ee684174c..3e46d2a31158 100644 --- a/Documentation/arm/Samsung-S3C24XX/Overview.txt +++ b/Documentation/arm/Samsung-S3C24XX/Overview.txt @@ -7,11 +7,13 @@ Introduction ------------ The Samsung S3C24XX range of ARM9 System-on-Chip CPUs are supported - by the 's3c2410' architecture of ARM Linux. Currently the S3C2410 and - the S3C2440 are supported CPUs. + by the 's3c2410' architecture of ARM Linux. Currently the S3C2410, + S3C2440 and S3C2442 devices are supported. Support for the S3C2400 series is in progress. + Support for the S3C2412 and S3C2413 CPUs is being merged. + Configuration ------------- @@ -43,9 +45,18 @@ Machines Samsung's own development board, geared for PDA work. + Samsung/Aiji SMDK2412 + + The S3C2412 version of the SMDK2440. + + Samsung/Aiji SMDK2413 + + The S3C2412 version of the SMDK2440. + Samsung/Meritech SMDK2440 - The S3C2440 compatible version of the SMDK2440 + The S3C2440 compatible version of the SMDK2440, which has the + option of an S3C2440 or S3C2442 CPU module. Thorcom VR1000 @@ -211,24 +222,6 @@ Port Contributors Lucas Correia Villa Real (S3C2400 port) -Document Changes ----------------- - - 05 Sep 2004 - BJD - Added Document Changes section - 05 Sep 2004 - BJD - Added Klaus Fetscher to list of contributors - 25 Oct 2004 - BJD - Added Dimitry Andric to list of contributors - 25 Oct 2004 - BJD - Updated the MTD from the 2.6.9 merge - 21 Jan 2005 - BJD - Added rx3715, added Shannon to contributors - 10 Feb 2005 - BJD - Added Guillaume Gourat to contributors - 02 Mar 2005 - BJD - Added SMDK2440 to list of machines - 06 Mar 2005 - BJD - Added Christer Weinigel - 08 Mar 2005 - BJD - Added LCVR to list of people, updated introduction - 08 Mar 2005 - BJD - Added section on adding machines - 09 Sep 2005 - BJD - Added section on platform data - 11 Feb 2006 - BJD - Added I2C, RTC and Watchdog sections - 11 Feb 2006 - BJD - Added Osiris machine, and S3C2400 information - - Document Author --------------- diff --git a/Documentation/arm/Samsung-S3C24XX/S3C2412.txt b/Documentation/arm/Samsung-S3C24XX/S3C2412.txt new file mode 100644 index 000000000000..cb82a7fc7901 --- /dev/null +++ b/Documentation/arm/Samsung-S3C24XX/S3C2412.txt @@ -0,0 +1,120 @@ + S3C2412 ARM Linux Overview + ========================== + +Introduction +------------ + + The S3C2412 is part of the S3C24XX range of ARM9 System-on-Chip CPUs + from Samsung. This part has an ARM926-EJS core, capable of running up + to 266MHz (see data-sheet for more information) + + +Clock +----- + + The core clock code provides a set of clocks to the drivers, and allows + for source selection and a number of other features. + + +Power +----- + + No support for suspend/resume to RAM in the current system. + + +DMA +--- + + No current support for DMA. + + +GPIO +---- + + There is support for setting the GPIO to input/output/special function + and reading or writing to them. + + +UART +---- + + The UART hardware is similar to the S3C2440, and is supported by the + s3c2410 driver in the drivers/serial directory. + + +NAND +---- + + The NAND hardware is similar to the S3C2440, and is supported by the + s3c2410 driver in the drivers/mtd/nand directory. + + +USB Host +-------- + + The USB hardware is similar to the S3C2410, with extended clock source + control. The OHCI portion is supported by the ohci-s3c2410 driver, and + the clock control selection is supported by the core clock code. + + +USB Device +---------- + + No current support in the kernel + + +IRQs +---- + + All the standard, and external interrupt sources are supported. The + extra sub-sources are not yet supported. + + +RTC +--- + + The RTC hardware is similar to the S3C2410, and is supported by the + s3c2410-rtc driver. + + +Watchdog +-------- + + The watchdog harware is the same as the S3C2410, and is supported by + the s3c2410_wdt driver. + + +MMC/SD/SDIO +----------- + + No current support for the MMC/SD/SDIO block. + +IIC +--- + + The IIC hardware is the same as the S3C2410, and is supported by the + i2c-s3c24xx driver. + + +IIS +--- + + No current support for the IIS interface. + + +SPI +--- + + No current support for the SPI interfaces. + + +ATA +--- + + No current support for the on-board ATA block. + + +Document Author +--------------- + +Ben Dooks, (c) 2006 Simtec Electronics diff --git a/Documentation/arm/Samsung-S3C24XX/S3C2413.txt b/Documentation/arm/Samsung-S3C24XX/S3C2413.txt new file mode 100644 index 000000000000..ab2a88858f12 --- /dev/null +++ b/Documentation/arm/Samsung-S3C24XX/S3C2413.txt @@ -0,0 +1,21 @@ + S3C2413 ARM Linux Overview + ========================== + +Introduction +------------ + + The S3C2413 is an extended version of the S3C2412, with an camera + interface and mobile DDR memory support. See the S3C2412 support + documentation for more information. + + +Camera Interface +--------------- + + This block is currently not supported. + + +Document Author +--------------- + +Ben Dooks, (c) 2006 Simtec Electronics diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt index 23a1c2402bcc..2a63d5662a93 100644 --- a/Documentation/atomic_ops.txt +++ b/Documentation/atomic_ops.txt @@ -157,13 +157,13 @@ For example, smp_mb__before_atomic_dec() can be used like so: smp_mb__before_atomic_dec(); atomic_dec(&obj->ref_count); -It makes sure that all memory operations preceeding the atomic_dec() +It makes sure that all memory operations preceding the atomic_dec() call are strongly ordered with respect to the atomic counter -operation. In the above example, it guarentees that the assignment of +operation. In the above example, it guarantees that the assignment of "1" to obj->dead will be globally visible to other cpus before the atomic counter decrement. -Without the explicitl smp_mb__before_atomic_dec() call, the +Without the explicit smp_mb__before_atomic_dec() call, the implementation could legally allow the atomic counter update visible to other cpus before the "obj->dead = 1;" assignment. @@ -173,11 +173,11 @@ ordering with respect to memory operations after an atomic_dec() call (smp_mb__{before,after}_atomic_inc()). A missing memory barrier in the cases where they are required by the -atomic_t implementation above can have disasterous results. Here is -an example, which follows a pattern occuring frequently in the Linux +atomic_t implementation above can have disastrous results. Here is +an example, which follows a pattern occurring frequently in the Linux kernel. It is the use of atomic counters to implement reference counting, and it works such that once the counter falls to zero it can -be guarenteed that no other entity can be accessing the object: +be guaranteed that no other entity can be accessing the object: static void obj_list_add(struct obj *obj) { @@ -291,9 +291,9 @@ to the size of an "unsigned long" C data type, and are least of that size. The endianness of the bits within each "unsigned long" are the native endianness of the cpu. - void set_bit(unsigned long nr, volatils unsigned long *addr); - void clear_bit(unsigned long nr, volatils unsigned long *addr); - void change_bit(unsigned long nr, volatils unsigned long *addr); + void set_bit(unsigned long nr, volatile unsigned long *addr); + void clear_bit(unsigned long nr, volatile unsigned long *addr); + void change_bit(unsigned long nr, volatile unsigned long *addr); These routines set, clear, and change, respectively, the bit number indicated by "nr" on the bit mask pointed to by "ADDR". @@ -301,9 +301,9 @@ indicated by "nr" on the bit mask pointed to by "ADDR". They must execute atomically, yet there are no implicit memory barrier semantics required of these interfaces. - int test_and_set_bit(unsigned long nr, volatils unsigned long *addr); - int test_and_clear_bit(unsigned long nr, volatils unsigned long *addr); - int test_and_change_bit(unsigned long nr, volatils unsigned long *addr); + int test_and_set_bit(unsigned long nr, volatile unsigned long *addr); + int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); + int test_and_change_bit(unsigned long nr, volatile unsigned long *addr); Like the above, except that these routines return a boolean which indicates whether the changed bit was set _BEFORE_ the atomic bit @@ -335,7 +335,7 @@ subsequent memory operation is made visible. For example: /* ... */; obj->killed = 1; -The implementation of test_and_set_bit() must guarentee that +The implementation of test_and_set_bit() must guarantee that "obj->dead = 1;" is visible to cpus before the atomic memory operation done by test_and_set_bit() becomes visible. Likewise, the atomic memory operation done by test_and_set_bit() must become visible before @@ -474,7 +474,7 @@ Now, as far as memory barriers go, as long as spin_lock() strictly orders all subsequent memory operations (including the cas()) with respect to itself, things will be fine. -Said another way, _atomic_dec_and_lock() must guarentee that +Said another way, _atomic_dec_and_lock() must guarantee that a counter dropping to zero is never made visible before the spinlock being acquired. diff --git a/Documentation/cciss.txt b/Documentation/cciss.txt index 15378422fc46..9c629ffa0e58 100644 --- a/Documentation/cciss.txt +++ b/Documentation/cciss.txt @@ -20,6 +20,7 @@ This driver is known to work with the following cards: * SA P400i * SA E200 * SA E200i + * SA E500 If nodes are not already created in the /dev/cciss directory, run as root: diff --git a/Documentation/connector/ucon.c b/Documentation/connector/ucon.c new file mode 100644 index 000000000000..d738cde2a8d5 --- /dev/null +++ b/Documentation/connector/ucon.c @@ -0,0 +1,206 @@ +/* + * ucon.c + * + * Copyright (c) 2004+ Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <asm/types.h> + +#include <sys/types.h> +#include <sys/socket.h> +#include <sys/poll.h> + +#include <linux/netlink.h> +#include <linux/rtnetlink.h> + +#include <arpa/inet.h> + +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <string.h> +#include <errno.h> +#include <time.h> + +#include <linux/connector.h> + +#define DEBUG +#define NETLINK_CONNECTOR 11 + +#ifdef DEBUG +#define ulog(f, a...) fprintf(stdout, f, ##a) +#else +#define ulog(f, a...) do {} while (0) +#endif + +static int need_exit; +static __u32 seq; + +static int netlink_send(int s, struct cn_msg *msg) +{ + struct nlmsghdr *nlh; + unsigned int size; + int err; + char buf[128]; + struct cn_msg *m; + + size = NLMSG_SPACE(sizeof(struct cn_msg) + msg->len); + + nlh = (struct nlmsghdr *)buf; + nlh->nlmsg_seq = seq++; + nlh->nlmsg_pid = getpid(); + nlh->nlmsg_type = NLMSG_DONE; + nlh->nlmsg_len = NLMSG_LENGTH(size - sizeof(*nlh)); + nlh->nlmsg_flags = 0; + + m = NLMSG_DATA(nlh); +#if 0 + ulog("%s: [%08x.%08x] len=%u, seq=%u, ack=%u.\n", + __func__, msg->id.idx, msg->id.val, msg->len, msg->seq, msg->ack); +#endif + memcpy(m, msg, sizeof(*m) + msg->len); + + err = send(s, nlh, size, 0); + if (err == -1) + ulog("Failed to send: %s [%d].\n", + strerror(errno), errno); + + return err; +} + +int main(int argc, char *argv[]) +{ + int s; + char buf[1024]; + int len; + struct nlmsghdr *reply; + struct sockaddr_nl l_local; + struct cn_msg *data; + FILE *out; + time_t tm; + struct pollfd pfd; + + if (argc < 2) + out = stdout; + else { + out = fopen(argv[1], "a+"); + if (!out) { + ulog("Unable to open %s for writing: %s\n", + argv[1], strerror(errno)); + out = stdout; + } + } + + memset(buf, 0, sizeof(buf)); + + s = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR); + if (s == -1) { + perror("socket"); + return -1; + } + + l_local.nl_family = AF_NETLINK; + l_local.nl_groups = 0x123; /* bitmask of requested groups */ + l_local.nl_pid = 0; + + if (bind(s, (struct sockaddr *)&l_local, sizeof(struct sockaddr_nl)) == -1) { + perror("bind"); + close(s); + return -1; + } + +#if 0 + { + int on = 0x57; /* Additional group number */ + setsockopt(s, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP, &on, sizeof(on)); + } +#endif + if (0) { + int i, j; + + memset(buf, 0, sizeof(buf)); + + data = (struct cn_msg *)buf; + + data->id.idx = 0x123; + data->id.val = 0x456; + data->seq = seq++; + data->ack = 0; + data->len = 0; + + for (j=0; j<10; ++j) { + for (i=0; i<1000; ++i) { + len = netlink_send(s, data); + } + + ulog("%d messages have been sent to %08x.%08x.\n", i, data->id.idx, data->id.val); + } + + return 0; + } + + + pfd.fd = s; + + while (!need_exit) { + pfd.events = POLLIN; + pfd.revents = 0; + switch (poll(&pfd, 1, -1)) { + case 0: + need_exit = 1; + break; + case -1: + if (errno != EINTR) { + need_exit = 1; + break; + } + continue; + } + if (need_exit) + break; + + memset(buf, 0, sizeof(buf)); + len = recv(s, buf, sizeof(buf), 0); + if (len == -1) { + perror("recv buf"); + close(s); + return -1; + } + reply = (struct nlmsghdr *)buf; + + switch (reply->nlmsg_type) { + case NLMSG_ERROR: + fprintf(out, "Error message received.\n"); + fflush(out); + break; + case NLMSG_DONE: + data = (struct cn_msg *)NLMSG_DATA(reply); + + time(&tm); + fprintf(out, "%.24s : [%x.%x] [%08u.%08u].\n", + ctime(&tm), data->id.idx, data->id.val, data->seq, data->ack); + fflush(out); + break; + default: + break; + } + } + + close(s); + return 0; +} diff --git a/Documentation/console/console.txt b/Documentation/console/console.txt new file mode 100644 index 000000000000..d3e17447321c --- /dev/null +++ b/Documentation/console/console.txt @@ -0,0 +1,144 @@ +Console Drivers +=============== + +The linux kernel has 2 general types of console drivers. The first type is +assigned by the kernel to all the virtual consoles during the boot process. +This type will be called 'system driver', and only one system driver is allowed +to exist. The system driver is persistent and it can never be unloaded, though +it may become inactive. + +The second type has to be explicitly loaded and unloaded. This will be called +'modular driver' by this document. Multiple modular drivers can coexist at +any time with each driver sharing the console with other drivers including +the system driver. However, modular drivers cannot take over the console +that is currently occupied by another modular driver. (Exception: Drivers that +call take_over_console() will succeed in the takeover regardless of the type +of driver occupying the consoles.) They can only take over the console that is +occupied by the system driver. In the same token, if the modular driver is +released by the console, the system driver will take over. + +Modular drivers, from the programmer's point of view, has to call: + + take_over_console() - load and bind driver to console layer + give_up_console() - unbind and unload driver + +In newer kernels, the following are also available: + + register_con_driver() + unregister_con_driver() + +If sysfs is enabled, the contents of /sys/class/vtconsole can be +examined. This shows the console backends currently registered by the +system which are named vtcon<n> where <n> is an integer fro 0 to 15. Thus: + + ls /sys/class/vtconsole + . .. vtcon0 vtcon1 + +Each directory in /sys/class/vtconsole has 3 files: + + ls /sys/class/vtconsole/vtcon0 + . .. bind name uevent + +What do these files signify? + + 1. bind - this is a read/write file. It shows the status of the driver if + read, or acts to bind or unbind the driver to the virtual consoles + when written to. The possible values are: + + 0 - means the driver is not bound and if echo'ed, commands the driver + to unbind + + 1 - means the driver is bound and if echo'ed, commands the driver to + bind + + 2. name - read-only file. Shows the name of the driver in this format: + + cat /sys/class/vtconsole/vtcon0/name + (S) VGA+ + + '(S)' stands for a (S)ystem driver, ie, it cannot be directly + commanded to bind or unbind + + 'VGA+' is the name of the driver + + cat /sys/class/vtconsole/vtcon1/name + (M) frame buffer device + + In this case, '(M)' stands for a (M)odular driver, one that can be + directly commanded to bind or unbind. + + 3. uevent - ignore this file + +When unbinding, the modular driver is detached first, and then the system +driver takes over the consoles vacated by the driver. Binding, on the other +hand, will bind the driver to the consoles that are currently occupied by a +system driver. + +NOTE1: Binding and binding must be selected in Kconfig. It's under: + +Device Drivers -> Character devices -> Support for binding and unbinding +console drivers + +NOTE2: If any of the virtual consoles are in KD_GRAPHICS mode, then binding or +unbinding will not succeed. An example of an application that sets the console +to KD_GRAPHICS is X. + +How useful is this feature? This is very useful for console driver +developers. By unbinding the driver from the console layer, one can unload the +driver, make changes, recompile, reload and rebind the driver without any need +for rebooting the kernel. For regular users who may want to switch from +framebuffer console to VGA console and vice versa, this feature also makes +this possible. (NOTE NOTE NOTE: Please read fbcon.txt under Documentation/fb +for more details). + +Notes for developers: +===================== + +take_over_console() is now broken up into: + + register_con_driver() + bind_con_driver() - private function + +give_up_console() is a wrapper to unregister_con_driver(), and a driver must +be fully unbound for this call to succeed. con_is_bound() will check if the +driver is bound or not. + +Guidelines for console driver writers: +===================================== + +In order for binding to and unbinding from the console to properly work, +console drivers must follow these guidelines: + +1. All drivers, except system drivers, must call either register_con_driver() + or take_over_console(). register_con_driver() will just add the driver to + the console's internal list. It won't take over the + console. take_over_console(), as it name implies, will also take over (or + bind to) the console. + +2. All resources allocated during con->con_init() must be released in + con->con_deinit(). + +3. All resources allocated in con->con_startup() must be released when the + driver, which was previously bound, becomes unbound. The console layer + does not have a complementary call to con->con_startup() so it's up to the + driver to check when it's legal to release these resources. Calling + con_is_bound() in con->con_deinit() will help. If the call returned + false(), then it's safe to release the resources. This balance has to be + ensured because con->con_startup() can be called again when a request to + rebind the driver to the console arrives. + +4. Upon exit of the driver, ensure that the driver is totally unbound. If the + condition is satisfied, then the driver must call unregister_con_driver() + or give_up_console(). + +5. unregister_con_driver() can also be called on conditions which make it + impossible for the driver to service console requests. This can happen + with the framebuffer console that suddenly lost all of its drivers. + +The current crop of console drivers should still work correctly, but binding +and unbinding them may cause problems. With minimal fixes, these drivers can +be made to work correctly. + +========================== +Antonino Daplas <adaplas@pol.net> + diff --git a/Documentation/cpu-freq/user-guide.txt b/Documentation/cpu-freq/user-guide.txt index 7fedc00c3d30..555c8cf3650a 100644 --- a/Documentation/cpu-freq/user-guide.txt +++ b/Documentation/cpu-freq/user-guide.txt @@ -153,10 +153,13 @@ scaling_governor, and by "echoing" the name of another that some governors won't load - they only work on some specific architectures or processors. -scaling_min_freq and +scaling_min_freq and scaling_max_freq show the current "policy limits" (in kHz). By echoing new values into these files, you can change these limits. + NOTE: when setting a policy you need to + first set scaling_max_freq, then + scaling_min_freq. If you have selected the "userspace" governor which allows you to diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt index 1bcf69996c9d..bc107cb157a8 100644 --- a/Documentation/cpu-hotplug.txt +++ b/Documentation/cpu-hotplug.txt @@ -251,16 +251,24 @@ A: This is what you would need in your kernel code to receive notifications. return NOTIFY_OK; } - static struct notifier_block foobar_cpu_notifer = + static struct notifier_block __cpuinitdata foobar_cpu_notifer = { .notifier_call = foobar_cpu_callback, }; +You need to call register_cpu_notifier() from your init function. +Init functions could be of two types: +1. early init (init function called when only the boot processor is online). +2. late init (init function called _after_ all the CPUs are online). -In your init function, +For the first case, you should add the following to your init function register_cpu_notifier(&foobar_cpu_notifier); +For the second case, you should add the following to your init function + + register_hotcpu_notifier(&foobar_cpu_notifier); + You can fail PREPARE notifiers if something doesn't work to prepare resources. This will stop the activity and send a following CANCELED event back. diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 159e2a0c3e80..842f0d1ab216 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt @@ -217,6 +217,12 @@ exclusive cpuset. Also, the use of a Linux virtual file system (vfs) to represent the cpuset hierarchy provides for a familiar permission and name space for cpusets, with a minimum of additional kernel code. +The cpus and mems files in the root (top_cpuset) cpuset are +read-only. The cpus file automatically tracks the value of +cpu_online_map using a CPU hotplug notifier, and the mems file +automatically tracks the value of node_online_map using the +cpuset_track_online_nodes() hook. + 1.4 What are exclusive cpusets ? -------------------------------- diff --git a/Documentation/crypto/api-intro.txt b/Documentation/crypto/api-intro.txt index 74dffc68ff9f..5a03a2801d67 100644 --- a/Documentation/crypto/api-intro.txt +++ b/Documentation/crypto/api-intro.txt @@ -19,15 +19,14 @@ At the lowest level are algorithms, which register dynamically with the API. 'Transforms' are user-instantiated objects, which maintain state, handle all -of the implementation logic (e.g. manipulating page vectors), provide an -abstraction to the underlying algorithms, and handle common logical -operations (e.g. cipher modes, HMAC for digests). However, at the user +of the implementation logic (e.g. manipulating page vectors) and provide an +abstraction to the underlying algorithms. However, at the user level they are very simple. Conceptually, the API layering looks like this: [transform api] (user interface) - [transform ops] (per-type logic glue e.g. cipher.c, digest.c) + [transform ops] (per-type logic glue e.g. cipher.c, compress.c) [algorithm api] (for registering algorithms) The idea is to make the user interface and algorithm registration API @@ -44,22 +43,27 @@ under development. Here's an example of how to use the API: #include <linux/crypto.h> + #include <linux/err.h> + #include <linux/scatterlist.h> struct scatterlist sg[2]; char result[128]; - struct crypto_tfm *tfm; + struct crypto_hash *tfm; + struct hash_desc desc; - tfm = crypto_alloc_tfm("md5", 0); - if (tfm == NULL) + tfm = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC); + if (IS_ERR(tfm)) fail(); /* ... set up the scatterlists ... */ + + desc.tfm = tfm; + desc.flags = 0; - crypto_digest_init(tfm); - crypto_digest_update(tfm, &sg, 2); - crypto_digest_final(tfm, result); + if (crypto_hash_digest(&desc, &sg, 2, result)) + fail(); - crypto_free_tfm(tfm); + crypto_free_hash(tfm); Many real examples are available in the regression test module (tcrypt.c). @@ -126,7 +130,7 @@ might already be working on. BUGS Send bug reports to: -James Morris <jmorris@redhat.com> +Herbert Xu <herbert@gondor.apana.org.au> Cc: David S. Miller <davem@redhat.com> @@ -134,13 +138,14 @@ FURTHER INFORMATION For further patches and various updates, including the current TODO list, see: -http://samba.org/~jamesm/crypto/ +http://gondor.apana.org.au/~herbert/crypto/ AUTHORS James Morris David S. Miller +Herbert Xu CREDITS @@ -238,8 +243,11 @@ Anubis algorithm contributors: Tiger algorithm contributors: Aaron Grothe +VIA PadLock contributors: + Michal Ludvig + Generic scatterwalk code by Adam J. Richter <adam@yggdrasil.com> Please send any credits updates or corrections to: -James Morris <jmorris@redhat.com> +Herbert Xu <herbert@gondor.apana.org.au> diff --git a/Documentation/devices.txt b/Documentation/devices.txt index b2f593fc76ca..addc67b1d770 100644 --- a/Documentation/devices.txt +++ b/Documentation/devices.txt @@ -3,7 +3,7 @@ Maintained by Torben Mathiasen <device@lanana.org> - Last revised: 01 March 2006 + Last revised: 15 May 2006 This list is the Linux Device List, the official registry of allocated device numbers and /dev directory nodes for the Linux operating @@ -2543,6 +2543,9 @@ Your cooperation is appreciated. 64 = /dev/usb/rio500 Diamond Rio 500 65 = /dev/usb/usblcd USBLCD Interface (info@usblcd.de) 66 = /dev/usb/cpad0 Synaptics cPad (mouse/LCD) + 67 = /dev/usb/adutux0 1st Ontrak ADU device + ... + 76 = /dev/usb/adutux10 10th Ontrak ADU device 96 = /dev/usb/hiddev0 1st USB HID device ... 111 = /dev/usb/hiddev15 16th USB HID device @@ -2565,10 +2568,10 @@ Your cooperation is appreciated. 243 = /dev/usb/dabusb3 Fourth dabusb device 180 block USB block devices - 0 = /dev/uba First USB block device - 8 = /dev/ubb Second USB block device - 16 = /dev/ubc Thrid USB block device - ... + 0 = /dev/uba First USB block device + 8 = /dev/ubb Second USB block device + 16 = /dev/ubc Third USB block device + ... 181 char Conrad Electronic parallel port radio clocks 0 = /dev/pcfclock0 First Conrad radio clock @@ -2791,6 +2794,7 @@ Your cooperation is appreciated. 170 = /dev/ttyNX0 Hilscher netX serial port 0 ... 185 = /dev/ttyNX15 Hilscher netX serial port 15 + 186 = /dev/ttyJ0 JTAG1 DCC protocol based serial port emulation 205 char Low-density serial ports (alternate device) 0 = /dev/culu0 Callout device for ttyLU0 @@ -3108,6 +3112,10 @@ Your cooperation is appreciated. ... 240 = /dev/rfdp 16th RFD FTL layer +257 char Phoenix Technologies Cryptographic Services Driver + 0 = /dev/ptlsec Crypto Services Driver + + **** ADDITIONAL /dev DIRECTORY ENTRIES diff --git a/Documentation/digiepca.txt b/Documentation/digiepca.txt index 88820fe38dad..f2560e22f2c9 100644 --- a/Documentation/digiepca.txt +++ b/Documentation/digiepca.txt @@ -2,7 +2,7 @@ NOTE: This driver is obsolete. Digi provides a 2.6 driver (dgdm) at http://www.digi.com for PCI cards. They no longer maintain this driver, and have no 2.6 driver for ISA cards. -This driver requires a number of user-space tools. They can be aquired from +This driver requires a number of user-space tools. They can be acquired from http://www.digi.com, but only works with 2.4 kernels. diff --git a/Documentation/dontdiff b/Documentation/dontdiff index 24adfe9af3ca..63c2d0c55aa2 100644 --- a/Documentation/dontdiff +++ b/Documentation/dontdiff @@ -135,6 +135,7 @@ tags times.h* tkparse trix_boot.h +utsrelease.h* version.h* vmlinux vmlinux-* diff --git a/Documentation/driver-model/overview.txt b/Documentation/driver-model/overview.txt index ac4a7a737e43..2050c9ffc629 100644 --- a/Documentation/driver-model/overview.txt +++ b/Documentation/driver-model/overview.txt @@ -18,7 +18,7 @@ Traditional driver models implemented some sort of tree-like structure (sometimes just a list) for the devices they control. There wasn't any uniformity across the different bus types. -The current driver model provides a comon, uniform data model for describing +The current driver model provides a common, uniform data model for describing a bus and the devices that can appear under the bus. The unified bus model includes a set of common attributes which all busses carry, and a set of common callbacks, such as device discovery during bus probing, bus diff --git a/Documentation/drivers/edac/edac.txt b/Documentation/drivers/edac/edac.txt index 70d96a62e5e1..7b3d969d2964 100644 --- a/Documentation/drivers/edac/edac.txt +++ b/Documentation/drivers/edac/edac.txt @@ -35,15 +35,14 @@ the vendor should tie the parity status bits to 0 if they do not intend to generate parity. Some vendors do not do this, and thus the parity bit can "float" giving false positives. -The PCI Parity EDAC device has the ability to "skip" known flaky -cards during the parity scan. These are set by the parity "blacklist" -interface in the sysfs for PCI Parity. (See the PCI section in the sysfs -section below.) There is also a parity "whitelist" which is used as -an explicit list of devices to scan, while the blacklist is a list -of devices to skip. +[There are patches in the kernel queue which will allow for storage of +quirks of PCI devices reporting false parity positives. The 2.6.18 +kernel should have those patches included. When that becomes available, +then EDAC will be patched to utilize that information to "skip" such +devices.] -EDAC will have future error detectors that will be added or integrated -into EDAC in the following list: +EDAC will have future error detectors that will be integrated with +EDAC or added to it, in the following list: MCE Machine Check Exception MCA Machine Check Architecture @@ -93,22 +92,24 @@ EDAC lives in the /sys/devices/system/edac directory. Within this directory there currently reside 2 'edac' components: mc memory controller(s) system - pci PCI status system + pci PCI control and status system ============================================================================ Memory Controller (mc) Model First a background on the memory controller's model abstracted in EDAC. -Each mc device controls a set of DIMM memory modules. These modules are +Each 'mc' device controls a set of DIMM memory modules. These modules are laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can -be multiple csrows and two channels. +be multiple csrows and multiple channels. Memory controllers allow for several csrows, with 8 csrows being a typical value. Yet, the actual number of csrows depends on the electrical "loading" of a given motherboard, memory controller and DIMM characteristics. Dual channels allows for 128 bit data transfers to the CPU from memory. +Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs +(FB-DIMMs). The following example will assume 2 channels: Channel 0 Channel 1 @@ -234,23 +235,15 @@ Polling period control file: The time period, in milliseconds, for polling for error information. Too small a value wastes resources. Too large a value might delay necessary handling of errors and might loose valuable information for - locating the error. 1000 milliseconds (once each second) is about - right for most uses. + locating the error. 1000 milliseconds (once each second) is the current + default. Systems which require all the bandwidth they can get, may + increase this. LOAD TIME: module/kernel parameter: poll_msec=[0|1] RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec -Module Version read-only attribute file: - - 'mc_version' - - The EDAC CORE module's version and compile date are shown here to - indicate what EDAC is running. - - - ============================================================================ 'mcX' DIRECTORIES @@ -284,35 +277,6 @@ Seconds since last counter reset control file: -DIMM capability attribute file: - - 'edac_capability' - - The EDAC (Error Detection and Correction) capabilities/modes of - the memory controller hardware. - - -DIMM Current Capability attribute file: - - 'edac_current_capability' - - The EDAC capabilities available with the hardware - configuration. This may not be the same as "EDAC capability" - if the correct memory is not used. If a memory controller is - capable of EDAC, but DIMMs without check bits are in use, then - Parity, SECDED, S4ECD4ED capabilities will not be available - even though the memory controller might be capable of those - modes with the proper memory loaded. - - -Memory Type supported on this controller attribute file: - - 'supported_mem_type' - - This attribute file displays the memory type, usually - buffered and unbuffered DIMMs. - - Memory Controller name attribute file: 'mc_name' @@ -321,16 +285,6 @@ Memory Controller name attribute file: that is being utilized. -Memory Controller Module name attribute file: - - 'module_name' - - This attribute file displays the memory controller module name, - version and date built. The name of the memory controller - hardware - some drivers work with multiple controllers and - this field shows which hardware is present. - - Total memory managed by this memory controller attribute file: 'size_mb' @@ -432,6 +386,9 @@ Memory Type attribute file: This attribute file will display what type of memory is currently on this csrow. Normally, either buffered or unbuffered memory. + Examples: + Registered-DDR + Unbuffered-DDR EDAC Mode of operation attribute file: @@ -446,8 +403,13 @@ Device type attribute file: 'dev_type' - This attribute file will display what type of DIMM device is - being utilized. Example: x4 + This attribute file will display what type of DRAM device is + being utilized on this DIMM. + Examples: + x1 + x2 + x4 + x8 Channel 0 CE Count attribute file: @@ -522,10 +484,10 @@ SYSTEM LOGGING If logging for UEs and CEs are enabled then system logs will have error notices indicating errors that have been detected: -MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, +EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac -MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, +EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac @@ -610,64 +572,4 @@ Parity Count: -PCI Device Whitelist: - - 'pci_parity_whitelist' - - This control file allows for an explicit list of PCI devices to be - scanned for parity errors. Only devices found on this list will - be examined. The list is a line of hexadecimal VENDOR and DEVICE - ID tuples: - - 1022:7450,1434:16a6 - - One or more can be inserted, separated by a comma. - - To write the above list doing the following as one command line: - - echo "1022:7450,1434:16a6" - > /sys/devices/system/edac/pci/pci_parity_whitelist - - - - To display what the whitelist is, simply 'cat' the same file. - - -PCI Device Blacklist: - - 'pci_parity_blacklist' - - This control file allows for a list of PCI devices to be - skipped for scanning. - The list is a line of hexadecimal VENDOR and DEVICE ID tuples: - - 1022:7450,1434:16a6 - - One or more can be inserted, separated by a comma. - - To write the above list doing the following as one command line: - - echo "1022:7450,1434:16a6" - > /sys/devices/system/edac/pci/pci_parity_blacklist - - - To display what the whitelist currently contains, - simply 'cat' the same file. - ======================================================================= - -PCI Vendor and Devices IDs can be obtained with the lspci command. Using -the -n option lspci will display the vendor and device IDs. The system -administrator will have to determine which devices should be scanned or -skipped. - - - -The two lists (white and black) are prioritized. blacklist is the lower -priority and will NOT be utilized when a whitelist has been set. -Turn OFF a whitelist by an empty echo command: - - echo > /sys/devices/system/edac/pci/pci_parity_whitelist - -and any previous blacklist will be utilized. - diff --git a/Documentation/fb/fbcon.txt b/Documentation/fb/fbcon.txt index 08dce0f631bf..f373df12ed4c 100644 --- a/Documentation/fb/fbcon.txt +++ b/Documentation/fb/fbcon.txt @@ -135,10 +135,10 @@ C. Boot options The angle can be changed anytime afterwards by 'echoing' the same numbers to any one of the 2 attributes found in - /sys/class/graphics/fb{x} + /sys/class/graphics/fbcon - con_rotate - rotate the display of the active console - con_rotate_all - rotate the display of all consoles + rotate - rotate the display of the active console + rotate_all - rotate the display of all consoles Console rotation will only become available if Console Rotation Support is compiled in your kernel. @@ -148,5 +148,177 @@ C. Boot options Actually, the underlying fb driver is totally ignorant of console rotation. ---- +C. Attaching, Detaching and Unloading + +Before going on on how to attach, detach and unload the framebuffer console, an +illustration of the dependencies may help. + +The console layer, as with most subsystems, needs a driver that interfaces with +the hardware. Thus, in a VGA console: + +console ---> VGA driver ---> hardware. + +Assuming the VGA driver can be unloaded, one must first unbind the VGA driver +from the console layer before unloading the driver. The VGA driver cannot be +unloaded if it is still bound to the console layer. (See +Documentation/console/console.txt for more information). + +This is more complicated in the case of the the framebuffer console (fbcon), +because fbcon is an intermediate layer between the console and the drivers: + +console ---> fbcon ---> fbdev drivers ---> hardware + +The fbdev drivers cannot be unloaded if it's bound to fbcon, and fbcon cannot +be unloaded if it's bound to the console layer. + +So to unload the fbdev drivers, one must first unbind fbcon from the console, +then unbind the fbdev drivers from fbcon. Fortunately, unbinding fbcon from +the console layer will automatically unbind framebuffer drivers from +fbcon. Thus, there is no need to explicitly unbind the fbdev drivers from +fbcon. + +So, how do we unbind fbcon from the console? Part of the answer is in +Documentation/console/console.txt. To summarize: + +Echo a value to the bind file that represents the framebuffer console +driver. So assuming vtcon1 represents fbcon, then: + +echo 1 > sys/class/vtconsole/vtcon1/bind - attach framebuffer console to + console layer +echo 0 > sys/class/vtconsole/vtcon1/bind - detach framebuffer console from + console layer + +If fbcon is detached from the console layer, your boot console driver (which is +usually VGA text mode) will take over. A few drivers (rivafb and i810fb) will +restore VGA text mode for you. With the rest, before detaching fbcon, you +must take a few additional steps to make sure that your VGA text mode is +restored properly. The following is one of the several methods that you can do: + +1. Download or install vbetool. This utility is included with most + distributions nowadays, and is usually part of the suspend/resume tool. + +2. In your kernel configuration, ensure that CONFIG_FRAMEBUFFER_CONSOLE is set + to 'y' or 'm'. Enable one or more of your favorite framebuffer drivers. + +3. Boot into text mode and as root run: + + vbetool vbestate save > <vga state file> + + The above command saves the register contents of your graphics + hardware to <vga state file>. You need to do this step only once as + the state file can be reused. + +4. If fbcon is compiled as a module, load fbcon by doing: + + modprobe fbcon + +5. Now to detach fbcon: + + vbetool vbestate restore < <vga state file> && \ + echo 0 > /sys/class/vtconsole/vtcon1/bind + +6. That's it, you're back to VGA mode. And if you compiled fbcon as a module, + you can unload it by 'rmmod fbcon' + +7. To reattach fbcon: + + echo 1 > /sys/class/vtconsole/vtcon1/bind + +8. Once fbcon is unbound, all drivers registered to the system will also +become unbound. This means that fbcon and individual framebuffer drivers +can be unloaded or reloaded at will. Reloading the drivers or fbcon will +automatically bind the console, fbcon and the drivers together. Unloading +all the drivers without unloading fbcon will make it impossible for the +console to bind fbcon. + +Notes for vesafb users: +======================= + +Unfortunately, if your bootline includes a vga=xxx parameter that sets the +hardware in graphics mode, such as when loading vesafb, vgacon will not load. +Instead, vgacon will replace the default boot console with dummycon, and you +won't get any display after detaching fbcon. Your machine is still alive, so +you can reattach vesafb. However, to reattach vesafb, you need to do one of +the following: + +Variation 1: + + a. Before detaching fbcon, do + + vbetool vbemode save > <vesa state file> # do once for each vesafb mode, + # the file can be reused + + b. Detach fbcon as in step 5. + + c. Attach fbcon + + vbetool vbestate restore < <vesa state file> && \ + echo 1 > /sys/class/vtconsole/vtcon1/bind + +Variation 2: + + a. Before detaching fbcon, do: + echo <ID> > /sys/class/tty/console/bind + + + vbetool vbemode get + + b. Take note of the mode number + + b. Detach fbcon as in step 5. + + c. Attach fbcon: + + vbetool vbemode set <mode number> && \ + echo 1 > /sys/class/vtconsole/vtcon1/bind + +Samples: +======== + +Here are 2 sample bash scripts that you can use to bind or unbind the +framebuffer console driver if you are in an X86 box: + +--------------------------------------------------------------------------- +#!/bin/bash +# Unbind fbcon + +# Change this to where your actual vgastate file is located +# Or Use VGASTATE=$1 to indicate the state file at runtime +VGASTATE=/tmp/vgastate + +# path to vbetool +VBETOOL=/usr/local/bin + + +for (( i = 0; i < 16; i++)) +do + if test -x /sys/class/vtconsole/vtcon$i; then + if [ `cat /sys/class/vtconsole/vtcon$i/name | grep -c "frame buffer"` \ + = 1 ]; then + if test -x $VBETOOL/vbetool; then + echo Unbinding vtcon$i + $VBETOOL/vbetool vbestate restore < $VGASTATE + echo 0 > /sys/class/vtconsole/vtcon$i/bind + fi + fi + fi +done + +--------------------------------------------------------------------------- +#!/bin/bash +# Bind fbcon + +for (( i = 0; i < 16; i++)) +do + if test -x /sys/class/vtconsole/vtcon$i; then + if [ `cat /sys/class/vtconsole/vtcon$i/name | grep -c "frame buffer"` \ + = 1 ]; then + echo Unbinding vtcon$i + echo 1 > /sys/class/vtconsole/vtcon$i/bind + fi + fi +done +--------------------------------------------------------------------------- + +-- Antonino Daplas <adaplas@pol.net> diff --git a/Documentation/fb/imacfb.txt b/Documentation/fb/imacfb.txt new file mode 100644 index 000000000000..759028545a7e --- /dev/null +++ b/Documentation/fb/imacfb.txt @@ -0,0 +1,31 @@ + +What is imacfb? +=============== + +This is a generic EFI platform driver for Intel based Apple computers. +Imacfb is only for EFI booted Intel Macs. + +Supported Hardware +================== + +iMac 17"/20" +Macbook +Macbook Pro 15"/17" +MacMini + +How to use it? +============== + +Imacfb does not have any kind of autodetection of your machine. +You have to add the fillowing kernel parameters in your elilo.conf: + Macbook : + video=imacfb:macbook + MacMini : + video=imacfb:mini + Macbook Pro 15", iMac 17" : + video=imacfb:i17 + Macbook Pro 17", iMac 20" : + video=imacfb:i20 + +-- +Edgar Hucek <gimli@dark-green.com> diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 027285d0c26c..436697cb9388 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -6,14 +6,18 @@ be removed from this file. --------------------------- -What: devfs -When: July 2005 -Files: fs/devfs/*, include/linux/devfs_fs*.h and assorted devfs - function calls throughout the kernel tree -Why: It has been unmaintained for a number of years, has unfixable - races, contains a naming policy within the kernel that is - against the LSB, and can be replaced by using udev. -Who: Greg Kroah-Hartman <greg@kroah.com> +What: /sys/devices/.../power/state + dev->power.power_state + dpm_runtime_{suspend,resume)() +When: July 2007 +Why: Broken design for runtime control over driver power states, confusing + driver-internal runtime power management with: mechanisms to support + system-wide sleep state transitions; event codes that distinguish + different phases of swsusp "sleep" transitions; and userspace policy + inputs. This framework was never widely used, and most attempts to + use it were broken. Drivers should instead be exposing domain-specific + interfaces either to kernel or to userspace. +Who: Pavel Machek <pavel@suse.cz> --------------------------- @@ -66,11 +70,15 @@ Who: Mauro Carvalho Chehab <mchehab@brturbo.com.br> --------------------------- -What: remove EXPORT_SYMBOL(insert_resource) -When: April 2006 -Files: kernel/resource.c -Why: No modular usage in the kernel. -Who: Adrian Bunk <bunk@stusta.de> +What: sys_sysctl +When: January 2007 +Why: The same information is available through /proc/sys and that is the + interface user space prefers to use. And there do not appear to be + any existing user in user space of sys_sysctl. The additional + maintenance overhead of keeping a set of binary names gets + in the way of doing a good job of maintaining this interface. + +Who: Eric Biederman <ebiederm@xmission.com> --------------------------- @@ -132,16 +140,6 @@ Who: NeilBrown <neilb@suse.de> --------------------------- -What: au1x00_uart driver -When: January 2006 -Why: The 8250 serial driver now has the ability to deal with the differences - between the standard 8250 family of UARTs and their slightly strange - brother on Alchemy SOCs. The loss of features is not considered an - issue. -Who: Ralf Baechle <ralf@linux-mips.org> - ---------------------------- - What: eepro100 network driver When: January 2007 Why: replaced by the e100 driver @@ -149,6 +147,13 @@ Who: Adrian Bunk <bunk@stusta.de> --------------------------- +What: drivers depending on OSS_OBSOLETE_DRIVER +When: options in 2.6.20, code in 2.6.22 +Why: OSS drivers with ALSA replacements +Who: Adrian Bunk <bunk@stusta.de> + +--------------------------- + What: pci_module_init(driver) When: January 2007 Why: Is replaced by pci_register_driver(pci_driver). @@ -177,14 +182,13 @@ Who: Jean Delvare <khali@linux-fr.org> --------------------------- -What: remove EXPORT_SYMBOL(tasklist_lock) -When: August 2006 -Files: kernel/fork.c -Why: tasklist_lock protects the kernel internal task list. Modules have - no business looking at it, and all instances in drivers have been due - to use of too-lowlevel APIs. Having this symbol exported prevents - moving to more scalable locking schemes for the task list. -Who: Christoph Hellwig <hch@lst.de> +What: Unused EXPORT_SYMBOL/EXPORT_SYMBOL_GPL exports + (temporary transition config option provided until then) + The transition config option will also be removed at the same time. +When: before 2.6.19 +Why: Unused symbols are both increasing the size of the kernel binary + and are often a sign of "wrong API" +Who: Arjan van de Ven <arjan@linux.intel.com> --------------------------- @@ -224,3 +228,109 @@ Why: The interface no longer has any callers left in the kernel. It Who: Nick Piggin <npiggin@suse.de> --------------------------- + +What: Support for the Momentum / PMC-Sierra Jaguar ATX evaluation board +When: September 2006 +Why: Does no longer build since quite some time, and was never popular, + due to the platform being replaced by successor models. Apparently + no user base left. It also is one of the last users of + WANT_PAGE_VIRTUAL. +Who: Ralf Baechle <ralf@linux-mips.org> + +--------------------------- + +What: Support for the Momentum Ocelot, Ocelot 3, Ocelot C and Ocelot G +When: September 2006 +Why: Some do no longer build and apparently there is no user base left + for these platforms. +Who: Ralf Baechle <ralf@linux-mips.org> + +--------------------------- + +What: Support for MIPS Technologies' Altas and SEAD evaluation board +When: September 2006 +Why: Some do no longer build and apparently there is no user base left + for these platforms. Hardware out of production since several years. +Who: Ralf Baechle <ralf@linux-mips.org> + +--------------------------- + +What: Support for the IT8172-based platforms, ITE 8172G and Globespan IVR +When: September 2006 +Why: Code does no longer build since at least 2.6.0, apparently there is + no user base left for these platforms. Hardware out of production + since several years and hardly a trace of the manufacturer left on + the net. +Who: Ralf Baechle <ralf@linux-mips.org> + +--------------------------- + +What: Interrupt only SA_* flags +When: Januar 2007 +Why: The interrupt related SA_* flags are replaced by IRQF_* to move them + out of the signal namespace. + +Who: Thomas Gleixner <tglx@linutronix.de> + +--------------------------- + +What: i2c-ite and i2c-algo-ite drivers +When: September 2006 +Why: These drivers never compiled since they were added to the kernel + tree 5 years ago. This feature removal can be reevaluated if + someone shows interest in the drivers, fixes them and takes over + maintenance. + http://marc.theaimsgroup.com/?l=linux-mips&m=115040510817448 +Who: Jean Delvare <khali@linux-fr.org> + +--------------------------- + +What: Bridge netfilter deferred IPv4/IPv6 output hook calling +When: January 2007 +Why: The deferred output hooks are a layering violation causing unusual + and broken behaviour on bridge devices. Examples of things they + break include QoS classifation using the MARK or CLASSIFY targets, + the IPsec policy match and connection tracking with VLANs on a + bridge. Their only use is to enable bridge output port filtering + within iptables with the physdev match, which can also be done by + combining iptables and ebtables using netfilter marks. Until it + will get removed the hook deferral is disabled by default and is + only enabled when needed. + +Who: Patrick McHardy <kaber@trash.net> + +--------------------------- + +What: frame diverter +When: November 2006 +Why: The frame diverter is included in most distribution kernels, but is + broken. It does not correctly handle many things: + - IPV6 + - non-linear skb's + - network device RCU on removal + - input frames not correctly checked for protocol errors + It also adds allocation overhead even if not enabled. + It is not clear if anyone is still using it. +Who: Stephen Hemminger <shemminger@osdl.org> + +--------------------------- + + +What: PHYSDEVPATH, PHYSDEVBUS, PHYSDEVDRIVER in the uevent environment +When: Oktober 2008 +Why: The stacking of class devices makes these values misleading and + inconsistent. + Class devices should not carry any of these properties, and bus + devices have SUBSYTEM and DRIVER as a replacement. +Who: Kay Sievers <kay.sievers@suse.de> + +--------------------------- + +What: i2c-isa +When: December 2006 +Why: i2c-isa is a non-sense and doesn't fit in the device driver + model. Drivers relying on it are better implemented as platform + drivers. +Who: Jean Delvare <khali@linux-fr.org> + +--------------------------- diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index 66fdc0744fe0..16dec61d7671 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -62,8 +62,8 @@ ramfs-rootfs-initramfs.txt - info on the 'in memory' filesystems ramfs, rootfs and initramfs. reiser4.txt - info on the Reiser4 filesystem based on dancing tree algorithms. -relayfs.txt - - info on relayfs, for efficient streaming from kernel to user space. +relay.txt + - info on relay, for efficient streaming from kernel to user space. romfs.txt - description of the ROMFS filesystem. smbfs.txt diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index d31efbbdfe50..247d7f619aa2 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -142,8 +142,8 @@ see also dquot_operations section. --------------------------- file_system_type --------------------------- prototypes: - struct int (*get_sb) (struct file_system_type *, int, - const char *, void *, struct vfsmount *); + int (*get_sb) (struct file_system_type *, int, + const char *, void *, struct vfsmount *); void (*kill_sb) (struct super_block *); locking rules: may block BKL diff --git a/Documentation/filesystems/automount-support.txt b/Documentation/filesystems/automount-support.txt index 58c65a1713e5..7cac200e2a85 100644 --- a/Documentation/filesystems/automount-support.txt +++ b/Documentation/filesystems/automount-support.txt @@ -19,7 +19,7 @@ following procedure: (2) Have the follow_link() op do the following steps: - (a) Call do_kern_mount() to call the appropriate filesystem to set up a + (a) Call vfs_kern_mount() to call the appropriate filesystem to set up a superblock and gain a vfsmount structure representing it. (b) Copy the nameidata provided as an argument and substitute the dentry diff --git a/Documentation/filesystems/configfs/configfs_example.c b/Documentation/filesystems/configfs/configfs_example.c index 3d4713a6c207..2d6a14a463e0 100644 --- a/Documentation/filesystems/configfs/configfs_example.c +++ b/Documentation/filesystems/configfs/configfs_example.c @@ -264,6 +264,15 @@ static struct config_item_type simple_child_type = { }; +struct simple_children { + struct config_group group; +}; + +static inline struct simple_children *to_simple_children(struct config_item *item) +{ + return item ? container_of(to_config_group(item), struct simple_children, group) : NULL; +} + static struct config_item *simple_children_make_item(struct config_group *group, const char *name) { struct simple_child *simple_child; @@ -304,7 +313,13 @@ static ssize_t simple_children_attr_show(struct config_item *item, "items have only one attribute that is readable and writeable.\n"); } +static void simple_children_release(struct config_item *item) +{ + kfree(to_simple_children(item)); +} + static struct configfs_item_operations simple_children_item_ops = { + .release = simple_children_release, .show_attribute = simple_children_attr_show, }; @@ -345,10 +360,6 @@ static struct configfs_subsystem simple_children_subsys = { * children of its own. */ -struct simple_children { - struct config_group group; -}; - static struct config_group *group_children_make_group(struct config_group *group, const char *name) { struct simple_children *simple_children; diff --git a/Documentation/filesystems/devfs/ChangeLog b/Documentation/filesystems/devfs/ChangeLog deleted file mode 100644 index e5aba5246d7c..000000000000 --- a/Documentation/filesystems/devfs/ChangeLog +++ /dev/null @@ -1,1977 +0,0 @@ -/* -*- auto-fill -*- */ -=============================================================================== -Changes for patch v1 - -- creation of devfs - -- modified miscellaneous character devices to support devfs -=============================================================================== -Changes for patch v2 - -- bug fix with manual inode creation -=============================================================================== -Changes for patch v3 - -- bugfixes - -- documentation improvements - -- created a couple of scripts (one to save&restore a devfs and the - other to set up compatibility symlinks) - -- devfs support for SCSI discs. New name format is: sd_hHcCiIlL -=============================================================================== -Changes for patch v4 - -- bugfix for the directory reading code - -- bugfix for compilation with kerneld - -- devfs support for generic hard discs - -- rationalisation of the various watchdog drivers -=============================================================================== -Changes for patch v5 - -- support for mounting directly from entries in the devfs (it doesn't - need to be mounted to do this), including the root filesystem. - Mounting of swap partitions also works. Hence, now if you set - CONFIG_DEVFS_ONLY to 'Y' then you won't be able to access your discs - via ordinary device nodes. Naturally, the default is 'N' so that you - can still use your old device nodes. If you want to mount from devfs - entries, make sure you use: append = "root=/dev/sd_..." in your - lilo.conf. It seems LILO looks for the device number (major&minor) - and writes that into the kernel image :-( - -- support for character memory devices (/dev/null, /dev/zero, /dev/full - and so on). Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> -=============================================================================== -Changes for patch v6 - -- support for subdirectories - -- support for symbolic links (created by devfs_mk_symlink(), no - support yet for creation via symlink(2)) - -- SCSI disc naming now cast in stone, with the format: - /dev/sd/c0b1t2u3 controller=0, bus=1, ID=2, LUN=3, whole disc - /dev/sd/c0b1t2u3p4 controller=0, bus=1, ID=2, LUN=3, 4th partition - -- loop devices now appear in devfs - -- tty devices, console, serial ports, etc. now appear in devfs - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- bugs with mounting devfs-only devices now fixed -=============================================================================== -Changes for patch v7 - -- SCSI CD-ROMS, tapes and generic devices now appear in devfs -=============================================================================== -Changes for patch v8 - -- bugfix with no-rewind SCSI tapes - -- RAMDISCs now appear in devfs - -- better cleaning up of devfs entries created by various modules - -- interface change to <devfs_register> -=============================================================================== -Changes for patch v9 - -- the v8 patch was corrupted somehow, which would affect the patch for - linux/fs/filesystems.c - I've also fixed the v8 patch file on the WWW - -- MetaDevices (/dev/md*) should now appear in devfs -=============================================================================== -Changes for patch v10 - -- bugfix in meta device support for devfs - -- created this ChangeLog file - -- added devfs support to the floppy driver - -- added support for creating sockets in a devfs -=============================================================================== -Changes for patch v11 - -- added DEVFS_FL_HIDE_UNREG flag - -- incorporated better patch for ttyname() in libc 5.4.43 from H.J. Lu. - -- interface change to <devfs_mk_symlink> - -- support for creating symlinks with symlink(2) - -- parallel port printer (/dev/lp*) now appears in devfs -=============================================================================== -Changes for patch v12 - -- added inode check to <devfs_fill_file> function - -- improved devfs support when mounting from devfs - -- added call to <<release>> operation when removing swap areas on - devfs devices - -- increased NR_SUPER to 128 to support large numbers of devfs mounts - (for chroot(2) gaols) - -- fixed bug in SCSI disc support: was generating incorrect minors if - SCSI ID's did not start at 0 and increase by 1 - -- support symlink traversal when mounting root -=============================================================================== -Changes for patch v13 - -- added devfs support to soundcard driver - Thanks to Eric Dumas <dumas@linux.eu.org> and - C. Scott Ananian <cananian@alumni.princeton.edu> - -- added devfs support to the joystick driver - -- loop driver now has it's own subdirectory "/dev/loop/" - -- created <devfs_get_flags> and <devfs_set_flags> functions - -- fix problem with SCSI disc compatibility names (sd{a,b,c,d,e,f}) - which assumes ID's start at 0 and increase by 1. Also only create - devfs entries for SCSI disc partitions which actually exist - Show new names in partition check - Thanks to Jakub Jelinek <jj@sunsite.ms.mff.cuni.cz> -=============================================================================== -Changes for patch v14 - -- bug fix in floppy driver: would not compile without - CONFIG_DEVFS_FS='Y' - Thanks to Jurgen Botz <jbotz@nova.botz.org> - -- bug fix in loop driver - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- do not create devfs entries for printers not configured - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- do not create devfs entries for serial ports not present - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- ensure <tty_register_devfs> is exported from tty_io.c - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- allow unregistering of devfs symlink entries - -- fixed bug in SCSI disc naming introduced in last patch version -=============================================================================== -Changes for patch v15 - -- ported to kernel 2.1.81 -=============================================================================== -Changes for patch v16 - -- created <devfs_set_symlink_destination> function - -- moved DEVFS_SUPER_MAGIC into header file - -- added DEVFS_FL_HIDE flag - -- created <devfs_get_maj_min> - -- created <devfs_get_handle_from_inode> - -- fixed bugs in searching by major&minor - -- changed interface to <devfs_unregister>, <devfs_fill_file> and - <devfs_find_handle> - -- fixed inode times when symlink created with symlink(2) - -- change tty driver to do auto-creation of devfs entries - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- fixed bug in genhd.c: whole disc (non-SCSI) was not registered to - devfs - -- updated libc 5.4.43 patch for ttyname() -=============================================================================== -Changes for patch v17 - -- added CONFIG_DEVFS_TTY_COMPAT - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- bugfix in devfs support for drivers/char/lp.c - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- clean up serial driver so that PCMCIA devices unregister correctly - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- fixed bug in genhd.c: whole disc (non-SCSI) was not registered to - devfs [was missing in patch v16] - -- updated libc 5.4.43 patch for ttyname() [was missing in patch v16] - -- all SCSI devices now registered in /dev/sg - -- support removal of devfs entries via unlink(2) -=============================================================================== -Changes for patch v18 - -- added floppy/?u720 floppy entry - -- fixed kerneld support for entries in devfs subdirectories - -- incorporated latest patch for ttyname() in libc 5.4.43 from H.J. Lu. -=============================================================================== -Changes for patch v19 - -- bug fix when looking up unregistered entries: kerneld was not called - -- fixes for kernel 2.1.86 (now requires 2.1.86) -=============================================================================== -Changes for patch v20 - -- only create available floppy entries - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> - -- new IDE naming scheme following SCSI format (i.e. /dev/id/c0b0t0u0p1 - instead of /dev/hda1) - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> - -- new XT disc naming scheme following SCSI format (i.e. /dev/xd/c0t0p1 - instead of /dev/xda1) - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> - -- new non-standard CD-ROM names (i.e. /dev/sbp/c#t#) - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> - -- allow symlink traversal when mounting the root filesystem - -- Create entries for MD devices at MD init - Thanks to Christophe Leroy <christophe.leroy5@capway.com> -=============================================================================== -Changes for patch v21 - -- ported to kernel 2.1.91 -=============================================================================== -Changes for patch v22 - -- SCSI host number patch ("scsihosts=" kernel option) - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> -=============================================================================== -Changes for patch v23 - -- Fixed persistence bug with device numbers for manually created - device files - -- Fixed problem with recreating symlinks with different content - -- Added CONFIG_DEVFS_MOUNT (mount devfs on /dev at boot time) -=============================================================================== -Changes for patch v24 - -- Switched from CONFIG_KERNELD to CONFIG_KMOD: module autoloading - should now work again - -- Hide entries which are manually unlinked - -- Always invalidate devfs dentry cache when registering entries - -- Support removal of devfs directories via rmdir(2) - -- Ensure directories created by <devfs_mk_dir> are visible - -- Default no access for "other" for floppy device -=============================================================================== -Changes for patch v25 - -- Updates to CREDITS file and minor IDE numbering change - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> - -- Invalidate devfs dentry cache when making directories - -- Invalidate devfs dentry cache when removing entries - -- More informative message if root FS mount fails when devfs - configured - -- Fixed persistence bug with fifos -=============================================================================== -Changes for patch v26 - -- ported to kernel 2.1.97 - -- Changed serial directory from "/dev/serial" to "/dev/tts" and - "/dev/consoles" to "/dev/vc" to be more friendly to new procps -=============================================================================== -Changes for patch v27 - -- Added support for IDE4 and IDE5 - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> - -- Documented "scsihosts=" boot parameter - -- Print process command when debugging kerneld/kmod - -- Added debugging for register/unregister/change operations - -- Added "devfs=" boot options - -- Hide unregistered entries by default -=============================================================================== -Changes for patch v28 - -- No longer lock/unlock superblock in <devfs_put_super> (cope with - recent VFS interface change) - -- Do not automatically change ownership/protection of /dev/tty - -- Drop negative dentries when they are released - -- Manage dcache more efficiently -=============================================================================== -Changes for patch v29 - -- Added DEVFS_FL_AUTO_DEVNUM flag -=============================================================================== -Changes for patch v30 - -- No longer set unnecessary methods - -- Ported to kernel 2.1.99-pre3 -=============================================================================== -Changes for patch v31 - -- Added PID display to <call_kerneld> debugging message - -- Added "diread" and "diwrite" options - -- Ported to kernel 2.1.102 - -- Fixed persistence problem with permissions -=============================================================================== -Changes for patch v32 - -- Fixed devfs support in drivers/block/md.c -=============================================================================== -Changes for patch v33 - -- Support legacy device nodes - -- Fixed bug where recreated inodes were hidden - -- New IDE naming scheme: everything is under /dev/ide -=============================================================================== -Changes for patch v34 - -- Improved debugging in <get_vfs_inode> - -- Prevent duplicate calls to <devfs_mk_dir> in SCSI layer - -- No longer free old dentries in <devfs_mk_dir> - -- Free all dentries for a given entry when deleting inodes -=============================================================================== -Changes for patch v35 - -- Ported to kernel 2.1.105 (sound driver changes) -=============================================================================== -Changes for patch v36 - -- Fixed sound driver port -=============================================================================== -Changes for patch v37 - -- Minor documentation tweaks -=============================================================================== -Changes for patch v38 - -- More documentation tweaks - -- Fix for sound driver port - -- Removed ttyname-patch (grab libc 5.4.44 instead) - -- Ported to kernel 2.1.107-pre2 (loop driver fix) -=============================================================================== -Changes for patch v39 - -- Ported to kernel 2.1.107 (hd.c hunk broke due to spelling "fixes"). Sigh - -- Removed many #ifdef's, replaced with trickery in include/devfs_fs.h -=============================================================================== -Changes for patch v40 - -- Fix for sound driver port - -- Limit auto-device numbering to majors 128 to 239 -=============================================================================== -Changes for patch v41 - -- Fixed inode times persistence problem -=============================================================================== -Changes for patch v42 - -- Ported to kernel 2.1.108 (drivers/scsi/hosts.c hunk broke) -=============================================================================== -Changes for patch v43 - -- Fixed spelling in <devfs_readlink> debug - -- Fixed bug in <devfs_setup> parsing "dilookup" - -- More #ifdef's removed - -- Supported Sparc keyboard (/dev/kbd) - -- Supported DSP56001 digital signal processor (/dev/dsp56k) - -- Supported Apple Desktop Bus (/dev/adb) - -- Supported Coda network file system (/dev/cfs*) -=============================================================================== -Changes for patch v44 - -- Fixed devfs inode leak when manually recreating inodes - -- Fixed permission persistence problem when recreating inodes -=============================================================================== -Changes for patch v45 - -- Ported to kernel 2.1.110 -=============================================================================== -Changes for patch v46 - -- Ported to kernel 2.1.112-pre1 - -- Removed harmless "unused variable" compiler warning - -- Fixed modes for manually recreated device nodes -=============================================================================== -Changes for patch v47 - -- Added NULL devfs inode warning in <devfs_read_inode> - -- Force all inode nlink values to 1 -=============================================================================== -Changes for patch v48 - -- Added "dimknod" option - -- Set inode nlink to 0 when freeing dentries - -- Added support for virtual console capture devices (/dev/vcs*) - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Fixed modes for manually recreated symlinks -=============================================================================== -Changes for patch v49 - -- Ported to kernel 2.1.113 -=============================================================================== -Changes for patch v50 - -- Fixed bugs in recreated directories and symlinks -=============================================================================== -Changes for patch v51 - -- Improved robustness of rc.devfs script - Thanks to Roderich Schupp <rsch@experteam.de> - -- Fixed bugs in recreated device nodes - -- Fixed bug in currently unused <devfs_get_handle_from_inode> - -- Defined new <devfs_handle_t> type - -- Improved debugging when getting entries - -- Fixed bug where directories could be emptied - -- Ported to kernel 2.1.115 -=============================================================================== -Changes for patch v52 - -- Replaced dummy .epoch inode with .devfsd character device - -- Modified rc.devfs to take account of above change - -- Removed spurious driver warning messages when CONFIG_DEVFS_FS=n - -- Implemented devfsd protocol revision 0 -=============================================================================== -Changes for patch v53 - -- Ported to kernel 2.1.116 (kmod change broke hunk) - -- Updated Documentation/Configure.help - -- Test and tty pattern patch for rc.devfs script - Thanks to Roderich Schupp <rsch@experteam.de> - -- Added soothing message to warning in <devfs_d_iput> -=============================================================================== -Changes for patch v54 - -- Ported to kernel 2.1.117 - -- Fixed default permissions in sound driver - -- Added support for frame buffer devices (/dev/fb*) -=============================================================================== -Changes for patch v55 - -- Ported to kernel 2.1.119 - -- Use GCC extensions for structure initialisations - -- Implemented async open notification - -- Incremented devfsd protocol revision to 1 -=============================================================================== -Changes for patch v56 - -- Ported to kernel 2.1.120-pre3 - -- Moved async open notification to end of <devfs_open> -=============================================================================== -Changes for patch v57 - -- Ported to kernel 2.1.121 - -- Prepended "/dev/" to module load request - -- Renamed <call_kerneld> to <call_kmod> - -- Created sample modules.conf file -=============================================================================== -Changes for patch v58 - -- Fixed typo "AYSNC" -> "ASYNC" -=============================================================================== -Changes for patch v59 - -- Added open flag for files -=============================================================================== -Changes for patch v60 - -- Ported to kernel 2.1.123-pre2 -=============================================================================== -Changes for patch v61 - -- Set i_blocks=0 and i_blksize=1024 in <devfs_read_inode> -=============================================================================== -Changes for patch v62 - -- Ported to kernel 2.1.123 -=============================================================================== -Changes for patch v63 - -- Ported to kernel 2.1.124-pre2 -=============================================================================== -Changes for patch v64 - -- Fixed Unix98 pty support - -- Increased buffer size in <get_partition_list> to avoid crash and - burn -=============================================================================== -Changes for patch v65 - -- More Unix98 pty support fixes - -- Added test for empty <<name>> in <devfs_find_handle> - -- Renamed <generate_path> to <devfs_generate_path> and published - -- Created /dev/root symlink - Thanks to Roderich Schupp <rsch@ExperTeam.de> - with further modifications by me -=============================================================================== -Changes for patch v66 - -- Yet more Unix98 pty support fixes (now tested) - -- Created <devfs_get_fops> - -- Support media change checks when CONFIG_DEVFS_ONLY=y - -- Abolished Unix98-style PTY names for old PTY devices -=============================================================================== -Changes for patch v67 - -- Added inline declaration for dummy <devfs_generate_path> - -- Removed spurious "unable to register... in devfs" messages when - CONFIG_DEVFS_FS=n - -- Fixed misc. devices when CONFIG_DEVFS_FS=n - -- Limit auto-device numbering to majors 144 to 239 -=============================================================================== -Changes for patch v68 - -- Hide unopened virtual consoles from directory listings - -- Added support for video capture devices - -- Ported to kernel 2.1.125 -=============================================================================== -Changes for patch v69 - -- Fix for CONFIG_VT=n -=============================================================================== -Changes for patch v70 - -- Added support for non-OSS/Free sound cards -=============================================================================== -Changes for patch v71 - -- Ported to kernel 2.1.126-pre2 -=============================================================================== -Changes for patch v72 - -- #ifdef's for CONFIG_DEVFS_DISABLE_OLD_NAMES removed -=============================================================================== -Changes for patch v73 - -- CONFIG_DEVFS_DISABLE_OLD_NAMES replaced with "nocompat" boot option - -- CONFIG_DEVFS_BOOT_OPTIONS removed: boot options always available -=============================================================================== -Changes for patch v74 - -- Removed CONFIG_DEVFS_MOUNT and "mount" boot option and replaced with - "nomount" boot option - -- Documentation updates - -- Updated sample modules.conf -=============================================================================== -Changes for patch v75 - -- Updated sample modules.conf - -- Remount devfs after initrd finishes - -- Ported to kernel 2.1.127 - -- Added support for ISDN - Thanks to Christophe Leroy <christophe.leroy5@capway.com> -=============================================================================== -Changes for patch v76 - -- Updated an email address in ChangeLog - -- CONFIG_DEVFS_ONLY replaced with "only" boot option -=============================================================================== -Changes for patch v77 - -- Added DEVFS_FL_REMOVABLE flag - -- Check for disc change when listing directories with removable media - devices - -- Use DEVFS_FL_REMOVABLE in sd.c - -- Ported to kernel 2.1.128 -=============================================================================== -Changes for patch v78 - -- Only call <scan_dir_for_removable> on first call to <devfs_readdir> - -- Ported to kernel 2.1.129-pre5 - -- ISDN support improvements - Thanks to Christophe Leroy <christophe.leroy5@capway.com> -=============================================================================== -Changes for patch v79 - -- Ported to kernel 2.1.130 - -- Renamed miscdevice "apm" to "apm_bios" to be consistent with - devices.txt -=============================================================================== -Changes for patch v80 - -- Ported to kernel 2.1.131 - -- Updated <devfs_rmdir> for VFS change in 2.1.131 -=============================================================================== -Changes for patch v81 - -- Fixed permissions on /dev/ptmx -=============================================================================== -Changes for patch v82 - -- Ported to kernel 2.1.132-pre4 - -- Changed initial permissions on /dev/pts/* - -- Created <devfs_mk_compat> - -- Added "symlinks" boot option - -- Changed devfs_register_blkdev() back to register_blkdev() for IDE - -- Check for partitions on removable media in <devfs_lookup> -=============================================================================== -Changes for patch v83 - -- Fixed support for ramdisc when using string-based root FS name - -- Ported to kernel 2.2.0-pre1 -=============================================================================== -Changes for patch v84 - -- Ported to kernel 2.2.0-pre7 -=============================================================================== -Changes for patch v85 - -- Compile fixes for driver/sound/sound_common.c (non-module) and - drivers/isdn/isdn_common.c - Thanks to Christophe Leroy <christophe.leroy5@capway.com> - -- Added support for registering regular files - -- Created <devfs_set_file_size> - -- Added /dev/cpu/mtrr as an alternative interface to /proc/mtrr - -- Update devfs inodes from entries if not changed through FS -=============================================================================== -Changes for patch v86 - -- Ported to kernel 2.2.0-pre9 -=============================================================================== -Changes for patch v87 - -- Fixed bug when mounting non-devfs devices in a devfs -=============================================================================== -Changes for patch v88 - -- Fixed <devfs_fill_file> to only initialise temporary inodes - -- Trap for NULL fops in <devfs_register> - -- Return -ENODEV in <devfs_fill_file> for non-driver inodes - -- Fixed bug when unswapping non-devfs devices in a devfs -=============================================================================== -Changes for patch v89 - -- Switched to C data types in include/linux/devfs_fs.h - -- Switched from PATH_MAX to DEVFS_PATHLEN - -- Updated Documentation/filesystems/devfs/modules.conf to take account - of reverse scanning (!) by modprobe - -- Ported to kernel 2.2.0 -=============================================================================== -Changes for patch v90 - -- CONFIG_DEVFS_DISABLE_OLD_TTY_NAMES replaced with "nottycompat" boot - option - -- CONFIG_DEVFS_TTY_COMPAT removed: existing "symlinks" boot option now - controls this. This means you must have libc 5.4.44 or later, or a - recent version of libc 6 if you use the "symlinks" option -=============================================================================== -Changes for patch v91 - -- Switch from <devfs_mk_symlink> to <devfs_mk_compat> in - drivers/char/vc_screen.c to fix problems with Midnight Commander -=============================================================================== -Changes for patch v92 - -- Ported to kernel 2.2.2-pre5 -=============================================================================== -Changes for patch v93 - -- Modified <sd_name> in drivers/scsi/sd.c to cope with devices that - don't exist (which happens with new RAID autostart code printk()s) -=============================================================================== -Changes for patch v94 - -- Fixed bug in joystick driver: only first joystick was registered -=============================================================================== -Changes for patch v95 - -- Fixed another bug in joystick driver - -- Fixed <devfsd_read> to not overrun event buffer -=============================================================================== -Changes for patch v96 - -- Ported to kernel 2.2.5-2 - -- Created <devfs_auto_unregister> - -- Fixed bugs: compatibility entries were not unregistered for: - loop driver - floppy driver - RAMDISC driver - IDE tape driver - SCSI CD-ROM driver - SCSI HDD driver -=============================================================================== -Changes for patch v97 - -- Fixed bugs: compatibility entries were not unregistered for: - ALSA sound driver - partitions in generic disc driver - -- Don't return unregistred entries in <devfs_find_handle> - -- Panic in <devfs_unregister> if entry unregistered - -- Don't panic in <devfs_auto_unregister> for duplicates -=============================================================================== -Changes for patch v98 - -- Don't unregister already unregistered entries in <unregister> - -- Register entry in <sd_detect> - -- Unregister entry in <sd_detach> - -- Changed to <devfs_*register_chrdev> in drivers/char/tty_io.c - -- Ported to kernel 2.2.7 -=============================================================================== -Changes for patch v99 - -- Ported to kernel 2.2.8 - -- Fixed bug in drivers/scsi/sd.c when >16 SCSI discs - -- Disable warning messages when unable to read partition table for - removable media -=============================================================================== -Changes for patch v100 - -- Ported to kernel 2.3.1-pre5 - -- Added "oops-on-panic" boot option - -- Improved debugging in <devfs_register> and <devfs_unregister> - -- Register entry in <sr_detect> - -- Unregister entry in <sr_detach> - -- Register entry in <sg_detect> - -- Unregister entry in <sg_detach> - -- Added support for ALSA drivers -=============================================================================== -Changes for patch v101 - -- Ported to kernel 2.3.2 -=============================================================================== -Changes for patch v102 - -- Update serial driver to register PCMCIA entries - Thanks to Roch-Alexandre Nomine-Beguin <roch@samarkand.infini.fr> - -- Updated an email address in ChangeLog - -- Hide virtual console capture entries from directory listings when - corresponding console device is not open -=============================================================================== -Changes for patch v103 - -- Ported to kernel 2.3.3 -=============================================================================== -Changes for patch v104 - -- Added documentation for some functions - -- Added "doc" target to fs/devfs/Makefile - -- Added "v4l" directory for video4linux devices - -- Replaced call to <devfs_unregister> in <sd_detach> with call to - <devfs_register_partitions> - -- Moved registration for sr and sg drivers from detect() to attach() - methods - -- Register entries in <st_attach> and unregister in <st_detach> - -- Work around IDE driver treating CD-ROM as gendisk - -- Use <sed> instead of <tr> in rc.devfs - -- Updated ToDo list - -- Removed "oops-on-panic" boot option: now always Oops -=============================================================================== -Changes for patch v105 - -- Unregister SCSI host from <scsi_host_no_list> in <scsi_unregister> - Thanks to Zoltán Böszörményi <zboszor@mail.externet.hu> - -- Don't save /dev/log in rc.devfs - -- Ported to kernel 2.3.4-pre1 -=============================================================================== -Changes for patch v106 - -- Fixed silly typo in drivers/scsi/st.c - -- Improved debugging in <devfs_register> -=============================================================================== -Changes for patch v107 - -- Added "diunlink" and "nokmod" boot options - -- Removed superfluous warning message in <devfs_d_iput> -=============================================================================== -Changes for patch v108 - -- Remove entries when unloading sound module -=============================================================================== -Changes for patch v109 - -- Ported to kernel 2.3.6-pre2 -=============================================================================== -Changes for patch v110 - -- Took account of change to <d_alloc_root> -=============================================================================== -Changes for patch v111 - -- Created separate event queue for each mounted devfs - -- Removed <devfs_invalidate_dcache> - -- Created new ioctl()s for devfsd - -- Incremented devfsd protocol revision to 3 - -- Fixed bug when re-creating directories: contents were lost - -- Block access to inodes until devfsd updates permissions -=============================================================================== -Changes for patch v112 - -- Modified patch so it applies against 2.3.5 and 2.3.6 - -- Updated an email address in ChangeLog - -- Do not automatically change ownership/protection of /dev/tty<n> - -- Updated sample modules.conf - -- Switched to sending process uid/gid to devfsd - -- Renamed <call_kmod> to <try_modload> - -- Added DEVFSD_NOTIFY_LOOKUP event - -- Added DEVFSD_NOTIFY_CHANGE event - -- Added DEVFSD_NOTIFY_CREATE event - -- Incremented devfsd protocol revision to 4 - -- Moved kernel-specific stuff to include/linux/devfs_fs_kernel.h -=============================================================================== -Changes for patch v113 - -- Ported to kernel 2.3.9 - -- Restricted permissions on some block devices -=============================================================================== -Changes for patch v114 - -- Added support for /dev/netlink - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Return EISDIR rather than EINVAL for read(2) on directories - -- Ported to kernel 2.3.10 -=============================================================================== -Changes for patch v115 - -- Added support for all remaining character devices - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Cleaned up netlink support -=============================================================================== -Changes for patch v116 - -- Added support for /dev/parport%d - Thanks to Tim Waugh <tim@cyberelk.demon.co.uk> - -- Fixed parallel port ATAPI tape driver - -- Fixed Atari SLM laser printer driver -=============================================================================== -Changes for patch v117 - -- Added support for COSA card - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Fixed drivers/char/ppdev.c: missing #include <linux/init.h> - -- Fixed drivers/char/ftape/zftape/zftape-init.c - Thanks to Vladimir Popov <mashgrad@usa.net> -=============================================================================== -Changes for patch v118 - -- Ported to kernel 2.3.15-pre3 - -- Fixed bug in loop driver - -- Unregister /dev/lp%d entries in drivers/char/lp.c - Thanks to Maciej W. Rozycki <macro@ds2.pg.gda.pl> -=============================================================================== -Changes for patch v119 - -- Ported to kernel 2.3.16 -=============================================================================== -Changes for patch v120 - -- Fixed bug in drivers/scsi/scsi.c - -- Added /dev/ppp - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Ported to kernel 2.3.17 -=============================================================================== -Changes for patch v121 - -- Fixed bug in drivers/block/loop.c - -- Ported to kernel 2.3.18 -=============================================================================== -Changes for patch v122 - -- Ported to kernel 2.3.19 -=============================================================================== -Changes for patch v123 - -- Ported to kernel 2.3.20 -=============================================================================== -Changes for patch v124 - -- Ported to kernel 2.3.21 -=============================================================================== -Changes for patch v125 - -- Created <devfs_get_info>, <devfs_set_info>, - <devfs_get_first_child> and <devfs_get_next_sibling> - Added <<dir>> parameter to <devfs_register>, <devfs_mk_compat>, - <devfs_mk_dir> and <devfs_find_handle> - Work sponsored by SGI - -- Fixed apparent bug in COSA driver - -- Re-instated "scsihosts=" boot option -=============================================================================== -Changes for patch v126 - -- Always create /dev/pts if CONFIG_UNIX98_PTYS=y - -- Fixed call to <devfs_mk_dir> in drivers/block/ide-disk.c - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Allow multiple unregistrations - -- Created /dev/scsi hierarchy - Work sponsored by SGI -=============================================================================== -Changes for patch v127 - -Work sponsored by SGI - -- No longer disable devpts if devfs enabled (caveat emptor) - -- Added flags array to struct gendisk and removed code from - drivers/scsi/sd.c - -- Created /dev/discs hierarchy -=============================================================================== -Changes for patch v128 - -Work sponsored by SGI - -- Created /dev/cdroms hierarchy -=============================================================================== -Changes for patch v129 - -Work sponsored by SGI - -- Removed compatibility entries for sound devices - -- Removed compatibility entries for printer devices - -- Removed compatibility entries for video4linux devices - -- Removed compatibility entries for parallel port devices - -- Removed compatibility entries for frame buffer devices -=============================================================================== -Changes for patch v130 - -Work sponsored by SGI - -- Added major and minor number to devfsd protocol - -- Incremented devfsd protocol revision to 5 - -- Removed compatibility entries for SoundBlaster CD-ROMs - -- Removed compatibility entries for netlink devices - -- Removed compatibility entries for SCSI generic devices - -- Removed compatibility entries for SCSI tape devices -=============================================================================== -Changes for patch v131 - -Work sponsored by SGI - -- Support info pointer for all devfs entry types - -- Added <<info>> parameter to <devfs_mk_dir> and <devfs_mk_symlink> - -- Removed /dev/st hierarchy - -- Removed /dev/sg hierarchy - -- Removed compatibility entries for loop devices - -- Removed compatibility entries for IDE tape devices - -- Removed compatibility entries for SCSI CD-ROMs - -- Removed /dev/sr hierarchy -=============================================================================== -Changes for patch v132 - -Work sponsored by SGI - -- Removed compatibility entries for floppy devices - -- Removed compatibility entries for RAMDISCs - -- Removed compatibility entries for meta-devices - -- Removed compatibility entries for SCSI discs - -- Created <devfs_make_root> - -- Removed /dev/sd hierarchy - -- Support "../" when searching devfs namespace - -- Created /dev/ide/host* hierarchy - -- Supported IDE hard discs in /dev/ide/host* hierarchy - -- Removed compatibility entries for IDE discs - -- Removed /dev/ide/hd hierarchy - -- Supported IDE CD-ROMs in /dev/ide/host* hierarchy - -- Removed compatibility entries for IDE CD-ROMs - -- Removed /dev/ide/cd hierarchy -=============================================================================== -Changes for patch v133 - -Work sponsored by SGI - -- Created <devfs_get_unregister_slave> - -- Fixed bug in fs/partitions/check.c when rescanning -=============================================================================== -Changes for patch v134 - -Work sponsored by SGI - -- Removed /dev/sd, /dev/sr, /dev/st and /dev/sg directories - -- Removed /dev/ide/hd directory - -- Exported <devfs_get_parent> - -- Created <devfs_register_tape> and /dev/tapes hierarchy - -- Removed /dev/ide/mt hierarchy - -- Removed /dev/ide/fd hierarchy - -- Ported to kernel 2.3.25 -=============================================================================== -Changes for patch v135 - -Work sponsored by SGI - -- Removed compatibility entries for virtual console capture devices - -- Removed unused <devfs_set_symlink_destination> - -- Removed compatibility entries for serial devices - -- Removed compatibility entries for console devices - -- Do not hide entries from devfsd or children - -- Removed DEVFS_FL_TTY_COMPAT flag - -- Removed "nottycompat" boot option - -- Removed <devfs_mk_compat> -=============================================================================== -Changes for patch v136 - -Work sponsored by SGI - -- Moved BSD pty devices to /dev/pty - -- Added DEVFS_FL_WAIT flag -=============================================================================== -Changes for patch v137 - -Work sponsored by SGI - -- Really fixed bug in fs/partitions/check.c when rescanning - -- Support new "disc" naming scheme in <get_removable_partition> - -- Allow NULL fops in <devfs_register> - -- Removed redundant name functions in SCSI disc and IDE drivers -=============================================================================== -Changes for patch v138 - -Work sponsored by SGI - -- Fixed old bugs in drivers/block/paride/pt.c, drivers/char/tpqic02.c, - drivers/net/wan/cosa.c and drivers/scsi/scsi.c - Thanks to Sergey Kubushin <ksi@ksi-linux.com> - -- Fall back to major table if NULL fops given to <devfs_register> -=============================================================================== -Changes for patch v139 - -Work sponsored by SGI - -- Corrected and moved <get_blkfops> and <get_chrfops> declarations - from arch/alpha/kernel/osf_sys.c to include/linux/fs.h - -- Removed name function from struct gendisk - -- Updated devfs FAQ -=============================================================================== -Changes for patch v140 - -Work sponsored by SGI - -- Ported to kernel 2.3.27 -=============================================================================== -Changes for patch v141 - -Work sponsored by SGI - -- Bug fix in arch/m68k/atari/joystick.c - -- Moved ISDN and capi devices to /dev/isdn -=============================================================================== -Changes for patch v142 - -Work sponsored by SGI - -- Bug fix in drivers/block/ide-probe.c (patch confusion) -=============================================================================== -Changes for patch v143 - -Work sponsored by SGI - -- Bug fix in drivers/block/blkpg.c:partition_name() -=============================================================================== -Changes for patch v144 - -Work sponsored by SGI - -- Ported to kernel 2.3.29 - -- Removed calls to <devfs_register> from cdu31a, cm206, mcd and mcdx - CD-ROM drivers: generic driver handles this now - -- Moved joystick devices to /dev/joysticks -=============================================================================== -Changes for patch v145 - -Work sponsored by SGI - -- Ported to kernel 2.3.30-pre3 - -- Register whole-disc entry even for invalid partition tables - -- Fixed bug in mounting root FS when initrd enabled - -- Fixed device entry leak with IDE CD-ROMs - -- Fixed compile problem with drivers/isdn/isdn_common.c - -- Moved COSA devices to /dev/cosa - -- Support fifos when unregistering - -- Created <devfs_register_series> and used in many drivers - -- Moved Coda devices to /dev/coda - -- Moved parallel port IDE tapes to /dev/pt - -- Moved parallel port IDE generic devices to /dev/pg -=============================================================================== -Changes for patch v146 - -Work sponsored by SGI - -- Removed obsolete DEVFS_FL_COMPAT and DEVFS_FL_TOLERANT flags - -- Fixed compile problem with fs/coda/psdev.c - -- Reinstate change to <devfs_register_blkdev> in - drivers/block/ide-probe.c now that fs/isofs/inode.c is fixed - -- Switched to <devfs_register_blkdev> in drivers/block/floppy.c, - drivers/scsi/sr.c and drivers/block/md.c - -- Moved DAC960 devices to /dev/dac960 -=============================================================================== -Changes for patch v147 - -Work sponsored by SGI - -- Ported to kernel 2.3.32-pre4 -=============================================================================== -Changes for patch v148 - -Work sponsored by SGI - -- Removed kmod support: use devfsd instead - -- Moved miscellaneous character devices to /dev/misc -=============================================================================== -Changes for patch v149 - -Work sponsored by SGI - -- Ensure include/linux/joystick.h is OK for user-space - -- Improved debugging in <get_vfs_inode> - -- Ensure dentries created by devfsd will be cleaned up -=============================================================================== -Changes for patch v150 - -Work sponsored by SGI - -- Ported to kernel 2.3.34 -=============================================================================== -Changes for patch v151 - -Work sponsored by SGI - -- Ported to kernel 2.3.35-pre1 - -- Created <devfs_get_name> -=============================================================================== -Changes for patch v152 - -Work sponsored by SGI - -- Updated sample modules.conf - -- Ported to kernel 2.3.36-pre1 -=============================================================================== -Changes for patch v153 - -Work sponsored by SGI - -- Ported to kernel 2.3.42 - -- Removed <devfs_fill_file> -=============================================================================== -Changes for patch v154 - -Work sponsored by SGI - -- Took account of device number changes for /dev/fb* -=============================================================================== -Changes for patch v155 - -Work sponsored by SGI - -- Ported to kernel 2.3.43-pre8 - -- Moved /dev/tty0 to /dev/vc/0 - -- Moved sequence number formatting from <_tty_make_name> to drivers -=============================================================================== -Changes for patch v156 - -Work sponsored by SGI - -- Fixed breakage in drivers/scsi/sd.c due to recent SCSI changes -=============================================================================== -Changes for patch v157 - -Work sponsored by SGI - -- Ported to kernel 2.3.45 -=============================================================================== -Changes for patch v158 - -Work sponsored by SGI - -- Ported to kernel 2.3.46-pre2 -=============================================================================== -Changes for patch v159 - -Work sponsored by SGI - -- Fixed drivers/block/md.c - Thanks to Mike Galbraith <mikeg@weiden.de> - -- Documentation fixes - -- Moved device registration from <lp_init> to <lp_register> - Thanks to Tim Waugh <twaugh@redhat.com> -=============================================================================== -Changes for patch v160 - -Work sponsored by SGI - -- Fixed drivers/char/joystick/joystick.c - Thanks to Vojtech Pavlik <vojtech@suse.cz> - -- Documentation updates - -- Fixed arch/i386/kernel/mtrr.c if procfs and devfs not enabled - -- Fixed drivers/char/stallion.c -=============================================================================== -Changes for patch v161 - -Work sponsored by SGI - -- Remove /dev/ide when ide-mod is unloaded - -- Fixed bug in drivers/block/ide-probe.c when secondary but no primary - -- Added DEVFS_FL_NO_PERSISTENCE flag - -- Used new DEVFS_FL_NO_PERSISTENCE flag for Unix98 pty slaves - -- Removed unnecessary call to <update_devfs_inode_from_entry> in - <devfs_readdir> - -- Only set auto-ownership for /dev/pty/s* -=============================================================================== -Changes for patch v162 - -Work sponsored by SGI - -- Set inode->i_size to correct size for symlinks - Thanks to Jeremy Fitzhardinge <jeremy@goop.org> - -- Only give lookup() method to directories to comply with new VFS - assumptions - -- Remove unnecessary tests in symlink methods - -- Don't kill existing block ops in <devfs_read_inode> - -- Restore auto-ownership for /dev/pty/m* -=============================================================================== -Changes for patch v163 - -Work sponsored by SGI - -- Don't create missing directories in <devfs_find_handle> - -- Removed Documentation/filesystems/devfs/mk-devlinks - -- Updated Documentation/filesystems/devfs/README -=============================================================================== -Changes for patch v164 - -Work sponsored by SGI - -- Fixed CONFIG_DEVFS breakage in drivers/char/serial.c introduced in - linux-2.3.99-pre6-7 -=============================================================================== -Changes for patch v165 - -Work sponsored by SGI - -- Ported to kernel 2.3.99-pre6 -=============================================================================== -Changes for patch v166 - -Work sponsored by SGI - -- Added CONFIG_DEVFS_MOUNT -=============================================================================== -Changes for patch v167 - -Work sponsored by SGI - -- Updated Documentation/filesystems/devfs/README - -- Updated sample modules.conf -=============================================================================== -Changes for patch v168 - -Work sponsored by SGI - -- Disabled multi-mount capability (use VFS bindings instead) - -- Updated README from master HTML file -=============================================================================== -Changes for patch v169 - -Work sponsored by SGI - -- Removed multi-mount code - -- Removed compatibility macros: VFS has changed too much -=============================================================================== -Changes for patch v170 - -Work sponsored by SGI - -- Updated README from master HTML file - -- Merged devfs inode into devfs entry -=============================================================================== -Changes for patch v171 - -Work sponsored by SGI - -- Updated sample modules.conf - -- Removed dead code in <devfs_register> which used to call - <free_dentries> - -- Ported to kernel 2.4.0-test2-pre3 -=============================================================================== -Changes for patch v172 - -Work sponsored by SGI - -- Changed interface to <devfs_register> - -- Changed interface to <devfs_register_series> -=============================================================================== -Changes for patch v173 - -Work sponsored by SGI - -- Simplified interface to <devfs_mk_symlink> - -- Simplified interface to <devfs_mk_dir> - -- Simplified interface to <devfs_find_handle> -=============================================================================== -Changes for patch v174 - -Work sponsored by SGI - -- Updated README from master HTML file -=============================================================================== -Changes for patch v175 - -Work sponsored by SGI - -- DocBook update for fs/devfs/base.c - Thanks to Tim Waugh <twaugh@redhat.com> - -- Removed stale fs/tunnel.c (was never used or completed) -=============================================================================== -Changes for patch v176 - -Work sponsored by SGI - -- Updated ToDo list - -- Removed sample modules.conf: now distributed with devfsd - -- Updated README from master HTML file - -- Ported to kernel 2.4.0-test3-pre4 (which had devfs-patch-v174) -=============================================================================== -Changes for patch v177 - -- Updated README from master HTML file - -- Documentation cleanups - -- Ensure <devfs_generate_path> terminates string for root entry - Thanks to Tim Jansen <tim@tjansen.de> - -- Exported <devfs_get_name> to modules - -- Make <devfs_mk_symlink> send events to devfsd - -- Cleaned up option processing in <devfs_setup> - -- Fixed bugs in handling symlinks: could leak or cause Oops - -- Cleaned up directory handling by separating fops - Thanks to Alexander Viro <viro@parcelfarce.linux.theplanet.co.uk> -=============================================================================== -Changes for patch v178 - -- Fixed handling of inverted options in <devfs_setup> -=============================================================================== -Changes for patch v179 - -- Adjusted <try_modload> to account for <devfs_generate_path> fix -=============================================================================== -Changes for patch v180 - -- Fixed !CONFIG_DEVFS_FS stub declaration of <devfs_get_info> -=============================================================================== -Changes for patch v181 - -- Answered question posed by Al Viro and removed his comments from <devfs_open> - -- Moved setting of registered flag after other fields are changed - -- Fixed race between <devfsd_close> and <devfsd_notify_one> - -- Global VFS changes added bogus BKL to devfsd_close(): removed - -- Widened locking in <devfs_readlink> and <devfs_follow_link> - -- Replaced <devfsd_read> stack usage with <devfsd_ioctl> kmalloc - -- Simplified locking in <devfsd_ioctl> and fixed memory leak -=============================================================================== -Changes for patch v182 - -- Created <devfs_*alloc_major> and <devfs_*alloc_devnum> - -- Removed broken devnum allocation and use <devfs_alloc_devnum> - -- Fixed old devnum leak by calling new <devfs_dealloc_devnum> - -- Created <devfs_*alloc_unique_number> - -- Fixed number leak for /dev/cdroms/cdrom%d - -- Fixed number leak for /dev/discs/disc%d -=============================================================================== -Changes for patch v183 - -- Fixed bug in <devfs_setup> which could hang boot process -=============================================================================== -Changes for patch v184 - -- Documentation typo fix for fs/devfs/util.c - -- Fixed drivers/char/stallion.c for devfs - -- Added DEVFSD_NOTIFY_DELETE event - -- Updated README from master HTML file - -- Removed #include <asm/segment.h> from fs/devfs/base.c -=============================================================================== -Changes for patch v185 - -- Made <block_semaphore> and <char_semaphore> in fs/devfs/util.c - private - -- Fixed inode table races by removing it and using inode->u.generic_ip - instead - -- Moved <devfs_read_inode> into <get_vfs_inode> - -- Moved <devfs_write_inode> into <devfs_notify_change> -=============================================================================== -Changes for patch v186 - -- Fixed race in <devfs_do_symlink> for uni-processor - -- Updated README from master HTML file -=============================================================================== -Changes for patch v187 - -- Fixed drivers/char/stallion.c for devfs - -- Fixed drivers/char/rocket.c for devfs - -- Fixed bug in <devfs_alloc_unique_number>: limited to 128 numbers -=============================================================================== -Changes for patch v188 - -- Updated major masks in fs/devfs/util.c up to Linus' "no new majors" - proclamation. Block: were 126 now 122 free, char: were 26 now 19 free - -- Updated README from master HTML file - -- Removed remnant of multi-mount support in <devfs_mknod> - -- Removed unused DEVFS_FL_SHOW_UNREG flag -=============================================================================== -Changes for patch v189 - -- Removed nlink field from struct devfs_inode - -- Removed auto-ownership for /dev/pty/* (BSD ptys) and used - DEVFS_FL_CURRENT_OWNER|DEVFS_FL_NO_PERSISTENCE for /dev/pty/s* (just - like Unix98 pty slaves) and made /dev/pty/m* rw-rw-rw- access -=============================================================================== -Changes for patch v190 - -- Updated README from master HTML file - -- Replaced BKL with global rwsem to protect symlink data (quick and - dirty hack) -=============================================================================== -Changes for patch v191 - -- Replaced global rwsem for symlink with per-link refcount -=============================================================================== -Changes for patch v192 - -- Removed unnecessary #ifdef CONFIG_DEVFS_FS from arch/i386/kernel/mtrr.c - -- Ported to kernel 2.4.10-pre11 - -- Set inode->i_mapping->a_ops for block nodes in <get_vfs_inode> -=============================================================================== -Changes for patch v193 - -- Went back to global rwsem for symlinks (refcount scheme no good) -=============================================================================== -Changes for patch v194 - -- Fixed overrun in <devfs_link> by removing function (not needed) - -- Updated README from master HTML file -=============================================================================== -Changes for patch v195 - -- Fixed buffer underrun in <try_modload> - -- Moved down_read() from <search_for_entry_in_dir> to <find_entry> -=============================================================================== -Changes for patch v196 - -- Fixed race in <devfsd_ioctl> when setting event mask - Thanks to Kari Hurtta <hurtta@leija.mh.fmi.fi> - -- Avoid deadlock in <devfs_follow_link> by using temporary buffer -=============================================================================== -Changes for patch v197 - -- First release of new locking code for devfs core (v1.0) - -- Fixed bug in drivers/cdrom/cdrom.c -=============================================================================== -Changes for patch v198 - -- Discard temporary buffer, now use "%s" for dentry names - -- Don't generate path in <try_modload>: use fake entry instead - -- Use "existing" directory in <_devfs_make_parent_for_leaf> - -- Use slab cache rather than fixed buffer for devfsd events -=============================================================================== -Changes for patch v199 - -- Removed obsolete usage of DEVFS_FL_NO_PERSISTENCE - -- Send DEVFSD_NOTIFY_REGISTERED events in <devfs_mk_dir> - -- Fixed locking bug in <devfs_d_revalidate_wait> due to typo - -- Do not send CREATE, CHANGE, ASYNC_OPEN or DELETE events from devfsd - or children -=============================================================================== -Changes for patch v200 - -- Ported to kernel 2.5.1-pre2 -=============================================================================== -Changes for patch v201 - -- Fixed bug in <devfsd_read>: was dereferencing freed pointer -=============================================================================== -Changes for patch v202 - -- Fixed bug in <devfsd_close>: was dereferencing freed pointer - -- Added process group check for devfsd privileges -=============================================================================== -Changes for patch v203 - -- Use SLAB_ATOMIC in <devfsd_notify_de> from <devfs_d_delete> -=============================================================================== -Changes for patch v204 - -- Removed long obsolete rc.devfs - -- Return old entry in <devfs_mk_dir> for 2.4.x kernels - -- Updated README from master HTML file - -- Increment refcount on module in <check_disc_changed> - -- Created <devfs_get_handle> and exported <devfs_put> - -- Increment refcount on module in <devfs_get_ops> - -- Created <devfs_put_ops> and used where needed to fix races - -- Added clarifying comments in response to preliminary EMC code review - -- Added poisoning to <devfs_put> - -- Improved debugging messages - -- Fixed unregister bugs in drivers/md/lvm-fs.c -=============================================================================== -Changes for patch v205 - -- Corrected (made useful) debugging message in <unregister> - -- Moved <kmem_cache_create> in <mount_devfs_fs> to <init_devfs_fs> - -- Fixed drivers/md/lvm-fs.c to create "lvm" entry - -- Added magic number to guard against scribbling drivers - -- Only return old entry in <devfs_mk_dir> if a directory - -- Defined macros for error and debug messages - -- Updated README from master HTML file -=============================================================================== -Changes for patch v206 - -- Added support for multiple Compaq cpqarray controllers - -- Fixed (rare, old) race in <devfs_lookup> -=============================================================================== -Changes for patch v207 - -- Fixed deadlock bug in <devfs_d_revalidate_wait> - -- Tag VFS deletable in <devfs_mk_symlink> if handle ignored - -- Updated README from master HTML file -=============================================================================== -Changes for patch v208 - -- Added KERN_* to remaining messages - -- Cleaned up declaration of <stat_read> - -- Updated README from master HTML file -=============================================================================== -Changes for patch v209 - -- Updated README from master HTML file - -- Removed silently introduced calls to lock_kernel() and - unlock_kernel() due to recent VFS locking changes. BKL isn't - required in devfs - -- Changed <devfs_rmdir> to allow later additions if not yet empty - -- Added calls to <devfs_register_partitions> in drivers/block/blkpc.c - <add_partition> and <del_partition> - -- Fixed bug in <devfs_alloc_unique_number>: was clearing beyond - bitfield - -- Fixed bitfield data type for <devfs_*alloc_devnum> - -- Made major bitfield type and initialiser 64 bit safe -=============================================================================== -Changes for patch v210 - -- Updated fs/devfs/util.c to fix shift warning on 64 bit machines - Thanks to Anton Blanchard <anton@samba.org> - -- Updated README from master HTML file -=============================================================================== -Changes for patch v211 - -- Do not put miscellaneous character devices in /dev/misc if they - specify their own directory (i.e. contain a '/' character) - -- Copied macro for error messages from fs/devfs/base.c to - fs/devfs/util.c and made use of this macro - -- Removed 2.4.x compatibility code from fs/devfs/base.c -=============================================================================== -Changes for patch v212 - -- Added BKL to <devfs_open> because drivers still need it -=============================================================================== -Changes for patch v213 - -- Protected <scan_dir_for_removable> and <get_removable_partition> - from changing directory contents -=============================================================================== -Changes for patch v214 - -- Switched to ISO C structure field initialisers - -- Switch to set_current_state() and move before add_wait_queue() - -- Updated README from master HTML file - -- Fixed devfs entry leak in <devfs_readdir> when *readdir fails -=============================================================================== -Changes for patch v215 - -- Created <devfs_find_and_unregister> - -- Switched many functions from <devfs_find_handle> to - <devfs_find_and_unregister> - -- Switched many functions from <devfs_find_handle> to <devfs_get_handle> -=============================================================================== -Changes for patch v216 - -- Switched arch/ia64/sn/io/hcl.c from <devfs_find_handle> to - <devfs_get_handle> - -- Removed deprecated <devfs_find_handle> -=============================================================================== -Changes for patch v217 - -- Exported <devfs_find_and_unregister> and <devfs_only> to modules - -- Updated README from master HTML file - -- Fixed module unload race in <devfs_open> -=============================================================================== -Changes for patch v218 - -- Removed DEVFS_FL_AUTO_OWNER flag - -- Switched lingering structure field initialiser to ISO C - -- Added locking when setting/clearing flags - -- Documentation fix in fs/devfs/util.c diff --git a/Documentation/filesystems/devfs/README b/Documentation/filesystems/devfs/README deleted file mode 100644 index aabfba24bc2e..000000000000 --- a/Documentation/filesystems/devfs/README +++ /dev/null @@ -1,1959 +0,0 @@ -Devfs (Device File System) FAQ - - -Linux Devfs (Device File System) FAQ -Richard Gooch -20-AUG-2002 - - -Document languages: - - - - - - - ------------------------------------------------------------------------------ - -NOTE: the master copy of this document is available online at: - -http://www.atnf.csiro.au/~rgooch/linux/docs/devfs.html -and looks much better than the text version distributed with the -kernel sources. A mirror site is available at: - -http://www.ras.ucalgary.ca/~rgooch/linux/docs/devfs.html - -There is also an optional daemon that may be used with devfs. You can -find out more about it at: - -http://www.atnf.csiro.au/~rgooch/linux/ - -A mailing list is available which you may subscribe to. Send -email -to majordomo@oss.sgi.com with the following line in the -body of the message: -subscribe devfs -To unsubscribe, send the message body: -unsubscribe devfs -instead. The list is archived at - -http://oss.sgi.com/projects/devfs/archive/. - ------------------------------------------------------------------------------ - -Contents - - -What is it? - -Why do it? - -Who else does it? - -How it works - -Operational issues (essential reading) - -Instructions for the impatient -Permissions persistence across reboots -Dealing with drivers without devfs support -All the way with Devfs -Other Issues -Kernel Naming Scheme -Devfsd Naming Scheme -Old Compatibility Names -SCSI Host Probing Issues - - - -Device drivers currently ported - -Allocation of Device Numbers - -Questions and Answers - -Making things work -Alternatives to devfs -What I don't like about devfs -How to report bugs -Strange kernel messages -Compilation problems with devfsd - - -Other resources - -Translations of this document - - ------------------------------------------------------------------------------ - - -What is it? - -Devfs is an alternative to "real" character and block special devices -on your root filesystem. Kernel device drivers can register devices by -name rather than major and minor numbers. These devices will appear in -devfs automatically, with whatever default ownership and -protection the driver specified. A daemon (devfsd) can be used to -override these defaults. Devfs has been in the kernel since 2.3.46. - -NOTE that devfs is entirely optional. If you prefer the old -disc-based device nodes, then simply leave CONFIG_DEVFS_FS=n (the -default). In this case, nothing will change. ALSO NOTE that if you do -enable devfs, the defaults are such that full compatibility is -maintained with the old devices names. - -There are two aspects to devfs: one is the underlying device -namespace, which is a namespace just like any mounted filesystem. The -other aspect is the filesystem code which provides a view of the -device namespace. The reason I make a distinction is because devfs -can be mounted many times, with each mount showing the same device -namespace. Changes made are global to all mounted devfs filesystems. -Also, because the devfs namespace exists without any devfs mounts, you -can easily mount the root filesystem by referring to an entry in the -devfs namespace. - - -The cost of devfs is a small increase in kernel code size and memory -usage. About 7 pages of code (some of that in __init sections) and 72 -bytes for each entry in the namespace. A modest system has only a -couple of hundred device entries, so this costs a few more -pages. Compare this with the suggestion to put /dev on a <a -href="#why-faq-ramdisc">ramdisc. - -On a typical machine, the cost is under 0.2 percent. On a modest -system with 64 MBytes of RAM, the cost is under 0.1 percent. The -accusations of "bloatware" levelled at devfs are not justified. - ------------------------------------------------------------------------------ - - -Why do it? - -There are several problems that devfs addresses. Some of these -problems are more serious than others (depending on your point of -view), and some can be solved without devfs. However, the totality of -these problems really calls out for devfs. - -The choice is a patchwork of inefficient user space solutions, which -are complex and likely to be fragile, or to use a simple and efficient -devfs which is robust. - -There have been many counter-proposals to devfs, all seeking to -provide some of the benefits without actually implementing devfs. So -far there has been an absence of code and no proposed alternative has -been able to provide all the features that devfs does. Further, -alternative proposals require far more complexity in user-space (and -still deliver less functionality than devfs). Some people have the -mantra of reducing "kernel bloat", but don't consider the effects on -user-space. - -A good solution limits the total complexity of kernel-space and -user-space. - - -Major&minor allocation - -The existing scheme requires the allocation of major and minor device -numbers for each and every device. This means that a central -co-ordinating authority is required to issue these device numbers -(unless you're developing a "private" device driver), in order to -preserve uniqueness. Devfs shifts the burden to a namespace. This may -not seem like a huge benefit, but actually it is. Since driver authors -will naturally choose a device name which reflects the functionality -of the device, there is far less potential for namespace conflict. -Solving this requires a kernel change. - -/dev management - -Because you currently access devices through device nodes, these must -be created by the system administrator. For standard devices you can -usually find a MAKEDEV programme which creates all these (hundreds!) -of nodes. This means that changes in the kernel must be reflected by -changes in the MAKEDEV programme, or else the system administrator -creates device nodes by hand. - -The basic problem is that there are two separate databases of -major and minor numbers. One is in the kernel and one is in /dev (or -in a MAKEDEV programme, if you want to look at it that way). This is -duplication of information, which is not good practice. -Solving this requires a kernel change. - -/dev growth - -A typical /dev has over 1200 nodes! Most of these devices simply don't -exist because the hardware is not available. A huge /dev increases the -time to access devices (I'm just referring to the dentry lookup times -and the time taken to read inodes off disc: the next subsection shows -some more horrors). - -An example of how big /dev can grow is if we consider SCSI devices: - -host 6 bits (say up to 64 hosts on a really big machine) -channel 4 bits (say up to 16 SCSI buses per host) -id 4 bits -lun 3 bits -partition 6 bits -TOTAL 23 bits - - -This requires 8 Mega (1024*1024) inodes if we want to store all -possible device nodes. Even if we scrap everything but id,partition -and assume a single host adapter with a single SCSI bus and only one -logical unit per SCSI target (id), that's still 10 bits or 1024 -inodes. Each VFS inode takes around 256 bytes (kernel 2.1.78), so -that's 256 kBytes of inode storage on disc (assuming real inodes take -a similar amount of space as VFS inodes). This is actually not so bad, -because disc is cheap these days. Embedded systems would care about -256 kBytes of /dev inodes, but you could argue that embedded systems -would have hand-tuned /dev directories. I've had to do just that on my -embedded systems, but I would rather just leave it to devfs. - -Another issue is the time taken to lookup an inode when first -referenced. Not only does this take time in scanning through a list in -memory, but also the seek times to read the inodes off disc. -This could be solved in user-space using a clever programme which -scanned the kernel logs and deleted /dev entries which are not -available and created them when they were available. This programme -would need to be run every time a new module was loaded, which would -slow things down a lot. - -There is an existing programme called scsidev which will automatically -create device nodes for SCSI devices. It can do this by scanning files -in /proc/scsi. Unfortunately, to extend this idea to other device -nodes would require significant modifications to existing drivers (so -they too would provide information in /proc). This is a non-trivial -change (I should know: devfs has had to do something similar). Once -you go to this much effort, you may as well use devfs itself (which -also provides this information). Furthermore, such a system would -likely be implemented in an ad-hoc fashion, as different drivers will -provide their information in different ways. - -Devfs is much cleaner, because it (naturally) has a uniform mechanism -to provide this information: the device nodes themselves! - - -Node to driver file_operations translation - -There is an important difference between the way disc-based character -and block nodes and devfs entries make the connection between an entry -in /dev and the actual device driver. - -With the current 8 bit major and minor numbers the connection between -disc-based c&b nodes and per-major drivers is done through a -fixed-length table of 128 entries. The various filesystem types set -the inode operations for c&b nodes to {chr,blk}dev_inode_operations, -so when a device is opened a few quick levels of indirection bring us -to the driver file_operations. - -For miscellaneous character devices a second step is required: there -is a scan for the driver entry with the same minor number as the file -that was opened, and the appropriate minor open method is called. This -scanning is done *every time* you open a device node. Potentially, you -may be searching through dozens of misc. entries before you find your -open method. While not an enormous performance overhead, this does -seem pointless. - -Linux *must* move beyond the 8 bit major and minor barrier, -somehow. If we simply increase each to 16 bits, then the indexing -scheme used for major driver lookup becomes untenable, because the -major tables (one each for character and block devices) would need to -be 64 k entries long (512 kBytes on x86, 1 MByte for 64 bit -systems). So we would have to use a scheme like that used for -miscellaneous character devices, which means the search time goes up -linearly with the average number of major device drivers on your -system. Not all "devices" are hardware, some are higher-level drivers -like KGI, so you can get more "devices" without adding hardware -You can improve this by creating an ordered (balanced:-) -binary tree, in which case your search time becomes log(N). -Alternatively, you can use hashing to speed up the search. -But why do that search at all if you don't have to? Once again, it -seems pointless. - -Note that devfs doesn't use the major&minor system. For devfs -entries, the connection is done when you lookup the /dev entry. When -devfs_register() is called, an internal table is appended which has -the entry name and the file_operations. If the dentry cache doesn't -have the /dev entry already, this internal table is scanned to get the -file_operations, and an inode is created. If the dentry cache already -has the entry, there is *no lookup time* (other than the dentry scan -itself, but we can't avoid that anyway, and besides Linux dentries -cream other OS's which don't have them:-). Furthermore, the number of -node entries in a devfs is only the number of available device -entries, not the number of *conceivable* entries. Even if you remove -unnecessary entries in a disc-based /dev, the number of conceivable -entries remains the same: you just limit yourself in order to save -space. - -Devfs provides a fast connection between a VFS node and the device -driver, in a scalable way. - -/dev as a system administration tool - -Right now /dev contains a list of conceivable devices, most of which I -don't have. Devfs only shows those devices available on my -system. This means that listing /dev is a handy way of checking what -devices are available. - -Major&minor size - -Existing major and minor numbers are limited to 8 bits each. This is -now a limiting factor for some drivers, particularly the SCSI disc -driver, which consumes a single major number. Only 16 discs are -supported, and each disc may have only 15 partitions. Maybe this isn't -a problem for you, but some of us are building huge Linux systems with -disc arrays. With devfs an arbitrary pointer can be associated with -each device entry, which can be used to give an effective 32 bit -device identifier (i.e. that's like having a 32 bit minor -number). Since this is private to the kernel, there are no C library -compatibility issues which you would have with increasing major and -minor number sizes. See the section on "Allocation of Device Numbers" -for details on maintaining compatibility with userspace. - -Solving this requires a kernel change. - -Since writing this, the kernel has been modified so that the SCSI disc -driver has more major numbers allocated to it and now supports up to -128 discs. Since these major numbers are non-contiguous (a result of -unplanned expansion), the implementation is a little more cumbersome -than originally. - -Just like the changes to IPv4 to fix impending limitations in the -address space, people find ways around the limitations. In the long -run, however, solutions like IPv6 or devfs can't be put off forever. - -Read-only root filesystem - -Having your device nodes on the root filesystem means that you can't -operate properly with a read-only root filesystem. This is because you -want to change ownerships and protections of tty devices. Existing -practice prevents you using a CD-ROM as your root filesystem for a -*real* system. Sure, you can boot off a CD-ROM, but you can't change -tty ownerships, so it's only good for installing. - -Also, you can't use a shared NFS root filesystem for a cluster of -discless Linux machines (having tty ownerships changed on a common -/dev is not good). Nor can you embed your root filesystem in a -ROM-FS. - -You can get around this by creating a RAMDISC at boot time, making -an ext2 filesystem in it, mounting it somewhere and copying the -contents of /dev into it, then unmounting it and mounting it over -/dev. - -A devfs is a cleaner way of solving this. - -Non-Unix root filesystem - -Non-Unix filesystems (such as NTFS) can't be used for a root -filesystem because they variously don't support character and block -special files or symbolic links. You can't have a separate disc-based -or RAMDISC-based filesystem mounted on /dev because you need device -nodes before you can mount these. Devfs can be mounted without any -device nodes. Devlinks won't work because symlinks aren't supported. -An alternative solution is to use initrd to mount a RAMDISC initial -root filesystem (which is populated with a minimal set of device -nodes), and then construct a new /dev in another RAMDISC, and finally -switch to your non-Unix root filesystem. This requires clever boot -scripts and a fragile and conceptually complex boot procedure. - -Devfs solves this in a robust and conceptually simple way. - -PTY security - -Current pseudo-tty (pty) devices are owned by root and read-writable -by everyone. The user of a pty-pair cannot change -ownership/protections without being suid-root. - -This could be solved with a secure user-space daemon which runs as -root and does the actual creation of pty-pairs. Such a daemon would -require modification to *every* programme that wants to use this new -mechanism. It also slows down creation of pty-pairs. - -An alternative is to create a new open_pty() syscall which does much -the same thing as the user-space daemon. Once again, this requires -modifications to pty-handling programmes. - -The devfs solution allows a device driver to "tag" certain device -files so that when an unopened device is opened, the ownerships are -changed to the current euid and egid of the opening process, and the -protections are changed to the default registered by the driver. When -the device is closed ownership is set back to root and protections are -set back to read-write for everybody. No programme need be changed. -The devpts filesystem provides this auto-ownership feature for Unix98 -ptys. It doesn't support old-style pty devices, nor does it have all -the other features of devfs. - -Intelligent device management - -Devfs implements a simple yet powerful protocol for communication with -a device management daemon (devfsd) which runs in user space. It is -possible to send a message (either synchronously or asynchronously) to -devfsd on any event, such as registration/unregistration of device -entries, opening and closing devices, looking up inodes, scanning -directories and more. This has many possibilities. Some of these are -already implemented. See: - - -http://www.atnf.csiro.au/~rgooch/linux/ - -Device entry registration events can be used by devfsd to change -permissions of newly-created device nodes. This is one mechanism to -control device permissions. - -Device entry registration/unregistration events can be used to run -programmes or scripts. This can be used to provide automatic mounting -of filesystems when a new block device media is inserted into the -drive. - -Asynchronous device open and close events can be used to implement -clever permissions management. For example, the default permissions on -/dev/dsp do not allow everybody to read from the device. This is -sensible, as you don't want some remote user recording what you say at -your console. However, the console user is also prevented from -recording. This behaviour is not desirable. With asynchronous device -open and close events, you can have devfsd run a programme or script -when console devices are opened to change the ownerships for *other* -device nodes (such as /dev/dsp). On closure, you can run a different -script to restore permissions. An advantage of this scheme over -modifying the C library tty handling is that this works even if your -programme crashes (how many times have you seen the utmp database with -lingering entries for non-existent logins?). - -Synchronous device open events can be used to perform intelligent -device access protections. Before the device driver open() method is -called, the daemon must first validate the open attempt, by running an -external programme or script. This is far more flexible than access -control lists, as access can be determined on the basis of other -system conditions instead of just the UID and GID. - -Inode lookup events can be used to authenticate module autoload -requests. Instead of using kmod directly, the event is sent to -devfsd which can implement an arbitrary authentication before loading -the module itself. - -Inode lookup events can also be used to construct arbitrary -namespaces, without having to resort to populating devfs with symlinks -to devices that don't exist. - -Speculative Device Scanning - -Consider an application (like cdparanoia) that wants to find all -CD-ROM devices on the system (SCSI, IDE and other types), whether or -not their respective modules are loaded. The application must -speculatively open certain device nodes (such as /dev/sr0 for the SCSI -CD-ROMs) in order to make sure the module is loaded. This requires -that all Linux distributions follow the standard device naming scheme -(last time I looked RedHat did things differently). Devfs solves the -naming problem. - -The same application also wants to see which devices are actually -available on the system. With the existing system it needs to read the -/dev directory and speculatively open each /dev/sr* device to -determine if the device exists or not. With a large /dev this is an -inefficient operation, especially if there are many /dev/sr* nodes. A -solution like scsidev could reduce the number of /dev/sr* entries (but -of course that also requires all that inefficient directory scanning). - -With devfs, the application can open the /dev/sr directory -(which triggers the module autoloading if required), and proceed to -read /dev/sr. Since only the available devices will have -entries, there are no inefficencies in directory scanning or device -openings. - ------------------------------------------------------------------------------ - -Who else does it? - -FreeBSD has a devfs implementation. Solaris and AIX each have a -pseudo-devfs (something akin to scsidev but for all devices, with some -unspecified kernel support). BeOS, Plan9 and QNX also have it. SGI's -IRIX 6.4 and above also have a device filesystem. - -While we shouldn't just automatically do something because others do -it, we should not ignore the work of others either. FreeBSD has a lot -of competent people working on it, so their opinion should not be -blithely ignored. - ------------------------------------------------------------------------------ - - -How it works - -Registering device entries - -For every entry (device node) in a devfs-based /dev a driver must call -devfs_register(). This adds the name of the device entry, the -file_operations structure pointer and a few other things to an -internal table. Device entries may be added and removed at any -time. When a device entry is registered, it automagically appears in -any mounted devfs'. - -Inode lookup - -When a lookup operation on an entry is performed and if there is no -driver information for that entry devfs will attempt to call -devfsd. If still no driver information can be found then a negative -dentry is yielded and the next stage operation will be called by the -VFS (such as create() or mknod() inode methods). If driver information -can be found, an inode is created (if one does not exist already) and -all is well. - -Manually creating device nodes - -The mknod() method allows you to create an ordinary named pipe in the -devfs, or you can create a character or block special inode if one -does not already exist. You may wish to create a character or block -special inode so that you can set permissions and ownership. Later, if -a device driver registers an entry with the same name, the -permissions, ownership and times are retained. This is how you can set -the protections on a device even before the driver is loaded. Once you -create an inode it appears in the directory listing. - -Unregistering device entries - -A device driver calls devfs_unregister() to unregister an entry. - -Chroot() gaols - -2.2.x kernels - -The semantics of inode creation are different when devfs is mounted -with the "explicit" option. Now, when a device entry is registered, it -will not appear until you use mknod() to create the device. It doesn't -matter if you mknod() before or after the device is registered with -devfs_register(). The purpose of this behaviour is to support -chroot(2) gaols, where you want to mount a minimal devfs inside the -gaol. Only the devices you specifically want to be available (through -your mknod() setup) will be accessible. - -2.4.x kernels - -As of kernel 2.3.99, the VFS has had the ability to rebind parts of -the global filesystem namespace into another part of the namespace. -This now works even at the leaf-node level, which means that -individual files and device nodes may be bound into other parts of the -namespace. This is like making links, but better, because it works -across filesystems (unlike hard links) and works through chroot() -gaols (unlike symbolic links). - -Because of these improvements to the VFS, the multi-mount capability -in devfs is no longer needed. The administrator may create a minimal -device tree inside a chroot(2) gaol by using VFS bindings. As this -provides most of the features of the devfs multi-mount capability, I -removed the multi-mount support code (after issuing an RFC). This -yielded code size reductions and simplifications. - -If you want to construct a minimal chroot() gaol, the following -command should suffice: - -mount --bind /dev/null /gaol/dev/null - - -Repeat for other device nodes you want to expose. Simple! - ------------------------------------------------------------------------------ - - -Operational issues - - -Instructions for the impatient - -Nobody likes reading documentation. People just want to get in there -and play. So this section tells you quickly the steps you need to take -to run with devfs mounted over /dev. Skip these steps and you will end -up with a nearly unbootable system. Subsequent sections describe the -issues in more detail, and discuss non-essential configuration -options. - -Devfsd -OK, if you're reading this, I assume you want to play with -devfs. First you should ensure that /usr/src/linux contains a -recent kernel source tree. Then you need to compile devfsd, the device -management daemon, available at - -http://www.atnf.csiro.au/~rgooch/linux/. -Because the kernel has a naming scheme -which is quite different from the old naming scheme, you need to -install devfsd so that software and configuration files that use the -old naming scheme will not break. - -Compile and install devfsd. You will be provided with a default -configuration file /etc/devfsd.conf which will provide -compatibility symlinks for the old naming scheme. Don't change this -config file unless you know what you're doing. Even if you think you -do know what you're doing, don't change it until you've followed all -the steps below and booted a devfs-enabled system and verified that it -works. - -Now edit your main system boot script so that devfsd is started at the -very beginning (before any filesystem -checks). /etc/rc.d/rc.sysinit is often the main boot script -on systems with SysV-style boot scripts. On systems with BSD-style -boot scripts it is often /etc/rc. Also check -/sbin/rc. - -NOTE that the line you put into the boot -script should be exactly: - -/sbin/devfsd /dev - -DO NOT use some special daemon-launching -programme, otherwise the boot script may not wait for devfsd to finish -initialising. - -System Libraries -There may still be some problems because of broken software making -assumptions about device names. In particular, some software does not -handle devices which are symbolic links. If you are running a libc 5 -based system, install libc 5.4.44 (if you have libc 5.4.46, go back to -libc 5.4.44, which is actually correct). If you are running a glibc -based system, make sure you have glibc 2.1.3 or later. - -/etc/securetty -PAM (Pluggable Authentication Modules) is supposed to be a flexible -mechanism for providing better user authentication and access to -services. Unfortunately, it's also fragile, complex and undocumented -(check out RedHat 6.1, and probably other distributions as well). PAM -has problems with symbolic links. Append the following lines to your -/etc/securetty file: - -vc/1 -vc/2 -vc/3 -vc/4 -vc/5 -vc/6 -vc/7 -vc/8 - -This will not weaken security. If you have a version of util-linux -earlier than 2.10.h, please upgrade to 2.10.h or later. If you -absolutely cannot upgrade, then also append the following lines to -your /etc/securetty file: - -1 -2 -3 -4 -5 -6 -7 -8 - -This may potentially weaken security by allowing root logins over the -network (a password is still required, though). However, since there -are problems with dealing with symlinks, I'm suspicious of the level -of security offered in any case. - -XFree86 -While not essential, it's probably a good idea to upgrade to XFree86 -4.0, as patches went in to make it more devfs-friendly. If you don't, -you'll probably need to apply the following patch to -/etc/security/console.perms so that ordinary users can run -startx. Note that not all distributions have this file (e.g. Debian), -so if it's not present, don't worry about it. - ---- /etc/security/console.perms.orig Sat Apr 17 16:26:47 1999 -+++ /etc/security/console.perms Fri Feb 25 23:53:55 2000 -@@ -14,7 +14,7 @@ - # man 5 console.perms - - # file classes -- these are regular expressions --<console>=tty[0-9][0-9]* :[0-9]\.[0-9] :[0-9] -+<console>=tty[0-9][0-9]* vc/[0-9][0-9]* :[0-9]\.[0-9] :[0-9] - - # device classes -- these are shell-style globs - <floppy>=/dev/fd[0-1]* - -If the patch does not apply, then change the line: - -<console>=tty[0-9][0-9]* :[0-9]\.[0-9] :[0-9] - -with: - -<console>=tty[0-9][0-9]* vc/[0-9][0-9]* :[0-9]\.[0-9] :[0-9] - - -Disable devpts -I've had a report of devpts mounted on /dev/pts not working -correctly. Since devfs will also manage /dev/pts, there is no -need to mount devpts as well. You should either edit your -/etc/fstab so devpts is not mounted, or disable devpts from -your kernel configuration. - -Unsupported drivers -Not all drivers have devfs support. If you depend on one of these -drivers, you will need to create a script or tarfile that you can use -at boot time to create device nodes as appropriate. There is a -section which describes this. Another -section lists the drivers which have -devfs support. - -/dev/mouse - -Many disributions configure /dev/mouse to be the mouse device -for XFree86 and GPM. I actually think this is a bad idea, because it -adds another level of indirection. When looking at a config file, if -you see /dev/mouse you're left wondering which mouse -is being referred to. Hence I recommend putting the actual mouse -device (for example /dev/psaux) into your -/etc/X11/XF86Config file (and similarly for the GPM -configuration file). - -Alternatively, use the same technique used for unsupported drivers -described above. - -The Kernel -Finally, you need to make sure devfs is compiled into your kernel. Set -CONFIG_EXPERIMENTAL=y, CONFIG_DEVFS_FS=y and CONFIG_DEVFS_MOUNT=y by -using favourite configuration tool (i.e. make config or -make xconfig) and then make clean and then recompile your kernel and -modules. At boot, devfs will be mounted onto /dev. - -If you encounter problems booting (for example if you forgot a -configuration step), you can pass devfs=nomount at the kernel -boot command line. This will prevent the kernel from mounting devfs at -boot time onto /dev. - -In general, a kernel built with CONFIG_DEVFS_FS=y but without mounting -devfs onto /dev is completely safe, and requires no -configuration changes. One exception to take note of is when -LABEL= directives are used in /etc/fstab. In this -case you will be unable to boot properly. This is because the -mount(8) programme uses /proc/partitions as part of -the volume label search process, and the device names it finds are not -available, because setting CONFIG_DEVFS_FS=y changes the names in -/proc/partitions, irrespective of whether devfs is mounted. - -Now you've finished all the steps required. You're now ready to boot -your shiny new kernel. Enjoy. - -Changing the configuration - -OK, you've now booted a devfs-enabled system, and everything works. -Now you may feel like changing the configuration (common targets are -/etc/fstab and /etc/devfsd.conf). Since you have a -system that works, if you make any changes and it doesn't work, you -now know that you only have to restore your configuration files to the -default and it will work again. - - -Permissions persistence across reboots - -If you don't use mknod(2) to create a device file, nor use chmod(2) or -chown(2) to change the ownerships/permissions, the inode ctime will -remain at 0 (the epoch, 12 am, 1-JAN-1970, GMT). Anything with a ctime -later than this has had it's ownership/permissions changed. Hence, a -simple script or programme may be used to tar up all changed inodes, -prior to shutdown. Although effective, many consider this approach a -kludge. - -A much better approach is to use devfsd to save and restore -permissions. It may be configured to record changes in permissions and -will save them in a database (in fact a directory tree), and restore -these upon boot. This is an efficient method and results in immediate -saving of current permissions (unlike the tar approach, which saves -permissions at some unspecified future time). - -The default configuration file supplied with devfsd has config entries -which you may uncomment to enable persistence management. - -If you decide to use the tar approach anyway, be aware that tar will -first unlink(2) an inode before creating a new device node. The -unlink(2) has the effect of breaking the connection between a devfs -entry and the device driver. If you use the "devfs=only" boot option, -you lose access to the device driver, requiring you to reload the -module. I consider this a bug in tar (there is no real need to -unlink(2) the inode first). - -Alternatively, you can use devfsd to provide more sophisticated -management of device permissions. You can use devfsd to store -permissions for whole groups of devices with a single configuration -entry, rather than the conventional single entry per device entry. - -Permissions database stored in mounted-over /dev - -If you wish to save and restore your device permissions into the -disc-based /dev while still mounting devfs onto /dev -you may do so. This requires a 2.4.x kernel (in fact, 2.3.99 or -later), which has the VFS binding facility. You need to do the -following to set this up: - - - -make sure the kernel does not mount devfs at boot time - - -make sure you have a correct /dev/console entry in your -root file-system (where your disc-based /dev lives) - -create the /dev-state directory - - -add the following lines near the very beginning of your boot -scripts: - -mount --bind /dev /dev-state -mount -t devfs none /dev -devfsd /dev - - - - -add the following lines to your /etc/devfsd.conf file: - -REGISTER ^pt[sy] IGNORE -CREATE ^pt[sy] IGNORE -CHANGE ^pt[sy] IGNORE -DELETE ^pt[sy] IGNORE -REGISTER .* COPY /dev-state/$devname $devpath -CREATE .* COPY $devpath /dev-state/$devname -CHANGE .* COPY $devpath /dev-state/$devname -DELETE .* CFUNCTION GLOBAL unlink /dev-state/$devname -RESTORE /dev-state - -Note that the sample devfsd.conf file contains these lines, -as well as other sample configurations you may find useful. See the -devfsd distribution - - -reboot. - - - - -Permissions database stored in normal directory - -If you are using an older kernel which doesn't support VFS binding, -then you won't be able to have the permissions database in a -mounted-over /dev. However, you can still use a regular -directory to store the database. The sample /etc/devfsd.conf -file above may still be used. You will need to create the -/dev-state directory prior to installing devfsd. If you have -old permissions in /dev, then just copy (or move) the device -nodes over to the new directory. - -Which method is better? - -The best method is to have the permissions database stored in the -mounted-over /dev. This is because you will not need to copy -device nodes over to /dev-state, and because it allows you to -switch between devfs and non-devfs kernels, without requiring you to -copy permissions between /dev-state (for devfs) and -/dev (for non-devfs). - - -Dealing with drivers without devfs support - -Currently, not all device drivers in the kernel have been modified to -use devfs. Device drivers which do not yet have devfs support will not -automagically appear in devfs. The simplest way to create device nodes -for these drivers is to unpack a tarfile containing the required -device nodes. You can do this in your boot scripts. All your drivers -will now work as before. - -Hopefully for most people devfs will have enough support so that they -can mount devfs directly over /dev without losing most functionality -(i.e. losing access to various devices). As of 22-JAN-1998 (devfs -patch version 10) I am now running this way. All the devices I have -are available in devfs, so I don't lose anything. - -WARNING: if your configuration requires the old-style device names -(i.e. /dev/hda1 or /dev/sda1), you must install devfsd and configure -it to maintain compatibility entries. It is almost certain that you -will require this. Note that the kernel creates a compatibility entry -for the root device, so you don't need initrd. - -Note that you no longer need to mount devpts if you use Unix98 PTYs, -as devfs can manage /dev/pts itself. This saves you some RAM, as you -don't need to compile and install devpts. Note that some versions of -glibc have a bug with Unix98 pty handling on devfs systems. Contact -the glibc maintainers for a fix. Glibc 2.1.3 has the fix. - -Note also that apart from editing /etc/fstab, other things will need -to be changed if you *don't* install devfsd. Some software (like the X -server) hard-wire device names in their source. It really is much -easier to install devfsd so that compatibility entries are created. -You can then slowly migrate your system to using the new device names -(for example, by starting with /etc/fstab), and then limiting the -compatibility entries that devfsd creates. - -IF YOU CONFIGURE TO MOUNT DEVFS AT BOOT, MAKE SURE YOU INSTALL DEVFSD -BEFORE YOU BOOT A DEVFS-ENABLED KERNEL! - -Now that devfs has gone into the 2.3.46 kernel, I'm getting a lot of -reports back. Many of these are because people are trying to run -without devfsd, and hence some things break. Please just run devfsd if -things break. I want to concentrate on real bugs rather than -misconfiguration problems at the moment. If people are willing to fix -bugs/false assumptions in other code (i.e. glibc, X server) and submit -that to the respective maintainers, that would be great. - - -All the way with Devfs - -The devfs kernel patch creates a rationalised device tree. As stated -above, if you want to keep using the old /dev naming scheme, -you just need to configure devfsd appopriately (see the man -page). People who prefer the old names can ignore this section. For -those of us who like the rationalised names and an uncluttered -/dev, read on. - -If you don't run devfsd, or don't enable compatibility entry -management, then you will have to configure your system to use the new -names. For example, you will then need to edit your -/etc/fstab to use the new disc naming scheme. If you want to -be able to boot non-devfs kernels, you will need compatibility -symlinks in the underlying disc-based /dev pointing back to -the old-style names for when you boot a kernel without devfs. - -You can selectively decide which devices you want compatibility -entries for. For example, you may only want compatibility entries for -BSD pseudo-terminal devices (otherwise you'll have to patch you C -library or use Unix98 ptys instead). It's just a matter of putting in -the correct regular expression into /dev/devfsd.conf. - -There are other choices of naming schemes that you may prefer. For -example, I don't use the kernel-supplied -names, because they are too verbose. A common misconception is -that the kernel-supplied names are meant to be used directly in -configuration files. This is not the case. They are designed to -reflect the layout of the devices attached and to provide easy -classification. - -If you like the kernel-supplied names, that's fine. If you don't then -you should be using devfsd to construct a namespace more to your -liking. Devfsd has built-in code to construct a -namespace that is both logical and easy to -manage. In essence, it creates a convenient abbreviation of the -kernel-supplied namespace. - -You are of course free to build your own namespace. Devfsd has all the -infrastructure required to make this easy for you. All you need do is -write a script. You can even write some C code and devfsd can load the -shared object as a callable extension. - - -Other Issues - -The init programme -Another thing to take note of is whether your init programme -creates a Unix socket /dev/telinit. Some versions of init -create /dev/telinit so that the telinit programme can -communicate with the init process. If you have such a system you need -to make sure that devfs is mounted over /dev *before* init -starts. In other words, you can't leave the mounting of devfs to -/etc/rc, since this is executed after init. Other -versions of init require a named pipe /dev/initctl -which must exist *before* init starts. Once again, you need to -mount devfs and then create the named pipe *before* init -starts. - -The default behaviour now is not to mount devfs onto /dev at -boot time for 2.3.x and later kernels. You can correct this with the -"devfs=mount" boot option. This solves any problems with init, -and also prevents the dreaded: - -Cannot open initial console - -message. For 2.2.x kernels where you need to apply the devfs patch, -the default is to mount. - -If you have automatic mounting of devfs onto /dev then you -may need to create /dev/initctl in your boot scripts. The -following lines should suffice: - -mknod /dev/initctl p -kill -SIGUSR1 1 # tell init that /dev/initctl now exists - -Alternatively, if you don't want the kernel to mount devfs onto -/dev then you could use the following procedure is a -guideline for how to get around /dev/initctl problems: - -# cd /sbin -# mv init init.real -# cat > init -#! /bin/sh -mount -n -t devfs none /dev -mknod /dev/initctl p -exec /sbin/init.real $* -[control-D] -# chmod a+x init - -Note that newer versions of init create /dev/initctl -automatically, so you don't have to worry about this. - -Module autoloading -You will need to configure devfsd to enable module -autoloading. The following lines should be placed in your -/etc/devfsd.conf file: - -LOOKUP .* MODLOAD - - -As of devfsd-v1.3.10, a generic /etc/modules.devfs -configuration file is installed, which is used by the MODLOAD -action. This should be sufficient for most configurations. If you -require further configuration, edit your /etc/modules.conf -file. The way module autoloading work with devfs is: - - -a process attempts to lookup a device node (e.g. /dev/fred) - - -if that device node does not exist, the full pathname is passed to -devfsd as a string - - -devfsd will pass the string to the modprobe programme (provided the -configuration line shown above is present), and specifies that -/etc/modules.devfs is the configuration file - - -/etc/modules.devfs includes /etc/modules.conf to -access local configurations - -modprobe will search it's configuration files, looking for an alias -that translates the pathname into a module name - - -the translated pathname is then used to load the module. - - -If you wanted a lookup of /dev/fred to load the -mymod module, you would require the following configuration -line in /etc/modules.conf: - -alias /dev/fred mymod - -The /etc/modules.devfs configuration file provides many such -aliases for standard device names. If you look closely at this file, -you will note that some modules require multiple alias configuration -lines. This is required to support module autoloading for old and new -device names. - -Mounting root off a devfs device -If you wish to mount root off a devfs device when you pass the -"devfs=only" boot option, then you need to pass in the -"root=<device>" option to the kernel when booting. If you use -LILO, then you must have this in lilo.conf: - -append = "root=<device>" - -Surprised? Yep, so was I. It turns out if you have (as most people -do): - -root = <device> - - -then LILO will determine the device number of <device> and will -write that device number into a special place in the kernel image -before starting the kernel, and the kernel will use that device number -to mount the root filesystem. So, using the "append" variety ensures -that LILO passes the root filesystem device as a string, which devfs -can then use. - -Note that this isn't an issue if you don't pass "devfs=only". - -TTY issues -The ttyname(3) function in some versions of the C library makes -false assumptions about device entries which are symbolic links. The -tty(1) programme is one that depends on this function. I've -written a patch to libc 5.4.43 which fixes this. This has been -included in libc 5.4.44 and a similar fix is in glibc 2.1.3. - - -Kernel Naming Scheme - -The kernel provides a default naming scheme. This scheme is designed -to make it easy to search for specific devices or device types, and to -view the available devices. Some device types (such as hard discs), -have a directory of entries, making it easy to see what devices of -that class are available. Often, the entries are symbolic links into a -directory tree that reflects the topology of available devices. The -topological tree is useful for finding how your devices are arranged. - -Below is a list of the naming schemes for the most common drivers. A -list of reserved device names is -available for reference. Please send email to -rgooch@atnf.csiro.au to obtain an allocation. Please be -patient (the maintainer is busy). An alternative name may be allocated -instead of the requested name, at the discretion of the maintainer. - -Disc Devices - -All discs, whether SCSI, IDE or whatever, are placed under the -/dev/discs hierarchy: - - /dev/discs/disc0 first disc - /dev/discs/disc1 second disc - - -Each of these entries is a symbolic link to the directory for that -device. The device directory contains: - - disc for the whole disc - part* for individual partitions - - -CD-ROM Devices - -All CD-ROMs, whether SCSI, IDE or whatever, are placed under the -/dev/cdroms hierarchy: - - /dev/cdroms/cdrom0 first CD-ROM - /dev/cdroms/cdrom1 second CD-ROM - - -Each of these entries is a symbolic link to the real device entry for -that device. - -Tape Devices - -All tapes, whether SCSI, IDE or whatever, are placed under the -/dev/tapes hierarchy: - - /dev/tapes/tape0 first tape - /dev/tapes/tape1 second tape - - -Each of these entries is a symbolic link to the directory for that -device. The device directory contains: - - mt for mode 0 - mtl for mode 1 - mtm for mode 2 - mta for mode 3 - mtn for mode 0, no rewind - mtln for mode 1, no rewind - mtmn for mode 2, no rewind - mtan for mode 3, no rewind - - -SCSI Devices - -To uniquely identify any SCSI device requires the following -information: - - controller (host adapter) - bus (SCSI channel) - target (SCSI ID) - unit (Logical Unit Number) - - -All SCSI devices are placed under /dev/scsi (assuming devfs -is mounted on /dev). Hence, a SCSI device with the following -parameters: c=1,b=2,t=3,u=4 would appear as: - - /dev/scsi/host1/bus2/target3/lun4 device directory - - -Inside this directory, a number of device entries may be created, -depending on which SCSI device-type drivers were installed. - -See the section on the disc naming scheme to see what entries the SCSI -disc driver creates. - -See the section on the tape naming scheme to see what entries the SCSI -tape driver creates. - -The SCSI CD-ROM driver creates: - - cd - - -The SCSI generic driver creates: - - generic - - -IDE Devices - -To uniquely identify any IDE device requires the following -information: - - controller - bus (aka. primary/secondary) - target (aka. master/slave) - unit - - -All IDE devices are placed under /dev/ide, and uses a similar -naming scheme to the SCSI subsystem. - -XT Hard Discs - -All XT discs are placed under /dev/xd. The first XT disc has -the directory /dev/xd/disc0. - -TTY devices - -The tty devices now appear as: - - New name Old-name Device Type - -------- -------- ----------- - /dev/tts/{0,1,...} /dev/ttyS{0,1,...} Serial ports - /dev/cua/{0,1,...} /dev/cua{0,1,...} Call out devices - /dev/vc/0 /dev/tty Current virtual console - /dev/vc/{1,2,...} /dev/tty{1...63} Virtual consoles - /dev/vcc/{0,1,...} /dev/vcs{1...63} Virtual consoles - /dev/pty/m{0,1,...} /dev/ptyp?? PTY masters - /dev/pty/s{0,1,...} /dev/ttyp?? PTY slaves - - -RAMDISCS - -The RAMDISCS are placed in their own directory, and are named thus: - - /dev/rd/{0,1,2,...} - - -Meta Devices - -The meta devices are placed in their own directory, and are named -thus: - - /dev/md/{0,1,2,...} - - -Floppy discs - -Floppy discs are placed in the /dev/floppy directory. - -Loop devices - -Loop devices are placed in the /dev/loop directory. - -Sound devices - -Sound devices are placed in the /dev/sound directory -(audio, sequencer, ...). - - -Devfsd Naming Scheme - -Devfsd provides a naming scheme which is a convenient abbreviation of -the kernel-supplied namespace. In some -cases, the kernel-supplied naming scheme is quite convenient, so -devfsd does not provide another naming scheme. The convenience names -that devfsd creates are in fact the same names as the original devfs -kernel patch created (before Linus mandated the Big Name -Change). These are referred to as "new compatibility entries". - -In order to configure devfsd to create these convenience names, the -following lines should be placed in your /etc/devfsd.conf: - -REGISTER .* MKNEWCOMPAT -UNREGISTER .* RMNEWCOMPAT - -This will cause devfsd to create (and destroy) symbolic links which -point to the kernel-supplied names. - -SCSI Hard Discs - -All SCSI discs are placed under /dev/sd (assuming devfs is -mounted on /dev). Hence, a SCSI disc with the following -parameters: c=1,b=2,t=3,u=4 would appear as: - - /dev/sd/c1b2t3u4 for the whole disc - /dev/sd/c1b2t3u4p5 for the 5th partition - /dev/sd/c1b2t3u4p5s6 for the 6th slice in the 5th partition - - -SCSI Tapes - -All SCSI tapes are placed under /dev/st. A similar naming -scheme is used as for SCSI discs. A SCSI tape with the -parameters:c=1,b=2,t=3,u=4 would appear as: - - /dev/st/c1b2t3u4m0 for mode 0 - /dev/st/c1b2t3u4m1 for mode 1 - /dev/st/c1b2t3u4m2 for mode 2 - /dev/st/c1b2t3u4m3 for mode 3 - /dev/st/c1b2t3u4m0n for mode 0, no rewind - /dev/st/c1b2t3u4m1n for mode 1, no rewind - /dev/st/c1b2t3u4m2n for mode 2, no rewind - /dev/st/c1b2t3u4m3n for mode 3, no rewind - - -SCSI CD-ROMs - -All SCSI CD-ROMs are placed under /dev/sr. A similar naming -scheme is used as for SCSI discs. A SCSI CD-ROM with the -parameters:c=1,b=2,t=3,u=4 would appear as: - - /dev/sr/c1b2t3u4 - - -SCSI Generic Devices - -The generic (aka. raw) interface for all SCSI devices are placed under -/dev/sg. A similar naming scheme is used as for SCSI discs. A -SCSI generic device with the parameters:c=1,b=2,t=3,u=4 would appear -as: - - /dev/sg/c1b2t3u4 - - -IDE Hard Discs - -All IDE discs are placed under /dev/ide/hd, using a similar -convention to SCSI discs. The following mappings exist between the new -and the old names: - - /dev/hda /dev/ide/hd/c0b0t0u0 - /dev/hdb /dev/ide/hd/c0b0t1u0 - /dev/hdc /dev/ide/hd/c0b1t0u0 - /dev/hdd /dev/ide/hd/c0b1t1u0 - - -IDE Tapes - -A similar naming scheme is used as for IDE discs. The entries will -appear in the /dev/ide/mt directory. - -IDE CD-ROM - -A similar naming scheme is used as for IDE discs. The entries will -appear in the /dev/ide/cd directory. - -IDE Floppies - -A similar naming scheme is used as for IDE discs. The entries will -appear in the /dev/ide/fd directory. - -XT Hard Discs - -All XT discs are placed under /dev/xd. The first XT disc -would appear as /dev/xd/c0t0. - - -Old Compatibility Names - -The old compatibility names are the legacy device names, such as -/dev/hda, /dev/sda, /dev/rtc and so on. -Devfsd can be configured to create compatibility symlinks so that you -may continue to use the old names in your configuration files and so -that old applications will continue to function correctly. - -In order to configure devfsd to create these legacy names, the -following lines should be placed in your /etc/devfsd.conf: - -REGISTER .* MKOLDCOMPAT -UNREGISTER .* RMOLDCOMPAT - -This will cause devfsd to create (and destroy) symbolic links which -point to the kernel-supplied names. - - ------------------------------------------------------------------------------ - - -Device drivers currently ported - -- All miscellaneous character devices support devfs (this is done - transparently through misc_register()) - -- SCSI discs and generic hard discs - -- Character memory devices (null, zero, full and so on) - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- Loop devices (/dev/loop?) - -- TTY devices (console, serial ports, terminals and pseudo-terminals) - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- SCSI tapes (/dev/scsi and /dev/tapes) - -- SCSI CD-ROMs (/dev/scsi and /dev/cdroms) - -- SCSI generic devices (/dev/scsi) - -- RAMDISCS (/dev/ram?) - -- Meta Devices (/dev/md*) - -- Floppy discs (/dev/floppy) - -- Parallel port printers (/dev/printers) - -- Sound devices (/dev/sound) - Thanks to Eric Dumas <dumas@linux.eu.org> and - C. Scott Ananian <cananian@alumni.princeton.edu> - -- Joysticks (/dev/joysticks) - -- Sparc keyboard (/dev/kbd) - -- DSP56001 digital signal processor (/dev/dsp56k) - -- Apple Desktop Bus (/dev/adb) - -- Coda network file system (/dev/cfs*) - -- Virtual console capture devices (/dev/vcc) - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Frame buffer devices (/dev/fb) - -- Video capture devices (/dev/v4l) - - ------------------------------------------------------------------------------ - - -Allocation of Device Numbers - -Devfs allows you to write a driver which doesn't need to allocate a -device number (major&minor numbers) for the internal operation of the -kernel. However, there are a number of userspace programmes that use -the device number as a unique handle for a device. An example is the -find programme, which uses device numbers to determine whether -an inode is on a different filesystem than another inode. The device -number used is the one for the block device which a filesystem is -using. To preserve compatibility with userspace programmes, block -devices using devfs need to have unique device numbers allocated to -them. Furthermore, POSIX specifies device numbers, so some kind of -device number needs to be presented to userspace. - -The simplest option (especially when porting drivers to devfs) is to -keep using the old major and minor numbers. Devfs will take whatever -values are given for major&minor and pass them onto userspace. - -This device number is a 16 bit number, so this leaves plenty of space -for large numbers of discs and partitions. This scheme can also be -used for character devices, in particular the tty devices, which are -currently limited to 256 pseudo-ttys (this limits the total number of -simultaneous xterms and remote logins). Note that the device number -is limited to the range 36864-61439 (majors 144-239), in order to -avoid any possible conflicts with existing official allocations. - -Please note that using dynamically allocated block device numbers may -break the NFS daemons (both user and kernel mode), which expect dev_t -for a given device to be constant over the lifetime of remote mounts. - -A final note on this scheme: since it doesn't increase the size of -device numbers, there are no compatibility issues with userspace. - ------------------------------------------------------------------------------ - - -Questions and Answers - - -Making things work -Alternatives to devfs -What I don't like about devfs -How to report bugs -Strange kernel messages -Compilation problems with devfsd - - - -Making things work - -Here are some common questions and answers. - - - -Devfsd doesn't start - -Make sure you have compiled and installed devfsd -Make sure devfsd is being started from your boot -scripts -Make sure you have configured your kernel to enable devfs (see -below) -Make sure devfs is mounted (see below) - - -Devfsd is not managing all my permissions - -Make sure you are capturing the appropriate events. For example, -device entries created by the kernel generate REGISTER events, -but those created by devfsd generate CREATE events. - - -Devfsd is not capturing all REGISTER events - -See the previous entry: you may need to capture CREATE events. - - -X will not start - -Make sure you followed the steps -outlined above. - - -Why don't my network devices appear in devfs? - -This is not a bug. Network devices have their own, completely separate -namespace. They are accessed via socket(2) and -setsockopt(2) calls, and thus require no device nodes. I have -raised the possibilty of moving network devices into the device -namespace, but have had no response. - - -How can I test if I have devfs compiled into my kernel? - -All filesystems built-in or currently loaded are listed in -/proc/filesystems. If you see a devfs entry, then -you know that devfs was compiled into your kernel. If you have -correctly configured and rebuilt your kernel, then devfs will be -built-in. If you think you've configured it in, but -/proc/filesystems doesn't show it, you've made a mistake. -Common mistakes include: - -Using a 2.2.x kernel without applying the devfs patch (if you -don't know how to patch your kernel, use 2.4.x instead, don't bother -asking me how to patch) -Forgetting to set CONFIG_EXPERIMENTAL=y -Forgetting to set CONFIG_DEVFS_FS=y -Forgetting to set CONFIG_DEVFS_MOUNT=y (if you want devfs -to be automatically mounted at boot) -Editing your .config manually, instead of using make -config or make xconfig -Forgetting to run make dep; make clean after changing the -configuration and before compiling -Forgetting to compile your kernel and modules -Forgetting to install your kernel -Forgetting to install your modules - -Please check twice that you've done all these steps before sending in -a bug report. - - - -How can I test if devfs is mounted on /dev? - -The device filesystem will always create an entry called -".devfsd", which is used to communicate with the daemon. Even -if the daemon is not running, this entry will exist. Testing for the -existence of this entry is the approved method of determining if devfs -is mounted or not. Note that the type of entry (i.e. regular file, -character device, named pipe, etc.) may change without notice. Only -the existence of the entry should be relied upon. - - -When I start devfsd, I see the error: -Error opening file: ".devfsd" No such file or directory? - -This means that devfs is not mounted. Make sure you have devfs mounted. - - -How do I mount devfs? - -First make sure you have devfs compiled into your kernel (see -above). Then you will either need to: - -set CONFIG_DEVFS_MOUNT=y in your kernel config -pass devfs=mount to your boot loader -mount devfs manually in your boot scripts with: -mount -t none devfs /dev - - - -Mount by volume LABEL=<label> doesn't work with -devfs - -Most probably you are not mounting devfs onto /dev. What -happens is that if your kernel config has CONFIG_DEVFS_FS=y -then the contents of /proc/partitions will have the devfs -names (such as scsi/host0/bus0/target0/lun0/part1). The -contents of /proc/partitions are used by mount(8) when -mounting by volume label. If devfs is not mounted on /dev, -then mount(8) will fail to find devices. The solution is to -make sure that devfs is mounted on /dev. See above for how to -do that. - - -I have extra or incorrect entries in /dev - -You may have stale entries in your dev-state area. Check for a -RESTORE configuration line in your devfsd configuration -(typically /etc/devfsd.conf). If you have this line, check -the contents of the specified directory for stale entries. Remove -any entries which are incorrect, then reboot. - - -I get "Unable to open initial console" messages at boot - -This usually happens when you don't have devfs automounted onto -/dev at boot time, and there is no valid -/dev/console entry on your root file-system. Create a valid -/dev/console device node. - - - - - -Alternatives to devfs - -I've attempted to collate all the anti-devfs proposals and explain -their limitations. Under construction. - - -Why not just pass device create/remove events to a daemon? - -Here the suggestion is to develop an API in the kernel so that devices -can register create and remove events, and a daemon listens for those -events. The daemon would then populate/depopulate /dev (which -resides on disc). - -This has several limitations: - - -it only works for modules loaded and unloaded (or devices inserted -and removed) after the kernel has finished booting. Without a database -of events, there is no way the daemon could fully populate -/dev - - -if you add a database to this scheme, the question is then how to -present that database to user-space. If you make it a list of strings -with embedded event codes which are passed through a pipe to the -daemon, then this is only of use to the daemon. I would argue that the -natural way to present this data is via a filesystem (since many of -the events will be of a hierarchical nature), such as devfs. -Presenting the data as a filesystem makes it easy for the user to see -what is available and also makes it easy to write scripts to scan the -"database" - - -the tight binding between device nodes and drivers is no longer -possible (requiring the otherwise perfectly avoidable -table lookups) - - -you cannot catch inode lookup events on /dev which means -that module autoloading requires device nodes to be created. This is a -problem, particularly for drivers where only a few inodes are created -from a potentially large set - - -this technique can't be used when the root FS is mounted -read-only - - - - -Just implement a better scsidev - -This suggestion involves taking the scsidev programme and -extending it to scan for all devices, not just SCSI devices. The -scsidev programme works by scanning /proc/scsi - -Problems: - - -the kernel does not currently provide a list of all devices -available. Not all drivers register entries in /proc or -generate kernel messages - - -there is no uniform mechanism to register devices other than the -devfs API - - -implementing such an API is then the same as the -proposal above - - - - -Put /dev on a ramdisc - -This suggestion involves creating a ramdisc and populating it with -device nodes and then mounting it over /dev. - -Problems: - - - -this doesn't help when mounting the root filesystem, since you -still need a device node to do that - - -if you want to use this technique for the root device node as -well, you need to use initrd. This complicates the booting sequence -and makes it significantly harder to administer and configure. The -initrd is essentially opaque, robbing the system administrator of easy -configuration - - -insufficient information is available to correctly populate the -ramdisc. So we come back to the -proposal above to "solve" this - - -a ramdisc-based solution would take more kernel memory, since the -backing store would be (at best) normal VFS inodes and dentries, which -take 284 bytes and 112 bytes, respectively, for each entry. Compare -that to 72 bytes for devfs - - - - -Do nothing: there's no problem - -Sometimes people can be heard to claim that the existing scheme is -fine. This is what they're ignoring: - - -device number size (8 bits each for major and minor) is a real -limitation, and must be fixed somehow. Systems with large numbers of -SCSI devices, for example, will continue to consume the remaining -unallocated major numbers. USB will also need to push beyond the 8 bit -minor limitation - - -simply increasing the device number size is insufficient. Apart -from causing a lot of pain, it doesn't solve the management issues -of a /dev with thousands or more device nodes - - -ignoring the problem of a huge /dev will not make it go -away, and dismisses the legitimacy of a large number of people who -want a dynamic /dev - - -the standard response then becomes: "write a device management -daemon", which brings us back to the -proposal above - - - - -What I don't like about devfs - -Here are some common complaints about devfs, and some suggestions and -solutions that may make it more palatable for you. I can't please -everybody, but I do try :-) - -I hate the naming scheme - -First, remember that no naming scheme will please everybody. You hate -the scheme, others love it. Who's to say who's right and who's wrong? -Ultimately, the person who writes the code gets to choose, and what -exists now is a combination of the choices made by the -devfs author and the -kernel maintainer (Linus). - -However, not all is lost. If you want to create your own naming -scheme, it is a simple matter to write a standalone script, hack -devfsd, or write a script called by devfsd. You can create whatever -naming scheme you like. - -Further, if you want to remove all traces of the devfs naming scheme -from /dev, you can mount devfs elsewhere (say -/devfs) and populate /dev with links into -/devfs. This population can be automated using devfsd if you -wish. - -You can even use the VFS binding facility to make the links, rather -than using symbolic links. This way, you don't even have to see the -"destination" of these symbolic links. - -Devfs puts policy into the kernel - -There's already policy in the kernel. Device numbers are in fact -policy (why should the kernel dictate what device numbers I use?). -Face it, some policy has to be in the kernel. The real difference -between device names as policy and device numbers as policy is that -no one will use device numbers directly, because device -numbers are devoid of meaning to humans and are ugly. At least with -the devfs device names, (even though you can add your own naming -scheme) some people will use the devfs-supplied names directly. This -offends some people :-) - -Devfs is bloatware - -This is not even remotely true. As shown above, -both code and data size are quite modest. - - -How to report bugs - -If you have (or think you have) a bug with devfs, please follow the -steps below: - - - -make sure you have enabled debugging output when configuring your -kernel. You will need to set (at least) the following config options: - -CONFIG_DEVFS_DEBUG=y -CONFIG_DEBUG_KERNEL=y -CONFIG_DEBUG_SLAB=y - - - -please make sure you have the latest devfs patches applied. The -latest kernel version might not have the latest devfs patches applied -yet (Linus is very busy) - - -save a copy of your complete kernel logs (preferably by -using the dmesg programme) for later inclusion in your bug -report. You may need to use the -s switch to increase the -internal buffer size so you can capture all the boot messages. -Don't edit or trim the dmesg output - - - - -try booting with devfs=dall passed to the kernel boot -command line (read the documentation on your bootloader on how to do -this), and save the result to a file. This may be quite verbose, and -it may overflow the messages buffer, but try to get as much of it as -you can - - -send a copy of your devfsd configuration file(s) - -send the bug report to me first. -Don't expect that I will see it if you post it to the linux-kernel -mailing list. Include all the information listed above, plus -anything else that you think might be relevant. Put the string -devfs somewhere in the subject line, so my mail filters mark -it as urgent - - - - -Here is a general guide on how to ask questions in a way that greatly -improves your chances of getting a reply: - -http://www.tuxedo.org/~esr/faqs/smart-questions.html. If you have -a bug to report, you should also read - -http://www.chiark.greenend.org.uk/~sgtatham/bugs.html. - - -Strange kernel messages - -You may see devfs-related messages in your kernel logs. Below are some -messages and what they mean (and what you should do about them, if -anything). - - - -devfs_register(fred): could not append to parent, err: -17 - -You need to check what the error code means, but usually 17 means -EEXIST. This means that a driver attempted to create an entry -fred in a directory, but there already was an entry with that -name. This is often caused by flawed boot scripts which untar a bunch -of inodes into /dev, as a way to restore permissions. This -message is harmless, as the device nodes will still -provide access to the driver (unless you use the devfs=only -boot option, which is only for dedicated souls:-). If you want to get -rid of these annoying messages, upgrade to devfsd-v1.3.20 and use the -recommended RESTORE directive to restore permissions. - - -devfs_mk_dir(bill): using old entry in dir: c1808724 "" - -This is similar to the message above, except that a driver attempted -to create a directory named bill, and the parent directory -has an entry with the same name. In this case, to ensure that drivers -continue to work properly, the old entry is re-used and given to the -driver. In 2.5 kernels, the driver is given a NULL entry, and thus, -under rare circumstances, may not create the require device nodes. -The solution is the same as above. - - - - - -Compilation problems with devfsd - -Usually, you can compile devfsd just by typing in -make in the source directory, followed by a make -install (as root). Sometimes, you may have problems, particularly -on broken configurations. - - - -error messages relating to DEVFSD_NOTIFY_DELETE - -This happened because you have an ancient set of kernel headers -installed in /usr/include/linux or /usr/src/linux. -Install kernel 2.4.10 or later. You may need to pass the -KERNEL_DIR variable to make (if you did not install -the new kernel sources as /usr/src/linux), or you may copy -the devfs_fs.h file in the kernel source tree into -/usr/include/linux. - - - - ------------------------------------------------------------------------------ - - -Other resources - - - -Douglas Gilbert has written a useful document at - -http://www.torque.net/sg/devfs_scsi.html which -explores the SCSI subsystem and how it interacts with devfs - - -Douglas Gilbert has written another useful document at - -http://www.torque.net/scsi/SCSI-2.4-HOWTO/ which -discusses the Linux SCSI subsystem in 2.4. - - -Johannes Erdfelt has started a discussion paper on Linux and -hot-swap devices, describing what the requirements are for a scalable -solution and how and why he's used devfs+devfsd. Note that this is an -early draft only, available in plain text form at: - -http://johannes.erdfelt.com/hotswap.txt. -Johannes has promised a HTML version will follow. - - -I presented an invited -paper -at the - -2nd Annual Storage Management Workshop held in Miamia, Florida, -U.S.A. in October 2000. - - - - ------------------------------------------------------------------------------ - - -Translations of this document - -This document has been translated into other languages. - - - - -The document master (in English) by rgooch@atnf.csiro.au is -available at - -http://www.atnf.csiro.au/~rgooch/linux/docs/devfs.html - - - -A Korean translation by viatoris@nownuri.net is available at - -http://your.destiny.pe.kr/devfs/devfs.html - - - - ------------------------------------------------------------------------------ -Most flags courtesy of ITA's -Flags of All Countries -used with permission. diff --git a/Documentation/filesystems/devfs/ToDo b/Documentation/filesystems/devfs/ToDo deleted file mode 100644 index afd5a8f2c19b..000000000000 --- a/Documentation/filesystems/devfs/ToDo +++ /dev/null @@ -1,40 +0,0 @@ - Device File System (devfs) ToDo List - - Richard Gooch <rgooch@atnf.csiro.au> - - 3-JUL-2000 - -This is a list of things to be done for better devfs support in the -Linux kernel. If you'd like to contribute to the devfs, please have a -look at this list for anything that is unallocated. Also, if there are -items missing (surely), please contact me so I can add them to the -list (preferably with your name attached to them:-). - - -- >256 ptys - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- Amiga floppy driver (drivers/block/amiflop.c) - -- Atari floppy driver (drivers/block/ataflop.c) - -- SWIM3 (Super Woz Integrated Machine 3) floppy driver (drivers/block/swim3.c) - -- Amiga ZorroII ramdisc driver (drivers/block/z2ram.c) - -- Parallel port ATAPI CD-ROM (drivers/block/paride/pcd.c) - -- Parallel port ATAPI floppy (drivers/block/paride/pf.c) - -- AP1000 block driver (drivers/ap1000/ap.c, drivers/ap1000/ddv.c) - -- Archimedes floppy (drivers/acorn/block/fd1772.c) - -- MFM hard drive (drivers/acorn/block/mfmhd.c) - -- I2O block device (drivers/message/i2o/i2o_block.c) - -- ST-RAM device (arch/m68k/atari/stram.c) - -- Raw devices - diff --git a/Documentation/filesystems/devfs/boot-options b/Documentation/filesystems/devfs/boot-options deleted file mode 100644 index df3d33b03e0a..000000000000 --- a/Documentation/filesystems/devfs/boot-options +++ /dev/null @@ -1,65 +0,0 @@ -/* -*- auto-fill -*- */ - - Device File System (devfs) Boot Options - - Richard Gooch <rgooch@atnf.csiro.au> - - 18-AUG-2001 - - -When CONFIG_DEVFS_DEBUG is enabled, you can pass several boot options -to the kernel to debug devfs. The boot options are prefixed by -"devfs=", and are separated by commas. Spaces are not allowed. The -syntax looks like this: - -devfs=<option1>,<option2>,<option3> - -and so on. For example, if you wanted to turn on debugging for module -load requests and device registration, you would do: - -devfs=dmod,dreg - -You may prefix "no" to any option. This will invert the option. - - -Debugging Options -================= - -These requires CONFIG_DEVFS_DEBUG to be enabled. -Note that all debugging options have 'd' as the first character. By -default all options are off. All debugging output is sent to the -kernel logs. The debugging options do not take effect until the devfs -version message appears (just prior to the root filesystem being -mounted). - -These are the options: - -dmod print module load requests to <request_module> - -dreg print device register requests to <devfs_register> - -dunreg print device unregister requests to <devfs_unregister> - -dchange print device change requests to <devfs_set_flags> - -dilookup print inode lookup requests - -diget print VFS inode allocations - -diunlink print inode unlinks - -dichange print inode changes - -dimknod print calls to mknod(2) - -dall some debugging turned on - - -Other Options -============= - -These control the default behaviour of devfs. The options are: - -mount mount devfs onto /dev at boot time - -only disable non-devfs device nodes for devfs-capable drivers diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index afb1335c05d6..4aecc9bdb273 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -113,6 +113,14 @@ noquota grpquota usrquota +bh (*) ext3 associates buffer heads to data pages to +nobh (a) cache disk block mapping information + (b) link pages into transaction to provide + ordering guarantees. + "bh" option forces use of buffer heads. + "nobh" option tries to avoid associating buffer + heads (supported only for "writeback" mode). + Specification ============= diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt index 33f74310d161..a584f05403a4 100644 --- a/Documentation/filesystems/fuse.txt +++ b/Documentation/filesystems/fuse.txt @@ -18,6 +18,14 @@ Non-privileged mount (or user mount): user. NOTE: this is not the same as mounts allowed with the "user" option in /etc/fstab, which is not discussed here. +Filesystem connection: + + A connection between the filesystem daemon and the kernel. The + connection exists until either the daemon dies, or the filesystem is + umounted. Note that detaching (or lazy umounting) the filesystem + does _not_ break the connection, in this case it will exist until + the last reference to the filesystem is released. + Mount owner: The user who does the mounting. @@ -86,16 +94,20 @@ Mount options The default is infinite. Note that the size of read requests is limited anyway to 32 pages (which is 128kbyte on i386). -Sysfs -~~~~~ +Control filesystem +~~~~~~~~~~~~~~~~~~ + +There's a control filesystem for FUSE, which can be mounted by: -FUSE sets up the following hierarchy in sysfs: + mount -t fusectl none /sys/fs/fuse/connections - /sys/fs/fuse/connections/N/ +Mounting it under the '/sys/fs/fuse/connections' directory makes it +backwards compatible with earlier versions. -where N is an increasing number allocated to each new connection. +Under the fuse control filesystem each connection has a directory +named by a unique number. -For each connection the following attributes are defined: +For each connection the following files exist within this directory: 'waiting' @@ -110,7 +122,47 @@ For each connection the following attributes are defined: connection. This means that all waiting requests will be aborted an error returned for all aborted and new requests. -Only a privileged user may read or write these attributes. +Only the owner of the mount may read or write these files. + +Interrupting filesystem operations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a process issuing a FUSE filesystem request is interrupted, the +following will happen: + + 1) If the request is not yet sent to userspace AND the signal is + fatal (SIGKILL or unhandled fatal signal), then the request is + dequeued and returns immediately. + + 2) If the request is not yet sent to userspace AND the signal is not + fatal, then an 'interrupted' flag is set for the request. When + the request has been successfully transfered to userspace and + this flag is set, an INTERRUPT request is queued. + + 3) If the request is already sent to userspace, then an INTERRUPT + request is queued. + +INTERRUPT requests take precedence over other requests, so the +userspace filesystem will receive queued INTERRUPTs before any others. + +The userspace filesystem may ignore the INTERRUPT requests entirely, +or may honor them by sending a reply to the _original_ request, with +the error set to EINTR. + +It is also possible that there's a race between processing the +original request and it's INTERRUPT request. There are two possibilities: + + 1) The INTERRUPT request is processed before the original request is + processed + + 2) The INTERRUPT request is processed after the original request has + been answered + +If the filesystem cannot find the original request, it should wait for +some timeout and/or a number of new requests to arrive, after which it +should reply to the INTERRUPT request with an EAGAIN error. In case +1) the INTERRUPT request will be requeued. In case 2) the INTERRUPT +reply will be ignored. Aborting a filesystem connection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -139,8 +191,8 @@ the filesystem. There are several ways to do this: - Use forced umount (umount -f). Works in all cases but only if filesystem is still attached (it hasn't been lazy unmounted) - - Abort filesystem through the sysfs interface. Most powerful - method, always works. + - Abort filesystem through the FUSE control filesystem. Most + powerful method, always works. How do non-privileged mounts work? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -304,25 +356,7 @@ Scenario 1 - Simple deadlock | | for "file"] | | *DEADLOCK* -The solution for this is to allow requests to be interrupted while -they are in userspace: - - | [interrupted by signal] | - | <fuse_unlink() | - | [release semaphore] | [semaphore acquired] - | <sys_unlink() | - | | >fuse_unlink() - | | [queue req on fc->pending] - | | [wake up fc->waitq] - | | [sleep on req->waitq] - -If the filesystem daemon was single threaded, this will stop here, -since there's no other thread to dequeue and execute the request. -In this case the solution is to kill the FUSE daemon as well. If -there are multiple serving threads, you just have to kill them as -long as any remain. - -Moral: a filesystem which deadlocks, can soon find itself dead. +The solution for this is to allow the filesystem to be aborted. Scenario 2 - Tricky deadlock ---------------------------- @@ -355,24 +389,14 @@ but is caused by a pagefault. | | [lock page] | | * DEADLOCK * -Solution is again to let the the request be interrupted (not -elaborated further). - -An additional problem is that while the write buffer is being -copied to the request, the request must not be interrupted. This -is because the destination address of the copy may not be valid -after the request is interrupted. - -This is solved with doing the copy atomically, and allowing -interruption while the page(s) belonging to the write buffer are -faulted with get_user_pages(). The 'req->locked' flag indicates -when the copy is taking place, and interruption is delayed until -this flag is unset. +Solution is basically the same as above. -Scenario 3 - Tricky deadlock with asynchronous read ---------------------------------------------------- +An additional problem is that while the write buffer is being copied +to the request, the request must not be interrupted/aborted. This is +because the destination address of the copy may not be valid after the +request has returned. -The same situation as above, except thread-1 will wait on page lock -and hence it will be uninterruptible as well. The solution is to -abort the connection with forced umount (if mount is attached) or -through the abort attribute in sysfs. +This is solved with doing the copy atomically, and allowing abort +while the page(s) belonging to the write buffer are faulted with +get_user_pages(). The 'req->locked' flag indicates when the copy is +taking place, and abort is delayed until this flag is unset. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 99902ae6804e..7240ee7515de 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -39,6 +39,8 @@ Table of Contents 2.9 Appletalk 2.10 IPX 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem + 2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score + 2.13 /proc/<pid>/oom_score - Display current oom-killer score ------------------------------------------------------------------------------ Preface @@ -1124,11 +1126,15 @@ debugging information is displayed on console. NMI switch that most IA32 servers have fires unknown NMI up, for example. If a system hangs up, try pressing the NMI switch. -[NOTE] - This function and oprofile share a NMI callback. Therefore this function - cannot be enabled when oprofile is activated. - And NMI watchdog will be disabled when the value in this file is set to - non-zero. +nmi_watchdog +------------ + +Enables/Disables the NMI watchdog on x86 systems. When the value is non-zero +the NMI watchdog is enabled and will continuously test all online cpus to +determine whether or not they are still functioning properly. + +Because the NMI watchdog shares registers with oprofile, by disabling the NMI +watchdog, oprofile may have more registers to utilize. 2.4 /proc/sys/vm - The virtual memory subsystem @@ -1958,6 +1964,22 @@ a queue must be less or equal then msg_max. maximum message size value (it is every message queue's attribute set during its creation). +2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score +------------------------------------------------------ + +This file can be used to adjust the score used to select which processes +should be killed in an out-of-memory situation. Giving it a high score will +increase the likelihood of this process being killed by the oom-killer. Valid +values are in the range -16 to +15, plus the special value -17, which disables +oom-killing altogether for this process. + +2.13 /proc/<pid>/oom_score - Display current oom-killer score +------------------------------------------------------------- + +------------------------------------------------------------------------------ +This file can be used to check the current score used by the oom-killer is for +any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which +process should be killed in an out-of-memory situation. ------------------------------------------------------------------------------ Summary diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt index 60ab61e54e8a..25981e2e51be 100644 --- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt @@ -70,11 +70,13 @@ tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. What is rootfs? --------------- -Rootfs is a special instance of ramfs, which is always present in 2.6 systems. -(It's used internally as the starting and stopping point for searches of the -kernel's doubly-linked list of mount points.) +Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is +always present in 2.6 systems. You can't unmount rootfs for approximately the +same reason you can't kill the init process; rather than having special code +to check for and handle an empty list, it's smaller and simpler for the kernel +to just make sure certain lists can't become empty. -Most systems just mount another filesystem over it and ignore it. The +Most systems just mount another filesystem over rootfs and ignore it. The amount of space an empty instance of ramfs takes up is tiny. What is initramfs? @@ -92,14 +94,16 @@ out of that. All this differs from the old initrd in several ways: - - The old initrd was a separate file, while the initramfs archive is linked - into the linux kernel image. (The directory linux-*/usr is devoted to - generating this archive during the build.) + - The old initrd was always a separate file, while the initramfs archive is + linked into the linux kernel image. (The directory linux-*/usr is devoted + to generating this archive during the build.) - The old initrd file was a gzipped filesystem image (in some file format, - such as ext2, that had to be built into the kernel), while the new + such as ext2, that needed a driver built into the kernel), while the new initramfs archive is a gzipped cpio archive (like tar only simpler, - see cpio(1) and Documentation/early-userspace/buffer-format.txt). + see cpio(1) and Documentation/early-userspace/buffer-format.txt). The + kernel's cpio extraction code is not only extremely small, it's also + __init data that can be discarded during the boot process. - The program run by the old initrd (which was called /initrd, not /init) did some setup and then returned to the kernel, while the init program from @@ -124,13 +128,14 @@ Populating initramfs: The 2.6 kernel build process always creates a gzipped cpio format initramfs archive and links it into the resulting kernel binary. By default, this -archive is empty (consuming 134 bytes on x86). The config option -CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices -in menuconfig, and living in usr/Kconfig) can be used to specify a source for -the initramfs archive, which will automatically be incorporated into the -resulting binary. This option can point to an existing gzipped cpio archive, a -directory containing files to be archived, or a text file specification such -as the following example: +archive is empty (consuming 134 bytes on x86). + +The config option CONFIG_INITRAMFS_SOURCE (for some reason buried under +devices->block devices in menuconfig, and living in usr/Kconfig) can be used +to specify a source for the initramfs archive, which will automatically be +incorporated into the resulting binary. This option can point to an existing +gzipped cpio archive, a directory containing files to be archived, or a text +file specification such as the following example: dir /dev 755 0 0 nod /dev/console 644 0 0 c 5 1 @@ -146,23 +151,84 @@ as the following example: Run "usr/gen_init_cpio" (after the kernel build) to get a usage message documenting the above file format. -One advantage of the text file is that root access is not required to +One advantage of the configuration file is that root access is not required to set permissions or create device nodes in the new archive. (Note that those two example "file" entries expect to find files named "init.sh" and "busybox" in a directory called "initramfs", under the linux-2.6.* directory. See Documentation/early-userspace/README for more details.) -The kernel does not depend on external cpio tools, gen_init_cpio is created -from usr/gen_init_cpio.c which is entirely self-contained, and the kernel's -boot-time extractor is also (obviously) self-contained. However, if you _do_ -happen to have cpio installed, the following command line can extract the -generated cpio image back into its component files: +The kernel does not depend on external cpio tools. If you specify a +directory instead of a configuration file, the kernel's build infrastructure +creates a configuration file from that directory (usr/Makefile calls +scripts/gen_initramfs_list.sh), and proceeds to package up that directory +using the config file (by feeding it to usr/gen_init_cpio, which is created +from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is +entirely self-contained, and the kernel's boot-time extractor is also +(obviously) self-contained. + +The one thing you might need external cpio utilities installed for is creating +or extracting your own preprepared cpio files to feed to the kernel build +(instead of a config file or directory). + +The following command line can extract a cpio image (either by the above script +or by the kernel build) back into its component files: cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames +The following shell script can create a prebuilt cpio archive you can +use in place of the above config file: + + #!/bin/sh + + # Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation. + # Licensed under GPL version 2 + + if [ $# -ne 2 ] + then + echo "usage: mkinitramfs directory imagename.cpio.gz" + exit 1 + fi + + if [ -d "$1" ] + then + echo "creating $2 from $1" + (cd "$1"; find . | cpio -o -H newc | gzip) > "$2" + else + echo "First argument must be a directory" + exit 1 + fi + +Note: The cpio man page contains some bad advice that will break your initramfs +archive if you follow it. It says "A typical way to generate the list +of filenames is with the find command; you should give find the -depth option +to minimize problems with permissions on directories that are unwritable or not +searchable." Don't do this when creating initramfs.cpio.gz images, it won't +work. The Linux kernel cpio extractor won't create files in a directory that +doesn't exist, so the directory entries must go before the files that go in +those directories. The above script gets them in the right order. + +External initramfs images: +-------------------------- + +If the kernel has initrd support enabled, an external cpio.gz archive can also +be passed into a 2.6 kernel in place of an initrd. In this case, the kernel +will autodetect the type (initramfs, not initrd) and extract the external cpio +archive into rootfs before trying to run /init. + +This has the memory efficiency advantages of initramfs (no ramdisk block +device) but the separate packaging of initrd (which is nice if you have +non-GPL code you'd like to run from initramfs, without conflating it with +the GPL licensed Linux kernel binary). + +It can also be used to supplement the kernel's built-in initamfs image. The +files in the external archive will overwrite any conflicting files in +the built-in initramfs archive. Some distributors also prefer to customize +a single kernel image with task-specific initramfs images, without recompiling. + Contents of initramfs: ---------------------- +An initramfs archive is a complete self-contained root filesystem for Linux. If you don't already understand what shared libraries, devices, and paths you need to get a minimal root filesystem up and running, here are some references: @@ -176,13 +242,36 @@ code against, along with some related utilities. It is BSD licensed. I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) myself. These are LGPL and GPL, respectively. (A self-contained initramfs -package is planned for the busybox 1.2 release.) +package is planned for the busybox 1.3 release.) In theory you could use glibc, but that's not well suited for small embedded uses like this. (A "hello world" program statically linked against glibc is over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do name lookups, even when otherwise statically linked.) +A good first step is to get initramfs to run a statically linked "hello world" +program as init, and test it under an emulator like qemu (www.qemu.org) or +User Mode Linux, like so: + + cat > hello.c << EOF + #include <stdio.h> + #include <unistd.h> + + int main(int argc, char *argv[]) + { + printf("Hello world!\n"); + sleep(999999999); + } + EOF + gcc -static hello2.c -o init + echo init | cpio -o -H newc | gzip > test.cpio.gz + # Testing external initramfs using the initrd loading mechanism. + qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero + +When debugging a normal root filesystem, it's nice to be able to boot with +"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's +just as useful. + Why cpio rather than tar? ------------------------- @@ -241,7 +330,7 @@ the above threads) is: Future directions: ------------------ -Today (2.6.14), initramfs is always compiled in, but not always used. The +Today (2.6.16), initramfs is always compiled in, but not always used. The kernel falls back to legacy boot code that is reached only if initramfs does not contain an /init program. The fallback is legacy code, there to ensure a smooth transition and allowing early boot functionality to gradually move to @@ -258,8 +347,9 @@ and so on. This kind of complexity (which inevitably includes policy) is rightly handled in userspace. Both klibc and busybox/uClibc are working on simple initramfs -packages to drop into a kernel build, and when standard solutions are ready -and widely deployed, the kernel's legacy early boot code will become obsolete -and a candidate for the feature removal schedule. +packages to drop into a kernel build. -But that's a while off yet. +The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree. +The kernel's current early boot code (partition detection, etc) will probably +be migrated into a default initramfs, automatically created and used by the +kernel build. diff --git a/Documentation/filesystems/relay.txt b/Documentation/filesystems/relay.txt new file mode 100644 index 000000000000..d6788dae0349 --- /dev/null +++ b/Documentation/filesystems/relay.txt @@ -0,0 +1,479 @@ +relay interface (formerly relayfs) +================================== + +The relay interface provides a means for kernel applications to +efficiently log and transfer large quantities of data from the kernel +to userspace via user-defined 'relay channels'. + +A 'relay channel' is a kernel->user data relay mechanism implemented +as a set of per-cpu kernel buffers ('channel buffers'), each +represented as a regular file ('relay file') in user space. Kernel +clients write into the channel buffers using efficient write +functions; these automatically log into the current cpu's channel +buffer. User space applications mmap() or read() from the relay files +and retrieve the data as it becomes available. The relay files +themselves are files created in a host filesystem, e.g. debugfs, and +are associated with the channel buffers using the API described below. + +The format of the data logged into the channel buffers is completely +up to the kernel client; the relay interface does however provide +hooks which allow kernel clients to impose some structure on the +buffer data. The relay interface doesn't implement any form of data +filtering - this also is left to the kernel client. The purpose is to +keep things as simple as possible. + +This document provides an overview of the relay interface API. The +details of the function parameters are documented along with the +functions in the relay interface code - please see that for details. + +Semantics +========= + +Each relay channel has one buffer per CPU, each buffer has one or more +sub-buffers. Messages are written to the first sub-buffer until it is +too full to contain a new message, in which case it it is written to +the next (if available). Messages are never split across sub-buffers. +At this point, userspace can be notified so it empties the first +sub-buffer, while the kernel continues writing to the next. + +When notified that a sub-buffer is full, the kernel knows how many +bytes of it are padding i.e. unused space occurring because a complete +message couldn't fit into a sub-buffer. Userspace can use this +knowledge to copy only valid data. + +After copying it, userspace can notify the kernel that a sub-buffer +has been consumed. + +A relay channel can operate in a mode where it will overwrite data not +yet collected by userspace, and not wait for it to be consumed. + +The relay channel itself does not provide for communication of such +data between userspace and kernel, allowing the kernel side to remain +simple and not impose a single interface on userspace. It does +provide a set of examples and a separate helper though, described +below. + +The read() interface both removes padding and internally consumes the +read sub-buffers; thus in cases where read(2) is being used to drain +the channel buffers, special-purpose communication between kernel and +user isn't necessary for basic operation. + +One of the major goals of the relay interface is to provide a low +overhead mechanism for conveying kernel data to userspace. While the +read() interface is easy to use, it's not as efficient as the mmap() +approach; the example code attempts to make the tradeoff between the +two approaches as small as possible. + +klog and relay-apps example code +================================ + +The relay interface itself is ready to use, but to make things easier, +a couple simple utility functions and a set of examples are provided. + +The relay-apps example tarball, available on the relay sourceforge +site, contains a set of self-contained examples, each consisting of a +pair of .c files containing boilerplate code for each of the user and +kernel sides of a relay application. When combined these two sets of +boilerplate code provide glue to easily stream data to disk, without +having to bother with mundane housekeeping chores. + +The 'klog debugging functions' patch (klog.patch in the relay-apps +tarball) provides a couple of high-level logging functions to the +kernel which allow writing formatted text or raw data to a channel, +regardless of whether a channel to write into exists or not, or even +whether the relay interface is compiled into the kernel or not. These +functions allow you to put unconditional 'trace' statements anywhere +in the kernel or kernel modules; only when there is a 'klog handler' +registered will data actually be logged (see the klog and kleak +examples for details). + +It is of course possible to use the relay interface from scratch, +i.e. without using any of the relay-apps example code or klog, but +you'll have to implement communication between userspace and kernel, +allowing both to convey the state of buffers (full, empty, amount of +padding). The read() interface both removes padding and internally +consumes the read sub-buffers; thus in cases where read(2) is being +used to drain the channel buffers, special-purpose communication +between kernel and user isn't necessary for basic operation. Things +such as buffer-full conditions would still need to be communicated via +some channel though. + +klog and the relay-apps examples can be found in the relay-apps +tarball on http://relayfs.sourceforge.net + +The relay interface user space API +================================== + +The relay interface implements basic file operations for user space +access to relay channel buffer data. Here are the file operations +that are available and some comments regarding their behavior: + +open() enables user to open an _existing_ channel buffer. + +mmap() results in channel buffer being mapped into the caller's + memory space. Note that you can't do a partial mmap - you + must map the entire file, which is NRBUF * SUBBUFSIZE. + +read() read the contents of a channel buffer. The bytes read are + 'consumed' by the reader, i.e. they won't be available + again to subsequent reads. If the channel is being used + in no-overwrite mode (the default), it can be read at any + time even if there's an active kernel writer. If the + channel is being used in overwrite mode and there are + active channel writers, results may be unpredictable - + users should make sure that all logging to the channel has + ended before using read() with overwrite mode. Sub-buffer + padding is automatically removed and will not be seen by + the reader. + +sendfile() transfer data from a channel buffer to an output file + descriptor. Sub-buffer padding is automatically removed + and will not be seen by the reader. + +poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are + notified when sub-buffer boundaries are crossed. + +close() decrements the channel buffer's refcount. When the refcount + reaches 0, i.e. when no process or kernel client has the + buffer open, the channel buffer is freed. + +In order for a user application to make use of relay files, the +host filesystem must be mounted. For example, + + mount -t debugfs debugfs /debug + +NOTE: the host filesystem doesn't need to be mounted for kernel + clients to create or use channels - it only needs to be + mounted when user space applications need access to the buffer + data. + + +The relay interface kernel API +============================== + +Here's a summary of the API the relay interface provides to in-kernel clients: + +TBD(curr. line MT:/API/) + channel management functions: + + relay_open(base_filename, parent, subbuf_size, n_subbufs, + callbacks) + relay_close(chan) + relay_flush(chan) + relay_reset(chan) + + channel management typically called on instigation of userspace: + + relay_subbufs_consumed(chan, cpu, subbufs_consumed) + + write functions: + + relay_write(chan, data, length) + __relay_write(chan, data, length) + relay_reserve(chan, length) + + callbacks: + + subbuf_start(buf, subbuf, prev_subbuf, prev_padding) + buf_mapped(buf, filp) + buf_unmapped(buf, filp) + create_buf_file(filename, parent, mode, buf, is_global) + remove_buf_file(dentry) + + helper functions: + + relay_buf_full(buf) + subbuf_start_reserve(buf, length) + + +Creating a channel +------------------ + +relay_open() is used to create a channel, along with its per-cpu +channel buffers. Each channel buffer will have an associated file +created for it in the host filesystem, which can be and mmapped or +read from in user space. The files are named basename0...basenameN-1 +where N is the number of online cpus, and by default will be created +in the root of the filesystem (if the parent param is NULL). If you +want a directory structure to contain your relay files, you should +create it using the host filesystem's directory creation function, +e.g. debugfs_create_dir(), and pass the parent directory to +relay_open(). Users are responsible for cleaning up any directory +structure they create, when the channel is closed - again the host +filesystem's directory removal functions should be used for that, +e.g. debugfs_remove(). + +In order for a channel to be created and the host filesystem's files +associated with its channel buffers, the user must provide definitions +for two callback functions, create_buf_file() and remove_buf_file(). +create_buf_file() is called once for each per-cpu buffer from +relay_open() and allows the user to create the file which will be used +to represent the corresponding channel buffer. The callback should +return the dentry of the file created to represent the channel buffer. +remove_buf_file() must also be defined; it's responsible for deleting +the file(s) created in create_buf_file() and is called during +relay_close(). + +Here are some typical definitions for these callbacks, in this case +using debugfs: + +/* + * create_buf_file() callback. Creates relay file in debugfs. + */ +static struct dentry *create_buf_file_handler(const char *filename, + struct dentry *parent, + int mode, + struct rchan_buf *buf, + int *is_global) +{ + return debugfs_create_file(filename, mode, parent, buf, + &relay_file_operations); +} + +/* + * remove_buf_file() callback. Removes relay file from debugfs. + */ +static int remove_buf_file_handler(struct dentry *dentry) +{ + debugfs_remove(dentry); + + return 0; +} + +/* + * relay interface callbacks + */ +static struct rchan_callbacks relay_callbacks = +{ + .create_buf_file = create_buf_file_handler, + .remove_buf_file = remove_buf_file_handler, +}; + +And an example relay_open() invocation using them: + + chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks); + +If the create_buf_file() callback fails, or isn't defined, channel +creation and thus relay_open() will fail. + +The total size of each per-cpu buffer is calculated by multiplying the +number of sub-buffers by the sub-buffer size passed into relay_open(). +The idea behind sub-buffers is that they're basically an extension of +double-buffering to N buffers, and they also allow applications to +easily implement random-access-on-buffer-boundary schemes, which can +be important for some high-volume applications. The number and size +of sub-buffers is completely dependent on the application and even for +the same application, different conditions will warrant different +values for these parameters at different times. Typically, the right +values to use are best decided after some experimentation; in general, +though, it's safe to assume that having only 1 sub-buffer is a bad +idea - you're guaranteed to either overwrite data or lose events +depending on the channel mode being used. + +The create_buf_file() implementation can also be defined in such a way +as to allow the creation of a single 'global' buffer instead of the +default per-cpu set. This can be useful for applications interested +mainly in seeing the relative ordering of system-wide events without +the need to bother with saving explicit timestamps for the purpose of +merging/sorting per-cpu files in a postprocessing step. + +To have relay_open() create a global buffer, the create_buf_file() +implementation should set the value of the is_global outparam to a +non-zero value in addition to creating the file that will be used to +represent the single buffer. In the case of a global buffer, +create_buf_file() and remove_buf_file() will be called only once. The +normal channel-writing functions, e.g. relay_write(), can still be +used - writes from any cpu will transparently end up in the global +buffer - but since it is a global buffer, callers should make sure +they use the proper locking for such a buffer, either by wrapping +writes in a spinlock, or by copying a write function from relay.h and +creating a local version that internally does the proper locking. + +Channel 'modes' +--------------- + +relay channels can be used in either of two modes - 'overwrite' or +'no-overwrite'. The mode is entirely determined by the implementation +of the subbuf_start() callback, as described below. The default if no +subbuf_start() callback is defined is 'no-overwrite' mode. If the +default mode suits your needs, and you plan to use the read() +interface to retrieve channel data, you can ignore the details of this +section, as it pertains mainly to mmap() implementations. + +In 'overwrite' mode, also known as 'flight recorder' mode, writes +continuously cycle around the buffer and will never fail, but will +unconditionally overwrite old data regardless of whether it's actually +been consumed. In no-overwrite mode, writes will fail, i.e. data will +be lost, if the number of unconsumed sub-buffers equals the total +number of sub-buffers in the channel. It should be clear that if +there is no consumer or if the consumer can't consume sub-buffers fast +enough, data will be lost in either case; the only difference is +whether data is lost from the beginning or the end of a buffer. + +As explained above, a relay channel is made of up one or more +per-cpu channel buffers, each implemented as a circular buffer +subdivided into one or more sub-buffers. Messages are written into +the current sub-buffer of the channel's current per-cpu buffer via the +write functions described below. Whenever a message can't fit into +the current sub-buffer, because there's no room left for it, the +client is notified via the subbuf_start() callback that a switch to a +new sub-buffer is about to occur. The client uses this callback to 1) +initialize the next sub-buffer if appropriate 2) finalize the previous +sub-buffer if appropriate and 3) return a boolean value indicating +whether or not to actually move on to the next sub-buffer. + +To implement 'no-overwrite' mode, the userspace client would provide +an implementation of the subbuf_start() callback something like the +following: + +static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + unsigned int prev_padding) +{ + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; + + if (relay_buf_full(buf)) + return 0; + + subbuf_start_reserve(buf, sizeof(unsigned int)); + + return 1; +} + +If the current buffer is full, i.e. all sub-buffers remain unconsumed, +the callback returns 0 to indicate that the buffer switch should not +occur yet, i.e. until the consumer has had a chance to read the +current set of ready sub-buffers. For the relay_buf_full() function +to make sense, the consumer is reponsible for notifying the relay +interface when sub-buffers have been consumed via +relay_subbufs_consumed(). Any subsequent attempts to write into the +buffer will again invoke the subbuf_start() callback with the same +parameters; only when the consumer has consumed one or more of the +ready sub-buffers will relay_buf_full() return 0, in which case the +buffer switch can continue. + +The implementation of the subbuf_start() callback for 'overwrite' mode +would be very similar: + +static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + unsigned int prev_padding) +{ + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; + + subbuf_start_reserve(buf, sizeof(unsigned int)); + + return 1; +} + +In this case, the relay_buf_full() check is meaningless and the +callback always returns 1, causing the buffer switch to occur +unconditionally. It's also meaningless for the client to use the +relay_subbufs_consumed() function in this mode, as it's never +consulted. + +The default subbuf_start() implementation, used if the client doesn't +define any callbacks, or doesn't define the subbuf_start() callback, +implements the simplest possible 'no-overwrite' mode, i.e. it does +nothing but return 0. + +Header information can be reserved at the beginning of each sub-buffer +by calling the subbuf_start_reserve() helper function from within the +subbuf_start() callback. This reserved area can be used to store +whatever information the client wants. In the example above, room is +reserved in each sub-buffer to store the padding count for that +sub-buffer. This is filled in for the previous sub-buffer in the +subbuf_start() implementation; the padding value for the previous +sub-buffer is passed into the subbuf_start() callback along with a +pointer to the previous sub-buffer, since the padding value isn't +known until a sub-buffer is filled. The subbuf_start() callback is +also called for the first sub-buffer when the channel is opened, to +give the client a chance to reserve space in it. In this case the +previous sub-buffer pointer passed into the callback will be NULL, so +the client should check the value of the prev_subbuf pointer before +writing into the previous sub-buffer. + +Writing to a channel +-------------------- + +Kernel clients write data into the current cpu's channel buffer using +relay_write() or __relay_write(). relay_write() is the main logging +function - it uses local_irqsave() to protect the buffer and should be +used if you might be logging from interrupt context. If you know +you'll never be logging from interrupt context, you can use +__relay_write(), which only disables preemption. These functions +don't return a value, so you can't determine whether or not they +failed - the assumption is that you wouldn't want to check a return +value in the fast logging path anyway, and that they'll always succeed +unless the buffer is full and no-overwrite mode is being used, in +which case you can detect a failed write in the subbuf_start() +callback by calling the relay_buf_full() helper function. + +relay_reserve() is used to reserve a slot in a channel buffer which +can be written to later. This would typically be used in applications +that need to write directly into a channel buffer without having to +stage data in a temporary buffer beforehand. Because the actual write +may not happen immediately after the slot is reserved, applications +using relay_reserve() can keep a count of the number of bytes actually +written, either in space reserved in the sub-buffers themselves or as +a separate array. See the 'reserve' example in the relay-apps tarball +at http://relayfs.sourceforge.net for an example of how this can be +done. Because the write is under control of the client and is +separated from the reserve, relay_reserve() doesn't protect the buffer +at all - it's up to the client to provide the appropriate +synchronization when using relay_reserve(). + +Closing a channel +----------------- + +The client calls relay_close() when it's finished using the channel. +The channel and its associated buffers are destroyed when there are no +longer any references to any of the channel buffers. relay_flush() +forces a sub-buffer switch on all the channel buffers, and can be used +to finalize and process the last sub-buffers before the channel is +closed. + +Misc +---- + +Some applications may want to keep a channel around and re-use it +rather than open and close a new channel for each use. relay_reset() +can be used for this purpose - it resets a channel to its initial +state without reallocating channel buffer memory or destroying +existing mappings. It should however only be called when it's safe to +do so, i.e. when the channel isn't currently being written to. + +Finally, there are a couple of utility callbacks that can be used for +different purposes. buf_mapped() is called whenever a channel buffer +is mmapped from user space and buf_unmapped() is called when it's +unmapped. The client can use this notification to trigger actions +within the kernel application, such as enabling/disabling logging to +the channel. + + +Resources +========= + +For news, example code, mailing list, etc. see the relay interface homepage: + + http://relayfs.sourceforge.net + + +Credits +======= + +The ideas and specs for the relay interface came about as a result of +discussions on tracing involving the following: + +Michel Dagenais <michel.dagenais@polymtl.ca> +Richard Moore <richardj_moore@uk.ibm.com> +Bob Wisniewski <bob@watson.ibm.com> +Karim Yaghmour <karim@opersys.com> +Tom Zanussi <zanussi@us.ibm.com> + +Also thanks to Hubertus Franke for a lot of useful suggestions and bug +reports. diff --git a/Documentation/filesystems/relayfs.txt b/Documentation/filesystems/relayfs.txt deleted file mode 100644 index 5832377b7340..000000000000 --- a/Documentation/filesystems/relayfs.txt +++ /dev/null @@ -1,442 +0,0 @@ - -relayfs - a high-speed data relay filesystem -============================================ - -relayfs is a filesystem designed to provide an efficient mechanism for -tools and facilities to relay large and potentially sustained streams -of data from kernel space to user space. - -The main abstraction of relayfs is the 'channel'. A channel consists -of a set of per-cpu kernel buffers each represented by a file in the -relayfs filesystem. Kernel clients write into a channel using -efficient write functions which automatically log to the current cpu's -channel buffer. User space applications mmap() the per-cpu files and -retrieve the data as it becomes available. - -The format of the data logged into the channel buffers is completely -up to the relayfs client; relayfs does however provide hooks which -allow clients to impose some structure on the buffer data. Nor does -relayfs implement any form of data filtering - this also is left to -the client. The purpose is to keep relayfs as simple as possible. - -This document provides an overview of the relayfs API. The details of -the function parameters are documented along with the functions in the -filesystem code - please see that for details. - -Semantics -========= - -Each relayfs channel has one buffer per CPU, each buffer has one or -more sub-buffers. Messages are written to the first sub-buffer until -it is too full to contain a new message, in which case it it is -written to the next (if available). Messages are never split across -sub-buffers. At this point, userspace can be notified so it empties -the first sub-buffer, while the kernel continues writing to the next. - -When notified that a sub-buffer is full, the kernel knows how many -bytes of it are padding i.e. unused. Userspace can use this knowledge -to copy only valid data. - -After copying it, userspace can notify the kernel that a sub-buffer -has been consumed. - -relayfs can operate in a mode where it will overwrite data not yet -collected by userspace, and not wait for it to consume it. - -relayfs itself does not provide for communication of such data between -userspace and kernel, allowing the kernel side to remain simple and -not impose a single interface on userspace. It does provide a set of -examples and a separate helper though, described below. - -klog and relay-apps example code -================================ - -relayfs itself is ready to use, but to make things easier, a couple -simple utility functions and a set of examples are provided. - -The relay-apps example tarball, available on the relayfs sourceforge -site, contains a set of self-contained examples, each consisting of a -pair of .c files containing boilerplate code for each of the user and -kernel sides of a relayfs application; combined these two sets of -boilerplate code provide glue to easily stream data to disk, without -having to bother with mundane housekeeping chores. - -The 'klog debugging functions' patch (klog.patch in the relay-apps -tarball) provides a couple of high-level logging functions to the -kernel which allow writing formatted text or raw data to a channel, -regardless of whether a channel to write into exists or not, or -whether relayfs is compiled into the kernel or is configured as a -module. These functions allow you to put unconditional 'trace' -statements anywhere in the kernel or kernel modules; only when there -is a 'klog handler' registered will data actually be logged (see the -klog and kleak examples for details). - -It is of course possible to use relayfs from scratch i.e. without -using any of the relay-apps example code or klog, but you'll have to -implement communication between userspace and kernel, allowing both to -convey the state of buffers (full, empty, amount of padding). - -klog and the relay-apps examples can be found in the relay-apps -tarball on http://relayfs.sourceforge.net - - -The relayfs user space API -========================== - -relayfs implements basic file operations for user space access to -relayfs channel buffer data. Here are the file operations that are -available and some comments regarding their behavior: - -open() enables user to open an _existing_ buffer. - -mmap() results in channel buffer being mapped into the caller's - memory space. Note that you can't do a partial mmap - you must - map the entire file, which is NRBUF * SUBBUFSIZE. - -read() read the contents of a channel buffer. The bytes read are - 'consumed' by the reader i.e. they won't be available again - to subsequent reads. If the channel is being used in - no-overwrite mode (the default), it can be read at any time - even if there's an active kernel writer. If the channel is - being used in overwrite mode and there are active channel - writers, results may be unpredictable - users should make - sure that all logging to the channel has ended before using - read() with overwrite mode. - -poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are - notified when sub-buffer boundaries are crossed. - -close() decrements the channel buffer's refcount. When the refcount - reaches 0 i.e. when no process or kernel client has the buffer - open, the channel buffer is freed. - - -In order for a user application to make use of relayfs files, the -relayfs filesystem must be mounted. For example, - - mount -t relayfs relayfs /mnt/relay - -NOTE: relayfs doesn't need to be mounted for kernel clients to create - or use channels - it only needs to be mounted when user space - applications need access to the buffer data. - - -The relayfs kernel API -====================== - -Here's a summary of the API relayfs provides to in-kernel clients: - - - channel management functions: - - relay_open(base_filename, parent, subbuf_size, n_subbufs, - callbacks) - relay_close(chan) - relay_flush(chan) - relay_reset(chan) - relayfs_create_dir(name, parent) - relayfs_remove_dir(dentry) - relayfs_create_file(name, parent, mode, fops, data) - relayfs_remove_file(dentry) - - channel management typically called on instigation of userspace: - - relay_subbufs_consumed(chan, cpu, subbufs_consumed) - - write functions: - - relay_write(chan, data, length) - __relay_write(chan, data, length) - relay_reserve(chan, length) - - callbacks: - - subbuf_start(buf, subbuf, prev_subbuf, prev_padding) - buf_mapped(buf, filp) - buf_unmapped(buf, filp) - create_buf_file(filename, parent, mode, buf, is_global) - remove_buf_file(dentry) - - helper functions: - - relay_buf_full(buf) - subbuf_start_reserve(buf, length) - - -Creating a channel ------------------- - -relay_open() is used to create a channel, along with its per-cpu -channel buffers. Each channel buffer will have an associated file -created for it in the relayfs filesystem, which can be opened and -mmapped from user space if desired. The files are named -basename0...basenameN-1 where N is the number of online cpus, and by -default will be created in the root of the filesystem. If you want a -directory structure to contain your relayfs files, you can create it -with relayfs_create_dir() and pass the parent directory to -relay_open(). Clients are responsible for cleaning up any directory -structure they create when the channel is closed - use -relayfs_remove_dir() for that. - -The total size of each per-cpu buffer is calculated by multiplying the -number of sub-buffers by the sub-buffer size passed into relay_open(). -The idea behind sub-buffers is that they're basically an extension of -double-buffering to N buffers, and they also allow applications to -easily implement random-access-on-buffer-boundary schemes, which can -be important for some high-volume applications. The number and size -of sub-buffers is completely dependent on the application and even for -the same application, different conditions will warrant different -values for these parameters at different times. Typically, the right -values to use are best decided after some experimentation; in general, -though, it's safe to assume that having only 1 sub-buffer is a bad -idea - you're guaranteed to either overwrite data or lose events -depending on the channel mode being used. - -Channel 'modes' ---------------- - -relayfs channels can be used in either of two modes - 'overwrite' or -'no-overwrite'. The mode is entirely determined by the implementation -of the subbuf_start() callback, as described below. In 'overwrite' -mode, also known as 'flight recorder' mode, writes continuously cycle -around the buffer and will never fail, but will unconditionally -overwrite old data regardless of whether it's actually been consumed. -In no-overwrite mode, writes will fail i.e. data will be lost, if the -number of unconsumed sub-buffers equals the total number of -sub-buffers in the channel. It should be clear that if there is no -consumer or if the consumer can't consume sub-buffers fast enought, -data will be lost in either case; the only difference is whether data -is lost from the beginning or the end of a buffer. - -As explained above, a relayfs channel is made of up one or more -per-cpu channel buffers, each implemented as a circular buffer -subdivided into one or more sub-buffers. Messages are written into -the current sub-buffer of the channel's current per-cpu buffer via the -write functions described below. Whenever a message can't fit into -the current sub-buffer, because there's no room left for it, the -client is notified via the subbuf_start() callback that a switch to a -new sub-buffer is about to occur. The client uses this callback to 1) -initialize the next sub-buffer if appropriate 2) finalize the previous -sub-buffer if appropriate and 3) return a boolean value indicating -whether or not to actually go ahead with the sub-buffer switch. - -To implement 'no-overwrite' mode, the userspace client would provide -an implementation of the subbuf_start() callback something like the -following: - -static int subbuf_start(struct rchan_buf *buf, - void *subbuf, - void *prev_subbuf, - unsigned int prev_padding) -{ - if (prev_subbuf) - *((unsigned *)prev_subbuf) = prev_padding; - - if (relay_buf_full(buf)) - return 0; - - subbuf_start_reserve(buf, sizeof(unsigned int)); - - return 1; -} - -If the current buffer is full i.e. all sub-buffers remain unconsumed, -the callback returns 0 to indicate that the buffer switch should not -occur yet i.e. until the consumer has had a chance to read the current -set of ready sub-buffers. For the relay_buf_full() function to make -sense, the consumer is reponsible for notifying relayfs when -sub-buffers have been consumed via relay_subbufs_consumed(). Any -subsequent attempts to write into the buffer will again invoke the -subbuf_start() callback with the same parameters; only when the -consumer has consumed one or more of the ready sub-buffers will -relay_buf_full() return 0, in which case the buffer switch can -continue. - -The implementation of the subbuf_start() callback for 'overwrite' mode -would be very similar: - -static int subbuf_start(struct rchan_buf *buf, - void *subbuf, - void *prev_subbuf, - unsigned int prev_padding) -{ - if (prev_subbuf) - *((unsigned *)prev_subbuf) = prev_padding; - - subbuf_start_reserve(buf, sizeof(unsigned int)); - - return 1; -} - -In this case, the relay_buf_full() check is meaningless and the -callback always returns 1, causing the buffer switch to occur -unconditionally. It's also meaningless for the client to use the -relay_subbufs_consumed() function in this mode, as it's never -consulted. - -The default subbuf_start() implementation, used if the client doesn't -define any callbacks, or doesn't define the subbuf_start() callback, -implements the simplest possible 'no-overwrite' mode i.e. it does -nothing but return 0. - -Header information can be reserved at the beginning of each sub-buffer -by calling the subbuf_start_reserve() helper function from within the -subbuf_start() callback. This reserved area can be used to store -whatever information the client wants. In the example above, room is -reserved in each sub-buffer to store the padding count for that -sub-buffer. This is filled in for the previous sub-buffer in the -subbuf_start() implementation; the padding value for the previous -sub-buffer is passed into the subbuf_start() callback along with a -pointer to the previous sub-buffer, since the padding value isn't -known until a sub-buffer is filled. The subbuf_start() callback is -also called for the first sub-buffer when the channel is opened, to -give the client a chance to reserve space in it. In this case the -previous sub-buffer pointer passed into the callback will be NULL, so -the client should check the value of the prev_subbuf pointer before -writing into the previous sub-buffer. - -Writing to a channel --------------------- - -kernel clients write data into the current cpu's channel buffer using -relay_write() or __relay_write(). relay_write() is the main logging -function - it uses local_irqsave() to protect the buffer and should be -used if you might be logging from interrupt context. If you know -you'll never be logging from interrupt context, you can use -__relay_write(), which only disables preemption. These functions -don't return a value, so you can't determine whether or not they -failed - the assumption is that you wouldn't want to check a return -value in the fast logging path anyway, and that they'll always succeed -unless the buffer is full and no-overwrite mode is being used, in -which case you can detect a failed write in the subbuf_start() -callback by calling the relay_buf_full() helper function. - -relay_reserve() is used to reserve a slot in a channel buffer which -can be written to later. This would typically be used in applications -that need to write directly into a channel buffer without having to -stage data in a temporary buffer beforehand. Because the actual write -may not happen immediately after the slot is reserved, applications -using relay_reserve() can keep a count of the number of bytes actually -written, either in space reserved in the sub-buffers themselves or as -a separate array. See the 'reserve' example in the relay-apps tarball -at http://relayfs.sourceforge.net for an example of how this can be -done. Because the write is under control of the client and is -separated from the reserve, relay_reserve() doesn't protect the buffer -at all - it's up to the client to provide the appropriate -synchronization when using relay_reserve(). - -Closing a channel ------------------ - -The client calls relay_close() when it's finished using the channel. -The channel and its associated buffers are destroyed when there are no -longer any references to any of the channel buffers. relay_flush() -forces a sub-buffer switch on all the channel buffers, and can be used -to finalize and process the last sub-buffers before the channel is -closed. - -Creating non-relay files ------------------------- - -relay_open() automatically creates files in the relayfs filesystem to -represent the per-cpu kernel buffers; it's often useful for -applications to be able to create their own files alongside the relay -files in the relayfs filesystem as well e.g. 'control' files much like -those created in /proc or debugfs for similar purposes, used to -communicate control information between the kernel and user sides of a -relayfs application. For this purpose the relayfs_create_file() and -relayfs_remove_file() API functions exist. For relayfs_create_file(), -the caller passes in a set of user-defined file operations to be used -for the file and an optional void * to a user-specified data item, -which will be accessible via inode->u.generic_ip (see the relay-apps -tarball for examples). The file_operations are a required parameter -to relayfs_create_file() and thus the semantics of these files are -completely defined by the caller. - -See the relay-apps tarball at http://relayfs.sourceforge.net for -examples of how these non-relay files are meant to be used. - -Creating relay files in other filesystems ------------------------------------------ - -By default of course, relay_open() creates relay files in the relayfs -filesystem. Because relay_file_operations is exported, however, it's -also possible to create and use relay files in other pseudo-filesytems -such as debugfs. - -For this purpose, two callback functions are provided, -create_buf_file() and remove_buf_file(). create_buf_file() is called -once for each per-cpu buffer from relay_open() to allow the client to -create a file to be used to represent the corresponding buffer; if -this callback is not defined, the default implementation will create -and return a file in the relayfs filesystem to represent the buffer. -The callback should return the dentry of the file created to represent -the relay buffer. Note that the parent directory passed to -relay_open() (and passed along to the callback), if specified, must -exist in the same filesystem the new relay file is created in. If -create_buf_file() is defined, remove_buf_file() must also be defined; -it's responsible for deleting the file(s) created in create_buf_file() -and is called during relay_close(). - -The create_buf_file() implementation can also be defined in such a way -as to allow the creation of a single 'global' buffer instead of the -default per-cpu set. This can be useful for applications interested -mainly in seeing the relative ordering of system-wide events without -the need to bother with saving explicit timestamps for the purpose of -merging/sorting per-cpu files in a postprocessing step. - -To have relay_open() create a global buffer, the create_buf_file() -implementation should set the value of the is_global outparam to a -non-zero value in addition to creating the file that will be used to -represent the single buffer. In the case of a global buffer, -create_buf_file() and remove_buf_file() will be called only once. The -normal channel-writing functions e.g. relay_write() can still be used -- writes from any cpu will transparently end up in the global buffer - -but since it is a global buffer, callers should make sure they use the -proper locking for such a buffer, either by wrapping writes in a -spinlock, or by copying a write function from relayfs_fs.h and -creating a local version that internally does the proper locking. - -See the 'exported-relayfile' examples in the relay-apps tarball for -examples of creating and using relay files in debugfs. - -Misc ----- - -Some applications may want to keep a channel around and re-use it -rather than open and close a new channel for each use. relay_reset() -can be used for this purpose - it resets a channel to its initial -state without reallocating channel buffer memory or destroying -existing mappings. It should however only be called when it's safe to -do so i.e. when the channel isn't currently being written to. - -Finally, there are a couple of utility callbacks that can be used for -different purposes. buf_mapped() is called whenever a channel buffer -is mmapped from user space and buf_unmapped() is called when it's -unmapped. The client can use this notification to trigger actions -within the kernel application, such as enabling/disabling logging to -the channel. - - -Resources -========= - -For news, example code, mailing list, etc. see the relayfs homepage: - - http://relayfs.sourceforge.net - - -Credits -======= - -The ideas and specs for relayfs came about as a result of discussions -on tracing involving the following: - -Michel Dagenais <michel.dagenais@polymtl.ca> -Richard Moore <richardj_moore@uk.ibm.com> -Bob Wisniewski <bob@watson.ibm.com> -Karim Yaghmour <karim@opersys.com> -Tom Zanussi <zanussi@us.ibm.com> - -Also thanks to Hubertus Franke for a lot of useful suggestions and bug -reports. diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 9d3aed628bc1..1cb7e8be927a 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -113,8 +113,8 @@ members are defined: struct file_system_type { const char *name; int fs_flags; - struct int (*get_sb) (struct file_system_type *, int, - const char *, void *, struct vfsmount *); + int (*get_sb) (struct file_system_type *, int, + const char *, void *, struct vfsmount *); void (*kill_sb) (struct super_block *); struct module *owner; struct file_system_type * next; diff --git a/Documentation/hwmon/abituguru b/Documentation/hwmon/abituguru index 69cdb527d58f..b2c0d61b39a2 100644 --- a/Documentation/hwmon/abituguru +++ b/Documentation/hwmon/abituguru @@ -2,13 +2,36 @@ Kernel driver abituguru ======================= Supported chips: - * Abit uGuru (Hardware Monitor part only) + * Abit uGuru revision 1-3 (Hardware Monitor part only) Prefix: 'abituguru' Addresses scanned: ISA 0x0E0 Datasheet: Not available, this driver is based on reverse engineering. A "Datasheet" has been written based on the reverse engineering it should be available in the same dir as this file under the name abituguru-datasheet. + Note: + The uGuru is a microcontroller with onboard firmware which programs + it to behave as a hwmon IC. There are many different revisions of the + firmware and thus effectivly many different revisions of the uGuru. + Below is an incomplete list with which revisions are used for which + Motherboards: + uGuru 1.00 ~ 1.24 (AI7, KV8-MAX3, AN7) (1) + uGuru 2.0.0.0 ~ 2.0.4.2 (KV8-PRO) + uGuru 2.1.0.0 ~ 2.1.2.8 (AS8, AV8, AA8, AG8, AA8XE, AX8) + uGuru 2.2.0.0 ~ 2.2.0.6 (AA8 Fatal1ty) + uGuru 2.3.0.0 ~ 2.3.0.9 (AN8) + uGuru 3.0.0.0 ~ 3.0.1.2 (AW8, AL8, NI8) + uGuru 4.xxxxx? (AT8 32X) (2) + 1) For revisions 2 and 3 uGuru's the driver can autodetect the + sensortype (Volt or Temp) for bank1 sensors, for revision 1 uGuru's + this doesnot always work. For these uGuru's the autodection can + be overriden with the bank1_types module param. For all 3 known + revison 1 motherboards the correct use of this param is: + bank1_types=1,1,0,0,0,0,0,2,0,0,0,0,2,0,0,1 + You may also need to specify the fan_sensors option for these boards + fan_sensors=5 + 2) The current version of the abituguru driver is known to NOT work + on these Motherboards Authors: Hans de Goede <j.w.r.degoede@hhs.nl>, @@ -22,6 +45,11 @@ Module Parameters * force: bool Force detection. Note this parameter only causes the detection to be skipped, if the uGuru can't be read the module initialization (insmod) will still fail. +* bank1_types: int[] Bank1 sensortype autodetection override: + -1 autodetect (default) + 0 volt sensor + 1 temp sensor + 2 not connected * fan_sensors: int Tell the driver how many fan speed sensors there are on your motherboard. Default: 0 (autodetect). * pwms: int Tell the driver how many fan speed controls (fan @@ -29,7 +57,7 @@ Module Parameters * verbose: int How verbose should the driver be? (0-3): 0 normal output 1 + verbose error reporting - 2 + sensors type probing info\n" + 2 + sensors type probing info (default) 3 + retryable error reporting Default: 2 (the driver is still in the testing phase) diff --git a/Documentation/hwmon/it87 b/Documentation/hwmon/it87 index 9555be1ed999..e783fd62e308 100644 --- a/Documentation/hwmon/it87 +++ b/Documentation/hwmon/it87 @@ -13,12 +13,25 @@ Supported chips: from Super I/O config space (8 I/O ports) Datasheet: Publicly available at the ITE website http://www.ite.com.tw/ + * IT8716F + Prefix: 'it8716' + Addresses scanned: from Super I/O config space (8 I/O ports) + Datasheet: Publicly available at the ITE website + http://www.ite.com.tw/product_info/file/pc/IT8716F_V0.3.ZIP + * IT8718F + Prefix: 'it8718' + Addresses scanned: from Super I/O config space (8 I/O ports) + Datasheet: Publicly available at the ITE website + http://www.ite.com.tw/product_info/file/pc/IT8718F_V0.2.zip + http://www.ite.com.tw/product_info/file/pc/IT8718F_V0%203_(for%20C%20version).zip * SiS950 [clone of IT8705F] Prefix: 'it87' Addresses scanned: from Super I/O config space (8 I/O ports) Datasheet: No longer be available -Author: Christophe Gauthron <chrisg@0-in.com> +Authors: + Christophe Gauthron <chrisg@0-in.com> + Jean Delvare <khali@linux-fr.org> Module Parameters @@ -43,26 +56,46 @@ Module Parameters Description ----------- -This driver implements support for the IT8705F, IT8712F and SiS950 chips. - -This driver also supports IT8712F, which adds SMBus access, and a VID -input, used to report the Vcore voltage of the Pentium processor. -The IT8712F additionally features VID inputs. +This driver implements support for the IT8705F, IT8712F, IT8716F, +IT8718F and SiS950 chips. These chips are 'Super I/O chips', supporting floppy disks, infrared ports, joysticks and other miscellaneous stuff. For hardware monitoring, they include an 'environment controller' with 3 temperature sensors, 3 fan rotation speed sensors, 8 voltage sensors, and associated alarms. +The IT8712F and IT8716F additionally feature VID inputs, used to report +the Vcore voltage of the processor. The early IT8712F have 5 VID pins, +the IT8716F and late IT8712F have 6. They are shared with other functions +though, so the functionality may not be available on a given system. +The driver dumbly assume it is there. + +The IT8718F also features VID inputs (up to 8 pins) but the value is +stored in the Super-I/O configuration space. Due to technical limitations, +this value can currently only be read once at initialization time, so +the driver won't notice and report changes in the VID value. The two +upper VID bits share their pins with voltage inputs (in5 and in6) so you +can't have both on a given board. + +The IT8716F, IT8718F and later IT8712F revisions have support for +2 additional fans. They are not yet supported by the driver. + +The IT8716F and IT8718F, and late IT8712F and IT8705F also have optional +16-bit tachometer counters for fans 1 to 3. This is better (no more fan +clock divider mess) but not compatible with the older chips and +revisions. For now, the driver only uses the 16-bit mode on the +IT8716F and IT8718F. + Temperatures are measured in degrees Celsius. An alarm is triggered once when the Overtemperature Shutdown limit is crossed. Fan rotation speeds are reported in RPM (rotations per minute). An alarm is -triggered if the rotation speed has dropped below a programmable limit. Fan -readings can be divided by a programmable divider (1, 2, 4 or 8) to give the -readings more range or accuracy. Not all RPM values can accurately be -represented, so some rounding is done. With a divider of 2, the lowest -representable value is around 2600 RPM. +triggered if the rotation speed has dropped below a programmable limit. When +16-bit tachometer counters aren't used, fan readings can be divided by +a programmable divider (1, 2, 4 or 8) to give the readings more range or +accuracy. With a divider of 2, the lowest representable value is around +2600 RPM. Not all RPM values can accurately be represented, so some rounding +is done. Voltage sensors (also known as IN sensors) report their values in volts. An alarm is triggered if the voltage has crossed a programmable minimum or @@ -71,9 +104,9 @@ zero'; this is important for negative voltage measurements. All voltage inputs can measure voltages between 0 and 4.08 volts, with a resolution of 0.016 volt. The battery voltage in8 does not have limit registers. -The VID lines (IT8712F only) encode the core voltage value: the voltage -level your processor should work with. This is hardcoded by the mainboard -and/or processor itself. It is a value in volts. +The VID lines (IT8712F/IT8716F/IT8718F) encode the core voltage value: +the voltage level your processor should work with. This is hardcoded by +the mainboard and/or processor itself. It is a value in volts. If an alarm triggers, it will remain triggered until the hardware register is read at least once. This means that the cause for the alarm may already diff --git a/Documentation/hwmon/k8temp b/Documentation/hwmon/k8temp new file mode 100644 index 000000000000..bab445ab0f52 --- /dev/null +++ b/Documentation/hwmon/k8temp @@ -0,0 +1,52 @@ +Kernel driver k8temp +==================== + +Supported chips: + * AMD K8 CPU + Prefix: 'k8temp' + Addresses scanned: PCI space + Datasheet: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/32559.pdf + +Author: Rudolf Marek +Contact: Rudolf Marek <r.marek@sh.cvut.cz> + +Description +----------- + +This driver permits reading temperature sensor(s) embedded inside AMD K8 CPUs. +Official documentation says that it works from revision F of K8 core, but +in fact it seems to be implemented for all revisions of K8 except the first +two revisions (SH-B0 and SH-B3). + +There can be up to four temperature sensors inside single CPU. The driver +will auto-detect the sensors and will display only temperatures from +implemented sensors. + +Mapping of /sys files is as follows: + +temp1_input - temperature of Core 0 and "place" 0 +temp2_input - temperature of Core 0 and "place" 1 +temp3_input - temperature of Core 1 and "place" 0 +temp4_input - temperature of Core 1 and "place" 1 + +Temperatures are measured in degrees Celsius and measurement resolution is +1 degree C. It is expected that future CPU will have better resolution. The +temperature is updated once a second. Valid temperatures are from -49 to +206 degrees C. + +Temperature known as TCaseMax was specified for processors up to revision E. +This temperature is defined as temperature between heat-spreader and CPU +case, so the internal CPU temperature supplied by this driver can be higher. +There is no easy way how to measure the temperature which will correlate +with TCaseMax temperature. + +For newer revisions of CPU (rev F, socket AM2) there is a mathematically +computed temperature called TControl, which must be lower than TControlMax. + +The relationship is following: + +temp1_input - TjOffset*2 < TControlMax, + +TjOffset is not yet exported by the driver, TControlMax is usually +70 degrees C. The rule of the thumb -> CPU temperature should not cross +60 degrees C too much. diff --git a/Documentation/hwmon/vt1211 b/Documentation/hwmon/vt1211 new file mode 100644 index 000000000000..77fa633b97a8 --- /dev/null +++ b/Documentation/hwmon/vt1211 @@ -0,0 +1,206 @@ +Kernel driver vt1211 +==================== + +Supported chips: + * VIA VT1211 + Prefix: 'vt1211' + Addresses scanned: none, address read from Super-I/O config space + Datasheet: Provided by VIA upon request and under NDA + +Authors: Juerg Haefliger <juergh@gmail.com> + +This driver is based on the driver for kernel 2.4 by Mark D. Studebaker and +its port to kernel 2.6 by Lars Ekman. + +Thanks to Joseph Chan and Fiona Gatt from VIA for providing documentation and +technical support. + + +Module Parameters +----------------- + +* uch_config: int Override the BIOS default universal channel (UCH) + configuration for channels 1-5. + Legal values are in the range of 0-31. Bit 0 maps to + UCH1, bit 1 maps to UCH2 and so on. Setting a bit to 1 + enables the thermal input of that particular UCH and + setting a bit to 0 enables the voltage input. + +* int_mode: int Override the BIOS default temperature interrupt mode. + The only possible value is 0 which forces interrupt + mode 0. In this mode, any pending interrupt is cleared + when the status register is read but is regenerated as + long as the temperature stays above the hysteresis + limit. + +Be aware that overriding BIOS defaults might cause some unwanted side effects! + + +Description +----------- + +The VIA VT1211 Super-I/O chip includes complete hardware monitoring +capabilities. It monitors 2 dedicated temperature sensor inputs (temp1 and +temp2), 1 dedicated voltage (in5) and 2 fans. Additionally, the chip +implements 5 universal input channels (UCH1-5) that can be individually +programmed to either monitor a voltage or a temperature. + +This chip also provides manual and automatic control of fan speeds (according +to the datasheet). The driver only supports automatic control since the manual +mode doesn't seem to work as advertised in the datasheet. In fact I couldn't +get manual mode to work at all! Be aware that automatic mode hasn't been +tested very well (due to the fact that my EPIA M10000 doesn't have the fans +connected to the PWM outputs of the VT1211 :-(). + +The following table shows the relationship between the vt1211 inputs and the +sysfs nodes. + +Sensor Voltage Mode Temp Mode Default Use (from the datasheet) +------ ------------ --------- -------------------------------- +Reading 1 temp1 Intel thermal diode +Reading 3 temp2 Internal thermal diode +UCH1/Reading2 in0 temp3 NTC type thermistor +UCH2 in1 temp4 +2.5V +UCH3 in2 temp5 VccP (processor core) +UCH4 in3 temp6 +5V +UCH5 in4 temp7 +12V ++3.3V in5 Internal VCC (+3.3V) + + +Voltage Monitoring +------------------ + +Voltages are sampled by an 8-bit ADC with a LSB of ~10mV. The supported input +range is thus from 0 to 2.60V. Voltage values outside of this range need +external scaling resistors. This external scaling needs to be compensated for +via compute lines in sensors.conf, like: + +compute inx @*(1+R1/R2), @/(1+R1/R2) + +The board level scaling resistors according to VIA's recommendation are as +follows. And this is of course totally dependent on the actual board +implementation :-) You will have to find documentation for your own +motherboard and edit sensors.conf accordingly. + + Expected +Voltage R1 R2 Divider Raw Value +----------------------------------------------- ++2.5V 2K 10K 1.2 2083 mV +VccP --- --- 1.0 1400 mV (1) ++5V 14K 10K 2.4 2083 mV ++12V 47K 10K 5.7 2105 mV ++3.3V (int) 2K 3.4K 1.588 3300 mV (2) ++3.3V (ext) 6.8K 10K 1.68 1964 mV + +(1) Depending on the CPU (1.4V is for a VIA C3 Nehemiah). +(2) R1 and R2 for 3.3V (int) are internal to the VT1211 chip and the driver + performs the scaling and returns the properly scaled voltage value. + +Each measured voltage has an associated low and high limit which triggers an +alarm when crossed. + + +Temperature Monitoring +---------------------- + +Temperatures are reported in millidegree Celsius. Each measured temperature +has a high limit which triggers an alarm if crossed. There is an associated +hysteresis value with each temperature below which the temperature has to drop +before the alarm is cleared (this is only true for interrupt mode 0). The +interrupt mode can be forced to 0 in case the BIOS doesn't do it +automatically. See the 'Module Parameters' section for details. + +All temperature channels except temp2 are external. Temp2 is the VT1211 +internal thermal diode and the driver does all the scaling for temp2 and +returns the temperature in millidegree Celsius. For the external channels +temp1 and temp3-temp7, scaling depends on the board implementation and needs +to be performed in userspace via sensors.conf. + +Temp1 is an Intel-type thermal diode which requires the following formula to +convert between sysfs readings and real temperatures: + +compute temp1 (@-Offset)/Gain, (@*Gain)+Offset + +According to the VIA VT1211 BIOS porting guide, the following gain and offset +values should be used: + +Diode Type Offset Gain +---------- ------ ---- +Intel CPU 88.638 0.9528 + 65.000 0.9686 *) +VIA C3 Ezra 83.869 0.9528 +VIA C3 Ezra-T 73.869 0.9528 + +*) This is the formula from the lm_sensors 2.10.0 sensors.conf file. I don't +know where it comes from or how it was derived, it's just listed here for +completeness. + +Temp3-temp7 support NTC thermistors. For these channels, the driver returns +the voltages as seen at the individual pins of UCH1-UCH5. The voltage at the +pin (Vpin) is formed by a voltage divider made of the thermistor (Rth) and a +scaling resistor (Rs): + +Vpin = 2200 * Rth / (Rs + Rth) (2200 is the ADC max limit of 2200 mV) + +The equation for the thermistor is as follows (google it if you want to know +more about it): + +Rth = Ro * exp(B * (1 / T - 1 / To)) (To is 298.15K (25C) and Ro is the + nominal resistance at 25C) + +Mingling the above two equations and assuming Rs = Ro and B = 3435 yields the +following formula for sensors.conf: + +compute tempx 1 / (1 / 298.15 - (` (2200 / @ - 1)) / 3435) - 273.15, + 2200 / (1 + (^ (3435 / 298.15 - 3435 / (273.15 + @)))) + + +Fan Speed Control +----------------- + +The VT1211 provides 2 programmable PWM outputs to control the speeds of 2 +fans. Writing a 2 to any of the two pwm[1-2]_enable sysfs nodes will put the +PWM controller in automatic mode. There is only a single controller that +controls both PWM outputs but each PWM output can be individually enabled and +disabled. + +Each PWM has 4 associated distinct output duty-cycles: full, high, low and +off. Full and off are internally hard-wired to 255 (100%) and 0 (0%), +respectively. High and low can be programmed via +pwm[1-2]_auto_point[2-3]_pwm. Each PWM output can be associated with a +different thermal input but - and here's the weird part - only one set of +thermal thresholds exist that controls both PWMs output duty-cycles. The +thermal thresholds are accessible via pwm[1-2]_auto_point[1-4]_temp. Note +that even though there are 2 sets of 4 auto points each, they map to the same +registers in the VT1211 and programming one set is sufficient (actually only +the first set pwm1_auto_point[1-4]_temp is writable, the second set is +read-only). + +PWM Auto Point PWM Output Duty-Cycle +------------------------------------------------ +pwm[1-2]_auto_point4_pwm full speed duty-cycle (hard-wired to 255) +pwm[1-2]_auto_point3_pwm high speed duty-cycle +pwm[1-2]_auto_point2_pwm low speed duty-cycle +pwm[1-2]_auto_point1_pwm off duty-cycle (hard-wired to 0) + +Temp Auto Point Thermal Threshold +--------------------------------------------- +pwm[1-2]_auto_point4_temp full speed temp +pwm[1-2]_auto_point3_temp high speed temp +pwm[1-2]_auto_point2_temp low speed temp +pwm[1-2]_auto_point1_temp off temp + +Long story short, the controller implements the following algorithm to set the +PWM output duty-cycle based on the input temperature: + +Thermal Threshold Output Duty-Cycle + (Rising Temp) (Falling Temp) +---------------------------------------------------------- + full speed duty-cycle full speed duty-cycle +full speed temp + high speed duty-cycle full speed duty-cycle +high speed temp + low speed duty-cycle high speed duty-cycle +low speed temp + off duty-cycle low speed duty-cycle +off temp diff --git a/Documentation/hwmon/w83627ehf b/Documentation/hwmon/w83627ehf new file mode 100644 index 000000000000..fae3b781d82d --- /dev/null +++ b/Documentation/hwmon/w83627ehf @@ -0,0 +1,85 @@ +Kernel driver w83627ehf +======================= + +Supported chips: + * Winbond W83627EHF/EHG (ISA access ONLY) + Prefix: 'w83627ehf' + Addresses scanned: ISA address retrieved from Super I/O registers + Datasheet: http://www.winbond-usa.com/products/winbond_products/pdfs/PCIC/W83627EHF_%20W83627EHGb.pdf + +Authors: + Jean Delvare <khali@linux-fr.org> + Yuan Mu (Winbond) + Rudolf Marek <r.marek@sh.cvut.cz> + +Description +----------- + +This driver implements support for the Winbond W83627EHF and W83627EHG +super I/O chips. We will refer to them collectively as Winbond chips. + +The chips implement three temperature sensors, five fan rotation +speed sensors, ten analog voltage sensors, alarms with beep warnings (control +unimplemented), and some automatic fan regulation strategies (plus manual +fan control mode). + +Temperatures are measured in degrees Celsius and measurement resolution is 1 +degC for temp1 and 0.5 degC for temp2 and temp3. An alarm is triggered when +the temperature gets higher than high limit; it stays on until the temperature +falls below the Hysteresis value. + +Fan rotation speeds are reported in RPM (rotations per minute). An alarm is +triggered if the rotation speed has dropped below a programmable limit. Fan +readings can be divided by a programmable divider (1, 2, 4, 8, 16, 32, 64 or +128) to give the readings more range or accuracy. The driver sets the most +suitable fan divisor itself. Some fans might not be present because they +share pins with other functions. + +Voltage sensors (also known as IN sensors) report their values in millivolts. +An alarm is triggered if the voltage has crossed a programmable minimum +or maximum limit. + +The driver supports automatic fan control mode known as Thermal Cruise. +In this mode, the chip attempts to keep the measured temperature in a +predefined temperature range. If the temperature goes out of range, fan +is driven slower/faster to reach the predefined range again. + +The mode works for fan1-fan4. Mapping of temperatures to pwm outputs is as +follows: + +temp1 -> pwm1 +temp2 -> pwm2 +temp3 -> pwm3 +prog -> pwm4 (the programmable setting is not supported by the driver) + +/sys files +---------- + +pwm[1-4] - this file stores PWM duty cycle or DC value (fan speed) in range: + 0 (stop) to 255 (full) + +pwm[1-4]_enable - this file controls mode of fan/temperature control: + * 1 Manual Mode, write to pwm file any value 0-255 (full speed) + * 2 Thermal Cruise + +Thermal Cruise mode +------------------- + +If the temperature is in the range defined by: + +pwm[1-4]_target - set target temperature, unit millidegree Celcius + (range 0 - 127000) +pwm[1-4]_tolerance - tolerance, unit millidegree Celcius (range 0 - 15000) + +there are no changes to fan speed. Once the temperature leaves the interval, +fan speed increases (temp is higher) or decreases if lower than desired. +There are defined steps and times, but not exported by the driver yet. + +pwm[1-4]_min_output - minimum fan speed (range 1 - 255), when the temperature + is below defined range. +pwm[1-4]_stop_time - how many milliseconds [ms] must elapse to switch + corresponding fan off. (when the temperature was below + defined range). + +Note: last two functions are influenced by other control bits, not yet exported + by the driver, so a change might not have any effect. diff --git a/Documentation/hwmon/w83791d b/Documentation/hwmon/w83791d index 83a3836289c2..19b2ed739fa1 100644 --- a/Documentation/hwmon/w83791d +++ b/Documentation/hwmon/w83791d @@ -5,7 +5,7 @@ Supported chips: * Winbond W83791D Prefix: 'w83791d' Addresses scanned: I2C 0x2c - 0x2f - Datasheet: http://www.winbond-usa.com/products/winbond_products/pdfs/PCIC/W83791Da.pdf + Datasheet: http://www.winbond-usa.com/products/winbond_products/pdfs/PCIC/W83791D_W83791Gb.pdf Author: Charles Spirakis <bezaur@gmail.com> @@ -20,6 +20,9 @@ Credits: Chunhao Huang <DZShen@Winbond.com.tw>, Rudolf Marek <r.marek@sh.cvut.cz> +Additional contributors: + Sven Anders <anders@anduras.de> + Module Parameters ----------------- @@ -46,7 +49,8 @@ Module Parameters Description ----------- -This driver implements support for the Winbond W83791D chip. +This driver implements support for the Winbond W83791D chip. The W83791G +chip appears to be the same as the W83791D but is lead free. Detection of the chip can sometimes be foiled because it can be in an internal state that allows no clean access (Bank with ID register is not @@ -71,34 +75,36 @@ Voltage sensors (also known as IN sensors) report their values in millivolts. An alarm is triggered if the voltage has crossed a programmable minimum or maximum limit. -Alarms are provided as output from a "realtime status register". The -following bits are defined: - -bit - alarm on: -0 - Vcore -1 - VINR0 -2 - +3.3VIN -3 - 5VDD -4 - temp1 -5 - temp2 -6 - fan1 -7 - fan2 -8 - +12VIN -9 - -12VIN -10 - -5VIN -11 - fan3 -12 - chassis -13 - temp3 -14 - VINR1 -15 - reserved -16 - tart1 -17 - tart2 -18 - tart3 -19 - VSB -20 - VBAT -21 - fan4 -22 - fan5 -23 - reserved +The bit ordering for the alarm "realtime status register" and the +"beep enable registers" are different. + +in0 (VCORE) : alarms: 0x000001 beep_enable: 0x000001 +in1 (VINR0) : alarms: 0x000002 beep_enable: 0x002000 <== mismatch +in2 (+3.3VIN): alarms: 0x000004 beep_enable: 0x000004 +in3 (5VDD) : alarms: 0x000008 beep_enable: 0x000008 +in4 (+12VIN) : alarms: 0x000100 beep_enable: 0x000100 +in5 (-12VIN) : alarms: 0x000200 beep_enable: 0x000200 +in6 (-5VIN) : alarms: 0x000400 beep_enable: 0x000400 +in7 (VSB) : alarms: 0x080000 beep_enable: 0x010000 <== mismatch +in8 (VBAT) : alarms: 0x100000 beep_enable: 0x020000 <== mismatch +in9 (VINR1) : alarms: 0x004000 beep_enable: 0x004000 +temp1 : alarms: 0x000010 beep_enable: 0x000010 +temp2 : alarms: 0x000020 beep_enable: 0x000020 +temp3 : alarms: 0x002000 beep_enable: 0x000002 <== mismatch +fan1 : alarms: 0x000040 beep_enable: 0x000040 +fan2 : alarms: 0x000080 beep_enable: 0x000080 +fan3 : alarms: 0x000800 beep_enable: 0x000800 +fan4 : alarms: 0x200000 beep_enable: 0x200000 +fan5 : alarms: 0x400000 beep_enable: 0x400000 +tart1 : alarms: 0x010000 beep_enable: 0x040000 <== mismatch +tart2 : alarms: 0x020000 beep_enable: 0x080000 <== mismatch +tart3 : alarms: 0x040000 beep_enable: 0x100000 <== mismatch +case_open : alarms: 0x001000 beep_enable: 0x001000 +user_enable : alarms: -------- beep_enable: 0x800000 + +*** NOTE: It is the responsibility of user-space code to handle the fact +that the beep enable and alarm bits are in different positions when using that +feature of the chip. When an alarm goes off, you can be warned by a beeping signal through your computer speaker. It is possible to enable all beeping globally, or only @@ -109,5 +115,6 @@ often will do no harm, but will return 'old' values. W83791D TODO: --------------- -Provide a patch for per-file alarms as discussed on the mailing list +Provide a patch for per-file alarms and beep enables as defined in the hwmon + documentation (Documentation/hwmon/sysfs-interface) Provide a patch for smart-fan control (still need appropriate motherboard/fans) diff --git a/Documentation/i2c/busses/i2c-sis96x b/Documentation/i2c/busses/i2c-sis96x index 00a009b977e9..08d7b2dac69a 100644 --- a/Documentation/i2c/busses/i2c-sis96x +++ b/Documentation/i2c/busses/i2c-sis96x @@ -42,8 +42,8 @@ I suspect that this driver could be made to work for the following SiS chipsets as well: 635, and 635T. If anyone owns a board with those chips AND is willing to risk crashing & burning an otherwise well-behaved kernel in the name of progress... please contact me at <mhoffman@lightlink.com> or -via the project's mailing list: <lm-sensors@lm-sensors.org>. Please -send bug reports and/or success stories as well. +via the project's mailing list: <i2c@lm-sensors.org>. Please send bug +reports and/or success stories as well. TO DOs diff --git a/Documentation/i2c/busses/i2c-viapro b/Documentation/i2c/busses/i2c-viapro index 16775663b9f5..25680346e0ac 100644 --- a/Documentation/i2c/busses/i2c-viapro +++ b/Documentation/i2c/busses/i2c-viapro @@ -7,9 +7,12 @@ Supported adapters: * VIA Technologies, Inc. VT82C686A/B Datasheet: Sometimes available at the VIA website - * VIA Technologies, Inc. VT8231, VT8233, VT8233A, VT8235, VT8237R + * VIA Technologies, Inc. VT8231, VT8233, VT8233A Datasheet: available on request from VIA + * VIA Technologies, Inc. VT8235, VT8237R, VT8237A, VT8251 + Datasheet: available on request and under NDA from VIA + Authors: Kyösti Mälkki <kmalkki@cc.hut.fi>, Mark D. Studebaker <mdsxyz123@yahoo.com>, @@ -39,6 +42,8 @@ Your lspci -n listing must show one of these : device 1106:8235 (VT8231 function 4) device 1106:3177 (VT8235) device 1106:3227 (VT8237R) + device 1106:3337 (VT8237A) + device 1106:3287 (VT8251) If none of these show up, you should look in the BIOS for settings like enable ACPI / SMBus or even USB. diff --git a/Documentation/i2c/i2c-stub b/Documentation/i2c/i2c-stub index d6dcb138abf5..9cc081e69764 100644 --- a/Documentation/i2c/i2c-stub +++ b/Documentation/i2c/i2c-stub @@ -6,9 +6,12 @@ This module is a very simple fake I2C/SMBus driver. It implements four types of SMBus commands: write quick, (r/w) byte, (r/w) byte data, and (r/w) word data. +You need to provide a chip address as a module parameter when loading +this driver, which will then only react to SMBus commands to this address. + No hardware is needed nor associated with this module. It will accept write -quick commands to all addresses; it will respond to the other commands (also -to all addresses) by reading from or writing to an array in memory. It will +quick commands to one address; it will respond to the other commands (also +to one address) by reading from or writing to an array in memory. It will also spam the kernel logs for every command it handles. A pointer register with auto-increment is implemented for all byte @@ -21,6 +24,11 @@ The typical use-case is like this: 3. load the target sensors chip driver module 4. observe its behavior in the kernel log +PARAMETERS: + +int chip_addr: + The SMBus address to emulate a chip at. + CAVEATS: There are independent arrays for byte/data and word/data commands. Depending @@ -33,6 +41,9 @@ If the hardware for your driver has banked registers (e.g. Winbond sensors chips) this module will not work well - although it could be extended to support that pretty easily. +Only one chip address is supported - although this module could be +extended to support more. + If you spam it hard enough, printk can be lossy. This module really wants something like relayfs. diff --git a/Documentation/i386/boot.txt b/Documentation/i386/boot.txt index 10312bebe55d..c51314b1a463 100644 --- a/Documentation/i386/boot.txt +++ b/Documentation/i386/boot.txt @@ -181,6 +181,7 @@ filled out, however: 5 ELILO 7 GRuB 8 U-BOOT + 9 Xen Please contact <hpa@zytor.com> if you need a bootloader ID value assigned. diff --git a/Documentation/i386/zero-page.txt b/Documentation/i386/zero-page.txt index df28c7416781..c04a421f4a7c 100644 --- a/Documentation/i386/zero-page.txt +++ b/Documentation/i386/zero-page.txt @@ -63,6 +63,10 @@ Offset Type Description 2 for bootsect-loader 3 for SYSLINUX 4 for ETHERBOOT + 5 for ELILO + 7 for GRuB + 8 for U-BOOT + 9 for Xen V = version 0x211 char loadflags: bit0 = 1: kernel is loaded high (bzImage) diff --git a/Documentation/infiniband/ipoib.txt b/Documentation/infiniband/ipoib.txt index 187035560d7f..864ff3283780 100644 --- a/Documentation/infiniband/ipoib.txt +++ b/Documentation/infiniband/ipoib.txt @@ -51,8 +51,6 @@ Debugging Information References - IETF IP over InfiniBand (ipoib) Working Group - http://ietf.org/html.charters/ipoib-charter.html Transmission of IP over InfiniBand (IPoIB) (RFC 4391) http://ietf.org/rfc/rfc4391.txt IP over InfiniBand (IPoIB) Architecture (RFC 4392) diff --git a/Documentation/initrd.txt b/Documentation/initrd.txt index 7de1c80cd719..15f1b35deb34 100644 --- a/Documentation/initrd.txt +++ b/Documentation/initrd.txt @@ -67,12 +67,27 @@ initrd adds the following new options: as the last process has closed it, all data is freed and /dev/initrd can't be opened anymore. - root=/dev/ram0 (without devfs) - root=/dev/rd/0 (with devfs) + root=/dev/ram0 initrd is mounted as root, and the normal boot procedure is followed, with the RAM disk still mounted as root. +Compressed cpio images +---------------------- + +Recent kernels have support for populating a ramdisk from a compressed cpio +archive, on such systems, the creation of a ramdisk image doesn't need to +involve special block devices or loopbacks, you merely create a directory on +disk with the desired initrd content, cd to that directory, and run (as an +example): + +find . | cpio --quiet -c -o | gzip -9 -n > /boot/imagefile.img + +Examining the contents of an existing image file is just as simple: + +mkdir /tmp/imagefile +cd /tmp/imagefile +gzip -cd /boot/imagefile.img | cpio -imd --quiet Installation ------------ @@ -90,8 +105,7 @@ you're building an install floppy), the root file system creation procedure should create the /initrd directory. If initrd will not be mounted in some cases, its content is still -accessible if the following device has been created (note that this -does not work if using devfs): +accessible if the following device has been created: # mknod /dev/initrd b 1 250 # chmod 400 /dev/initrd @@ -119,8 +133,7 @@ We'll describe the loopback device method: (if space is critical, you may want to use the Minix FS instead of Ext2) 3) mount the file system, e.g. # mount -t ext2 -o loop initrd /mnt - 4) create the console device (not necessary if using devfs, but it can't - hurt to do it anyway): + 4) create the console device: # mkdir /mnt/dev # mknod /mnt/dev/console c 5 1 5) copy all the files that are needed to properly use the initrd @@ -152,12 +165,7 @@ have to be given: root=/dev/ram0 init=/linuxrc rw -if not using devfs, or - - root=/dev/rd/0 init=/linuxrc rw - -if using devfs. (rw is only necessary if writing to the initrd file -system.) +(rw is only necessary if writing to the initrd file system.) With LOADLIN, you simply execute @@ -217,9 +225,9 @@ following command: # exec chroot . what-follows <dev/console >dev/console 2>&1 Where what-follows is a program under the new root, e.g. /sbin/init -If the new root file system will be used with devfs and has no valid -/dev directory, devfs must be mounted before invoking chroot in order to -provide /dev/console. +If the new root file system will be used with udev and has no valid +/dev directory, udev must be initialized before invoking chroot in order +to provide /dev/console. Note: implementation details of pivot_root may change with time. In order to ensure compatibility, the following points should be observed: @@ -236,7 +244,7 @@ Now, the initrd can be unmounted and the memory allocated by the RAM disk can be freed: # umount /initrd -# blockdev --flushbufs /dev/ram0 # /dev/rd/0 if using devfs +# blockdev --flushbufs /dev/ram0 It is also possible to use initrd with an NFS-mounted root, see the pivot_root(8) man page for details. diff --git a/Documentation/input/joystick.txt b/Documentation/input/joystick.txt index d53b857a3710..841c353297e6 100644 --- a/Documentation/input/joystick.txt +++ b/Documentation/input/joystick.txt @@ -39,7 +39,6 @@ them. Bug reports and success stories are also welcome. The input project website is at: - http://www.suse.cz/development/input/ http://atrey.karlin.mff.cuni.cz/~vojtech/input/ There is also a mailing list for the driver at: diff --git a/Documentation/ioctl-number.txt b/Documentation/ioctl-number.txt index 1543802ef53e..edc04d74ae23 100644 --- a/Documentation/ioctl-number.txt +++ b/Documentation/ioctl-number.txt @@ -119,7 +119,6 @@ Code Seq# Include File Comments 'c' 00-7F linux/comstats.h conflict! 'c' 00-7F linux/coda.h conflict! 'd' 00-FF linux/char/drm/drm/h conflict! -'d' 00-1F linux/devfs_fs.h conflict! 'd' 00-DF linux/video_decoder.h conflict! 'd' F0-FF linux/digi1.h 'e' all linux/digi1.h conflict! diff --git a/Documentation/irqflags-tracing.txt b/Documentation/irqflags-tracing.txt new file mode 100644 index 000000000000..6a444877ee0b --- /dev/null +++ b/Documentation/irqflags-tracing.txt @@ -0,0 +1,57 @@ +IRQ-flags state tracing + +started by Ingo Molnar <mingo@redhat.com> + +the "irq-flags tracing" feature "traces" hardirq and softirq state, in +that it gives interested subsystems an opportunity to be notified of +every hardirqs-off/hardirqs-on, softirqs-off/softirqs-on event that +happens in the kernel. + +CONFIG_TRACE_IRQFLAGS_SUPPORT is needed for CONFIG_PROVE_SPIN_LOCKING +and CONFIG_PROVE_RW_LOCKING to be offered by the generic lock debugging +code. Otherwise only CONFIG_PROVE_MUTEX_LOCKING and +CONFIG_PROVE_RWSEM_LOCKING will be offered on an architecture - these +are locking APIs that are not used in IRQ context. (the one exception +for rwsems is worked around) + +architecture support for this is certainly not in the "trivial" +category, because lots of lowlevel assembly code deal with irq-flags +state changes. But an architecture can be irq-flags-tracing enabled in a +rather straightforward and risk-free manner. + +Architectures that want to support this need to do a couple of +code-organizational changes first: + +- move their irq-flags manipulation code from their asm/system.h header + to asm/irqflags.h + +- rename local_irq_disable()/etc to raw_local_irq_disable()/etc. so that + the linux/irqflags.h code can inject callbacks and can construct the + real local_irq_disable()/etc APIs. + +- add and enable TRACE_IRQFLAGS_SUPPORT in their arch level Kconfig file + +and then a couple of functional changes are needed as well to implement +irq-flags-tracing support: + +- in lowlevel entry code add (build-conditional) calls to the + trace_hardirqs_off()/trace_hardirqs_on() functions. The lock validator + closely guards whether the 'real' irq-flags matches the 'virtual' + irq-flags state, and complains loudly (and turns itself off) if the + two do not match. Usually most of the time for arch support for + irq-flags-tracing is spent in this state: look at the lockdep + complaint, try to figure out the assembly code we did not cover yet, + fix and repeat. Once the system has booted up and works without a + lockdep complaint in the irq-flags-tracing functions arch support is + complete. +- if the architecture has non-maskable interrupts then those need to be + excluded from the irq-tracing [and lock validation] mechanism via + lockdep_off()/lockdep_on(). + +in general there is no risk from having an incomplete irq-flags-tracing +implementation in an architecture: lockdep will detect that and will +turn itself off. I.e. the lock validator will still be reliable. There +should be no crashes due to irq-tracing bugs. (except if the assembly +changes break other code by modifying conditions or registers that +shouldnt be) + diff --git a/Documentation/kbuild/kconfig-language.txt b/Documentation/kbuild/kconfig-language.txt index ca1967f36423..003fccc14d24 100644 --- a/Documentation/kbuild/kconfig-language.txt +++ b/Documentation/kbuild/kconfig-language.txt @@ -67,19 +67,19 @@ applicable everywhere (see syntax). - default value: "default" <expr> ["if" <expr>] A config option can have any number of default values. If multiple default values are visible, only the first defined one is active. - Default values are not limited to the menu entry, where they are - defined, this means the default can be defined somewhere else or be + Default values are not limited to the menu entry where they are + defined. This means the default can be defined somewhere else or be overridden by an earlier definition. The default value is only assigned to the config symbol if no other value was set by the user (via the input prompt above). If an input prompt is visible the default value is presented to the user and can be overridden by him. - Optionally dependencies only for this default value can be added with + Optionally, dependencies only for this default value can be added with "if". - dependencies: "depends on"/"requires" <expr> This defines a dependency for this menu entry. If multiple - dependencies are defined they are connected with '&&'. Dependencies + dependencies are defined, they are connected with '&&'. Dependencies are applied to all other options within this menu entry (which also accept an "if" expression), so these two examples are equivalent: @@ -153,7 +153,7 @@ Nonconstant symbols are the most common ones and are defined with the 'config' statement. Nonconstant symbols consist entirely of alphanumeric characters or underscores. Constant symbols are only part of expressions. Constant symbols are -always surrounded by single or double quotes. Within the quote any +always surrounded by single or double quotes. Within the quote, any other character is allowed and the quotes can be escaped using '\'. Menu structure @@ -237,7 +237,7 @@ choices: <choice block> "endchoice" -This defines a choice group and accepts any of above attributes as +This defines a choice group and accepts any of the above attributes as options. A choice can only be of type bool or tristate, while a boolean choice only allows a single config entry to be selected, a tristate choice also allows any number of config entries to be set to 'm'. This diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt index a9c00facdf40..e2cbd59cf2d0 100644 --- a/Documentation/kbuild/makefiles.txt +++ b/Documentation/kbuild/makefiles.txt @@ -22,7 +22,7 @@ This document describes the Linux kernel Makefiles. === 4 Host Program support --- 4.1 Simple Host Program --- 4.2 Composite Host Programs - --- 4.3 Defining shared libraries + --- 4.3 Defining shared libraries --- 4.4 Using C++ for host programs --- 4.5 Controlling compiler options for host programs --- 4.6 When host programs are actually built @@ -69,7 +69,7 @@ architecture-specific information to the top Makefile. Each subdirectory has a kbuild Makefile which carries out the commands passed down from above. The kbuild Makefile uses information from the -.config file to construct various file lists used by kbuild to build +.config file to construct various file lists used by kbuild to build any built-in or modular targets. scripts/Makefile.* contains all the definitions/rules etc. that @@ -86,7 +86,7 @@ any kernel Makefiles (or any other source files). *Normal developers* are people who work on features such as device drivers, file systems, and network protocols. These people need to -maintain the kbuild Makefiles for the subsystem that they are +maintain the kbuild Makefiles for the subsystem they are working on. In order to do this effectively, they need some overall knowledge about the kernel Makefiles, plus detailed knowledge about the public interface for kbuild. @@ -104,10 +104,10 @@ This document is aimed towards normal developers and arch developers. === 3 The kbuild files Most Makefiles within the kernel are kbuild Makefiles that use the -kbuild infrastructure. This chapter introduce the syntax used in the +kbuild infrastructure. This chapter introduces the syntax used in the kbuild makefiles. The preferred name for the kbuild files are 'Makefile' but 'Kbuild' can -be used and if both a 'Makefile' and a 'Kbuild' file exists then the 'Kbuild' +be used and if both a 'Makefile' and a 'Kbuild' file exists, then the 'Kbuild' file will be used. Section 3.1 "Goal definitions" is a quick intro, further chapters provide @@ -124,7 +124,7 @@ more details, with real examples. Example: obj-y += foo.o - This tell kbuild that there is one object in that directory named + This tell kbuild that there is one object in that directory, named foo.o. foo.o will be built from foo.c or foo.S. If foo.o shall be built as a module, the variable obj-m is used. @@ -140,7 +140,7 @@ more details, with real examples. --- 3.2 Built-in object goals - obj-y The kbuild Makefile specifies object files for vmlinux - in the lists $(obj-y). These lists depend on the kernel + in the $(obj-y) lists. These lists depend on the kernel configuration. Kbuild compiles all the $(obj-y) files. It then calls @@ -154,8 +154,8 @@ more details, with real examples. Link order is significant, because certain functions (module_init() / __initcall) will be called during boot in the order they appear. So keep in mind that changing the link - order may e.g. change the order in which your SCSI - controllers are detected, and thus you disks are renumbered. + order may e.g. change the order in which your SCSI + controllers are detected, and thus your disks are renumbered. Example: #drivers/isdn/i4l/Makefile @@ -203,11 +203,11 @@ more details, with real examples. Example: #fs/ext2/Makefile obj-$(CONFIG_EXT2_FS) += ext2.o - ext2-y := balloc.o bitmap.o + ext2-y := balloc.o bitmap.o ext2-$(CONFIG_EXT2_FS_XATTR) += xattr.o - - In this example xattr.o is only part of the composite object - ext2.o, if $(CONFIG_EXT2_FS_XATTR) evaluates to 'y'. + + In this example, xattr.o is only part of the composite object + ext2.o if $(CONFIG_EXT2_FS_XATTR) evaluates to 'y'. Note: Of course, when you are building objects into the kernel, the syntax above will also work. So, if you have CONFIG_EXT2_FS=y, @@ -221,16 +221,16 @@ more details, with real examples. --- 3.5 Library file goals - lib-y - Objects listed with obj-* are used for modules or + Objects listed with obj-* are used for modules, or combined in a built-in.o for that specific directory. There is also the possibility to list objects that will be included in a library, lib.a. All objects listed with lib-y are combined in a single library for that directory. - Objects that are listed in obj-y and additional listed in + Objects that are listed in obj-y and additionaly listed in lib-y will not be included in the library, since they will anyway be accessible. - For consistency objects listed in lib-m will be included in lib.a. + For consistency, objects listed in lib-m will be included in lib.a. Note that the same kbuild makefile may list files to be built-in and to be part of a library. Therefore the same directory @@ -241,11 +241,11 @@ more details, with real examples. lib-y := checksum.o delay.o This will create a library lib.a based on checksum.o and delay.o. - For kbuild to actually recognize that there is a lib.a being build + For kbuild to actually recognize that there is a lib.a being built, the directory shall be listed in libs-y. See also "6.3 List directories to visit when descending". - - Usage of lib-y is normally restricted to lib/ and arch/*/lib. + + Use of lib-y is normally restricted to lib/ and arch/*/lib. --- 3.6 Descending down in directories @@ -255,7 +255,7 @@ more details, with real examples. invoke make recursively in subdirectories, provided you let it know of them. - To do so obj-y and obj-m are used. + To do so, obj-y and obj-m are used. ext2 lives in a separate directory, and the Makefile present in fs/ tells kbuild to descend down using the following assignment. @@ -353,8 +353,8 @@ more details, with real examples. Special rules are used when the kbuild infrastructure does not provide the required support. A typical example is header files generated during the build process. - Another example is the architecture specific Makefiles which - needs special rules to prepare boot images etc. + Another example are the architecture specific Makefiles which + need special rules to prepare boot images etc. Special rules are written as normal Make rules. Kbuild is not executing in the directory where the Makefile is @@ -387,28 +387,47 @@ more details, with real examples. --- 3.11 $(CC) support functions - The kernel may be build with several different versions of + The kernel may be built with several different versions of $(CC), each supporting a unique set of features and options. kbuild provide basic support to check for valid options for $(CC). $(CC) is useally the gcc compiler, but other alternatives are available. as-option - as-option is used to check if $(CC) when used to compile - assembler (*.S) files supports the given option. An optional - second option may be specified if first option are not supported. + as-option is used to check if $(CC) -- when used to compile + assembler (*.S) files -- supports the given option. An optional + second option may be specified if the first option is not supported. Example: #arch/sh/Makefile cflags-y += $(call as-option,-Wa$(comma)-isa=$(isa-y),) - In the above example cflags-y will be assinged the the option + In the above example, cflags-y will be assigned the option -Wa$(comma)-isa=$(isa-y) if it is supported by $(CC). The second argument is optional, and if supplied will be used if first argument is not supported. + ld-option + ld-option is used to check if $(CC) when used to link object files + supports the given option. An optional second option may be + specified if first option are not supported. + + Example: + #arch/i386/kernel/Makefile + vsyscall-flags += $(call ld-option, -Wl$(comma)--hash-style=sysv) + + In the above example vsyscall-flags will be assigned the option + -Wl$(comma)--hash-style=sysv if it is supported by $(CC). + The second argument is optional, and if supplied will be used + if first argument is not supported. + + as-instr + as-instr checks if the assembler reports a specific instruction + and then outputs either option1 or option2 + C escapes are supported in the test instruction + cc-option - cc-option is used to check if $(CC) support a given option, and not + cc-option is used to check if $(CC) supports a given option, and not supported to use an optional second option. Example: @@ -416,12 +435,12 @@ more details, with real examples. cflags-y += $(call cc-option,-march=pentium-mmx,-march=i586) In the above example cflags-y will be assigned the option - -march=pentium-mmx if supported by $(CC), otherwise -march-i586. - The second argument to cc-option is optional, and if omitted + -march=pentium-mmx if supported by $(CC), otherwise -march=i586. + The second argument to cc-option is optional, and if omitted, cflags-y will be assigned no value if first option is not supported. cc-option-yn - cc-option-yn is used to check if gcc supports a given option + cc-option-yn is used to check if gcc supports a given option and return 'y' if supported, otherwise 'n'. Example: @@ -429,32 +448,33 @@ more details, with real examples. biarch := $(call cc-option-yn, -m32) aflags-$(biarch) += -a32 cflags-$(biarch) += -m32 - - In the above example $(biarch) is set to y if $(CC) supports the -m32 - option. When $(biarch) equals to y the expanded variables $(aflags-y) - and $(cflags-y) will be assigned the values -a32 and -m32. + + In the above example, $(biarch) is set to y if $(CC) supports the -m32 + option. When $(biarch) equals 'y', the expanded variables $(aflags-y) + and $(cflags-y) will be assigned the values -a32 and -m32, + respectively. cc-option-align - gcc version >= 3.0 shifted type of options used to speify - alignment of functions, loops etc. $(cc-option-align) whrn used - as prefix to the align options will select the right prefix: + gcc versions >= 3.0 changed the type of options used to specify + alignment of functions, loops etc. $(cc-option-align), when used + as prefix to the align options, will select the right prefix: gcc < 3.00 cc-option-align = -malign gcc >= 3.00 cc-option-align = -falign - + Example: CFLAGS += $(cc-option-align)-functions=4 - In the above example the option -falign-functions=4 is used for - gcc >= 3.00. For gcc < 3.00 -malign-functions=4 is used. - + In the above example, the option -falign-functions=4 is used for + gcc >= 3.00. For gcc < 3.00, -malign-functions=4 is used. + cc-version - cc-version return a numerical version of the $(CC) compiler version. + cc-version returns a numerical version of the $(CC) compiler version. The format is <major><minor> where both are two digits. So for example gcc 3.41 would return 0341. cc-version is useful when a specific $(CC) version is faulty in one - area, for example the -mregparm=3 were broken in some gcc version + area, for example -mregparm=3 was broken in some gcc versions even though the option was accepted by gcc. Example: @@ -463,20 +483,20 @@ more details, with real examples. if [ $(call cc-version) -ge 0300 ] ; then \ echo "-mregparm=3"; fi ;) - In the above example -mregparm=3 is only used for gcc version greater + In the above example, -mregparm=3 is only used for gcc version greater than or equal to gcc 3.0. cc-ifversion - cc-ifversion test the version of $(CC) and equals last argument if + cc-ifversion tests the version of $(CC) and equals last argument if version expression is true. Example: #fs/reiserfs/Makefile EXTRA_CFLAGS := $(call cc-ifversion, -lt, 0402, -O1) - In this example EXTRA_CFLAGS will be assigned the value -O1 if the + In this example, EXTRA_CFLAGS will be assigned the value -O1 if the $(CC) version is less than 4.2. - cc-ifversion takes all the shell operators: + cc-ifversion takes all the shell operators: -eq, -ne, -lt, -le, -gt, and -ge The third parameter may be a text as in this example, but it may also be an expanded variable or a macro. @@ -492,7 +512,7 @@ The first step is to tell kbuild that a host program exists. This is done utilising the variable hostprogs-y. The second step is to add an explicit dependency to the executable. -This can be done in two ways. Either add the dependency in a rule, +This can be done in two ways. Either add the dependency in a rule, or utilise the variable $(always). Both possibilities are described in the following. @@ -509,28 +529,28 @@ Both possibilities are described in the following. Kbuild assumes in the above example that bin2hex is made from a single c-source file named bin2hex.c located in the same directory as the Makefile. - + --- 4.2 Composite Host Programs Host programs can be made up based on composite objects. The syntax used to define composite objects for host programs is similar to the syntax used for kernel objects. - $(<executeable>-objs) list all objects used to link the final + $(<executeable>-objs) lists all objects used to link the final executable. Example: #scripts/lxdialog/Makefile - hostprogs-y := lxdialog + hostprogs-y := lxdialog lxdialog-objs := checklist.o lxdialog.o Objects with extension .o are compiled from the corresponding .c - files. In the above example checklist.c is compiled to checklist.o + files. In the above example, checklist.c is compiled to checklist.o and lxdialog.c is compiled to lxdialog.o. - Finally the two .o files are linked to the executable, lxdialog. + Finally, the two .o files are linked to the executable, lxdialog. Note: The syntax <executable>-y is not permitted for host-programs. ---- 4.3 Defining shared libraries - +--- 4.3 Defining shared libraries + Objects with extension .so are considered shared libraries, and will be compiled as position independent objects. Kbuild provides support for shared libraries, but the usage @@ -543,7 +563,7 @@ Both possibilities are described in the following. hostprogs-y := conf conf-objs := conf.o libkconfig.so libkconfig-objs := expr.o type.o - + Shared libraries always require a corresponding -objs line, and in the example above the shared library libkconfig is composed by the two objects expr.o and type.o. @@ -564,7 +584,7 @@ Both possibilities are described in the following. In the example above the executable is composed of the C++ file qconf.cc - identified by $(qconf-cxxobjs). - + If qconf is composed by a mixture of .c and .cc files, then an additional line can be used to identify this. @@ -573,34 +593,35 @@ Both possibilities are described in the following. hostprogs-y := qconf qconf-cxxobjs := qconf.o qconf-objs := check.o - + --- 4.5 Controlling compiler options for host programs When compiling host programs, it is possible to set specific flags. The programs will always be compiled utilising $(HOSTCC) passed the options specified in $(HOSTCFLAGS). To set flags that will take effect for all host programs created - in that Makefile use the variable HOST_EXTRACFLAGS. + in that Makefile, use the variable HOST_EXTRACFLAGS. Example: #scripts/lxdialog/Makefile HOST_EXTRACFLAGS += -I/usr/include/ncurses - + To set specific flags for a single file the following construction is used: Example: #arch/ppc64/boot/Makefile HOSTCFLAGS_piggyback.o := -DKERNELBASE=$(KERNELBASE) - + It is also possible to specify additional options to the linker. - + Example: #scripts/kconfig/Makefile HOSTLOADLIBES_qconf := -L$(QTDIR)/lib - When linking qconf it will be passed the extra option "-L$(QTDIR)/lib". - + When linking qconf, it will be passed the extra option + "-L$(QTDIR)/lib". + --- 4.6 When host programs are actually built Kbuild will only build host-programs when they are referenced @@ -615,7 +636,7 @@ Both possibilities are described in the following. $(obj)/devlist.h: $(src)/pci.ids $(obj)/gen-devlist ( cd $(obj); ./gen-devlist ) < $< - The target $(obj)/devlist.h will not be built before + The target $(obj)/devlist.h will not be built before $(obj)/gen-devlist is updated. Note that references to the host programs in special rules must be prefixed with $(obj). @@ -634,7 +655,7 @@ Both possibilities are described in the following. --- 4.7 Using hostprogs-$(CONFIG_FOO) - A typcal pattern in a Kbuild file lok like this: + A typical pattern in a Kbuild file looks like this: Example: #scripts/Makefile @@ -642,13 +663,13 @@ Both possibilities are described in the following. Kbuild knows about both 'y' for built-in and 'm' for module. So if a config symbol evaluate to 'm', kbuild will still build - the binary. In other words Kbuild handle hostprogs-m exactly - like hostprogs-y. But only hostprogs-y is recommend used - when no CONFIG symbol are involved. + the binary. In other words, Kbuild handles hostprogs-m exactly + like hostprogs-y. But only hostprogs-y is recommended to be used + when no CONFIG symbols are involved. === 5 Kbuild clean infrastructure -"make clean" deletes most generated files in the src tree where the kernel +"make clean" deletes most generated files in the obj tree where the kernel is compiled. This includes generated files such as host programs. Kbuild knows targets listed in $(hostprogs-y), $(hostprogs-m), $(always), $(extra-y) and $(targets). They are all deleted during "make clean". @@ -666,7 +687,8 @@ When executing "make clean", the two files "devlist.h classlist.h" will be deleted. Kbuild will assume files to be in same relative directory as the Makefile except if an absolute path is specified (path starting with '/'). -To delete a directory hirachy use: +To delete a directory hierarchy use: + Example: #scripts/package/Makefile clean-dirs := $(objtree)/debian/ @@ -709,29 +731,29 @@ be visited during "make clean". The top level Makefile sets up the environment and does the preparation, before starting to descend down in the individual directories. -The top level makefile contains the generic part, whereas the -arch/$(ARCH)/Makefile contains what is required to set-up kbuild -to the said architecture. -To do so arch/$(ARCH)/Makefile sets a number of variables, and defines +The top level makefile contains the generic part, whereas +arch/$(ARCH)/Makefile contains what is required to set up kbuild +for said architecture. +To do so, arch/$(ARCH)/Makefile sets up a number of variables and defines a few targets. -When kbuild executes the following steps are followed (roughly): -1) Configuration of the kernel => produced .config +When kbuild executes, the following steps are followed (roughly): +1) Configuration of the kernel => produce .config 2) Store kernel version in include/linux/version.h 3) Symlink include/asm to include/asm-$(ARCH) 4) Updating all other prerequisites to the target prepare: - Additional prerequisites are specified in arch/$(ARCH)/Makefile 5) Recursively descend down in all directories listed in init-* core* drivers-* net-* libs-* and build all targets. - - The value of the above variables are extended in arch/$(ARCH)/Makefile. -6) All object files are then linked and the resulting file vmlinux is - located at the root of the src tree. + - The values of the above variables are expanded in arch/$(ARCH)/Makefile. +6) All object files are then linked and the resulting file vmlinux is + located at the root of the obj tree. The very first objects linked are listed in head-y, assigned by arch/$(ARCH)/Makefile. -7) Finally the architecture specific part does any required post processing +7) Finally, the architecture specific part does any required post processing and builds the final bootimage. - This includes building boot records - - Preparing initrd images and the like + - Preparing initrd images and thelike --- 6.1 Set variables to tweak the build to the architecture @@ -746,7 +768,7 @@ When kbuild executes the following steps are followed (roughly): LDFLAGS := -m elf_s390 Note: EXTRA_LDFLAGS and LDFLAGS_$@ can be used to further customise the flags used. See chapter 7. - + LDFLAGS_MODULE Options for $(LD) when linking modules LDFLAGS_MODULE is used to set specific flags for $(LD) when @@ -756,7 +778,7 @@ When kbuild executes the following steps are followed (roughly): LDFLAGS_vmlinux Options for $(LD) when linking vmlinux LDFLAGS_vmlinux is used to specify additional flags to pass to - the linker when linking the final vmlinux. + the linker when linking the final vmlinux image. LDFLAGS_vmlinux uses the LDFLAGS_$@ support. Example: @@ -766,7 +788,7 @@ When kbuild executes the following steps are followed (roughly): OBJCOPYFLAGS objcopy flags When $(call if_changed,objcopy) is used to translate a .o file, - then the flags specified in OBJCOPYFLAGS will be used. + the flags specified in OBJCOPYFLAGS will be used. $(call if_changed,objcopy) is often used to generate raw binaries on vmlinux. @@ -778,7 +800,7 @@ When kbuild executes the following steps are followed (roughly): $(obj)/image: vmlinux FORCE $(call if_changed,objcopy) - In this example the binary $(obj)/image is a binary version of + In this example, the binary $(obj)/image is a binary version of vmlinux. The usage of $(call if_changed,xxx) will be described later. AFLAGS $(AS) assembler flags @@ -795,7 +817,7 @@ When kbuild executes the following steps are followed (roughly): Default value - see top level Makefile Append or modify as required per architecture. - Often the CFLAGS variable depends on the configuration. + Often, the CFLAGS variable depends on the configuration. Example: #arch/i386/Makefile @@ -816,7 +838,7 @@ When kbuild executes the following steps are followed (roughly): ... - The first examples utilises the trick that a config option expands + The first example utilises the trick that a config option expands to 'y' when selected. CFLAGS_KERNEL $(CC) options specific for built-in @@ -829,18 +851,18 @@ When kbuild executes the following steps are followed (roughly): $(CFLAGS_MODULE) contains extra C compiler flags used to compile code for loadable kernel modules. - + --- 6.2 Add prerequisites to archprepare: - The archprepare: rule is used to list prerequisites that needs to be + The archprepare: rule is used to list prerequisites that need to be built before starting to descend down in the subdirectories. - This is usual header files containing assembler constants. + This is usually used for header files containing assembler constants. Example: #arch/arm/Makefile archprepare: maketools - In this example the file target maketools will be processed + In this example, the file target maketools will be processed before descending down in the subdirectories. See also chapter XXX-TODO that describe how kbuild supports generating offset header files. @@ -853,18 +875,19 @@ When kbuild executes the following steps are followed (roughly): corresponding arch-specific section for modules; the module-building machinery is all architecture-independent. - + head-y, init-y, core-y, libs-y, drivers-y, net-y - $(head-y) list objects to be linked first in vmlinux. - $(libs-y) list directories where a lib.a archive can be located. - The rest list directories where a built-in.o object file can be located. + $(head-y) lists objects to be linked first in vmlinux. + $(libs-y) lists directories where a lib.a archive can be located. + The rest lists directories where a built-in.o object file can be + located. $(init-y) objects will be located after $(head-y). Then the rest follows in this order: $(core-y), $(libs-y), $(drivers-y) and $(net-y). - The top level Makefile define values for all generic directories, + The top level Makefile defines values for all generic directories, and arch/$(ARCH)/Makefile only adds architecture specific directories. Example: @@ -901,27 +924,27 @@ When kbuild executes the following steps are followed (roughly): "$(Q)$(MAKE) $(build)=<dir>" is the recommended way to invoke make in a subdirectory. - There are no rules for naming of the architecture specific targets, + There are no rules for naming architecture specific targets, but executing "make help" will list all relevant targets. - To support this $(archhelp) must be defined. + To support this, $(archhelp) must be defined. Example: #arch/i386/Makefile define archhelp echo '* bzImage - Image (arch/$(ARCH)/boot/bzImage)' - endef + endif When make is executed without arguments, the first goal encountered will be built. In the top level Makefile the first goal present is all:. - An architecture shall always per default build a bootable image. - In "make help" the default goal is highlighted with a '*'. + An architecture shall always, per default, build a bootable image. + In "make help", the default goal is highlighted with a '*'. Add a new prerequisite to all: to select a default goal different from vmlinux. Example: #arch/i386/Makefile - all: bzImage + all: bzImage When "make" is executed without arguments, bzImage will be built. @@ -941,10 +964,10 @@ When kbuild executes the following steps are followed (roughly): #arch/i386/kernel/Makefile extra-y := head.o init_task.o - In this example extra-y is used to list object files that + In this example, extra-y is used to list object files that shall be built, but shall not be linked as part of built-in.o. - + --- 6.6 Commands useful for building a boot image Kbuild provides a few macros that are useful when building a @@ -958,8 +981,8 @@ When kbuild executes the following steps are followed (roughly): target: source(s) FORCE $(call if_changed,ld/objcopy/gzip) - When the rule is evaluated it is checked to see if any files - needs an update, or the commandline has changed since last + When the rule is evaluated, it is checked to see if any files + needs an update, or the command line has changed since the last invocation. The latter will force a rebuild if any options to the executable have changed. Any target that utilises if_changed must be listed in $(targets), @@ -977,8 +1000,8 @@ When kbuild executes the following steps are followed (roughly): #WRONG!# $(call if_changed, ld/objcopy/gzip) ld - Link target. Often LDFLAGS_$@ is used to set specific options to ld. - + Link target. Often, LDFLAGS_$@ is used to set specific options to ld. + objcopy Copy binary. Uses OBJCOPYFLAGS usually specified in arch/$(ARCH)/Makefile. @@ -996,10 +1019,10 @@ When kbuild executes the following steps are followed (roughly): $(obj)/setup $(obj)/bootsect: %: %.o FORCE $(call if_changed,ld) - In this example there are two possible targets, requiring different - options to the linker. the linker options are specified using the + In this example, there are two possible targets, requiring different + options to the linker. The linker options are specified using the LDFLAGS_$@ syntax - one for each potential target. - $(targets) are assinged all potential targets, herby kbuild knows + $(targets) are assinged all potential targets, by which kbuild knows the targets and will: 1) check for commandline changes 2) delete target during make clean @@ -1013,7 +1036,7 @@ When kbuild executes the following steps are followed (roughly): --- 6.7 Custom kbuild commands - When kbuild is executing with KBUILD_VERBOSE=0 then only a shorthand + When kbuild is executing with KBUILD_VERBOSE=0, then only a shorthand of a command is normally displayed. To enable this behaviour for custom commands kbuild requires two variables to be set: @@ -1031,34 +1054,34 @@ When kbuild executes the following steps are followed (roughly): $(call if_changed,image) @echo 'Kernel: $@ is ready' - When updating the $(obj)/bzImage target the line: + When updating the $(obj)/bzImage target, the line BUILD arch/i386/boot/bzImage will be displayed with "make KBUILD_VERBOSE=0". - + --- 6.8 Preprocessing linker scripts - When the vmlinux image is build the linker script: + When the vmlinux image is built, the linker script arch/$(ARCH)/kernel/vmlinux.lds is used. The script is a preprocessed variant of the file vmlinux.lds.S located in the same directory. - kbuild knows .lds file and includes a rule *lds.S -> *lds. - + kbuild knows .lds files and includes a rule *lds.S -> *lds. + Example: #arch/i386/kernel/Makefile always := vmlinux.lds - + #Makefile export CPPFLAGS_vmlinux.lds += -P -C -U$(ARCH) - - The assigment to $(always) is used to tell kbuild to build the - target: vmlinux.lds. - The assignment to $(CPPFLAGS_vmlinux.lds) tell kbuild to use the + + The assignment to $(always) is used to tell kbuild to build the + target vmlinux.lds. + The assignment to $(CPPFLAGS_vmlinux.lds) tells kbuild to use the specified options when building the target vmlinux.lds. - - When building the *.lds target kbuild used the variakles: + + When building the *.lds target, kbuild uses the variables: CPPFLAGS : Set in top-level Makefile EXTRA_CPPFLAGS : May be set in the kbuild makefile CPPFLAGS_$(@F) : Target specific flags. @@ -1123,9 +1146,17 @@ The top Makefile exports the following variables: $(INSTALL_MOD_PATH)/lib/modules/$(KERNELRELEASE). The user may override this value on the command line if desired. + INSTALL_MOD_STRIP + + If this variable is specified, will cause modules to be stripped + after they are installed. If INSTALL_MOD_STRIP is '1', then the + default option --strip-debug will be used. Otherwise, + INSTALL_MOD_STRIP will used as the option(s) to the strip command. + + === 8 Makefile language -The kernel Makefiles are designed to run with GNU Make. The Makefiles +The kernel Makefiles are designed to be run with GNU Make. The Makefiles use only the documented features of GNU Make, but they do use many GNU extensions. @@ -1147,10 +1178,13 @@ is the right choice. Original version made by Michael Elizabeth Chastain, <mailto:mec@shout.net> Updates by Kai Germaschewski <kai@tp1.ruhr-uni-bochum.de> Updates by Sam Ravnborg <sam@ravnborg.org> +Language QA by Jan Engelhardt <jengelh@gmx.de> === 10 TODO -- Describe how kbuild support shipped files with _shipped. +- Describe how kbuild supports shipped files with _shipped. - Generating offset header files. - Add more variables to section 7? + + diff --git a/Documentation/kbuild/modules.txt b/Documentation/kbuild/modules.txt index 61fc079eb966..2e7702e94a78 100644 --- a/Documentation/kbuild/modules.txt +++ b/Documentation/kbuild/modules.txt @@ -1,7 +1,7 @@ In this document you will find information about: - how to build external modules -- how to make your module use kbuild infrastructure +- how to make your module use the kbuild infrastructure - how kbuild will install a kernel - how to install modules in a non-standard location @@ -24,7 +24,7 @@ In this document you will find information about: --- 6.1 INSTALL_MOD_PATH --- 6.2 INSTALL_MOD_DIR === 7. Module versioning & Module.symvers - --- 7.1 Symbols fron the kernel (vmlinux + modules) + --- 7.1 Symbols from the kernel (vmlinux + modules) --- 7.2 Symbols and external modules --- 7.3 Symbols from another external module === 8. Tips & Tricks @@ -36,13 +36,13 @@ In this document you will find information about: kbuild includes functionality for building modules both within the kernel source tree and outside the kernel source tree. -The latter is usually referred to as external modules and is used -both during development and for modules that are not planned to be -included in the kernel tree. +The latter is usually referred to as external or "out-of-tree" +modules and is used both during development and for modules that +are not planned to be included in the kernel tree. What is covered within this file is mainly information to authors -of modules. The author of an external modules should supply -a makefile that hides most of the complexity so one only has to type +of modules. The author of an external module should supply +a makefile that hides most of the complexity, so one only has to type 'make' to build the module. A complete example will be present in chapter 4, "Creating a kbuild file for an external module". @@ -63,14 +63,15 @@ when building an external module. For the running kernel use: make -C /lib/modules/`uname -r`/build M=`pwd` - For the above command to succeed the kernel must have been built with - modules enabled. + For the above command to succeed, the kernel must have been + built with modules enabled. To install the modules that were just built: make -C <path-to-kernel> M=`pwd` modules_install - More complex examples later, the above should get you going. + More complex examples will be shown later, the above should + be enough to get you started. --- 2.2 Available targets @@ -89,13 +90,13 @@ when building an external module. Same functionality as if no target was specified. See description above. - make -C $KDIR M=$PWD modules_install + make -C $KDIR M=`pwd` modules_install Install the external module(s). Installation default is in /lib/modules/<kernel-version>/extra, but may be prefixed with INSTALL_MOD_PATH - see separate chapter. - make -C $KDIR M=$PWD clean + make -C $KDIR M=`pwd` clean Remove all generated files for the module - the kernel source directory is not modified. @@ -129,29 +130,28 @@ when building an external module. To make sure the kernel contains the information required to build external modules the target 'modules_prepare' must be used. - 'module_prepare' solely exists as a simple way to prepare - a kernel for building external modules. + 'module_prepare' exists solely as a simple way to prepare + a kernel source tree for building external modules. Note: modules_prepare will not build Module.symvers even if - CONFIG_MODULEVERSIONING is set. - Therefore a full kernel build needs to be executed to make - module versioning work. + CONFIG_MODULEVERSIONING is set. Therefore a full kernel build + needs to be executed to make module versioning work. --- 2.5 Building separate files for a module - It is possible to build single files which is part of a module. - This works equal for the kernel, a module and even for external - modules. + It is possible to build single files which are part of a module. + This works equally well for the kernel, a module and even for + external modules. Examples (module foo.ko, consist of bar.o, baz.o): make -C $KDIR M=`pwd` bar.lst make -C $KDIR M=`pwd` bar.o make -C $KDIR M=`pwd` foo.ko make -C $KDIR M=`pwd` / - + === 3. Example commands This example shows the actual commands to be executed when building an external module for the currently running kernel. -In the example below the distribution is supposed to use the +In the example below, the distribution is supposed to use the facility to locate output files for a kernel compile in a different directory than the kernel source - but the examples will also work when the source and the output files are mixed in the same directory. @@ -170,14 +170,14 @@ the following commands to build the module: O=/lib/modules/`uname-r`/build \ M=`pwd` -Then to install the module use the following command: +Then, to install the module use the following command: make -C /usr/src/`uname -r`/source \ O=/lib/modules/`uname-r`/build \ M=`pwd` \ modules_install -If one looks closely you will see that this is the same commands as +If you look closely you will see that this is the same command as listed before - with the directories spelled out. The above are rather long commands, and the following chapter @@ -230,7 +230,7 @@ following files: endif - In example 1 the check for KERNELRELEASE is used to separate + In example 1, the check for KERNELRELEASE is used to separate the two parts of the Makefile. kbuild will only see the two assignments whereas make will see everything except the two kbuild assignments. @@ -255,7 +255,7 @@ following files: echo "X" > 8123_bin_shipped - In example 2 we are down to two fairly simple files and for simple + In example 2, we are down to two fairly simple files and for simple files as used in this example the split is questionable. But some external modules use Makefiles of several hundred lines and here it really pays off to separate the kbuild part from the rest. @@ -282,9 +282,9 @@ following files: endif - The trick here is to include the Kbuild file from Makefile so - if an older version of kbuild picks up the Makefile the Kbuild - file will be included. + The trick here is to include the Kbuild file from Makefile, so + if an older version of kbuild picks up the Makefile, the Kbuild + file will be included. --- 4.2 Binary blobs included in a module @@ -301,18 +301,19 @@ following files: obj-m := 8123.o 8123-y := 8123_if.o 8123_pci.o 8123_bin.o - In example 4 there is no distinction between the ordinary .c/.h files + In example 4, there is no distinction between the ordinary .c/.h files and the binary file. But kbuild will pick up different rules to create the .o file. === 5. Include files -Include files are a necessity when a .c file uses something from another .c -files (not strictly in the sense of .c but if good programming practice is -used). Any module that consist of more than one .c file will have a .h file -for one of the .c files. -- If the .h file only describes a module internal interface then the .h file +Include files are a necessity when a .c file uses something from other .c +files (not strictly in the sense of C, but if good programming practice is +used). Any module that consists of more than one .c file will have a .h file +for one of the .c files. + +- If the .h file only describes a module internal interface, then the .h file shall be placed in the same directory as the .c files. - If the .h files describe an interface used by other parts of the kernel located in different directories, the .h files shall be located in @@ -323,11 +324,11 @@ under include/ such as include/scsi. Another exception is arch-specific .h files which are located under include/asm-$(ARCH)/*. External modules have a tendency to locate include files in a separate include/ -directory and therefore needs to deal with this in their kbuild file. +directory and therefore need to deal with this in their kbuild file. --- 5.1 How to include files from the kernel include dir - When a module needs to include a file from include/linux/ then one + When a module needs to include a file from include/linux/, then one just uses: #include <linux/modules.h> @@ -348,7 +349,7 @@ directory and therefore needs to deal with this in their kbuild file. The trick here is to use either EXTRA_CFLAGS (take effect for all .c files) or CFLAGS_$F.o (take effect only for a single file). - In our example if we move 8123_if.h to a subdirectory named include/ + In our example, if we move 8123_if.h to a subdirectory named include/ the resulting Kbuild file would look like: --> filename: Kbuild @@ -362,19 +363,19 @@ directory and therefore needs to deal with this in their kbuild file. --- 5.3 External modules using several directories - If an external module does not follow the usual kernel style but - decide to spread files over several directories then kbuild can - support this too. + If an external module does not follow the usual kernel style, but + decides to spread files over several directories, then kbuild can + handle this too. Consider the following example: - + | +- src/complex_main.c | +- hal/hardwareif.c | +- hal/include/hardwareif.h +- include/complex.h - - To build a single module named complex.ko we then need the following + + To build a single module named complex.ko, we then need the following kbuild file: Kbuild: @@ -387,12 +388,12 @@ directory and therefore needs to deal with this in their kbuild file. kbuild knows how to handle .o files located in another directory - - although this is NOT reccommended practice. The syntax is to specify + although this is NOT recommended practice. The syntax is to specify the directory relative to the directory where the Kbuild file is located. - To find the .h files we have to explicitly tell kbuild where to look - for the .h files. When kbuild executes current directory is always + To find the .h files, we have to explicitly tell kbuild where to look + for the .h files. When kbuild executes, the current directory is always the root of the kernel tree (argument to -C) and therefore we have to tell kbuild how to find the .h files using absolute paths. $(src) will specify the absolute path to the directory where the @@ -412,7 +413,7 @@ External modules are installed in the directory: --- 6.1 INSTALL_MOD_PATH - Above are the default directories, but as always some level of + Above are the default directories, but as always, some level of customization is possible. One can prefix the path using the variable INSTALL_MOD_PATH: @@ -420,17 +421,17 @@ External modules are installed in the directory: => Install dir: /frodo/lib/modules/$(KERNELRELEASE)/kernel INSTALL_MOD_PATH may be set as an ordinary shell variable or as in the - example above be specified on the command line when calling make. + example above, can be specified on the command line when calling make. INSTALL_MOD_PATH has effect both when installing modules included in the kernel as well as when installing external modules. --- 6.2 INSTALL_MOD_DIR - When installing external modules they are default installed in a + When installing external modules they are by default installed to a directory under /lib/modules/$(KERNELRELEASE)/extra, but one may wish to locate modules for a specific functionality in a separate - directory. For this purpose one can use INSTALL_MOD_DIR to specify an - alternative name than 'extra'. + directory. For this purpose, one can use INSTALL_MOD_DIR to specify an + alternative name to 'extra'. $ make INSTALL_MOD_DIR=gandalf -C KERNELDIR \ M=`pwd` modules_install @@ -444,16 +445,16 @@ Module versioning is enabled by the CONFIG_MODVERSIONS tag. Module versioning is used as a simple ABI consistency check. The Module versioning creates a CRC value of the full prototype for an exported symbol and when a module is loaded/used then the CRC values contained in the kernel are -compared with similar values in the module. If they are not equal then the +compared with similar values in the module. If they are not equal, then the kernel refuses to load the module. Module.symvers contains a list of all exported symbols from a kernel build. --- 7.1 Symbols fron the kernel (vmlinux + modules) - During a kernel build a file named Module.symvers will be generated. + During a kernel build, a file named Module.symvers will be generated. Module.symvers contains all exported symbols from the kernel and - compiled modules. For each symbols the corresponding CRC value + compiled modules. For each symbols, the corresponding CRC value is stored too. The syntax of the Module.symvers file is: @@ -461,27 +462,27 @@ Module.symvers contains a list of all exported symbols from a kernel build. Sample: 0x2d036834 scsi_remove_host drivers/scsi/scsi_mod - For a kernel build without CONFIG_MODVERSIONING enabled the crc + For a kernel build without CONFIG_MODVERSIONS enabled, the crc would read: 0x00000000 - Module.symvers serve two purposes. - 1) It list all exported symbols both from vmlinux and all modules - 2) It list CRC if CONFIG_MODVERSION is enabled + Module.symvers serves two purposes: + 1) It lists all exported symbols both from vmlinux and all modules + 2) It lists the CRC if CONFIG_MODVERSIONS is enabled --- 7.2 Symbols and external modules - When building an external module the build system needs access to + When building an external module, the build system needs access to the symbols from the kernel to check if all external symbols are defined. This is done in the MODPOST step and to obtain all - symbols modpost reads Module.symvers from the kernel. + symbols, modpost reads Module.symvers from the kernel. If a Module.symvers file is present in the directory where - the external module is being build this file will be read too. - During the MODPOST step a new Module.symvers file will be written - containing all exported symbols that was not defined in the kernel. - + the external module is being built, this file will be read too. + During the MODPOST step, a new Module.symvers file will be written + containing all exported symbols that were not defined in the kernel. + --- 7.3 Symbols from another external module - Sometimes one external module uses exported symbols from another + Sometimes, an external module uses exported symbols from another external module. Kbuild needs to have full knowledge on all symbols to avoid spitting out warnings about undefined symbols. Two solutions exist to let kbuild know all symbols of more than @@ -490,15 +491,15 @@ Module.symvers contains a list of all exported symbols from a kernel build. impractical in certain situations. Use a top-level Kbuild file - If you have two modules: 'foo', 'bar' and 'foo' needs symbols - from 'bar' then one can use a common top-level kbuild file so - both modules are compiled in same build. + If you have two modules: 'foo' and 'bar', and 'foo' needs + symbols from 'bar', then one can use a common top-level kbuild + file so both modules are compiled in same build. Consider following directory layout: ./foo/ <= contains the foo module ./bar/ <= contains the bar module The top-level Kbuild file would then look like: - + #./Kbuild: (this file may also be named Makefile) obj-y := foo/ bar/ @@ -509,23 +510,23 @@ Module.symvers contains a list of all exported symbols from a kernel build. knowledge on symbols from both modules. Use an extra Module.symvers file - When an external module is build a Module.symvers file is + When an external module is built, a Module.symvers file is generated containing all exported symbols which are not defined in the kernel. - To get access to symbols from module 'bar' one can copy the + To get access to symbols from module 'bar', one can copy the Module.symvers file from the compilation of the 'bar' module - to the directory where the 'foo' module is build. - During the module build kbuild will read the Module.symvers + to the directory where the 'foo' module is built. + During the module build, kbuild will read the Module.symvers file in the directory of the external module and when the - build is finished a new Module.symvers file is created + build is finished, a new Module.symvers file is created containing the sum of all symbols defined and not part of the kernel. - + === 8. Tips & Tricks --- 8.1 Testing for CONFIG_FOO_BAR - Modules often needs to check for certain CONFIG_ options to decide if + Modules often need to check for certain CONFIG_ options to decide if a specific feature shall be included in the module. When kbuild is used this is done by referencing the CONFIG_ variable directly. @@ -537,7 +538,7 @@ Module.symvers contains a list of all exported symbols from a kernel build. External modules have traditionally used grep to check for specific CONFIG_ settings directly in .config. This usage is broken. - As introduced before external modules shall use kbuild when building - and therefore can use the same methods as in-kernel modules when testing - for CONFIG_ definitions. + As introduced before, external modules shall use kbuild when building + and therefore can use the same methods as in-kernel modules when + testing for CONFIG_ definitions. diff --git a/Documentation/kdump/gdbmacros.txt b/Documentation/kdump/gdbmacros.txt index dcf5580380ab..9b9b454b048a 100644 --- a/Documentation/kdump/gdbmacros.txt +++ b/Documentation/kdump/gdbmacros.txt @@ -175,7 +175,7 @@ end document trapinfo Run info threads and lookup pid of thread #1 'trapinfo <pid>' will tell you by which trap & possibly - addresthe kernel paniced. + address the kernel panicked. end diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt index 212cf3c21abf..08bafa8c1caa 100644 --- a/Documentation/kdump/kdump.txt +++ b/Documentation/kdump/kdump.txt @@ -1,155 +1,325 @@ -Documentation for kdump - the kexec-based crash dumping solution +================================================================ +Documentation for Kdump - The kexec-based Crash Dumping Solution ================================================================ -DESIGN -====== +This document includes overview, setup and installation, and analysis +information. -Kdump uses kexec to reboot to a second kernel whenever a dump needs to be -taken. This second kernel is booted with very little memory. The first kernel -reserves the section of memory that the second kernel uses. This ensures that -on-going DMA from the first kernel does not corrupt the second kernel. +Overview +======== -All the necessary information about Core image is encoded in ELF format and -stored in reserved area of memory before crash. Physical address of start of -ELF header is passed to new kernel through command line parameter elfcorehdr=. +Kdump uses kexec to quickly boot to a dump-capture kernel whenever a +dump of the system kernel's memory needs to be taken (for example, when +the system panics). The system kernel's memory image is preserved across +the reboot and is accessible to the dump-capture kernel. -On i386, the first 640 KB of physical memory is needed to boot, irrespective -of where the kernel loads. Hence, this region is backed up by kexec just before -rebooting into the new kernel. +You can use common Linux commands, such as cp and scp, to copy the +memory image to a dump file on the local disk, or across the network to +a remote system. -In the second kernel, "old memory" can be accessed in two ways. +Kdump and kexec are currently supported on the x86, x86_64, and ppc64 +architectures. -- The first one is through a /dev/oldmem device interface. A capture utility - can read the device file and write out the memory in raw format. This is raw - dump of memory and analysis/capture tool should be intelligent enough to - determine where to look for the right information. ELF headers (elfcorehdr=) - can become handy here. +When the system kernel boots, it reserves a small section of memory for +the dump-capture kernel. This ensures that ongoing Direct Memory Access +(DMA) from the system kernel does not corrupt the dump-capture kernel. +The kexec -p command loads the dump-capture kernel into this reserved +memory. -- The second interface is through /proc/vmcore. This exports the dump as an ELF - format file which can be written out using any file copy command - (cp, scp, etc). Further, gdb can be used to perform limited debugging on - the dump file. This method ensures methods ensure that there is correct - ordering of the dump pages (corresponding to the first 640 KB that has been - relocated). +On x86 machines, the first 640 KB of physical memory is needed to boot, +regardless of where the kernel loads. Therefore, kexec backs up this +region just before rebooting into the dump-capture kernel. -SETUP -===== +All of the necessary information about the system kernel's core image is +encoded in the ELF format, and stored in a reserved area of memory +before a crash. The physical address of the start of the ELF header is +passed to the dump-capture kernel through the elfcorehdr= boot +parameter. + +With the dump-capture kernel, you can access the memory image, or "old +memory," in two ways: + +- Through a /dev/oldmem device interface. A capture utility can read the + device file and write out the memory in raw format. This is a raw dump + of memory. Analysis and capture tools must be intelligent enough to + determine where to look for the right information. + +- Through /proc/vmcore. This exports the dump as an ELF-format file that + you can write out using file copy commands such as cp or scp. Further, + you can use analysis tools such as the GNU Debugger (GDB) and the Crash + tool to debug the dump file. This method ensures that the dump pages are + correctly ordered. + + +Setup and Installation +====================== + +Install kexec-tools and the Kdump patch +--------------------------------------- + +1) Login as the root user. + +2) Download the kexec-tools user-space package from the following URL: + + http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.101.tar.gz + +3) Unpack the tarball with the tar command, as follows: + + tar xvpzf kexec-tools-1.101.tar.gz + +4) Download the latest consolidated Kdump patch from the following URL: + + http://lse.sourceforge.net/kdump/ + + (This location is being used until all the user-space Kdump patches + are integrated with the kexec-tools package.) + +5) Change to the kexec-tools-1.101 directory, as follows: + + cd kexec-tools-1.101 + +6) Apply the consolidated patch to the kexec-tools-1.101 source tree + with the patch command, as follows. (Modify the path to the downloaded + patch as necessary.) + + patch -p1 < /path-to-kdump-patch/kexec-tools-1.101-kdump.patch + +7) Configure the package, as follows: + + ./configure + +8) Compile the package, as follows: + + make + +9) Install the package, as follows: + + make install + + +Download and build the system and dump-capture kernels +------------------------------------------------------ + +Download the mainline (vanilla) kernel source code (2.6.13-rc1 or newer) +from http://www.kernel.org. Two kernels must be built: a system kernel +and a dump-capture kernel. Use the following steps to configure these +kernels with the necessary kexec and Kdump features: + +System kernel +------------- + +1) Enable "kexec system call" in "Processor type and features." + + CONFIG_KEXEC=y + +2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo + filesystems." This is usually enabled by default. + + CONFIG_SYSFS=y + + Note that "sysfs file system support" might not appear in the "Pseudo + filesystems" menu if "Configure standard kernel features (for small + systems)" is not enabled in "General Setup." In this case, check the + .config file itself to ensure that sysfs is turned on, as follows: + + grep 'CONFIG_SYSFS' .config + +3) Enable "Compile the kernel with debug info" in "Kernel hacking." + + CONFIG_DEBUG_INFO=Y + + This causes the kernel to be built with debug symbols. The dump + analysis tools require a vmlinux with debug symbols in order to read + and analyze a dump file. + +4) Make and install the kernel and its modules. Update the boot loader + (such as grub, yaboot, or lilo) configuration files as necessary. + +5) Boot the system kernel with the boot parameter "crashkernel=Y@X", + where Y specifies how much memory to reserve for the dump-capture kernel + and X specifies the beginning of this reserved memory. For example, + "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory + starting at physical address 0x01000000 for the dump-capture kernel. + + On x86 and x86_64, use "crashkernel=64M@16M". + + On ppc64, use "crashkernel=128M@32M". + + +The dump-capture kernel +----------------------- -1) Download the upstream kexec-tools userspace package from - http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.101.tar.gz. - - Apply the latest consolidated kdump patch on top of kexec-tools-1.101 - from http://lse.sourceforge.net/kdump/. This arrangment has been made - till all the userspace patches supporting kdump are integrated with - upstream kexec-tools userspace. - -2) Download and build the appropriate (2.6.13-rc1 onwards) vanilla kernels. - Two kernels need to be built in order to get this feature working. - Following are the steps to properly configure the two kernels specific - to kexec and kdump features: - - A) First kernel or regular kernel: - ---------------------------------- - a) Enable "kexec system call" feature (in Processor type and features). - CONFIG_KEXEC=y - b) Enable "sysfs file system support" (in Pseudo filesystems). - CONFIG_SYSFS=y - c) make - d) Boot into first kernel with the command line parameter "crashkernel=Y@X". - Use appropriate values for X and Y. Y denotes how much memory to reserve - for the second kernel, and X denotes at what physical address the - reserved memory section starts. For example: "crashkernel=64M@16M". - - - B) Second kernel or dump capture kernel: - --------------------------------------- - a) For i386 architecture enable Highmem support - CONFIG_HIGHMEM=y - b) Enable "kernel crash dumps" feature (under "Processor type and features") - CONFIG_CRASH_DUMP=y - c) Make sure a suitable value for "Physical address where the kernel is - loaded" (under "Processor type and features"). By default this value - is 0x1000000 (16MB) and it should be same as X (See option d above), - e.g., 16 MB or 0x1000000. - CONFIG_PHYSICAL_START=0x1000000 - d) Enable "/proc/vmcore support" (Optional, under "Pseudo filesystems"). - CONFIG_PROC_VMCORE=y - -3) After booting to regular kernel or first kernel, load the second kernel - using the following command: - - kexec -p <second-kernel> --args-linux --elf32-core-headers - --append="root=<root-dev> init 1 irqpoll maxcpus=1" - - Notes: - ====== - i) <second-kernel> has to be a vmlinux image ie uncompressed elf image. - bzImage will not work, as of now. - ii) --args-linux has to be speicfied as if kexec it loading an elf image, - it needs to know that the arguments supplied are of linux type. - iii) By default ELF headers are stored in ELF64 format to support systems - with more than 4GB memory. Option --elf32-core-headers forces generation - of ELF32 headers. The reason for this option being, as of now gdb can - not open vmcore file with ELF64 headers on a 32 bit systems. So ELF32 - headers can be used if one has non-PAE systems and hence memory less - than 4GB. - iv) Specify "irqpoll" as command line parameter. This reduces driver - initialization failures in second kernel due to shared interrupts. - v) <root-dev> needs to be specified in a format corresponding to the root - device name in the output of mount command. - vi) If you have built the drivers required to mount root file system as - modules in <second-kernel>, then, specify - --initrd=<initrd-for-second-kernel>. - vii) Specify maxcpus=1 as, if during first kernel run, if panic happens on - non-boot cpus, second kernel doesn't seem to be boot up all the cpus. - The other option is to always built the second kernel without SMP - support ie CONFIG_SMP=n - -4) After successfully loading the second kernel as above, if a panic occurs - system reboots into the second kernel. A module can be written to force - the panic or "ALT-SysRq-c" can be used initiate a crash dump for testing - purposes. - -5) Once the second kernel has booted, write out the dump file using +1) Under "General setup," append "-kdump" to the current string in + "Local version." + +2) On x86, enable high memory support under "Processor type and + features": + + CONFIG_HIGHMEM64G=y + or + CONFIG_HIGHMEM4G + +3) On x86 and x86_64, disable symmetric multi-processing support + under "Processor type and features": + + CONFIG_SMP=n + (If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line + when loading the dump-capture kernel, see section "Load the Dump-capture + Kernel".) + +4) On ppc64, disable NUMA support and enable EMBEDDED support: + + CONFIG_NUMA=n + CONFIG_EMBEDDED=y + CONFIG_EEH=N for the dump-capture kernel + +5) Enable "kernel crash dumps" support under "Processor type and + features": + + CONFIG_CRASH_DUMP=y + +6) Use a suitable value for "Physical address where the kernel is + loaded" (under "Processor type and features"). This only appears when + "kernel crash dumps" is enabled. By default this value is 0x1000000 + (16MB). It should be the same as X in the "crashkernel=Y@X" boot + parameter discussed above. + + On x86 and x86_64, use "CONFIG_PHYSICAL_START=0x1000000". + + On ppc64 the value is automatically set at 32MB when + CONFIG_CRASH_DUMP is set. + +6) Optionally enable "/proc/vmcore support" under "Filesystems" -> + "Pseudo filesystems". + + CONFIG_PROC_VMCORE=y + (CONFIG_PROC_VMCORE is set by default when CONFIG_CRASH_DUMP is selected.) + +7) Make and install the kernel and its modules. DO NOT add this kernel + to the boot loader configuration files. + + +Load the Dump-capture Kernel +============================ + +After booting to the system kernel, load the dump-capture kernel using +the following command: + + kexec -p <dump-capture-kernel> \ + --initrd=<initrd-for-dump-capture-kernel> --args-linux \ + --append="root=<root-dev> init 1 irqpoll" + + +Notes on loading the dump-capture kernel: + +* <dump-capture-kernel> must be a vmlinux image (that is, an + uncompressed ELF image). bzImage does not work at this time. + +* By default, the ELF headers are stored in ELF64 format to support + systems with more than 4GB memory. The --elf32-core-headers option can + be used to force the generation of ELF32 headers. This is necessary + because GDB currently cannot open vmcore files with ELF64 headers on + 32-bit systems. ELF32 headers can be used on non-PAE systems (that is, + less than 4GB of memory). + +* The "irqpoll" boot parameter reduces driver initialization failures + due to shared interrupts in the dump-capture kernel. + +* You must specify <root-dev> in the format corresponding to the root + device name in the output of mount command. + +* "init 1" boots the dump-capture kernel into single-user mode without + networking. If you want networking, use "init 3." + + +Kernel Panic +============ + +After successfully loading the dump-capture kernel as previously +described, the system will reboot into the dump-capture kernel if a +system crash is triggered. Trigger points are located in panic(), +die(), die_nmi() and in the sysrq handler (ALT-SysRq-c). + +The following conditions will execute a crash trigger point: + +If a hard lockup is detected and "NMI watchdog" is configured, the system +will boot into the dump-capture kernel ( die_nmi() ). + +If die() is called, and it happens to be a thread with pid 0 or 1, or die() +is called inside interrupt context or die() is called and panic_on_oops is set, +the system will boot into the dump-capture kernel. + +On powererpc systems when a soft-reset is generated, die() is called by all cpus and the system system will boot into the dump-capture kernel. + +For testing purposes, you can trigger a crash by using "ALT-SysRq-c", +"echo c > /proc/sysrq-trigger or write a module to force the panic. + +Write Out the Dump File +======================= + +After the dump-capture kernel is booted, write out the dump file with +the following command: cp /proc/vmcore <dump-file> - Dump memory can also be accessed as a /dev/oldmem device for a linear/raw - view. To create the device, type: +You can also access dumped memory as a /dev/oldmem device for a linear +and raw view. To create the device, use the following command: - mknod /dev/oldmem c 1 12 + mknod /dev/oldmem c 1 12 - Use "dd" with suitable options for count, bs and skip to access specific - portions of the dump. +Use the dd command with suitable options for count, bs, and skip to +access specific portions of the dump. - Entire memory: dd if=/dev/oldmem of=oldmem.001 +To see the entire memory, use the following command: + dd if=/dev/oldmem of=oldmem.001 -ANALYSIS + +Analysis ======== -Limited analysis can be done using gdb on the dump file copied out of -/proc/vmcore. Use vmlinux built with -g and run - gdb vmlinux <dump-file> +Before analyzing the dump image, you should reboot into a stable kernel. + +You can do limited analysis using GDB on the dump file copied out of +/proc/vmcore. Use the debug vmlinux built with -g and run the following +command: + + gdb vmlinux <dump-file> -Stack trace for the task on processor 0, register display, memory display -work fine. +Stack trace for the task on processor 0, register display, and memory +display work fine. -Note: gdb cannot analyse core files generated in ELF64 format for i386. +Note: GDB cannot analyze core files generated in ELF64 format for x86. +On systems with a maximum of 4GB of memory, you can generate +ELF32-format headers using the --elf32-core-headers kernel option on the +dump kernel. -Latest "crash" (crash-4.0-2.18) as available on Dave Anderson's site -http://people.redhat.com/~anderson/ works well with kdump format. +You can also use the Crash utility to analyze dump files in Kdump +format. Crash is available on Dave Anderson's site at the following URL: + http://people.redhat.com/~anderson/ + + +To Do +===== -TODO -==== -1) Provide a kernel pages filtering mechanism so that core file size is not - insane on systems having huge memory banks. -2) Relocatable kernel can help in maintaining multiple kernels for crashdump - and same kernel as the first kernel can be used to capture the dump. +1) Provide a kernel pages filtering mechanism, so core file size is not + extreme on systems with huge memory banks. +2) Relocatable kernel can help in maintaining multiple kernels for + crash_dump, and the same kernel as the system kernel can be used to + capture the dump. -CONTACT + +Contact ======= + Vivek Goyal (vgoyal@in.ibm.com) Maneesh Soni (maneesh@in.ibm.com) + + +Trademark +========= + +Linux is a trademark of Linus Torvalds in the United States, other +countries, or both. diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index bca6f389da66..137e993f4329 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -35,7 +35,6 @@ parameter is applicable: APM Advanced Power Management support is enabled. AX25 Appropriate AX.25 support is enabled. CD Appropriate CD support is enabled. - DEVFS devfs support is enabled. DRM Direct Rendering Management support is enabled. EDD BIOS Enhanced Disk Drive Services (EDD) is enabled EFI EFI Partitioning (GPT) is enabled @@ -61,6 +60,7 @@ parameter is applicable: MTD MTD support is enabled. NET Appropriate network support is enabled. NUMA NUMA support is enabled. + GENERIC_TIME The generic timeofday code is enabled. NFS Appropriate NFS support is enabled. OSS OSS sound support is enabled. PARIDE The ParIDE subsystem is enabled. @@ -110,6 +110,13 @@ be entered as an environment variable, whereas its absence indicates that it will appear as a kernel argument readable via /proc/cmdline by programs running once the system is up. +The number of kernel parameters is not limited, but the length of the +complete command line (parameters including spaces etc.) is limited to +a fixed number of characters. This limit depends on the architecture +and is between 256 and 4096 characters. It is defined in the file +./include/asm/setup.h as COMMAND_LINE_SIZE. + + 53c7xx= [HW,SCSI] Amiga SCSI controllers See header of drivers/scsi/53c7xx.c. See also Documentation/scsi/ncr53c7xx.txt. @@ -179,6 +186,11 @@ running once the system is up. override platform specific driver. See also Documentation/acpi-hotkey.txt. + acpi_pm_good [IA-32,X86-64] + Override the pmtimer bug detection: force the kernel + to assume that this machine's pmtimer latches its value + and always returns good values. + enable_timer_pin_1 [i386,x86-64] Enable PIN 1 of APIC timer Can be useful to work around chipset bugs @@ -341,10 +353,11 @@ running once the system is up. Value can be changed at runtime via /selinux/checkreqprot. - clock= [BUGS=IA-32,HW] gettimeofday timesource override. - Forces specified timesource (if avaliable) to be used - when calculating gettimeofday(). If specicified - timesource is not avalible, it defaults to PIT. + clock= [BUGS=IA-32, HW] gettimeofday clocksource override. + [Deprecated] + Forces specified clocksource (if avaliable) to be used + when calculating gettimeofday(). If specified + clocksource is not avalible, it defaults to PIT. Format: { pit | tsc | cyclone | pmtmr } disable_8254_timer @@ -429,13 +442,19 @@ running once the system is up. debug [KNL] Enable kernel debugging (events log level). + debug_locks_verbose= + [KNL] verbose self-tests + Format=<0|1> + Print debugging info while doing the locking API + self-tests. + We default to 0 (no extra messages), setting it to + 1 will print _a lot_ more information - normally + only useful to kernel developers. + decnet= [HW,NET] Format: <area>[,<node>] See also Documentation/networking/decnet.txt. - devfs= [DEVFS] - See Documentation/filesystems/devfs/boot-options. - dhash_entries= [KNL] Set number of hash buckets for dentry cache. @@ -561,8 +580,6 @@ running once the system is up. gscd= [HW,CD] Format: <io> - gt96100eth= [NET] MIPS GT96100 Advanced Communication Controller - gus= [HW,OSS] Format: <io>,<irq>,<dma>,<dma16> @@ -685,6 +702,12 @@ running once the system is up. ips= [HW,SCSI] Adaptec / IBM ServeRAID controller See header of drivers/scsi/ips.c. + ports= [IP_VS_FTP] IPVS ftp helper module + Default is 21. + Up to 8 (IP_VS_APP_MAX_PORTS) ports + may be specified. + Format: <port>,<port>.... + irqfixup [HW] When an interrupt is not handled search all handlers for it. Intended to get systems with badly broken @@ -1017,6 +1040,8 @@ running once the system is up. nocache [ARM] + nodelayacct [KNL] Disable per-task delay accounting + nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects. noexec [IA-64] @@ -1220,7 +1245,11 @@ running once the system is up. bootloader. This is currently used on IXP2000 systems where the bus has to be configured a certain way for adjunct CPUs. - + noearly [X86] Don't do any early type 1 scanning. + This might help on some broken boards which + machine check when some devices' config space + is read. But various workarounds are disabled + and some IOMMU drivers will not work. pcmv= [HW,PCMCIA] BadgePAD 4 pd. [PARIDE] @@ -1302,7 +1331,7 @@ running once the system is up. pt. [PARIDE] See Documentation/paride.txt. - quiet= [KNL] Disable log messages + quiet [KNL] Disable most log messages r128= [HW,DRM] @@ -1343,6 +1372,14 @@ running once the system is up. reserve= [KNL,BUGS] Force the kernel to ignore some iomem area + reservetop= [IA-32] + Format: nn[KMG] + Reserves a hole at the top of the kernel virtual + address space. + + reset_devices [KNL] Force drivers to reset the underlying device + during initialization. + resume= [SWSUSP] Specify the partition device for software suspend @@ -1617,6 +1654,10 @@ running once the system is up. time Show timing data prefixed to each printk message line + clocksource= [GENERIC_TIME] Override the default clocksource + Override the default clocksource and use the clocksource + with the name specified. + tipar.timeout= [HW,PPT] Set communications timeout in tenths of a second (default 15). @@ -1658,6 +1699,10 @@ running once the system is up. usbhid.mousepoll= [USBHID] The interval which mice are to be polled at. + vdso= [IA-32] + vdso=1: enable VDSO (default) + vdso=0: disable VDSO mapping + video= [FB] Frame buffer configuration See Documentation/fb/modedb.txt. @@ -1674,9 +1719,14 @@ running once the system is up. decrease the size and leave more room for directly mapped kernel RAM. - vmhalt= [KNL,S390] + vmhalt= [KNL,S390] Perform z/VM CP command after system halt. + Format: <command> + + vmpanic= [KNL,S390] Perform z/VM CP command after kernel panic. + Format: <command> - vmpoff= [KNL,S390] + vmpoff= [KNL,S390] Perform z/VM CP command after power off. + Format: <command> waveartist= [HW,OSS] Format: <io>,<irq>,<dma>,<dma2> diff --git a/Documentation/keys-request-key.txt b/Documentation/keys-request-key.txt index 22488d791168..c1f64fdf84cb 100644 --- a/Documentation/keys-request-key.txt +++ b/Documentation/keys-request-key.txt @@ -3,16 +3,23 @@ =================== The key request service is part of the key retention service (refer to -Documentation/keys.txt). This document explains more fully how that the -requesting algorithm works. +Documentation/keys.txt). This document explains more fully how the requesting +algorithm works. The process starts by either the kernel requesting a service by calling -request_key(): +request_key*(): struct key *request_key(const struct key_type *type, const char *description, const char *callout_string); +or: + + struct key *request_key_with_auxdata(const struct key_type *type, + const char *description, + const char *callout_string, + void *aux); + Or by userspace invoking the request_key system call: key_serial_t request_key(const char *type, @@ -20,16 +27,26 @@ Or by userspace invoking the request_key system call: const char *callout_info, key_serial_t dest_keyring); -The main difference between the two access points is that the in-kernel -interface does not need to link the key to a keyring to prevent it from being -immediately destroyed. The kernel interface returns a pointer directly to the -key, and it's up to the caller to destroy the key. +The main difference between the access points is that the in-kernel interface +does not need to link the key to a keyring to prevent it from being immediately +destroyed. The kernel interface returns a pointer directly to the key, and +it's up to the caller to destroy the key. + +The request_key_with_auxdata() call is like the in-kernel request_key() call, +except that it permits auxiliary data to be passed to the upcaller (the default +is NULL). This is only useful for those key types that define their own upcall +mechanism rather than using /sbin/request-key. The userspace interface links the key to a keyring associated with the process to prevent the key from going away, and returns the serial number of the key to the caller. +The following example assumes that the key types involved don't define their +own upcall mechanisms. If they do, then those should be substituted for the +forking and execution of /sbin/request-key. + + =========== THE PROCESS =========== @@ -40,8 +57,8 @@ A request proceeds in the following manner: interface]. (2) request_key() searches the process's subscribed keyrings to see if there's - a suitable key there. If there is, it returns the key. If there isn't, and - callout_info is not set, an error is returned. Otherwise the process + a suitable key there. If there is, it returns the key. If there isn't, + and callout_info is not set, an error is returned. Otherwise the process proceeds to the next step. (3) request_key() sees that A doesn't have the desired key yet, so it creates @@ -62,7 +79,7 @@ A request proceeds in the following manner: instantiation. (7) The program may want to access another key from A's context (say a - Kerberos TGT key). It just requests the appropriate key, and the keyring + Kerberos TGT key). It just requests the appropriate key, and the keyring search notes that the session keyring has auth key V in its bottom level. This will permit it to then search the keyrings of process A with the @@ -79,10 +96,11 @@ A request proceeds in the following manner: (10) The program then exits 0 and request_key() deletes key V and returns key U to the caller. -This also extends further. If key W (step 7 above) didn't exist, key W would be -created uninstantiated, another auth key (X) would be created (as per step 3) -and another copy of /sbin/request-key spawned (as per step 4); but the context -specified by auth key X will still be process A, as it was in auth key V. +This also extends further. If key W (step 7 above) didn't exist, key W would +be created uninstantiated, another auth key (X) would be created (as per step +3) and another copy of /sbin/request-key spawned (as per step 4); but the +context specified by auth key X will still be process A, as it was in auth key +V. This is because process A's keyrings can't simply be attached to /sbin/request-key at the appropriate places because (a) execve will discard two @@ -118,17 +136,17 @@ A search of any particular keyring proceeds in the following fashion: (2) It considers all the non-keyring keys within that keyring and, if any key matches the criteria specified, calls key_permission(SEARCH) on it to see - if the key is allowed to be found. If it is, that key is returned; if + if the key is allowed to be found. If it is, that key is returned; if not, the search continues, and the error code is retained if of higher priority than the one currently set. (3) It then considers all the keyring-type keys in the keyring it's currently - searching. It calls key_permission(SEARCH) on each keyring, and if this + searching. It calls key_permission(SEARCH) on each keyring, and if this grants permission, it recurses, executing steps (2) and (3) on that keyring. The process stops immediately a valid key is found with permission granted to -use it. Any error from a previous match attempt is discarded and the key is +use it. Any error from a previous match attempt is discarded and the key is returned. When search_process_keyrings() is invoked, it performs the following searches @@ -153,7 +171,7 @@ The moment one succeeds, all pending errors are discarded and the found key is returned. Only if all these fail does the whole thing fail with the highest priority -error. Note that several errors may have come from LSM. +error. Note that several errors may have come from LSM. The error priority is: diff --git a/Documentation/keys.txt b/Documentation/keys.txt index 3bbe157b45e4..e373f0212843 100644 --- a/Documentation/keys.txt +++ b/Documentation/keys.txt @@ -241,25 +241,30 @@ The security class "key" has been added to SELinux so that mandatory access controls can be applied to keys created within various contexts. This support is preliminary, and is likely to change quite significantly in the near future. Currently, all of the basic permissions explained above are provided in SELinux -as well; SE Linux is simply invoked after all basic permission checks have been +as well; SELinux is simply invoked after all basic permission checks have been performed. -Each key is labeled with the same context as the task to which it belongs. -Typically, this is the same task that was running when the key was created. -The default keyrings are handled differently, but in a way that is very -intuitive: +The value of the file /proc/self/attr/keycreate influences the labeling of +newly-created keys. If the contents of that file correspond to an SELinux +security context, then the key will be assigned that context. Otherwise, the +key will be assigned the current context of the task that invoked the key +creation request. Tasks must be granted explicit permission to assign a +particular context to newly-created keys, using the "create" permission in the +key security class. - (*) The user and user session keyrings that are created when the user logs in - are currently labeled with the context of the login manager. - - (*) The keyrings associated with new threads are each labeled with the context - of their associated thread, and both session and process keyrings are - handled similarly. +The default keyrings associated with users will be labeled with the default +context of the user if and only if the login programs have been instrumented to +properly initialize keycreate during the login process. Otherwise, they will +be labeled with the context of the login program itself. Note, however, that the default keyrings associated with the root user are labeled with the default kernel context, since they are created early in the boot process, before root has a chance to log in. +The keyrings associated with new threads are each labeled with the context of +their associated thread, and both session and process keyrings are handled +similarly. + ================ NEW PROCFS FILES @@ -270,9 +275,17 @@ about the status of the key service: (*) /proc/keys - This lists all the keys on the system, giving information about their - type, description and permissions. The payload of the key is not available - this way: + This lists the keys that are currently viewable by the task reading the + file, giving information about their type, description and permissions. + It is not possible to view the payload of the key this way, though some + information about it may be given. + + The only keys included in the list are those that grant View permission to + the reading process whether or not it possesses them. Note that LSM + security checks are still performed, and may further filter out keys that + the current process is not authorised to view. + + The contents of the file look like this: SERIAL FLAGS USAGE EXPY PERM UID GID TYPE DESCRIPTION: SUMMARY 00000001 I----- 39 perm 1f3f0000 0 0 keyring _uid_ses.0: 1/4 @@ -300,7 +313,7 @@ about the status of the key service: (*) /proc/key-users This file lists the tracking data for each user that has at least one key - on the system. Such data includes quota information and statistics: + on the system. Such data includes quota information and statistics: [root@andromeda root]# cat /proc/key-users 0: 46 45/45 1/100 13/10000 @@ -767,6 +780,17 @@ payload contents" for more information. See also Documentation/keys-request-key.txt. +(*) To search for a key, passing auxiliary data to the upcaller, call: + + struct key *request_key_with_auxdata(const struct key_type *type, + const char *description, + const char *callout_string, + void *aux); + + This is identical to request_key(), except that the auxiliary data is + passed to the key_type->request_key() op if it exists. + + (*) When it is no longer required, the key should be released using: void key_put(struct key *key); @@ -1018,6 +1042,24 @@ The structure has a number of fields, some of which are mandatory: as might happen when the userspace buffer is accessed. + (*) int (*request_key)(struct key *key, struct key *authkey, const char *op, + void *aux); + + This method is optional. If provided, request_key() and + request_key_with_auxdata() will invoke this function rather than + upcalling to /sbin/request-key to operate upon a key of this type. + + The aux parameter is as passed to request_key_with_auxdata() or is NULL + otherwise. Also passed are the key to be operated upon, the + authorisation key for this operation and the operation type (currently + only "create"). + + This function should return only when the upcall is complete. Upon return + the authorisation key will be revoked, and the target key will be + negatively instantiated if it is still uninstantiated. The error will be + returned to the caller of request_key*(). + + ============================ REQUEST-KEY CALLBACK SERVICE ============================ diff --git a/Documentation/kobject.txt b/Documentation/kobject.txt index 8d9bffbd192c..949f7b5a2053 100644 --- a/Documentation/kobject.txt +++ b/Documentation/kobject.txt @@ -247,7 +247,7 @@ the object-specific fields, which include: - default_attrs: Default attributes to be exported via sysfs when the object is registered.Note that the last attribute has to be initialized to NULL ! You can find a complete implementation - in drivers/block/genhd.c + in block/genhd.c Instances of struct kobj_type are not registered; only referenced by diff --git a/Documentation/lockdep-design.txt b/Documentation/lockdep-design.txt new file mode 100644 index 000000000000..00d93605bfd3 --- /dev/null +++ b/Documentation/lockdep-design.txt @@ -0,0 +1,197 @@ +Runtime locking correctness validator +===================================== + +started by Ingo Molnar <mingo@redhat.com> +additions by Arjan van de Ven <arjan@linux.intel.com> + +Lock-class +---------- + +The basic object the validator operates upon is a 'class' of locks. + +A class of locks is a group of locks that are logically the same with +respect to locking rules, even if the locks may have multiple (possibly +tens of thousands of) instantiations. For example a lock in the inode +struct is one class, while each inode has its own instantiation of that +lock class. + +The validator tracks the 'state' of lock-classes, and it tracks +dependencies between different lock-classes. The validator maintains a +rolling proof that the state and the dependencies are correct. + +Unlike an lock instantiation, the lock-class itself never goes away: when +a lock-class is used for the first time after bootup it gets registered, +and all subsequent uses of that lock-class will be attached to this +lock-class. + +State +----- + +The validator tracks lock-class usage history into 5 separate state bits: + +- 'ever held in hardirq context' [ == hardirq-safe ] +- 'ever held in softirq context' [ == softirq-safe ] +- 'ever held with hardirqs enabled' [ == hardirq-unsafe ] +- 'ever held with softirqs and hardirqs enabled' [ == softirq-unsafe ] + +- 'ever used' [ == !unused ] + +Single-lock state rules: +------------------------ + +A softirq-unsafe lock-class is automatically hardirq-unsafe as well. The +following states are exclusive, and only one of them is allowed to be +set for any lock-class: + + <hardirq-safe> and <hardirq-unsafe> + <softirq-safe> and <softirq-unsafe> + +The validator detects and reports lock usage that violate these +single-lock state rules. + +Multi-lock dependency rules: +---------------------------- + +The same lock-class must not be acquired twice, because this could lead +to lock recursion deadlocks. + +Furthermore, two locks may not be taken in different order: + + <L1> -> <L2> + <L2> -> <L1> + +because this could lead to lock inversion deadlocks. (The validator +finds such dependencies in arbitrary complexity, i.e. there can be any +other locking sequence between the acquire-lock operations, the +validator will still track all dependencies between locks.) + +Furthermore, the following usage based lock dependencies are not allowed +between any two lock-classes: + + <hardirq-safe> -> <hardirq-unsafe> + <softirq-safe> -> <softirq-unsafe> + +The first rule comes from the fact the a hardirq-safe lock could be +taken by a hardirq context, interrupting a hardirq-unsafe lock - and +thus could result in a lock inversion deadlock. Likewise, a softirq-safe +lock could be taken by an softirq context, interrupting a softirq-unsafe +lock. + +The above rules are enforced for any locking sequence that occurs in the +kernel: when acquiring a new lock, the validator checks whether there is +any rule violation between the new lock and any of the held locks. + +When a lock-class changes its state, the following aspects of the above +dependency rules are enforced: + +- if a new hardirq-safe lock is discovered, we check whether it + took any hardirq-unsafe lock in the past. + +- if a new softirq-safe lock is discovered, we check whether it took + any softirq-unsafe lock in the past. + +- if a new hardirq-unsafe lock is discovered, we check whether any + hardirq-safe lock took it in the past. + +- if a new softirq-unsafe lock is discovered, we check whether any + softirq-safe lock took it in the past. + +(Again, we do these checks too on the basis that an interrupt context +could interrupt _any_ of the irq-unsafe or hardirq-unsafe locks, which +could lead to a lock inversion deadlock - even if that lock scenario did +not trigger in practice yet.) + +Exception: Nested data dependencies leading to nested locking +------------------------------------------------------------- + +There are a few cases where the Linux kernel acquires more than one +instance of the same lock-class. Such cases typically happen when there +is some sort of hierarchy within objects of the same type. In these +cases there is an inherent "natural" ordering between the two objects +(defined by the properties of the hierarchy), and the kernel grabs the +locks in this fixed order on each of the objects. + +An example of such an object hieararchy that results in "nested locking" +is that of a "whole disk" block-dev object and a "partition" block-dev +object; the partition is "part of" the whole device and as long as one +always takes the whole disk lock as a higher lock than the partition +lock, the lock ordering is fully correct. The validator does not +automatically detect this natural ordering, as the locking rule behind +the ordering is not static. + +In order to teach the validator about this correct usage model, new +versions of the various locking primitives were added that allow you to +specify a "nesting level". An example call, for the block device mutex, +looks like this: + +enum bdev_bd_mutex_lock_class +{ + BD_MUTEX_NORMAL, + BD_MUTEX_WHOLE, + BD_MUTEX_PARTITION +}; + + mutex_lock_nested(&bdev->bd_contains->bd_mutex, BD_MUTEX_PARTITION); + +In this case the locking is done on a bdev object that is known to be a +partition. + +The validator treats a lock that is taken in such a nested fasion as a +separate (sub)class for the purposes of validation. + +Note: When changing code to use the _nested() primitives, be careful and +check really thoroughly that the hiearchy is correctly mapped; otherwise +you can get false positives or false negatives. + +Proof of 100% correctness: +-------------------------- + +The validator achieves perfect, mathematical 'closure' (proof of locking +correctness) in the sense that for every simple, standalone single-task +locking sequence that occured at least once during the lifetime of the +kernel, the validator proves it with a 100% certainty that no +combination and timing of these locking sequences can cause any class of +lock related deadlock. [*] + +I.e. complex multi-CPU and multi-task locking scenarios do not have to +occur in practice to prove a deadlock: only the simple 'component' +locking chains have to occur at least once (anytime, in any +task/context) for the validator to be able to prove correctness. (For +example, complex deadlocks that would normally need more than 3 CPUs and +a very unlikely constellation of tasks, irq-contexts and timings to +occur, can be detected on a plain, lightly loaded single-CPU system as +well!) + +This radically decreases the complexity of locking related QA of the +kernel: what has to be done during QA is to trigger as many "simple" +single-task locking dependencies in the kernel as possible, at least +once, to prove locking correctness - instead of having to trigger every +possible combination of locking interaction between CPUs, combined with +every possible hardirq and softirq nesting scenario (which is impossible +to do in practice). + +[*] assuming that the validator itself is 100% correct, and no other + part of the system corrupts the state of the validator in any way. + We also assume that all NMI/SMM paths [which could interrupt + even hardirq-disabled codepaths] are correct and do not interfere + with the validator. We also assume that the 64-bit 'chain hash' + value is unique for every lock-chain in the system. Also, lock + recursion must not be higher than 20. + +Performance: +------------ + +The above rules require _massive_ amounts of runtime checking. If we did +that for every lock taken and for every irqs-enable event, it would +render the system practically unusably slow. The complexity of checking +is O(N^2), so even with just a few hundred lock-classes we'd have to do +tens of thousands of checks for every event. + +This problem is solved by checking any given 'locking scenario' (unique +sequence of locks taken after each other) only once. A simple stack of +held locks is maintained, and a lightweight 64-bit hash value is +calculated, which hash is unique for every lock chain. The hash value, +when the chain is validated for the first time, is then put into a hash +table, which hash-table can be checked in a lockfree manner. If the +locking chain occurs again later on, the hash table tells us that we +dont have to validate the chain again. diff --git a/Documentation/md.txt b/Documentation/md.txt index 03a13c462cf2..0668f9dc9d29 100644 --- a/Documentation/md.txt +++ b/Documentation/md.txt @@ -200,6 +200,17 @@ All md devices contain: This can be written only while the array is being assembled, not after it is started. + layout + The "layout" for the array for the particular level. This is + simply a number that is interpretted differently by different + levels. It can be written while assembling an array. + + resync_start + The point at which resync should start. If no resync is needed, + this will be a very large number. At array creation it will + default to 0, though starting the array as 'clean' will + set it much larger. + new_dev This file can be written but not read. The value written should be a block device number as major:minor. e.g. 8:0 @@ -207,6 +218,54 @@ All md devices contain: available. It will then appear at md/dev-XXX (depending on the name of the device) and further configuration is then possible. + safe_mode_delay + When an md array has seen no write requests for a certain period + of time, it will be marked as 'clean'. When another write + request arrive, the array is marked as 'dirty' before the write + commenses. This is known as 'safe_mode'. + The 'certain period' is controlled by this file which stores the + period as a number of seconds. The default is 200msec (0.200). + Writing a value of 0 disables safemode. + + array_state + This file contains a single word which describes the current + state of the array. In many cases, the state can be set by + writing the word for the desired state, however some states + cannot be explicitly set, and some transitions are not allowed. + + clear + No devices, no size, no level + Writing is equivalent to STOP_ARRAY ioctl + inactive + May have some settings, but array is not active + all IO results in error + When written, doesn't tear down array, but just stops it + suspended (not supported yet) + All IO requests will block. The array can be reconfigured. + Writing this, if accepted, will block until array is quiessent + readonly + no resync can happen. no superblocks get written. + write requests fail + read-auto + like readonly, but behaves like 'clean' on a write request. + + clean - no pending writes, but otherwise active. + When written to inactive array, starts without resync + If a write request arrives then + if metadata is known, mark 'dirty' and switch to 'active'. + if not known, block and switch to write-pending + If written to an active array that has pending writes, then fails. + active + fully active: IO and resync can be happening. + When written to inactive array, starts with resync + + write-pending + clean, but writes are blocked waiting for 'active' to be written. + + active-idle + like active, but no writes have been seen for a while (safe_mode_delay). + + sync_speed_min sync_speed_max This are similar to /proc/sys/dev/raid/speed_limit_{min,max} @@ -250,10 +309,18 @@ Each directory contains: faulty - device has been kicked from active use due to a detected fault in_sync - device is a fully in-sync member of the array + writemostly - device will only be subject to read + requests if there are no other options. + This applies only to raid1 arrays. spare - device is working, but not a full member. This includes spares that are in the process of being recoverred to This list make grow in future. + This can be written to. + Writing "faulty" simulates a failure on the device. + Writing "remove" removes the device from the array. + Writing "writemostly" sets the writemostly flag. + Writing "-writemostly" clears the writemostly flag. errors An approximate count of read errors that have been detected on diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 4710845dbac4..46b9b389df35 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -262,9 +262,14 @@ What is required is some way of intervening to instruct the compiler and the CPU to restrict the order. Memory barriers are such interventions. They impose a perceived partial -ordering between the memory operations specified on either side of the barrier. -They request that the sequence of memory events generated appears to other -parts of the system as if the barrier is effective on that CPU. +ordering over the memory operations on either side of the barrier. + +Such enforcement is important because the CPUs and other devices in a system +can use a variety of tricks to improve performance - including reordering, +deferral and combination of memory operations; speculative loads; speculative +branch prediction and various types of caching. Memory barriers are used to +override or suppress these tricks, allowing the code to sanely control the +interaction of multiple CPUs and/or devices. VARIETIES OF MEMORY BARRIER @@ -282,7 +287,7 @@ Memory barriers come in four basic varieties: A write barrier is a partial ordering on stores only; it is not required to have any effect on loads. - A CPU can be viewed as as commiting a sequence of store operations to the + A CPU can be viewed as committing a sequence of store operations to the memory system as time progresses. All stores before a write barrier will occur in the sequence _before_ all the stores after the write barrier. @@ -413,7 +418,7 @@ There are certain things that the Linux kernel memory barriers do not guarantee: indirect effect will be the order in which the second CPU sees the effects of the first CPU's accesses occur, but see the next point: - (*) There is no guarantee that the a CPU will see the correct order of effects + (*) There is no guarantee that a CPU will see the correct order of effects from a second CPU's accesses, even _if_ the second CPU uses a memory barrier, unless the first CPU _also_ uses a matching memory barrier (see the subsection on "SMP Barrier Pairing"). @@ -461,8 +466,8 @@ Whilst this may seem like a failure of coherency or causality maintenance, it isn't, and this behaviour can be observed on certain real CPUs (such as the DEC Alpha). -To deal with this, a data dependency barrier must be inserted between the -address load and the data load: +To deal with this, a data dependency barrier or better must be inserted +between the address load and the data load: CPU 1 CPU 2 =============== =============== @@ -484,7 +489,7 @@ lines. The pointer P might be stored in an odd-numbered cache line, and the variable B might be stored in an even-numbered cache line. Then, if the even-numbered bank of the reading CPU's cache is extremely busy while the odd-numbered bank is idle, one can see the new value of the pointer P (&B), -but the old value of the variable B (1). +but the old value of the variable B (2). Another example of where data dependency barriers might by required is where a @@ -597,7 +602,7 @@ Consider the following sequence of events: This sequence of events is committed to the memory coherence system in an order that the rest of the system might perceive as the unordered set of { STORE A, -STORE B, STORE C } all occuring before the unordered set of { STORE D, STORE E +STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E }: +-------+ : : @@ -744,7 +749,7 @@ some effectively random order, despite the write barrier issued by CPU 1: : : -If, however, a read barrier were to be placed between the load of E and the +If, however, a read barrier were to be placed between the load of B and the load of A on CPU 2: CPU 1 CPU 2 @@ -1010,10 +1015,9 @@ CPU from reordering them. There are some more advanced barrier functions: (*) set_mb(var, value) - (*) set_wmb(var, value) - These assign the value to the variable and then insert at least a write - barrier after it, depending on the function. They aren't guaranteed to + This assigns the value to the variable and then inserts at least a write + barrier after it, depending on the function. It isn't guaranteed to insert anything more than a compiler barrier in a UP compilation. @@ -1461,9 +1465,8 @@ instruction itself is complete. On a UP system - where this wouldn't be a problem - the smp_mb() is just a compiler barrier, thus making sure the compiler emits the instructions in the -right order without actually intervening in the CPU. Since there there's only -one CPU, that CPU's dependency ordering logic will take care of everything -else. +right order without actually intervening in the CPU. Since there's only one +CPU, that CPU's dependency ordering logic will take care of everything else. ATOMIC OPERATIONS @@ -1640,9 +1643,9 @@ functions: The PCI bus, amongst others, defines an I/O space concept - which on such CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O - space. However, it may also mapped as a virtual I/O space in the CPU's - memory map, particularly on those CPUs that don't support alternate - I/O spaces. + space. However, it may also be mapped as a virtual I/O space in the CPU's + memory map, particularly on those CPUs that don't support alternate I/O + spaces. Accesses to this space may be fully synchronous (as on i386), but intermediary bridges (such as the PCI host bridge) may not fully honour diff --git a/Documentation/mips/time.README b/Documentation/mips/time.README index 70bc0dd43d6d..69ddc5c14b79 100644 --- a/Documentation/mips/time.README +++ b/Documentation/mips/time.README @@ -65,7 +65,7 @@ the following functions or values: 1. (optional) set up RTC routines 2. (optional) calibrate and set the mips_counter_frequency - b) board_timer_setup - a function pointer. Invoked at the end of time_init() + b) plat_timer_setup - a function pointer. Invoked at the end of time_init() 1. (optional) over-ride any decisions made in time_init() 2. set up the irqaction for timer interrupt. 3. enable the timer interrupt @@ -116,19 +116,17 @@ Step 2: the machine setup() function If you supply board_time_init(), set the function poointer. - Set the function pointer board_timer_setup() (mandatory) - -Step 3: implement rtc routines, board_time_init() and board_timer_setup() +Step 3: implement rtc routines, board_time_init() and plat_timer_setup() if needed. - board_time_init() - + board_time_init() - a) (optional) set up RTC routines, b) (optional) calibrate and set the mips_counter_frequency (only needed if you intended to use fixed_rate_gettimeoffset or use cpu counter as timer interrupt source) - board_timer_setup() - + plat_timer_setup() - a) (optional) over-write any choices made above by time_init(). b) machine specific code should setup the timer irqaction. c) enable the timer interrupt diff --git a/Documentation/netlabel/00-INDEX b/Documentation/netlabel/00-INDEX new file mode 100644 index 000000000000..837bf35990e2 --- /dev/null +++ b/Documentation/netlabel/00-INDEX @@ -0,0 +1,10 @@ +00-INDEX + - this file. +cipso_ipv4.txt + - documentation on the IPv4 CIPSO protocol engine. +draft-ietf-cipso-ipsecurity-01.txt + - IETF draft of the CIPSO protocol, dated 16 July 1992. +introduction.txt + - NetLabel introduction, READ THIS FIRST. +lsm_interface.txt + - documentation on the NetLabel kernel security module API. diff --git a/Documentation/netlabel/cipso_ipv4.txt b/Documentation/netlabel/cipso_ipv4.txt new file mode 100644 index 000000000000..93dacb132c3c --- /dev/null +++ b/Documentation/netlabel/cipso_ipv4.txt @@ -0,0 +1,48 @@ +NetLabel CIPSO/IPv4 Protocol Engine +============================================================================== +Paul Moore, paul.moore@hp.com + +May 17, 2006 + + * Overview + +The NetLabel CIPSO/IPv4 protocol engine is based on the IETF Commercial IP +Security Option (CIPSO) draft from July 16, 1992. A copy of this draft can be +found in this directory, consult '00-INDEX' for the filename. While the IETF +draft never made it to an RFC standard it has become a de-facto standard for +labeled networking and is used in many trusted operating systems. + + * Outbound Packet Processing + +The CIPSO/IPv4 protocol engine applies the CIPSO IP option to packets by +adding the CIPSO label to the socket. This causes all packets leaving the +system through the socket to have the CIPSO IP option applied. The socket's +CIPSO label can be changed at any point in time, however, it is recommended +that it is set upon the socket's creation. The LSM can set the socket's CIPSO +label by using the NetLabel security module API; if the NetLabel "domain" is +configured to use CIPSO for packet labeling then a CIPSO IP option will be +generated and attached to the socket. + + * Inbound Packet Processing + +The CIPSO/IPv4 protocol engine validates every CIPSO IP option it finds at the +IP layer without any special handling required by the LSM. However, in order +to decode and translate the CIPSO label on the packet the LSM must use the +NetLabel security module API to extract the security attributes of the packet. +This is typically done at the socket layer using the 'socket_sock_rcv_skb()' +LSM hook. + + * Label Translation + +The CIPSO/IPv4 protocol engine contains a mechanism to translate CIPSO security +attributes such as sensitivity level and category to values which are +appropriate for the host. These mappings are defined as part of a CIPSO +Domain Of Interpretation (DOI) definition and are configured through the +NetLabel user space communication layer. Each DOI definition can have a +different security attribute mapping table. + + * Label Translation Cache + +The NetLabel system provides a framework for caching security attribute +mappings from the network labels to the corresponding LSM identifiers. The +CIPSO/IPv4 protocol engine supports this caching mechanism. diff --git a/Documentation/netlabel/draft-ietf-cipso-ipsecurity-01.txt b/Documentation/netlabel/draft-ietf-cipso-ipsecurity-01.txt new file mode 100644 index 000000000000..256c2c9d4f50 --- /dev/null +++ b/Documentation/netlabel/draft-ietf-cipso-ipsecurity-01.txt @@ -0,0 +1,791 @@ +IETF CIPSO Working Group +16 July, 1992 + + + + COMMERCIAL IP SECURITY OPTION (CIPSO 2.2) + + + +1. Status + +This Internet Draft provides the high level specification for a Commercial +IP Security Option (CIPSO). This draft reflects the version as approved by +the CIPSO IETF Working Group. Distribution of this memo is unlimited. + +This document is an Internet Draft. Internet Drafts are working documents +of the Internet Engineering Task Force (IETF), its Areas, and its Working +Groups. Note that other groups may also distribute working documents as +Internet Drafts. + +Internet Drafts are draft documents valid for a maximum of six months. +Internet Drafts may be updated, replaced, or obsoleted by other documents +at any time. It is not appropriate to use Internet Drafts as reference +material or to cite them other than as a "working draft" or "work in +progress." + +Please check the I-D abstract listing contained in each Internet Draft +directory to learn the current status of this or any other Internet Draft. + + + + +2. Background + +Currently the Internet Protocol includes two security options. One of +these options is the DoD Basic Security Option (BSO) (Type 130) which allows +IP datagrams to be labeled with security classifications. This option +provides sixteen security classifications and a variable number of handling +restrictions. To handle additional security information, such as security +categories or compartments, another security option (Type 133) exists and +is referred to as the DoD Extended Security Option (ESO). The values for +the fixed fields within these two options are administered by the Defense +Information Systems Agency (DISA). + +Computer vendors are now building commercial operating systems with +mandatory access controls and multi-level security. These systems are +no longer built specifically for a particular group in the defense or +intelligence communities. They are generally available commercial systems +for use in a variety of government and civil sector environments. + +The small number of ESO format codes can not support all the possible +applications of a commercial security option. The BSO and ESO were +designed to only support the United States DoD. CIPSO has been designed +to support multiple security policies. This Internet Draft provides the +format and procedures required to support a Mandatory Access Control +security policy. Support for additional security policies shall be +defined in future RFCs. + + + + +Internet Draft, Expires 15 Jan 93 [PAGE 1] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + + +3. CIPSO Format + +Option type: 134 (Class 0, Number 6, Copy on Fragmentation) +Option length: Variable + +This option permits security related information to be passed between +systems within a single Domain of Interpretation (DOI). A DOI is a +collection of systems which agree on the meaning of particular values +in the security option. An authority that has been assigned a DOI +identifier will define a mapping between appropriate CIPSO field values +and their human readable equivalent. This authority will distribute that +mapping to hosts within the authority's domain. These mappings may be +sensitive, therefore a DOI authority is not required to make these +mappings available to anyone other than the systems that are included in +the DOI. + +This option MUST be copied on fragmentation. This option appears at most +once in a datagram. All multi-octet fields in the option are defined to be +transmitted in network byte order. The format of this option is as follows: + ++----------+----------+------//------+-----------//---------+ +| 10000110 | LLLLLLLL | DDDDDDDDDDDD | TTTTTTTTTTTTTTTTTTTT | ++----------+----------+------//------+-----------//---------+ + + TYPE=134 OPTION DOMAIN OF TAGS + LENGTH INTERPRETATION + + + Figure 1. CIPSO Format + + +3.1 Type + +This field is 1 octet in length. Its value is 134. + + +3.2 Length + +This field is 1 octet in length. It is the total length of the option +including the type and length fields. With the current IP header length +restriction of 40 octets the value of this field MUST not exceed 40. + + +3.3 Domain of Interpretation Identifier + +This field is an unsigned 32 bit integer. The value 0 is reserved and MUST +not appear as the DOI identifier in any CIPSO option. Implementations +should assume that the DOI identifier field is not aligned on any particular +byte boundary. + +To conserve space in the protocol, security levels and categories are +represented by numbers rather than their ASCII equivalent. This requires +a mapping table within CIPSO hosts to map these numbers to their +corresponding ASCII representations. Non-related groups of systems may + + + +Internet Draft, Expires 15 Jan 93 [PAGE 2] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +have their own unique mappings. For example, one group of systems may +use the number 5 to represent Unclassified while another group may use the +number 1 to represent that same security level. The DOI identifier is used +to identify which mapping was used for the values within the option. + + +3.4 Tag Types + +A common format for passing security related information is necessary +for interoperability. CIPSO uses sets of "tags" to contain the security +information relevant to the data in the IP packet. Each tag begins with +a tag type identifier followed by the length of the tag and ends with the +actual security information to be passed. All multi-octet fields in a tag +are defined to be transmitted in network byte order. Like the DOI +identifier field in the CIPSO header, implementations should assume that +all tags, as well as fields within a tag, are not aligned on any particular +octet boundary. The tag types defined in this document contain alignment +bytes to assist alignment of some information, however alignment can not +be guaranteed if CIPSO is not the first IP option. + +CIPSO tag types 0 through 127 are reserved for defining standard tag +formats. Their definitions will be published in RFCs. Tag types whose +identifiers are greater than 127 are defined by the DOI authority and may +only be meaningful in certain Domains of Interpretation. For these tag +types, implementations will require the DOI identifier as well as the tag +number to determine the security policy and the format associated with the +tag. Use of tag types above 127 are restricted to closed networks where +interoperability with other networks will not be an issue. Implementations +that support a tag type greater than 127 MUST support at least one DOI that +requires only tag types 1 to 127. + +Tag type 0 is reserved. Tag types 1, 2, and 5 are defined in this +Internet Draft. Types 3 and 4 are reserved for work in progress. +The standard format for all current and future CIPSO tags is shown below: + ++----------+----------+--------//--------+ +| TTTTTTTT | LLLLLLLL | IIIIIIIIIIIIIIII | ++----------+----------+--------//--------+ + TAG TAG TAG + TYPE LENGTH INFORMATION + + Figure 2: Standard Tag Format + +In the three tag types described in this document, the length and count +restrictions are based on the current IP limitation of 40 octets for all +IP options. If the IP header is later expanded, then the length and count +restrictions specified in this document may increase to use the full area +provided for IP options. + + +3.4.1 Tag Type Classes + +Tag classes consist of tag types that have common processing requirements +and support the same security policy. The three tags defined in this +Internet Draft belong to the Mandatory Access Control (MAC) Sensitivity + + + +Internet Draft, Expires 15 Jan 93 [PAGE 3] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +class and support the MAC Sensitivity security policy. + + +3.4.2 Tag Type 1 + +This is referred to as the "bit-mapped" tag type. Tag type 1 is included +in the MAC Sensitivity tag type class. The format of this tag type is as +follows: + ++----------+----------+----------+----------+--------//---------+ +| 00000001 | LLLLLLLL | 00000000 | LLLLLLLL | CCCCCCCCCCCCCCCCC | ++----------+----------+----------+----------+--------//---------+ + + TAG TAG ALIGNMENT SENSITIVITY BIT MAP OF + TYPE LENGTH OCTET LEVEL CATEGORIES + + Figure 3. Tag Type 1 Format + + +3.4.2.1 Tag Type + +This field is 1 octet in length and has a value of 1. + + +3.4.2.2 Tag Length + +This field is 1 octet in length. It is the total length of the tag type +including the type and length fields. With the current IP header length +restriction of 40 bytes the value within this field is between 4 and 34. + + +3.4.2.3 Alignment Octet + +This field is 1 octet in length and always has the value of 0. Its purpose +is to align the category bitmap field on an even octet boundary. This will +speed many implementations including router implementations. + + +3.4.2.4 Sensitivity Level + +This field is 1 octet in length. Its value is from 0 to 255. The values +are ordered with 0 being the minimum value and 255 representing the maximum +value. + + +3.4.2.5 Bit Map of Categories + +The length of this field is variable and ranges from 0 to 30 octets. This +provides representation of categories 0 to 239. The ordering of the bits +is left to right or MSB to LSB. For example category 0 is represented by +the most significant bit of the first byte and category 15 is represented +by the least significant bit of the second byte. Figure 4 graphically +shows this ordering. Bit N is binary 1 if category N is part of the label +for the datagram, and bit N is binary 0 if category N is not part of the +label. Except for the optimized tag 1 format described in the next section, + + + +Internet Draft, Expires 15 Jan 93 [PAGE 4] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +minimal encoding SHOULD be used resulting in no trailing zero octets in the +category bitmap. + + octet 0 octet 1 octet 2 octet 3 octet 4 octet 5 + XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX . . . +bit 01234567 89111111 11112222 22222233 33333333 44444444 +number 012345 67890123 45678901 23456789 01234567 + + Figure 4. Ordering of Bits in Tag 1 Bit Map + + +3.4.2.6 Optimized Tag 1 Format + +Routers work most efficiently when processing fixed length fields. To +support these routers there is an optimized form of tag type 1. The format +does not change. The only change is to the category bitmap which is set to +a constant length of 10 octets. Trailing octets required to fill out the 10 +octets are zero filled. Ten octets, allowing for 80 categories, was chosen +because it makes the total length of the CIPSO option 20 octets. If CIPSO +is the only option then the option will be full word aligned and additional +filler octets will not be required. + + +3.4.3 Tag Type 2 + +This is referred to as the "enumerated" tag type. It is used to describe +large but sparsely populated sets of categories. Tag type 2 is in the MAC +Sensitivity tag type class. The format of this tag type is as follows: + ++----------+----------+----------+----------+-------------//-------------+ +| 00000010 | LLLLLLLL | 00000000 | LLLLLLLL | CCCCCCCCCCCCCCCCCCCCCCCCCC | ++----------+----------+----------+----------+-------------//-------------+ + + TAG TAG ALIGNMENT SENSITIVITY ENUMERATED + TYPE LENGTH OCTET LEVEL CATEGORIES + + Figure 5. Tag Type 2 Format + + +3.4.3.1 Tag Type + +This field is one octet in length and has a value of 2. + + +3.4.3.2 Tag Length + +This field is 1 octet in length. It is the total length of the tag type +including the type and length fields. With the current IP header length +restriction of 40 bytes the value within this field is between 4 and 34. + + +3.4.3.3 Alignment Octet + +This field is 1 octet in length and always has the value of 0. Its purpose +is to align the category field on an even octet boundary. This will + + + +Internet Draft, Expires 15 Jan 93 [PAGE 5] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +speed many implementations including router implementations. + + +3.4.3.4 Sensitivity Level + +This field is 1 octet in length. Its value is from 0 to 255. The values +are ordered with 0 being the minimum value and 255 representing the +maximum value. + + +3.4.3.5 Enumerated Categories + +In this tag, categories are represented by their actual value rather than +by their position within a bit field. The length of each category is 2 +octets. Up to 15 categories may be represented by this tag. Valid values +for categories are 0 to 65534. Category 65535 is not a valid category +value. The categories MUST be listed in ascending order within the tag. + + +3.4.4 Tag Type 5 + +This is referred to as the "range" tag type. It is used to represent +labels where all categories in a range, or set of ranges, are included +in the sensitivity label. Tag type 5 is in the MAC Sensitivity tag type +class. The format of this tag type is as follows: + ++----------+----------+----------+----------+------------//-------------+ +| 00000101 | LLLLLLLL | 00000000 | LLLLLLLL | Top/Bottom | Top/Bottom | ++----------+----------+----------+----------+------------//-------------+ + + TAG TAG ALIGNMENT SENSITIVITY CATEGORY RANGES + TYPE LENGTH OCTET LEVEL + + Figure 6. Tag Type 5 Format + + +3.4.4.1 Tag Type + +This field is one octet in length and has a value of 5. + + +3.4.4.2 Tag Length + +This field is 1 octet in length. It is the total length of the tag type +including the type and length fields. With the current IP header length +restriction of 40 bytes the value within this field is between 4 and 34. + + +3.4.4.3 Alignment Octet + +This field is 1 octet in length and always has the value of 0. Its purpose +is to align the category range field on an even octet boundary. This will +speed many implementations including router implementations. + + + + + +Internet Draft, Expires 15 Jan 93 [PAGE 6] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +3.4.4.4 Sensitivity Level + +This field is 1 octet in length. Its value is from 0 to 255. The values +are ordered with 0 being the minimum value and 255 representing the maximum +value. + + +3.4.4.5 Category Ranges + +A category range is a 4 octet field comprised of the 2 octet index of the +highest numbered category followed by the 2 octet index of the lowest +numbered category. These range endpoints are inclusive within the range of +categories. All categories within a range are included in the sensitivity +label. This tag may contain a maximum of 7 category pairs. The bottom +category endpoint for the last pair in the tag MAY be omitted and SHOULD be +assumed to be 0. The ranges MUST be non-overlapping and be listed in +descending order. Valid values for categories are 0 to 65534. Category +65535 is not a valid category value. + + +3.4.5 Minimum Requirements + +A CIPSO implementation MUST be capable of generating at least tag type 1 in +the non-optimized form. In addition, a CIPSO implementation MUST be able +to receive any valid tag type 1 even those using the optimized tag type 1 +format. + + +4. Configuration Parameters + +The configuration parameters defined below are required for all CIPSO hosts, +gateways, and routers that support multiple sensitivity labels. A CIPSO +host is defined to be the origination or destination system for an IP +datagram. A CIPSO gateway provides IP routing services between two or more +IP networks and may be required to perform label translations between +networks. A CIPSO gateway may be an enhanced CIPSO host or it may just +provide gateway services with no end system CIPSO capabilities. A CIPSO +router is a dedicated IP router that routes IP datagrams between two or more +IP networks. + +An implementation of CIPSO on a host MUST have the capability to reject a +datagram for reasons that the information contained can not be adequately +protected by the receiving host or if acceptance may result in violation of +the host or network security policy. In addition, a CIPSO gateway or router +MUST be able to reject datagrams going to networks that can not provide +adequate protection or may violate the network's security policy. To +provide this capability the following minimal set of configuration +parameters are required for CIPSO implementations: + +HOST_LABEL_MAX - This parameter contains the maximum sensitivity label that +a CIPSO host is authorized to handle. All datagrams that have a label +greater than this maximum MUST be rejected by the CIPSO host. This +parameter does not apply to CIPSO gateways or routers. This parameter need +not be defined explicitly as it can be implicitly derived from the +PORT_LABEL_MAX parameters for the associated interfaces. + + + +Internet Draft, Expires 15 Jan 93 [PAGE 7] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + + +HOST_LABEL_MIN - This parameter contains the minimum sensitivity label that +a CIPSO host is authorized to handle. All datagrams that have a label less +than this minimum MUST be rejected by the CIPSO host. This parameter does +not apply to CIPSO gateways or routers. This parameter need not be defined +explicitly as it can be implicitly derived from the PORT_LABEL_MIN +parameters for the associated interfaces. + +PORT_LABEL_MAX - This parameter contains the maximum sensitivity label for +all datagrams that may exit a particular network interface port. All +outgoing datagrams that have a label greater than this maximum MUST be +rejected by the CIPSO system. The label within this parameter MUST be +less than or equal to the label within the HOST_LABEL_MAX parameter. This +parameter does not apply to CIPSO hosts that support only one network port. + +PORT_LABEL_MIN - This parameter contains the minimum sensitivity label for +all datagrams that may exit a particular network interface port. All +outgoing datagrams that have a label less than this minimum MUST be +rejected by the CIPSO system. The label within this parameter MUST be +greater than or equal to the label within the HOST_LABEL_MIN parameter. +This parameter does not apply to CIPSO hosts that support only one network +port. + +PORT_DOI - This parameter is used to assign a DOI identifier value to a +particular network interface port. All CIPSO labels within datagrams +going out this port MUST use the specified DOI identifier. All CIPSO +hosts and gateways MUST support either this parameter, the NET_DOI +parameter, or the HOST_DOI parameter. + +NET_DOI - This parameter is used to assign a DOI identifier value to a +particular IP network address. All CIPSO labels within datagrams destined +for the particular IP network MUST use the specified DOI identifier. All +CIPSO hosts and gateways MUST support either this parameter, the PORT_DOI +parameter, or the HOST_DOI parameter. + +HOST_DOI - This parameter is used to assign a DOI identifier value to a +particular IP host address. All CIPSO labels within datagrams destined for +the particular IP host will use the specified DOI identifier. All CIPSO +hosts and gateways MUST support either this parameter, the PORT_DOI +parameter, or the NET_DOI parameter. + +This list represents the minimal set of configuration parameters required +to be compliant. Implementors are encouraged to add to this list to +provide enhanced functionality and control. For example, many security +policies may require both incoming and outgoing datagrams be checked against +the port and host label ranges. + + +4.1 Port Range Parameters + +The labels represented by the PORT_LABEL_MAX and PORT_LABEL_MIN parameters +MAY be in CIPSO or local format. Some CIPSO systems, such as routers, may +want to have the range parameters expressed in CIPSO format so that incoming +labels do not have to be converted to a local format before being compared +against the range. If multiple DOIs are supported by one of these CIPSO + + + +Internet Draft, Expires 15 Jan 93 [PAGE 8] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +systems then multiple port range parameters would be needed, one set for +each DOI supported on a particular port. + +The port range will usually represent the total set of labels that may +exist on the logical network accessed through the corresponding network +interface. It may, however, represent a subset of these labels that are +allowed to enter the CIPSO system. + + +4.2 Single Label CIPSO Hosts + +CIPSO implementations that support only one label are not required to +support the parameters described above. These limited implementations are +only required to support a NET_LABEL parameter. This parameter contains +the CIPSO label that may be inserted in datagrams that exit the host. In +addition, the host MUST reject any incoming datagram that has a label which +is not equivalent to the NET_LABEL parameter. + + +5. Handling Procedures + +This section describes the processing requirements for incoming and +outgoing IP datagrams. Just providing the correct CIPSO label format +is not enough. Assumptions will be made by one system on how a +receiving system will handle the CIPSO label. Wrong assumptions may +lead to non-interoperability or even a security incident. The +requirements described below represent the minimal set needed for +interoperability and that provide users some level of confidence. +Many other requirements could be added to increase user confidence, +however at the risk of restricting creativity and limiting vendor +participation. + + +5.1 Input Procedures + +All datagrams received through a network port MUST have a security label +associated with them, either contained in the datagram or assigned to the +receiving port. Without this label the host, gateway, or router will not +have the information it needs to make security decisions. This security +label will be obtained from the CIPSO if the option is present in the +datagram. See section 4.1.2 for handling procedures for unlabeled +datagrams. This label will be compared against the PORT (if appropriate) +and HOST configuration parameters defined in section 3. + +If any field within the CIPSO option, such as the DOI identifier, is not +recognized the IP datagram is discarded and an ICMP "parameter problem" +(type 12) is generated and returned. The ICMP code field is set to "bad +parameter" (code 0) and the pointer is set to the start of the CIPSO field +that is unrecognized. + +If the contents of the CIPSO are valid but the security label is +outside of the configured host or port label range, the datagram is +discarded and an ICMP "destination unreachable" (type 3) is generated +and returned. The code field of the ICMP is set to "communication with +destination network administratively prohibited" (code 9) or to + + + +Internet Draft, Expires 15 Jan 93 [PAGE 9] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +"communication with destination host administratively prohibited" +(code 10). The value of the code field used is dependent upon whether +the originator of the ICMP message is acting as a CIPSO host or a CIPSO +gateway. The recipient of the ICMP message MUST be able to handle either +value. The same procedure is performed if a CIPSO can not be added to an +IP packet because it is too large to fit in the IP options area. + +If the error is triggered by receipt of an ICMP message, the message +is discarded and no response is permitted (consistent with general ICMP +processing rules). + + +5.1.1 Unrecognized tag types + +The default condition for any CIPSO implementation is that an +unrecognized tag type MUST be treated as a "parameter problem" and +handled as described in section 4.1. A CIPSO implementation MAY allow +the system administrator to identify tag types that may safely be +ignored. This capability is an allowable enhancement, not a +requirement. + + +5.1.2 Unlabeled Packets + +A network port may be configured to not require a CIPSO label for all +incoming datagrams. For this configuration a CIPSO label must be +assigned to that network port and associated with all unlabeled IP +datagrams. This capability might be used for single level networks or +networks that have CIPSO and non-CIPSO hosts and the non-CIPSO hosts +all operate at the same label. + +If a CIPSO option is required and none is found, the datagram is +discarded and an ICMP "parameter problem" (type 12) is generated and +returned to the originator of the datagram. The code field of the ICMP +is set to "option missing" (code 1) and the ICMP pointer is set to 134 +(the value of the option type for the missing CIPSO option). + + +5.2 Output Procedures + +A CIPSO option MUST appear only once in a datagram. Only one tag type +from the MAC Sensitivity class MAY be included in a CIPSO option. Given +the current set of defined tag types, this means that CIPSO labels at +first will contain only one tag. + +All datagrams leaving a CIPSO system MUST meet the following condition: + + PORT_LABEL_MIN <= CIPSO label <= PORT_LABEL_MAX + +If this condition is not satisfied the datagram MUST be discarded. +If the CIPSO system only supports one port, the HOST_LABEL_MIN and the +HOST_LABEL_MAX parameters MAY be substituted for the PORT parameters in +the above condition. + +The DOI identifier to be used for all outgoing datagrams is configured by + + + +Internet Draft, Expires 15 Jan 93 [PAGE 10] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + +the administrator. If port level DOI identifier assignment is used, then +the PORT_DOI configuration parameter MUST contain the DOI identifier to +use. If network level DOI assignment is used, then the NET_DOI parameter +MUST contain the DOI identifier to use. And if host level DOI assignment +is employed, then the HOST_DOI parameter MUST contain the DOI identifier +to use. A CIPSO implementation need only support one level of DOI +assignment. + + +5.3 DOI Processing Requirements + +A CIPSO implementation MUST support at least one DOI and SHOULD support +multiple DOIs. System and network administrators are cautioned to +ensure that at least one DOI is common within an IP network to allow for +broadcasting of IP datagrams. + +CIPSO gateways MUST be capable of translating a CIPSO option from one +DOI to another when forwarding datagrams between networks. For +efficiency purposes this capability is only a desired feature for CIPSO +routers. + + +5.4 Label of ICMP Messages + +The CIPSO label to be used on all outgoing ICMP messages MUST be equivalent +to the label of the datagram that caused the ICMP message. If the ICMP was +generated due to a problem associated with the original CIPSO label then the +following responses are allowed: + + a. Use the CIPSO label of the original IP datagram + b. Drop the original datagram with no return message generated + +In most cases these options will have the same effect. If you can not +interpret the label or if it is outside the label range of your host or +interface then an ICMP message with the same label will probably not be +able to exit the system. + + +6. Assignment of DOI Identifier Numbers = + +Requests for assignment of a DOI identifier number should be addressed to +the Internet Assigned Numbers Authority (IANA). + + +7. Acknowledgements + +Much of the material in this RFC is based on (and copied from) work +done by Gary Winiger of Sun Microsystems and published as Commercial +IP Security Option at the INTEROP 89, Commercial IPSO Workshop. + + +8. Author's Address + +To submit mail for distribution to members of the IETF CIPSO Working +Group, send mail to: cipso@wdl1.wdl.loral.com. + + + +Internet Draft, Expires 15 Jan 93 [PAGE 11] + + + +CIPSO INTERNET DRAFT 16 July, 1992 + + + + +To be added to or deleted from this distribution, send mail to: +cipso-request@wdl1.wdl.loral.com. + + +9. References + +RFC 1038, "Draft Revised IP Security Option", M. St. Johns, IETF, January +1988. + +RFC 1108, "U.S. Department of Defense Security Options +for the Internet Protocol", Stephen Kent, IAB, 1 March, 1991. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Internet Draft, Expires 15 Jan 93 [PAGE 12] + + + diff --git a/Documentation/netlabel/introduction.txt b/Documentation/netlabel/introduction.txt new file mode 100644 index 000000000000..a4ffba1694c8 --- /dev/null +++ b/Documentation/netlabel/introduction.txt @@ -0,0 +1,46 @@ +NetLabel Introduction +============================================================================== +Paul Moore, paul.moore@hp.com + +August 2, 2006 + + * Overview + +NetLabel is a mechanism which can be used by kernel security modules to attach +security attributes to outgoing network packets generated from user space +applications and read security attributes from incoming network packets. It +is composed of three main components, the protocol engines, the communication +layer, and the kernel security module API. + + * Protocol Engines + +The protocol engines are responsible for both applying and retrieving the +network packet's security attributes. If any translation between the network +security attributes and those on the host are required then the protocol +engine will handle those tasks as well. Other kernel subsystems should +refrain from calling the protocol engines directly, instead they should use +the NetLabel kernel security module API described below. + +Detailed information about each NetLabel protocol engine can be found in this +directory, consult '00-INDEX' for filenames. + + * Communication Layer + +The communication layer exists to allow NetLabel configuration and monitoring +from user space. The NetLabel communication layer uses a message based +protocol built on top of the Generic NETLINK transport mechanism. The exact +formatting of these NetLabel messages as well as the Generic NETLINK family +names can be found in the the 'net/netlabel/' directory as comments in the +header files as well as in 'include/net/netlabel.h'. + + * Security Module API + +The purpose of the NetLabel security module API is to provide a protocol +independent interface to the underlying NetLabel protocol engines. In addition +to protocol independence, the security module API is designed to be completely +LSM independent which should allow multiple LSMs to leverage the same code +base. + +Detailed information about the NetLabel security module API can be found in the +'include/net/netlabel.h' header file as well as the 'lsm_interface.txt' file +found in this directory. diff --git a/Documentation/netlabel/lsm_interface.txt b/Documentation/netlabel/lsm_interface.txt new file mode 100644 index 000000000000..98dd9f7430f2 --- /dev/null +++ b/Documentation/netlabel/lsm_interface.txt @@ -0,0 +1,47 @@ +NetLabel Linux Security Module Interface +============================================================================== +Paul Moore, paul.moore@hp.com + +May 17, 2006 + + * Overview + +NetLabel is a mechanism which can set and retrieve security attributes from +network packets. It is intended to be used by LSM developers who want to make +use of a common code base for several different packet labeling protocols. +The NetLabel security module API is defined in 'include/net/netlabel.h' but a +brief overview is given below. + + * NetLabel Security Attributes + +Since NetLabel supports multiple different packet labeling protocols and LSMs +it uses the concept of security attributes to refer to the packet's security +labels. The NetLabel security attributes are defined by the +'netlbl_lsm_secattr' structure in the NetLabel header file. Internally the +NetLabel subsystem converts the security attributes to and from the correct +low-level packet label depending on the NetLabel build time and run time +configuration. It is up to the LSM developer to translate the NetLabel +security attributes into whatever security identifiers are in use for their +particular LSM. + + * NetLabel LSM Protocol Operations + +These are the functions which allow the LSM developer to manipulate the labels +on outgoing packets as well as read the labels on incoming packets. Functions +exist to operate both on sockets as well as the sk_buffs directly. These high +level functions are translated into low level protocol operations based on how +the administrator has configured the NetLabel subsystem. + + * NetLabel Label Mapping Cache Operations + +Depending on the exact configuration, translation between the network packet +label and the internal LSM security identifier can be time consuming. The +NetLabel label mapping cache is a caching mechanism which can be used to +sidestep much of this overhead once a mapping has been established. Once the +LSM has received a packet, used NetLabel to decode it's security attributes, +and translated the security attributes into a LSM internal identifier the LSM +can use the NetLabel caching functions to associate the LSM internal +identifier with the network packet's label. This means that in the future +when a incoming packet matches a cached value not only are the internal +NetLabel translation mechanisms bypassed but the LSM translation mechanisms are +bypassed as well which should result in a significant reduction in overhead. diff --git a/Documentation/networking/LICENSE.qla3xxx b/Documentation/networking/LICENSE.qla3xxx new file mode 100644 index 000000000000..2f2077e34d81 --- /dev/null +++ b/Documentation/networking/LICENSE.qla3xxx @@ -0,0 +1,46 @@ +Copyright (c) 2003-2006 QLogic Corporation +QLogic Linux Networking HBA Driver + +This program includes a device driver for Linux 2.6 that may be +distributed with QLogic hardware specific firmware binary file. +You may modify and redistribute the device driver code under the +GNU General Public License as published by the Free Software +Foundation (version 2 or a later version). + +You may redistribute the hardware specific firmware binary file +under the following terms: + + 1. Redistribution of source code (only if applicable), + must retain the above copyright notice, this list of + conditions and the following disclaimer. + + 2. Redistribution in binary form must reproduce the above + copyright notice, this list of conditions and the + following disclaimer in the documentation and/or other + materials provided with the distribution. + + 3. The name of QLogic Corporation may not be used to + endorse or promote products derived from this software + without specific prior written permission + +REGARDLESS OF WHAT LICENSING MECHANISM IS USED OR APPLICABLE, +THIS PROGRAM IS PROVIDED BY QLOGIC CORPORATION "AS IS'' AND ANY +EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR +BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED +TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON +ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. + +USER ACKNOWLEDGES AND AGREES THAT USE OF THIS PROGRAM WILL NOT +CREATE OR GIVE GROUNDS FOR A LICENSE BY IMPLICATION, ESTOPPEL, OR +OTHERWISE IN ANY INTELLECTUAL PROPERTY RIGHTS (PATENT, COPYRIGHT, +TRADE SECRET, MASK WORK, OR OTHER PROPRIETARY RIGHT) EMBODIED IN +ANY OTHER QLOGIC HARDWARE OR SOFTWARE EITHER SOLELY OR IN +COMBINATION WITH THIS PROGRAM. + diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index afac780445cd..dc942eaf490f 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -192,6 +192,17 @@ or, for backwards compatibility, the option value. E.g., arp_interval Specifies the ARP link monitoring frequency in milliseconds. + + The ARP monitor works by periodically checking the slave + devices to determine whether they have sent or received + traffic recently (the precise criteria depends upon the + bonding mode, and the state of the slave). Regular traffic is + generated via ARP probes issued for the addresses specified by + the arp_ip_target option. + + This behavior can be modified by the arp_validate option, + below. + If ARP monitoring is used in an etherchannel compatible mode (modes 0 and 2), the switch should be configured in a mode that evenly distributes packets across all links. If the @@ -213,6 +224,54 @@ arp_ip_target maximum number of targets that can be specified is 16. The default value is no IP addresses. +arp_validate + + Specifies whether or not ARP probes and replies should be + validated in the active-backup mode. This causes the ARP + monitor to examine the incoming ARP requests and replies, and + only consider a slave to be up if it is receiving the + appropriate ARP traffic. + + Possible values are: + + none or 0 + + No validation is performed. This is the default. + + active or 1 + + Validation is performed only for the active slave. + + backup or 2 + + Validation is performed only for backup slaves. + + all or 3 + + Validation is performed for all slaves. + + For the active slave, the validation checks ARP replies to + confirm that they were generated by an arp_ip_target. Since + backup slaves do not typically receive these replies, the + validation performed for backup slaves is on the ARP request + sent out via the active slave. It is possible that some + switch or network configurations may result in situations + wherein the backup slaves do not receive the ARP requests; in + such a situation, validation of backup slaves must be + disabled. + + This option is useful in network configurations in which + multiple bonding hosts are concurrently issuing ARPs to one or + more targets beyond a common switch. Should the link between + the switch and target fail (but not the switch itself), the + probe traffic generated by the multiple bonding instances will + fool the standard ARP monitor into considering the links as + still up. Use of the arp_validate option can resolve this, as + the ARP monitor will only consider ARP requests and replies + associated with its own instance of bonding. + + This option was added in bonding version 3.1.0. + downdelay Specifies the time, in milliseconds, to wait before disabling diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt index c45daabd3bfe..74563b38ffd9 100644 --- a/Documentation/networking/dccp.txt +++ b/Documentation/networking/dccp.txt @@ -1,7 +1,6 @@ DCCP protocol ============ -Last updated: 10 November 2005 Contents ======== @@ -42,8 +41,11 @@ Socket options DCCP_SOCKOPT_PACKET_SIZE is used for CCID3 to set default packet size for calculations. -DCCP_SOCKOPT_SERVICE sets the service. This is compulsory as per the -specification. If you don't set it you will get EPROTO. +DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of +service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, +the socket will fall back to 0 (which means that no meaningful service code +is present). Connecting sockets set at most one service option; for +listening sockets, multiple service codes can be specified. Notes ===== diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index d46338af6002..935e298f674a 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -102,9 +102,15 @@ inet_peer_gc_maxtime - INTEGER TCP variables: tcp_abc - INTEGER - Controls Appropriate Byte Count defined in RFC3465. If set to - 0 then does congestion avoid once per ack. 1 is conservative - value, and 2 is more agressive. + Controls Appropriate Byte Count (ABC) defined in RFC3465. + ABC is a way of increasing congestion window (cwnd) more slowly + in response to partial acknowledgments. + Possible values are: + 0 increase cwnd once per acknowledgment (no ABC) + 1 increase cwnd once per acknowledgment of full sized segment + 2 allow increase cwnd by two if acknowledgment is + of two segments to compensate for delayed acknowledgments. + Default: 0 (off) tcp_syn_retries - INTEGER Number of times initial SYNs for an active TCP connection attempt @@ -294,15 +300,15 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max Default: 87380*2 bytes. tcp_mem - vector of 3 INTEGERs: min, pressure, max - low: below this number of pages TCP is not bothered about its + min: below this number of pages TCP is not bothered about its memory appetite. pressure: when amount of memory allocated by TCP exceeds this number of pages, TCP moderates its memory consumption and enters memory pressure mode, which is exited when memory consumption falls - under "low". + under "min". - high: number of pages allowed for queueing by all TCP sockets. + max: number of pages allowed for queueing by all TCP sockets. Defaults are calculated at boot time from amount of available memory. @@ -369,6 +375,41 @@ tcp_slow_start_after_idle - BOOLEAN be timed out after an idle period. Default: 1 +CIPSOv4 Variables: + +cipso_cache_enable - BOOLEAN + If set, enable additions to and lookups from the CIPSO label mapping + cache. If unset, additions are ignored and lookups always result in a + miss. However, regardless of the setting the cache is still + invalidated when required when means you can safely toggle this on and + off and the cache will always be "safe". + Default: 1 + +cipso_cache_bucket_size - INTEGER + The CIPSO label cache consists of a fixed size hash table with each + hash bucket containing a number of cache entries. This variable limits + the number of entries in each hash bucket; the larger the value the + more CIPSO label mappings that can be cached. When the number of + entries in a given hash bucket reaches this limit adding new entries + causes the oldest entry in the bucket to be removed to make room. + Default: 10 + +cipso_rbm_optfmt - BOOLEAN + Enable the "Optimized Tag 1 Format" as defined in section 3.4.2.6 of + the CIPSO draft specification (see Documentation/netlabel for details). + This means that when set the CIPSO tag will be padded with empty + categories in order to make the packet data 32-bit aligned. + Default: 0 + +cipso_rbm_structvalid - BOOLEAN + If set, do a very strict check of the CIPSO option when + ip_options_compile() is called. If unset, relax the checks done during + ip_options_compile(). Either way is "safe" as errors are caught else + where in the CIPSO processing code but setting this to 0 (False) should + result in less work (i.e. it should be faster) but could cause problems + with other implementations that require strict checking. + Default: 0 + IP Variables: ip_local_port_range - 2 INTEGERS @@ -724,6 +765,9 @@ conf/all/forwarding - BOOLEAN This referred to as global forwarding. +proxy_ndp - BOOLEAN + Do proxy ndp. + conf/interface/*: Change special settings per interface. diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt new file mode 100644 index 000000000000..4ccdbca03811 --- /dev/null +++ b/Documentation/networking/ipvs-sysctl.txt @@ -0,0 +1,143 @@ +/proc/sys/net/ipv4/vs/* Variables: + +am_droprate - INTEGER + default 10 + + It sets the always mode drop rate, which is used in the mode 3 + of the drop_rate defense. + +amemthresh - INTEGER + default 1024 + + It sets the available memory threshold (in pages), which is + used in the automatic modes of defense. When there is no + enough available memory, the respective strategy will be + enabled and the variable is automatically set to 2, otherwise + the strategy is disabled and the variable is set to 1. + +cache_bypass - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + If it is enabled, forward packets to the original destination + directly when no cache server is available and destination + address is not local (iph->daddr is RTN_UNICAST). It is mostly + used in transparent web cache cluster. + +debug_level - INTEGER + 0 - transmission error messages (default) + 1 - non-fatal error messages + 2 - configuration + 3 - destination trash + 4 - drop entry + 5 - service lookup + 6 - scheduling + 7 - connection new/expire, lookup and synchronization + 8 - state transition + 9 - binding destination, template checks and applications + 10 - IPVS packet transmission + 11 - IPVS packet handling (ip_vs_in/ip_vs_out) + 12 or more - packet traversal + + Only available when IPVS is compiled with the CONFIG_IPVS_DEBUG + + Higher debugging levels include the messages for lower debugging + levels, so setting debug level 2, includes level 0, 1 and 2 + messages. Thus, logging becomes more and more verbose the higher + the level. + +drop_entry - INTEGER + 0 - disabled (default) + + The drop_entry defense is to randomly drop entries in the + connection hash table, just in order to collect back some + memory for new connections. In the current code, the + drop_entry procedure can be activated every second, then it + randomly scans 1/32 of the whole and drops entries that are in + the SYN-RECV/SYNACK state, which should be effective against + syn-flooding attack. + + The valid values of drop_entry are from 0 to 3, where 0 means + that this strategy is always disabled, 1 and 2 mean automatic + modes (when there is no enough available memory, the strategy + is enabled and the variable is automatically set to 2, + otherwise the strategy is disabled and the variable is set to + 1), and 3 means that that the strategy is always enabled. + +drop_packet - INTEGER + 0 - disabled (default) + + The drop_packet defense is designed to drop 1/rate packets + before forwarding them to real servers. If the rate is 1, then + drop all the incoming packets. + + The value definition is the same as that of the drop_entry. In + the automatic mode, the rate is determined by the follow + formula: rate = amemthresh / (amemthresh - available_memory) + when available memory is less than the available memory + threshold. When the mode 3 is set, the always mode drop rate + is controlled by the /proc/sys/net/ipv4/vs/am_droprate. + +expire_nodest_conn - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + The default value is 0, the load balancer will silently drop + packets when its destination server is not available. It may + be useful, when user-space monitoring program deletes the + destination server (because of server overload or wrong + detection) and add back the server later, and the connections + to the server can continue. + + If this feature is enabled, the load balancer will expire the + connection immediately when a packet arrives and its + destination server is not available, then the client program + will be notified that the connection is closed. This is + equivalent to the feature some people requires to flush + connections when its destination is not available. + +expire_quiescent_template - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + When set to a non-zero value, the load balancer will expire + persistent templates when the destination server is quiescent. + This may be useful, when a user makes a destination server + quiescent by setting its weight to 0 and it is desired that + subsequent otherwise persistent connections are sent to a + different destination server. By default new persistent + connections are allowed to quiescent destination servers. + + If this feature is enabled, the load balancer will expire the + persistence template if it is to be used to schedule a new + connection and the destination server is quiescent. + +nat_icmp_send - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + It controls sending icmp error messages (ICMP_DEST_UNREACH) + for VS/NAT when the load balancer receives packets from real + servers but the connection entries don't exist. + +secure_tcp - INTEGER + 0 - disabled (default) + + The secure_tcp defense is to use a more complicated state + transition table and some possible short timeouts of each + state. In the VS/NAT, it delays the entering the ESTABLISHED + until the real server starts to send data and ACK packet + (after 3-way handshake). + + The value definition is the same as that of drop_entry or + drop_packet. + +sync_threshold - INTEGER + default 3 + + It sets synchronization threshold, which is the minimum number + of incoming packets that a connection needs to receive before + the connection will be synchronized. A connection will be + synchronized, every time the number of its incoming packets + modulus 50 equals the threshold. The range of the threshold is + from 0 to 49. diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt index 278771c9ad99..18d385c068fc 100644 --- a/Documentation/networking/pktgen.txt +++ b/Documentation/networking/pktgen.txt @@ -74,7 +74,7 @@ Examples: pgset "pkt_size 9014" sets packet size to 9014 pgset "frags 5" packet will consist of 5 fragments pgset "count 200000" sets number of packets to send, set to zero - for continious sends untill explicitl stopped. + for continuous sends until explicitly stopped. pgset "delay 5000" adds delay to hard_start_xmit(). nanoseconds @@ -100,6 +100,7 @@ Examples: are: IPSRC_RND #IP Source is random (between min/max), IPDST_RND, UDPSRC_RND, UDPDST_RND, MACSRC_RND, MACDST_RND + MPLS_RND, VID_RND, SVID_RND pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then cycle through the port range. @@ -125,6 +126,21 @@ Examples: pgset "mpls 0" turn off mpls (or any invalid argument works too!) + pgset "vlan_id 77" set VLAN ID 0-4095 + pgset "vlan_p 3" set priority bit 0-7 (default 0) + pgset "vlan_cfi 0" set canonical format identifier 0-1 (default 0) + + pgset "svlan_id 22" set SVLAN ID 0-4095 + pgset "svlan_p 3" set priority bit 0-7 (default 0) + pgset "svlan_cfi 0" set canonical format identifier 0-1 (default 0) + + pgset "vlan_id 9999" > 4095 remove vlan and svlan tags + pgset "svlan 9999" > 4095 remove svlan tag + + + pgset "tos XX" set former IPv4 TOS field (e.g. "tos 28" for AF11 no ECN, default 00) + pgset "traffic_class XX" set former IPv6 TRAFFIC CLASS (e.g. "traffic_class B8" for EF no ECN, default 00) + pgset stop aborts injection. Also, ^C aborts generator. diff --git a/Documentation/networking/secid.txt b/Documentation/networking/secid.txt new file mode 100644 index 000000000000..95ea06784333 --- /dev/null +++ b/Documentation/networking/secid.txt @@ -0,0 +1,14 @@ +flowi structure: + +The secid member in the flow structure is used in LSMs (e.g. SELinux) to indicate +the label of the flow. This label of the flow is currently used in selecting +matching labeled xfrm(s). + +If this is an outbound flow, the label is derived from the socket, if any, or +the incoming packet this flow is being generated as a response to (e.g. tcp +resets, timewait ack, etc.). It is also conceivable that the label could be +derived from other sources such as process context, device, etc., in special +cases, as may be appropriate. + +If this is an inbound flow, the label is derived from the IPSec security +associations, if any, used by the packet. diff --git a/Documentation/nfsroot.txt b/Documentation/nfsroot.txt index d56dc71d9430..3cc953cb288f 100644 --- a/Documentation/nfsroot.txt +++ b/Documentation/nfsroot.txt @@ -4,15 +4,16 @@ Mounting the root filesystem via NFS (nfsroot) Written 1996 by Gero Kuhlmann <gero@gkminix.han.de> Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz> Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org> +Updated 2006 by Horms <horms@verge.net.au> -If you want to use a diskless system, as an X-terminal or printer -server for example, you have to put your root filesystem onto a -non-disk device. This can either be a ramdisk (see initrd.txt in -this directory for further information) or a filesystem mounted -via NFS. The following text describes on how to use NFS for the -root filesystem. For the rest of this text 'client' means the +In order to use a diskless system, such as an X-terminal or printer server +for example, it is necessary for the root filesystem to be present on a +non-disk device. This may be an initramfs (see Documentation/filesystems/ +ramfs-rootfs-initramfs.txt), a ramdisk (see Documenation/initrd.txt) or a +filesystem mounted via NFS. The following text describes on how to use NFS +for the root filesystem. For the rest of this text 'client' means the diskless system, and 'server' means the NFS server. @@ -21,11 +22,13 @@ diskless system, and 'server' means the NFS server. 1.) Enabling nfsroot capabilities ----------------------------- -In order to use nfsroot you have to select support for NFS during -kernel configuration. Note that NFS cannot be loaded as a module -in this case. The configuration script will then ask you whether -you want to use nfsroot, and if yes what kind of auto configuration -system you want to use. Selecting both BOOTP and RARP is safe. +In order to use nfsroot, NFS client support needs to be selected as +built-in during configuration. Once this has been selected, the nfsroot +option will become available, which should also be selected. + +In the networking options, kernel level autoconfiguration can be selected, +along with the types of autoconfiguration to support. Selecting all of +DHCP, BOOTP and RARP is safe. @@ -33,11 +36,10 @@ system you want to use. Selecting both BOOTP and RARP is safe. 2.) Kernel command line ------------------- -When the kernel has been loaded by a boot loader (either by loadlin, -LILO or a network boot program) it has to be told what root fs device -to use, and where to find the server and the name of the directory -on the server to mount as root. This can be established by a couple -of kernel command line parameters: +When the kernel has been loaded by a boot loader (see below) it needs to be +told what root fs device to use. And in the case of nfsroot, where to find +both the server and the name of the directory on the server to mount as root. +This can be established using the following kernel command line parameters: root=/dev/nfs @@ -49,23 +51,21 @@ root=/dev/nfs nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>] - If the `nfsroot' parameter is NOT given on the command line, the default - "/tftpboot/%s" will be used. + If the `nfsroot' parameter is NOT given on the command line, + the default "/tftpboot/%s" will be used. - <server-ip> Specifies the IP address of the NFS server. If this field - is not given, the default address as determined by the - `ip' variable (see below) is used. One use of this - parameter is for example to allow using different servers - for RARP and NFS. Usually you can leave this blank. + <server-ip> Specifies the IP address of the NFS server. + The default address is determined by the `ip' parameter + (see below). This parameter allows the use of different + servers for IP autoconfiguration and NFS. - <root-dir> Name of the directory on the server to mount as root. If - there is a "%s" token in the string, the token will be - replaced by the ASCII-representation of the client's IP - address. + <root-dir> Name of the directory on the server to mount as root. + If there is a "%s" token in the string, it will be + replaced by the ASCII-representation of the client's + IP address. <nfs-options> Standard NFS options. All options are separated by commas. - If the options field is not given, the following defaults - will be used: + The following defaults are used: port = as given by server portmap daemon rsize = 1024 wsize = 1024 @@ -81,129 +81,174 @@ nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>] ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf> This parameter tells the kernel how to configure IP addresses of devices - and also how to set up the IP routing table. It was originally called `nfsaddrs', - but now the boot-time IP configuration works independently of NFS, so it - was renamed to `ip' and the old name remained as an alias for compatibility - reasons. + and also how to set up the IP routing table. It was originally called + `nfsaddrs', but now the boot-time IP configuration works independently of + NFS, so it was renamed to `ip' and the old name remained as an alias for + compatibility reasons. If this parameter is missing from the kernel command line, all fields are assumed to be empty, and the defaults mentioned below apply. In general - this means that the kernel tries to configure everything using both - RARP and BOOTP (depending on what has been enabled during kernel confi- - guration, and if both what protocol answer got in first). + this means that the kernel tries to configure everything using + autoconfiguration. + + The <autoconf> parameter can appear alone as the value to the `ip' + parameter (without all the ':' characters before) in which case auto- + configuration is used. + + <client-ip> IP address of the client. - <client-ip> IP address of the client. If empty, the address will either - be determined by RARP or BOOTP. What protocol is used de- - pends on what has been enabled during kernel configuration - and on the <autoconf> parameter. If this parameter is not - empty, neither RARP nor BOOTP will be used. + Default: Determined using autoconfiguration. <server-ip> IP address of the NFS server. If RARP is used to determine the client address and this parameter is NOT empty only - replies from the specified server are accepted. To use - different RARP and NFS server, specify your RARP server - here (or leave it blank), and specify your NFS server in - the `nfsroot' parameter (see above). If this entry is blank - the address of the server is used which answered the RARP - or BOOTP request. - - <gw-ip> IP address of a gateway if the server is on a different - subnet. If this entry is empty no gateway is used and the - server is assumed to be on the local network, unless a - value has been received by BOOTP. - - <netmask> Netmask for local network interface. If this is empty, + replies from the specified server are accepted. + + Only required for for NFS root. That is autoconfiguration + will not be triggered if it is missing and NFS root is not + in operation. + + Default: Determined using autoconfiguration. + The address of the autoconfiguration server is used. + + <gw-ip> IP address of a gateway if the server is on a different subnet. + + Default: Determined using autoconfiguration. + + <netmask> Netmask for local network interface. If unspecified the netmask is derived from the client IP address assuming - classful addressing, unless overridden in BOOTP reply. + classful addressing. - <hostname> Name of the client. If empty, the client IP address is - used in ASCII notation, or the value received by BOOTP. + Default: Determined using autoconfiguration. - <device> Name of network device to use. If this is empty, all - devices are used for RARP and BOOTP requests, and the - first one we receive a reply on is configured. If you have - only one device, you can safely leave this blank. + <hostname> Name of the client. May be supplied by autoconfiguration, + but its absence will not trigger autoconfiguration. - <autoconf> Method to use for autoconfiguration. If this is either - 'rarp' or 'bootp', the specified protocol is used. - If the value is 'both' or empty, both protocols are used - so far as they have been enabled during kernel configura- - tion. 'off' means no autoconfiguration. + Default: Client IP address is used in ASCII notation. - The <autoconf> parameter can appear alone as the value to the `ip' - parameter (without all the ':' characters before) in which case auto- - configuration is used. + <device> Name of network device to use. + + Default: If the host only has one device, it is used. + Otherwise the device is determined using + autoconfiguration. This is done by sending + autoconfiguration requests out of all devices, + and using the device that received the first reply. + <autoconf> Method to use for autoconfiguration. In the case of options + which specify multiple autoconfiguration protocols, + requests are sent using all protocols, and the first one + to reply is used. + Only autoconfiguration protocols that have been compiled + into the kernel will be used, regardless of the value of + this option. + off or none: don't use autoconfiguration (default) + on or any: use any protocol available in the kernel + dhcp: use DHCP + bootp: use BOOTP + rarp: use RARP + both: use both BOOTP and RARP but not DHCP + (old option kept for backwards compatibility) -3.) Kernel loader - ------------- + Default: any -To get the kernel into memory different approaches can be used. They -depend on what facilities are available: -3.1) Writing the kernel onto a floppy using dd: - As always you can just write the kernel onto a floppy using dd, - but then it's not possible to use kernel command lines at all. - To substitute the 'root=' parameter, create a dummy device on any - linux system with major number 0 and minor number 255 using mknod: - mknod /dev/boot255 c 0 255 +3.) Boot Loader + ---------- - Then copy the kernel zImage file onto a floppy using dd: +To get the kernel into memory different approaches can be used. +They depend on various facilities being available: - dd if=/usr/src/linux/arch/i386/boot/zImage of=/dev/fd0 - And finally use rdev to set the root device: +3.1) Booting from a floppy using syslinux - rdev /dev/fd0 /dev/boot255 + When building kernels, an easy way to create a boot floppy that uses + syslinux is to use the zdisk or bzdisk make targets which use + and bzimage images respectively. Both targets accept the + FDARGS parameter which can be used to set the kernel command line. - You can then remove the dummy device /dev/boot255 again. There - is no real device available for it. - The other two kernel command line parameters cannot be substi- - tuted with rdev. Therefore, using this method the kernel will - by default use RARP and/or BOOTP, and if it gets an answer via - RARP will mount the directory /tftpboot/<client-ip>/ as its - root. If it got a BOOTP answer the directory name in that answer - is used. + e.g. + make bzdisk FDARGS="root=/dev/nfs" + + Note that the user running this command will need to have + access to the floppy drive device, /dev/fd0 + + For more information on syslinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + + N.B: Previously it was possible to write a kernel directly to + a floppy using dd, configure the boot device using rdev, and + boot using the resulting floppy. Linux no longer supports this + method of booting. + +3.2) Booting from a cdrom using isolinux + + When building kernels, an easy way to create a bootable cdrom that + uses isolinux is to use the isoimage target which uses a bzimage + image. Like zdisk and bzdisk, this target accepts the FDARGS + parameter which can be used to set the kernel command line. + + e.g. + make isoimage FDARGS="root=/dev/nfs" + + The resulting iso image will be arch/<ARCH>/boot/image.iso + This can be written to a cdrom using a variety of tools including + cdrecord. + + e.g. + cdrecord dev=ATAPI:1,0,0 arch/i386/boot/image.iso + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ 3.2) Using LILO - When using LILO you can specify all necessary command line - parameters with the 'append=' command in the LILO configuration - file. However, to use the 'root=' command you also need to - set up a dummy device as described in 3.1 above. For how to use - LILO and its 'append=' command please refer to the LILO - documentation. + When using LILO all the necessary command line parameters may be + specified using the 'append=' directive in the LILO configuration + file. + + However, to use the 'root=' directive you also need to create + a dummy root device, which may be removed after LILO is run. + + mknod /dev/boot255 c 0 255 + + For information on configuring LILO, please refer to its documentation. 3.3) Using GRUB - When you use GRUB, you simply append the parameters after the kernel - specification: "kernel <kernel> <parameters>" (without the quotes). + When using GRUB, kernel parameter are simply appended after the kernel + specification: kernel <kernel> <parameters> 3.4) Using loadlin - When you want to boot Linux from a DOS command prompt without - having a local hard disk to mount as root, you can use loadlin. - I was told that it works, but haven't used it myself yet. In - general you should be able to create a kernel command line simi- - lar to how LILO is doing it. Please refer to the loadlin docu- - mentation for further information. + loadlin may be used to boot Linux from a DOS command prompt without + requiring a local hard disk to mount as root. This has not been + thoroughly tested by the authors of this document, but in general + it should be possible configure the kernel command line similarly + to the configuration of LILO. + + Please refer to the loadlin documentation for further information. 3.5) Using a boot ROM - This is probably the most elegant way of booting a diskless - client. With a boot ROM the kernel gets loaded using the TFTP - protocol. As far as I know, no commercial boot ROMs yet - support booting Linux over the network, but there are two - free implementations of a boot ROM available on sunsite.unc.edu - and its mirrors. They are called 'netboot-nfs' and 'etherboot'. - Both contain everything you need to boot a diskless Linux client. + This is probably the most elegant way of booting a diskless client. + With a boot ROM the kernel is loaded using the TFTP protocol. The + authors of this document are not aware of any no commercial boot + ROMs that support booting Linux over the network. However, there + are two free implementations of a boot ROM, netboot-nfs and + etherboot, both of which are available on sunsite.unc.edu, and both + of which contain everything you need to boot a diskless Linux client. 3.6) Using pxelinux - Using pxelinux you specify the kernel you built with + Pxelinux may be used to boot linux using the PXE boot loader + which is present on many modern network cards. + + When using pxelinux, the kernel image is specified using "kernel <relative-path-below /tftpboot>". The nfsroot parameters are passed to the kernel by adding them to the "append" line. - You may perhaps also want to fine tune the console output, - see Documentation/serial-console.txt for serial console help. + It is common to use serial console in conjunction with pxeliunx, + see Documentation/serial-console.txt for more information. + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ diff --git a/Documentation/nommu-mmap.txt b/Documentation/nommu-mmap.txt index b88ebe4d808c..7714f57caad5 100644 --- a/Documentation/nommu-mmap.txt +++ b/Documentation/nommu-mmap.txt @@ -116,6 +116,9 @@ FURTHER NOTES ON NO-MMU MMAP (*) A list of all the mappings on the system is visible through /proc/maps in no-MMU mode. + (*) A list of all the mappings in use by a process is visible through + /proc/<pid>/maps in no-MMU mode. + (*) Supplying MAP_FIXED or a requesting a particular mapping address will result in an error. @@ -125,6 +128,49 @@ FURTHER NOTES ON NO-MMU MMAP error will result if they don't. This is most likely to be encountered with character device files, pipes, fifos and sockets. + +========================== +INTERPROCESS SHARED MEMORY +========================== + +Both SYSV IPC SHM shared memory and POSIX shared memory is supported in NOMMU +mode. The former through the usual mechanism, the latter through files created +on ramfs or tmpfs mounts. + + +======= +FUTEXES +======= + +Futexes are supported in NOMMU mode if the arch supports them. An error will +be given if an address passed to the futex system call lies outside the +mappings made by a process or if the mapping in which the address lies does not +support futexes (such as an I/O chardev mapping). + + +============= +NO-MMU MREMAP +============= + +The mremap() function is partially supported. It may change the size of a +mapping, and may move it[*] if MREMAP_MAYMOVE is specified and if the new size +of the mapping exceeds the size of the slab object currently occupied by the +memory to which the mapping refers, or if a smaller slab object could be used. + +MREMAP_FIXED is not supported, though it is ignored if there's no change of +address and the object does not need to be moved. + +Shared mappings may not be moved. Shareable mappings may not be moved either, +even if they are not currently shared. + +The mremap() function must be given an exact match for base address and size of +a previously mapped object. It may not be used to create holes in existing +mappings, move parts of existing mappings or resize parts of mappings. It must +act on a complete mapping. + +[*] Not currently supported. + + ============================================ PROVIDING SHAREABLE CHARACTER DEVICE SUPPORT ============================================ diff --git a/Documentation/pci.txt b/Documentation/pci.txt index 3242e5c1ee9c..2b395e478961 100644 --- a/Documentation/pci.txt +++ b/Documentation/pci.txt @@ -225,7 +225,7 @@ Generic flavors of pci_request_region() are request_mem_region() Use these for address resources that are not described by "normal" PCI interfaces (e.g. BAR). - All interrupt handlers should be registered with SA_SHIRQ and use the devid + All interrupt handlers should be registered with IRQF_SHARED and use the devid to map IRQs to devices (remember that all PCI interrupts are shared). diff --git a/Documentation/pcieaer-howto.txt b/Documentation/pcieaer-howto.txt new file mode 100644 index 000000000000..16c251230c82 --- /dev/null +++ b/Documentation/pcieaer-howto.txt @@ -0,0 +1,253 @@ + The PCI Express Advanced Error Reporting Driver Guide HOWTO + T. Long Nguyen <tom.l.nguyen@intel.com> + Yanmin Zhang <yanmin.zhang@intel.com> + 07/29/2006 + + +1. Overview + +1.1 About this guide + +This guide describes the basics of the PCI Express Advanced Error +Reporting (AER) driver and provides information on how to use it, as +well as how to enable the drivers of endpoint devices to conform with +PCI Express AER driver. + +1.2 Copyright © Intel Corporation 2006. + +1.3 What is the PCI Express AER Driver? + +PCI Express error signaling can occur on the PCI Express link itself +or on behalf of transactions initiated on the link. PCI Express +defines two error reporting paradigms: the baseline capability and +the Advanced Error Reporting capability. The baseline capability is +required of all PCI Express components providing a minimum defined +set of error reporting requirements. Advanced Error Reporting +capability is implemented with a PCI Express advanced error reporting +extended capability structure providing more robust error reporting. + +The PCI Express AER driver provides the infrastructure to support PCI +Express Advanced Error Reporting capability. The PCI Express AER +driver provides three basic functions: + +- Gathers the comprehensive error information if errors occurred. +- Reports error to the users. +- Performs error recovery actions. + +AER driver only attaches root ports which support PCI-Express AER +capability. + + +2. User Guide + +2.1 Include the PCI Express AER Root Driver into the Linux Kernel + +The PCI Express AER Root driver is a Root Port service driver attached +to the PCI Express Port Bus driver. If a user wants to use it, the driver +has to be compiled. Option CONFIG_PCIEAER supports this capability. It +depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and +CONFIG_PCIEAER = y. + +2.2 Load PCI Express AER Root Driver +There is a case where a system has AER support in BIOS. Enabling the AER +Root driver and having AER support in BIOS may result unpredictable +behavior. To avoid this conflict, a successful load of the AER Root driver +requires ACPI _OSC support in the BIOS to allow the AER Root driver to +request for native control of AER. See the PCI FW 3.0 Specification for +details regarding OSC usage. Currently, lots of firmwares don't provide +_OSC support while they use PCI Express. To support such firmwares, +forceload, a parameter of type bool, could enable AER to continue to +be initiated although firmwares have no _OSC support. To enable the +walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line +when booting kernel. Note that forceload=n by default. + +2.3 AER error output +When a PCI-E AER error is captured, an error message will be outputed to +console. If it's a correctable error, it is outputed as a warning. +Otherwise, it is printed as an error. So users could choose different +log level to filter out correctable error messages. + +Below shows an example. ++------ PCI-Express Device Error -----+ +Error Severity : Uncorrected (Fatal) +PCIE Bus Error type : Transaction Layer +Unsupported Request : First +Requester ID : 0500 +VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h +TLB Header: +04000001 00200a03 05010000 00050100 + +In the example, 'Requester ID' means the ID of the device who sends +the error message to root port. Pls. refer to pci express specs for +other fields. + + +3. Developer Guide + +To enable AER aware support requires a software driver to configure +the AER capability structure within its device and to provide callbacks. + +To support AER better, developers need understand how AER does work +firstly. + +PCI Express errors are classified into two types: correctable errors +and uncorrectable errors. This classification is based on the impacts +of those errors, which may result in degraded performance or function +failure. + +Correctable errors pose no impacts on the functionality of the +interface. The PCI Express protocol can recover without any software +intervention or any loss of data. These errors are detected and +corrected by hardware. Unlike correctable errors, uncorrectable +errors impact functionality of the interface. Uncorrectable errors +can cause a particular transaction or a particular PCI Express link +to be unreliable. Depending on those error conditions, uncorrectable +errors are further classified into non-fatal errors and fatal errors. +Non-fatal errors cause the particular transaction to be unreliable, +but the PCI Express link itself is fully functional. Fatal errors, on +the other hand, cause the link to be unreliable. + +When AER is enabled, a PCI Express device will automatically send an +error message to the PCIE root port above it when the device captures +an error. The Root Port, upon receiving an error reporting message, +internally processes and logs the error message in its PCI Express +capability structure. Error information being logged includes storing +the error reporting agent's requestor ID into the Error Source +Identification Registers and setting the error bits of the Root Error +Status Register accordingly. If AER error reporting is enabled in Root +Error Command Register, the Root Port generates an interrupt if an +error is detected. + +Note that the errors as described above are related to the PCI Express +hierarchy and links. These errors do not include any device specific +errors because device specific errors will still get sent directly to +the device driver. + +3.1 Configure the AER capability structure + +AER aware drivers of PCI Express component need change the device +control registers to enable AER. They also could change AER registers, +including mask and severity registers. Helper function +pci_enable_pcie_error_reporting could be used to enable AER. See +section 3.3. + +3.2. Provide callbacks + +3.2.1 callback reset_link to reset pci express link + +This callback is used to reset the pci express physical link when a +fatal error happens. The root port aer service driver provides a +default reset_link function, but different upstream ports might +have different specifications to reset pci express link, so all +upstream ports should provide their own reset_link functions. + +In struct pcie_port_service_driver, a new pointer, reset_link, is +added. + +pci_ers_result_t (*reset_link) (struct pci_dev *dev); + +Section 3.2.2.2 provides more detailed info on when to call +reset_link. + +3.2.2 PCI error-recovery callbacks + +The PCI Express AER Root driver uses error callbacks to coordinate +with downstream device drivers associated with a hierarchy in question +when performing error recovery actions. + +Data struct pci_driver has a pointer, err_handler, to point to +pci_error_handlers who consists of a couple of callback function +pointers. AER driver follows the rules defined in +pci-error-recovery.txt except pci express specific parts (e.g. +reset_link). Pls. refer to pci-error-recovery.txt for detailed +definitions of the callbacks. + +Below sections specify when to call the error callback functions. + +3.2.2.1 Correctable errors + +Correctable errors pose no impacts on the functionality of +the interface. The PCI Express protocol can recover without any +software intervention or any loss of data. These errors do not +require any recovery actions. The AER driver clears the device's +correctable error status register accordingly and logs these errors. + +3.2.2.2 Non-correctable (non-fatal and fatal) errors + +If an error message indicates a non-fatal error, performing link reset +at upstream is not required. The AER driver calls error_detected(dev, +pci_channel_io_normal) to all drivers associated within a hierarchy in +question. for example, +EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. +If Upstream port A captures an AER error, the hierarchy consists of +Downstream port B and EndPoint. + +A driver may return PCI_ERS_RESULT_CAN_RECOVER, +PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on +whether it can recover or the AER driver calls mmio_enabled as next. + +If an error message indicates a fatal error, kernel will broadcast +error_detected(dev, pci_channel_io_frozen) to all drivers within +a hierarchy in question. Then, performing link reset at upstream is +necessary. As different kinds of devices might use different approaches +to reset link, AER port service driver is required to provide the +function to reset link. Firstly, kernel looks for if the upstream +component has an aer driver. If it has, kernel uses the reset_link +callback of the aer driver. If the upstream component has no aer driver +and the port is downstream port, we will use the aer driver of the +root port who reports the AER error. As for upstream ports, +they should provide their own aer service drivers with reset_link +function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and +reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes +to mmio_enabled. + +3.3 helper functions + +3.3.1 int pci_find_aer_capability(struct pci_dev *dev); +pci_find_aer_capability locates the PCI Express AER capability +in the device configuration space. If the device doesn't support +PCI-Express AER, the function returns 0. + +3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev); +pci_enable_pcie_error_reporting enables the device to send error +messages to root port when an error is detected. Note that devices +don't enable the error reporting by default, so device drivers need +call this function to enable it. + +3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev); +pci_disable_pcie_error_reporting disables the device to send error +messages to root port when an error is detected. + +3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); +pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable +error status register. + +3.4 Frequent Asked Questions + +Q: What happens if a PCI Express device driver does not provide an +error recovery handler (pci_driver->err_handler is equal to NULL)? + +A: The devices attached with the driver won't be recovered. If the +error is fatal, kernel will print out warning messages. Please refer +to section 3 for more information. + +Q: What happens if an upstream port service driver does not provide +callback reset_link? + +A: Fatal error recovery will fail if the errors are reported by the +upstream ports who are attached by the service driver. + +Q: How does this infrastructure deal with driver that is not PCI +Express aware? + +A: This infrastructure calls the error callback functions of the +driver when an error happens. But if the driver is not aware of +PCI Express, the device might not report its own errors to root +port. + +Q: What modifications will that driver need to make it compatible +with the PCI Express AER Root driver? + +A: It could call the helper functions to enable AER in devices and +cleanup uncorrectable status register. Pls. refer to section 3.3. + diff --git a/Documentation/pcmcia/crc32hash.c b/Documentation/pcmcia/crc32hash.c new file mode 100644 index 000000000000..cbc36d299af8 --- /dev/null +++ b/Documentation/pcmcia/crc32hash.c @@ -0,0 +1,32 @@ +/* crc32hash.c - derived from linux/lib/crc32.c, GNU GPL v2 */ +/* Usage example: +$ ./crc32hash "Dual Speed" +*/ + +#include <string.h> +#include <stdio.h> +#include <ctype.h> +#include <stdlib.h> + +unsigned int crc32(unsigned char const *p, unsigned int len) +{ + int i; + unsigned int crc = 0; + while (len--) { + crc ^= *p++; + for (i = 0; i < 8; i++) + crc = (crc >> 1) ^ ((crc & 1) ? 0xedb88320 : 0); + } + return crc; +} + +int main(int argc, char **argv) { + unsigned int result; + if (argc != 2) { + printf("no string passed as argument\n"); + return -1; + } + result = crc32(argv[1], strlen(argv[1])); + printf("0x%x\n", result); + return 0; +} diff --git a/Documentation/pcmcia/devicetable.txt b/Documentation/pcmcia/devicetable.txt index 3351c0355143..199afd100cf2 100644 --- a/Documentation/pcmcia/devicetable.txt +++ b/Documentation/pcmcia/devicetable.txt @@ -27,37 +27,7 @@ pcmcia:m0149cC1ABf06pfn00fn00pa725B842DpbF1EFEE84pc0877B627pd00000000 The hex value after "pa" is the hash of product ID string 1, after "pb" for string 2 and so on. -Alternatively, you can use this small tool to determine the crc32 hash. -simply pass the string you want to evaluate as argument to this program, -e.g. +Alternatively, you can use crc32hash (see Documentation/pcmcia/crc32hash.c) +to determine the crc32 hash. Simply pass the string you want to evaluate +as argument to this program, e.g.: $ ./crc32hash "Dual Speed" - -------------------------------------------------------------------------- -/* crc32hash.c - derived from linux/lib/crc32.c, GNU GPL v2 */ -#include <string.h> -#include <stdio.h> -#include <ctype.h> -#include <stdlib.h> - -unsigned int crc32(unsigned char const *p, unsigned int len) -{ - int i; - unsigned int crc = 0; - while (len--) { - crc ^= *p++; - for (i = 0; i < 8; i++) - crc = (crc >> 1) ^ ((crc & 1) ? 0xedb88320 : 0); - } - return crc; -} - -int main(int argc, char **argv) { - unsigned int result; - if (argc != 2) { - printf("no string passed as argument\n"); - return -1; - } - result = crc32(argv[1], strlen(argv[1])); - printf("0x%x\n", result); - return 0; -} diff --git a/Documentation/pi-futex.txt b/Documentation/pi-futex.txt new file mode 100644 index 000000000000..5d61dacd21f6 --- /dev/null +++ b/Documentation/pi-futex.txt @@ -0,0 +1,121 @@ +Lightweight PI-futexes +---------------------- + +We are calling them lightweight for 3 reasons: + + - in the user-space fastpath a PI-enabled futex involves no kernel work + (or any other PI complexity) at all. No registration, no extra kernel + calls - just pure fast atomic ops in userspace. + + - even in the slowpath, the system call and scheduling pattern is very + similar to normal futexes. + + - the in-kernel PI implementation is streamlined around the mutex + abstraction, with strict rules that keep the implementation + relatively simple: only a single owner may own a lock (i.e. no + read-write lock support), only the owner may unlock a lock, no + recursive locking, etc. + +Priority Inheritance - why? +--------------------------- + +The short reply: user-space PI helps achieving/improving determinism for +user-space applications. In the best-case, it can help achieve +determinism and well-bound latencies. Even in the worst-case, PI will +improve the statistical distribution of locking related application +delays. + +The longer reply: +----------------- + +Firstly, sharing locks between multiple tasks is a common programming +technique that often cannot be replaced with lockless algorithms. As we +can see it in the kernel [which is a quite complex program in itself], +lockless structures are rather the exception than the norm - the current +ratio of lockless vs. locky code for shared data structures is somewhere +between 1:10 and 1:100. Lockless is hard, and the complexity of lockless +algorithms often endangers to ability to do robust reviews of said code. +I.e. critical RT apps often choose lock structures to protect critical +data structures, instead of lockless algorithms. Furthermore, there are +cases (like shared hardware, or other resource limits) where lockless +access is mathematically impossible. + +Media players (such as Jack) are an example of reasonable application +design with multiple tasks (with multiple priority levels) sharing +short-held locks: for example, a highprio audio playback thread is +combined with medium-prio construct-audio-data threads and low-prio +display-colory-stuff threads. Add video and decoding to the mix and +we've got even more priority levels. + +So once we accept that synchronization objects (locks) are an +unavoidable fact of life, and once we accept that multi-task userspace +apps have a very fair expectation of being able to use locks, we've got +to think about how to offer the option of a deterministic locking +implementation to user-space. + +Most of the technical counter-arguments against doing priority +inheritance only apply to kernel-space locks. But user-space locks are +different, there we cannot disable interrupts or make the task +non-preemptible in a critical section, so the 'use spinlocks' argument +does not apply (user-space spinlocks have the same priority inversion +problems as other user-space locking constructs). Fact is, pretty much +the only technique that currently enables good determinism for userspace +locks (such as futex-based pthread mutexes) is priority inheritance: + +Currently (without PI), if a high-prio and a low-prio task shares a lock +[this is a quite common scenario for most non-trivial RT applications], +even if all critical sections are coded carefully to be deterministic +(i.e. all critical sections are short in duration and only execute a +limited number of instructions), the kernel cannot guarantee any +deterministic execution of the high-prio task: any medium-priority task +could preempt the low-prio task while it holds the shared lock and +executes the critical section, and could delay it indefinitely. + +Implementation: +--------------- + +As mentioned before, the userspace fastpath of PI-enabled pthread +mutexes involves no kernel work at all - they behave quite similarly to +normal futex-based locks: a 0 value means unlocked, and a value==TID +means locked. (This is the same method as used by list-based robust +futexes.) Userspace uses atomic ops to lock/unlock these mutexes without +entering the kernel. + +To handle the slowpath, we have added two new futex ops: + + FUTEX_LOCK_PI + FUTEX_UNLOCK_PI + +If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to +TID fails], then FUTEX_LOCK_PI is called. The kernel does all the +remaining work: if there is no futex-queue attached to the futex address +yet then the code looks up the task that owns the futex [it has put its +own TID into the futex value], and attaches a 'PI state' structure to +the futex-queue. The pi_state includes an rt-mutex, which is a PI-aware, +kernel-based synchronization object. The 'other' task is made the owner +of the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the +futex value. Then this task tries to lock the rt-mutex, on which it +blocks. Once it returns, it has the mutex acquired, and it sets the +futex value to its own TID and returns. Userspace has no other work to +perform - it now owns the lock, and futex value contains +FUTEX_WAITERS|TID. + +If the unlock side fastpath succeeds, [i.e. userspace manages to do a +TID -> 0 atomic transition of the futex value], then no kernel work is +triggered. + +If the unlock fastpath fails (because the FUTEX_WAITERS bit is set), +then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the +behalf of userspace - and it also unlocks the attached +pi_state->rt_mutex and thus wakes up any potential waiters. + +Note that under this approach, contrary to previous PI-futex approaches, +there is no prior 'registration' of a PI-futex. [which is not quite +possible anyway, due to existing ABI properties of pthread mutexes.] + +Also, under this scheme, 'robustness' and 'PI' are two orthogonal +properties of futexes, and all four combinations are possible: futex, +robust-futex, PI-futex, robust+PI-futex. + +More details about priority inheritance can be found in +Documentation/rtmutex.txt. diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt index fba1e05c47c7..d0e79d5820a5 100644 --- a/Documentation/power/devices.txt +++ b/Documentation/power/devices.txt @@ -1,208 +1,553 @@ +Most of the code in Linux is device drivers, so most of the Linux power +management code is also driver-specific. Most drivers will do very little; +others, especially for platforms with small batteries (like cell phones), +will do a lot. + +This writeup gives an overview of how drivers interact with system-wide +power management goals, emphasizing the models and interfaces that are +shared by everything that hooks up to the driver model core. Read it as +background for the domain-specific work you'd do with any specific driver. + + +Two Models for Device Power Management +====================================== +Drivers will use one or both of these models to put devices into low-power +states: + + System Sleep model: + Drivers can enter low power states as part of entering system-wide + low-power states like "suspend-to-ram", or (mostly for systems with + disks) "hibernate" (suspend-to-disk). + + This is something that device, bus, and class drivers collaborate on + by implementing various role-specific suspend and resume methods to + cleanly power down hardware and software subsystems, then reactivate + them without loss of data. + + Some drivers can manage hardware wakeup events, which make the system + leave that low-power state. This feature may be disabled using the + relevant /sys/devices/.../power/wakeup file; enabling it may cost some + power usage, but let the whole system enter low power states more often. + + Runtime Power Management model: + Drivers may also enter low power states while the system is running, + independently of other power management activity. Upstream drivers + will normally not know (or care) if the device is in some low power + state when issuing requests; the driver will auto-resume anything + that's needed when it gets a request. + + This doesn't have, or need much infrastructure; it's just something you + should do when writing your drivers. For example, clk_disable() unused + clocks as part of minimizing power drain for currently-unused hardware. + Of course, sometimes clusters of drivers will collaborate with each + other, which could involve task-specific power management. + +There's not a lot to be said about those low power states except that they +are very system-specific, and often device-specific. Also, that if enough +drivers put themselves into low power states (at "runtime"), the effect may be +the same as entering some system-wide low-power state (system sleep) ... and +that synergies exist, so that several drivers using runtime pm might put the +system into a state where even deeper power saving options are available. + +Most suspended devices will have quiesced all I/O: no more DMA or irqs, no +more data read or written, and requests from upstream drivers are no longer +accepted. A given bus or platform may have different requirements though. + +Examples of hardware wakeup events include an alarm from a real time clock, +network wake-on-LAN packets, keyboard or mouse activity, and media insertion +or removal (for PCMCIA, MMC/SD, USB, and so on). + + +Interfaces for Entering System Sleep States +=========================================== +Most of the programming interfaces a device driver needs to know about +relate to that first model: entering a system-wide low power state, +rather than just minimizing power consumption by one device. + + +Bus Driver Methods +------------------ +The core methods to suspend and resume devices reside in struct bus_type. +These are mostly of interest to people writing infrastructure for busses +like PCI or USB, or because they define the primitives that device drivers +may need to apply in domain-specific ways to their devices: -Device Power Management +struct bus_type { + ... + int (*suspend)(struct device *dev, pm_message_t state); + int (*suspend_late)(struct device *dev, pm_message_t state); + int (*resume_early)(struct device *dev); + int (*resume)(struct device *dev); +}; -Device power management encompasses two areas - the ability to save -state and transition a device to a low-power state when the system is -entering a low-power state; and the ability to transition a device to -a low-power state while the system is running (and independently of -any other power management activity). +Bus drivers implement those methods as appropriate for the hardware and +the drivers using it; PCI works differently from USB, and so on. Not many +people write bus drivers; most driver code is a "device driver" that +builds on top of bus-specific framework code. + +For more information on these driver calls, see the description later; +they are called in phases for every device, respecting the parent-child +sequencing in the driver model tree. Note that as this is being written, +only the suspend() and resume() are widely available; not many bus drivers +leverage all of those phases, or pass them down to lower driver levels. + + +/sys/devices/.../power/wakeup files +----------------------------------- +All devices in the driver model have two flags to control handling of +wakeup events, which are hardware signals that can force the device and/or +system out of a low power state. These are initialized by bus or device +driver code using device_init_wakeup(dev,can_wakeup). + +The "can_wakeup" flag just records whether the device (and its driver) can +physically support wakeup events. When that flag is clear, the sysfs +"wakeup" file is empty, and device_may_wakeup() returns false. + +For devices that can issue wakeup events, a separate flag controls whether +that device should try to use its wakeup mechanism. The initial value of +device_may_wakeup() will be true, so that the device's "wakeup" file holds +the value "enabled". Userspace can change that to "disabled" so that +device_may_wakeup() returns false; or change it back to "enabled" (so that +it returns true again). + + +EXAMPLE: PCI Device Driver Methods +----------------------------------- +PCI framework software calls these methods when the PCI device driver bound +to a device device has provided them: + +struct pci_driver { + ... + int (*suspend)(struct pci_device *pdev, pm_message_t state); + int (*suspend_late)(struct pci_device *pdev, pm_message_t state); + + int (*resume_early)(struct pci_device *pdev); + int (*resume)(struct pci_device *pdev); +}; +Drivers will implement those methods, and call PCI-specific procedures +like pci_set_power_state(), pci_enable_wake(), pci_save_state(), and +pci_restore_state() to manage PCI-specific mechanisms. (PCI config space +could be saved during driver probe, if it weren't for the fact that some +systems rely on userspace tweaking using setpci.) Devices are suspended +before their bridges enter low power states, and likewise bridges resume +before their devices. + + +Upper Layers of Driver Stacks +----------------------------- +Device drivers generally have at least two interfaces, and the methods +sketched above are the ones which apply to the lower level (nearer PCI, USB, +or other bus hardware). The network and block layers are examples of upper +level interfaces, as is a character device talking to userspace. + +Power management requests normally need to flow through those upper levels, +which often use domain-oriented requests like "blank that screen". In +some cases those upper levels will have power management intelligence that +relates to end-user activity, or other devices that work in cooperation. + +When those interfaces are structured using class interfaces, there is a +standard way to have the upper layer stop issuing requests to a given +class device (and restart later): + +struct class { + ... + int (*suspend)(struct device *dev, pm_message_t state); + int (*resume)(struct device *dev); +}; -Methods +Those calls are issued in specific phases of the process by which the +system enters a low power "suspend" state, or resumes from it. + + +Calling Drivers to Enter System Sleep States +============================================ +When the system enters a low power state, each device's driver is asked +to suspend the device by putting it into state compatible with the target +system state. That's usually some version of "off", but the details are +system-specific. Also, wakeup-enabled devices will usually stay partly +functional in order to wake the system. + +When the system leaves that low power state, the device's driver is asked +to resume it. The suspend and resume operations always go together, and +both are multi-phase operations. + +For simple drivers, suspend might quiesce the device using the class code +and then turn its hardware as "off" as possible with late_suspend. The +matching resume calls would then completely reinitialize the hardware +before reactivating its class I/O queues. + +More power-aware drivers drivers will use more than one device low power +state, either at runtime or during system sleep states, and might trigger +system wakeup events. + + +Call Sequence Guarantees +------------------------ +To ensure that bridges and similar links needed to talk to a device are +available when the device is suspended or resumed, the device tree is +walked in a bottom-up order to suspend devices. A top-down order is +used to resume those devices. + +The ordering of the device tree is defined by the order in which devices +get registered: a child can never be registered, probed or resumed before +its parent; and can't be removed or suspended after that parent. + +The policy is that the device tree should match hardware bus topology. +(Or at least the control bus, for devices which use multiple busses.) + + +Suspending Devices +------------------ +Suspending a given device is done in several phases. Suspending the +system always includes every phase, executing calls for every device +before the next phase begins. Not all busses or classes support all +these callbacks; and not all drivers use all the callbacks. + +The phases are seen by driver notifications issued in this order: + + 1 class.suspend(dev, message) is called after tasks are frozen, for + devices associated with a class that has such a method. This + method may sleep. + + Since I/O activity usually comes from such higher layers, this is + a good place to quiesce all drivers of a given type (and keep such + code out of those drivers). + + 2 bus.suspend(dev, message) is called next. This method may sleep, + and is often morphed into a device driver call with bus-specific + parameters and/or rules. + + This call should handle parts of device suspend logic that require + sleeping. It probably does work to quiesce the device which hasn't + been abstracted into class.suspend() or bus.suspend_late(). + + 3 bus.suspend_late(dev, message) is called with IRQs disabled, and + with only one CPU active. Until the bus.resume_early() phase + completes (see later), IRQs are not enabled again. This method + won't be exposed by all busses; for message based busses like USB, + I2C, or SPI, device interactions normally require IRQs. This bus + call may be morphed into a driver call with bus-specific parameters. + + This call might save low level hardware state that might otherwise + be lost in the upcoming low power state, and actually put the + device into a low power state ... so that in some cases the device + may stay partly usable until this late. This "late" call may also + help when coping with hardware that behaves badly. + +The pm_message_t parameter is currently used to refine those semantics +(described later). + +At the end of those phases, drivers should normally have stopped all I/O +transactions (DMA, IRQs), saved enough state that they can re-initialize +or restore previous state (as needed by the hardware), and placed the +device into a low-power state. On many platforms they will also use +clk_disable() to gate off one or more clock sources; sometimes they will +also switch off power supplies, or reduce voltages. Drivers which have +runtime PM support may already have performed some or all of the steps +needed to prepare for the upcoming system sleep state. + +When any driver sees that its device_can_wakeup(dev), it should make sure +to use the relevant hardware signals to trigger a system wakeup event. +For example, enable_irq_wake() might identify GPIO signals hooked up to +a switch or other external hardware, and pci_enable_wake() does something +similar for PCI's PME# signal. + +If a driver (or bus, or class) fails it suspend method, the system won't +enter the desired low power state; it will resume all the devices it's +suspended so far. + +Note that drivers may need to perform different actions based on the target +system lowpower/sleep state. At this writing, there are only platform +specific APIs through which drivers could determine those target states. + + +Device Low Power (suspend) States +--------------------------------- +Device low-power states aren't very standard. One device might only handle +"on" and "off, while another might support a dozen different versions of +"on" (how many engines are active?), plus a state that gets back to "on" +faster than from a full "off". + +Some busses define rules about what different suspend states mean. PCI +gives one example: after the suspend sequence completes, a non-legacy +PCI device may not perform DMA or issue IRQs, and any wakeup events it +issues would be issued through the PME# bus signal. Plus, there are +several PCI-standard device states, some of which are optional. + +In contrast, integrated system-on-chip processors often use irqs as the +wakeup event sources (so drivers would call enable_irq_wake) and might +be able to treat DMA completion as a wakeup event (sometimes DMA can stay +active too, it'd only be the CPU and some peripherals that sleep). + +Some details here may be platform-specific. Systems may have devices that +can be fully active in certain sleep states, such as an LCD display that's +refreshed using DMA while most of the system is sleeping lightly ... and +its frame buffer might even be updated by a DSP or other non-Linux CPU while +the Linux control processor stays idle. + +Moreover, the specific actions taken may depend on the target system state. +One target system state might allow a given device to be very operational; +another might require a hard shut down with re-initialization on resume. +And two different target systems might use the same device in different +ways; the aforementioned LCD might be active in one product's "standby", +but a different product using the same SOC might work differently. + + +Meaning of pm_message_t.event +----------------------------- +Parameters to suspend calls include the device affected and a message of +type pm_message_t, which has one field: the event. If driver does not +recognize the event code, suspend calls may abort the request and return +a negative errno. However, most drivers will be fine if they implement +PM_EVENT_SUSPEND semantics for all messages. + +The event codes are used to refine the goal of suspending the device, and +mostly matter when creating or resuming system memory image snapshots, as +used with suspend-to-disk: + + PM_EVENT_SUSPEND -- quiesce the driver and put hardware into a low-power + state. When used with system sleep states like "suspend-to-RAM" or + "standby", the upcoming resume() call will often be able to rely on + state kept in hardware, or issue system wakeup events. When used + instead with suspend-to-disk, few devices support this capability; + most are completely powered off. + + PM_EVENT_FREEZE -- quiesce the driver, but don't necessarily change into + any low power mode. A system snapshot is about to be taken, often + followed by a call to the driver's resume() method. Neither wakeup + events nor DMA are allowed. + + PM_EVENT_PRETHAW -- quiesce the driver, knowing that the upcoming resume() + will restore a suspend-to-disk snapshot from a different kernel image. + Drivers that are smart enough to look at their hardware state during + resume() processing need that state to be correct ... a PRETHAW could + be used to invalidate that state (by resetting the device), like a + shutdown() invocation would before a kexec() or system halt. Other + drivers might handle this the same way as PM_EVENT_FREEZE. Neither + wakeup events nor DMA are allowed. + +To enter "standby" (ACPI S1) or "Suspend to RAM" (STR, ACPI S3) states, or +the similarly named APM states, only PM_EVENT_SUSPEND is used; for "Suspend +to Disk" (STD, hibernate, ACPI S4), all of those event codes are used. + +There's also PM_EVENT_ON, a value which never appears as a suspend event +but is sometimes used to record the "not suspended" device state. + + +Resuming Devices +---------------- +Resuming is done in multiple phases, much like suspending, with all +devices processing each phase's calls before the next phase begins. + +The phases are seen by driver notifications issued in this order: + + 1 bus.resume_early(dev) is called with IRQs disabled, and with + only one CPU active. As with bus.suspend_late(), this method + won't be supported on busses that require IRQs in order to + interact with devices. + + This reverses the effects of bus.suspend_late(). + + 2 bus.resume(dev) is called next. This may be morphed into a device + driver call with bus-specific parameters; implementations may sleep. + + This reverses the effects of bus.suspend(). + + 3 class.resume(dev) is called for devices associated with a class + that has such a method. Implementations may sleep. + + This reverses the effects of class.suspend(), and would usually + reactivate the device's I/O queue. + +At the end of those phases, drivers should normally be as functional as +they were before suspending: I/O can be performed using DMA and IRQs, and +the relevant clocks are gated on. The device need not be "fully on"; it +might be in a runtime lowpower/suspend state that acts as if it were. + +However, the details here may again be platform-specific. For example, +some systems support multiple "run" states, and the mode in effect at +the end of resume() might not be the one which preceded suspension. +That means availability of certain clocks or power supplies changed, +which could easily affect how a driver works. + + +Drivers need to be able to handle hardware which has been reset since the +suspend methods were called, for example by complete reinitialization. +This may be the hardest part, and the one most protected by NDA'd documents +and chip errata. It's simplest if the hardware state hasn't changed since +the suspend() was called, but that can't always be guaranteed. + +Drivers must also be prepared to notice that the device has been removed +while the system was powered off, whenever that's physically possible. +PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busses +where common Linux platforms will see such removal. Details of how drivers +will notice and handle such removals are currently bus-specific, and often +involve a separate thread. -The methods to suspend and resume devices reside in struct bus_type: -struct bus_type { - ... - int (*suspend)(struct device * dev, pm_message_t state); - int (*resume)(struct device * dev); -}; +Note that the bus-specific runtime PM wakeup mechanism can exist, and might +be defined to share some of the same driver code as for system wakeup. For +example, a bus-specific device driver's resume() method might be used there, +so it wouldn't only be called from bus.resume() during system-wide wakeup. +See bus-specific information about how runtime wakeup events are handled. -Each bus driver is responsible implementing these methods, translating -the call into a bus-specific request and forwarding the call to the -bus-specific drivers. For example, PCI drivers implement suspend() and -resume() methods in struct pci_driver. The PCI core is simply -responsible for translating the pointers to PCI-specific ones and -calling the low-level driver. - -This is done to a) ease transition to the new power management methods -and leverage the existing PM code in various bus drivers; b) allow -buses to implement generic and default PM routines for devices, and c) -make the flow of execution obvious to the reader. - - -System Power Management - -When the system enters a low-power state, the device tree is walked in -a depth-first fashion to transition each device into a low-power -state. The ordering of the device tree is guaranteed by the order in -which devices get registered - children are never registered before -their ancestors, and devices are placed at the back of the list when -registered. By walking the list in reverse order, we are guaranteed to -suspend devices in the proper order. - -Devices are suspended once with interrupts enabled. Drivers are -expected to stop I/O transactions, save device state, and place the -device into a low-power state. Drivers may sleep, allocate memory, -etc. at will. - -Some devices are broken and will inevitably have problems powering -down or disabling themselves with interrupts enabled. For these -special cases, they may return -EAGAIN. This will put the device on a -list to be taken care of later. When interrupts are disabled, before -we enter the low-power state, their drivers are called again to put -their device to sleep. - -On resume, the devices that returned -EAGAIN will be called to power -themselves back on with interrupts disabled. Once interrupts have been -re-enabled, the rest of the drivers will be called to resume their -devices. On resume, a driver is responsible for powering back on each -device, restoring state, and re-enabling I/O transactions for that -device. +System Devices +-------------- System devices follow a slightly different API, which can be found in include/linux/sysdev.h drivers/base/sys.c -System devices will only be suspended with interrupts disabled, and -after all other devices have been suspended. On resume, they will be -resumed before any other devices, and also with interrupts disabled. +System devices will only be suspended with interrupts disabled, and after +all other devices have been suspended. On resume, they will be resumed +before any other devices, and also with interrupts disabled. +That is, IRQs are disabled, the suspend_late() phase begins, then the +sysdev_driver.suspend() phase, and the system enters a sleep state. Then +the sysdev_driver.resume() phase begins, followed by the resume_early() +phase, after which IRQs are enabled. -Runtime Power Management - -Many devices are able to dynamically power down while the system is -still running. This feature is useful for devices that are not being -used, and can offer significant power savings on a running system. - -In each device's directory, there is a 'power' directory, which -contains at least a 'state' file. Reading from this file displays what -power state the device is currently in. Writing to this file initiates -a transition to the specified power state, which must be a decimal in -the range 1-3, inclusive; or 0 for 'On'. +Code to actually enter and exit the system-wide low power state sometimes +involves hardware details that are only known to the boot firmware, and +may leave a CPU running software (from SRAM or flash memory) that monitors +the system and manages its wakeup sequence. -The PM core will call the ->suspend() method in the bus_type object -that the device belongs to if the specified state is not 0, or -->resume() if it is. -Nothing will happen if the specified state is the same state the -device is currently in. - -If the device is already in a low-power state, and the specified state -is another, but different, low-power state, the ->resume() method will -first be called to power the device back on, then ->suspend() will be -called again with the new state. - -The driver is responsible for saving the working state of the device -and putting it into the low-power state specified. If this was -successful, it returns 0, and the device's power_state field is -updated. - -The driver must take care to know whether or not it is able to -properly resume the device, including all step of reinitialization -necessary. (This is the hardest part, and the one most protected by -NDA'd documents). - -The driver must also take care not to suspend a device that is -currently in use. It is their responsibility to provide their own -exclusion mechanisms. - -The runtime power transition happens with interrupts enabled. If a -device cannot support being powered down with interrupts, it may -return -EAGAIN (as it would during a system power management -transition), but it will _not_ be called again, and the transaction -will fail. - -There is currently no way to know what states a device or driver -supports a priori. This will change in the future. - -pm_message_t meaning - -pm_message_t has two fields. event ("major"), and flags. If driver -does not know event code, it aborts the request, returning error. Some -drivers may need to deal with special cases based on the actual type -of suspend operation being done at the system level. This is why -there are flags. - -Event codes are: - -ON -- no need to do anything except special cases like broken -HW. - -# NOTIFICATION -- pretty much same as ON? - -FREEZE -- stop DMA and interrupts, and be prepared to reinit HW from -scratch. That probably means stop accepting upstream requests, the -actual policy of what to do with them being specific to a given -driver. It's acceptable for a network driver to just drop packets -while a block driver is expected to block the queue so no request is -lost. (Use IDE as an example on how to do that). FREEZE requires no -power state change, and it's expected for drivers to be able to -quickly transition back to operating state. - -SUSPEND -- like FREEZE, but also put hardware into low-power state. If -there's need to distinguish several levels of sleep, additional flag -is probably best way to do that. - -Transitions are only from a resumed state to a suspended state, never -between 2 suspended states. (ON -> FREEZE or ON -> SUSPEND can happen, -FREEZE -> SUSPEND or SUSPEND -> FREEZE can not). - -All events are: - -[NOTE NOTE NOTE: If you are driver author, you should not care; you -should only look at event, and ignore flags.] - -#Prepare for suspend -- userland is still running but we are going to -#enter suspend state. This gives drivers chance to load firmware from -#disk and store it in memory, or do other activities taht require -#operating userland, ability to kmalloc GFP_KERNEL, etc... All of these -#are forbiden once the suspend dance is started.. event = ON, flags = -#PREPARE_TO_SUSPEND - -Apm standby -- prepare for APM event. Quiesce devices to make life -easier for APM BIOS. event = FREEZE, flags = APM_STANDBY - -Apm suspend -- same as APM_STANDBY, but it we should probably avoid -spinning down disks. event = FREEZE, flags = APM_SUSPEND - -System halt, reboot -- quiesce devices to make life easier for BIOS. event -= FREEZE, flags = SYSTEM_HALT or SYSTEM_REBOOT - -System shutdown -- at least disks need to be spun down, or data may be -lost. Quiesce devices, just to make life easier for BIOS. event = -FREEZE, flags = SYSTEM_SHUTDOWN - -Kexec -- turn off DMAs and put hardware into some state where new -kernel can take over. event = FREEZE, flags = KEXEC - -Powerdown at end of swsusp -- very similar to SYSTEM_SHUTDOWN, except wake -may need to be enabled on some devices. This actually has at least 3 -subtypes, system can reboot, enter S4 and enter S5 at the end of -swsusp. event = FREEZE, flags = SWSUSP and one of SYSTEM_REBOOT, -SYSTEM_SHUTDOWN, SYSTEM_S4 - -Suspend to ram -- put devices into low power state. event = SUSPEND, -flags = SUSPEND_TO_RAM - -Freeze for swsusp snapshot -- stop DMA and interrupts. No need to put -devices into low power mode, but you must be able to reinitialize -device from scratch in resume method. This has two flavors, its done -once on suspending kernel, once on resuming kernel. event = FREEZE, -flags = DURING_SUSPEND or DURING_RESUME - -Device detach requested from /sys -- deinitialize device; proably same as -SYSTEM_SHUTDOWN, I do not understand this one too much. probably event -= FREEZE, flags = DEV_DETACH. - -#These are not really events sent: -# -#System fully on -- device is working normally; this is probably never -#passed to suspend() method... event = ON, flags = 0 -# -#Ready after resume -- userland is now running, again. Time to free any -#memory you ate during prepare to suspend... event = ON, flags = -#READY_AFTER_RESUME -# +Runtime Power Management +======================== +Many devices are able to dynamically power down while the system is still +running. This feature is useful for devices that are not being used, and +can offer significant power savings on a running system. These devices +often support a range of runtime power states, which might use names such +as "off", "sleep", "idle", "active", and so on. Those states will in some +cases (like PCI) be partially constrained by a bus the device uses, and will +usually include hardware states that are also used in system sleep states. + +However, note that if a driver puts a device into a runtime low power state +and the system then goes into a system-wide sleep state, it normally ought +to resume into that runtime low power state rather than "full on". Such +distinctions would be part of the driver-internal state machine for that +hardware; the whole point of runtime power management is to be sure that +drivers are decoupled in that way from the state machine governing phases +of the system-wide power/sleep state transitions. + + +Power Saving Techniques +----------------------- +Normally runtime power management is handled by the drivers without specific +userspace or kernel intervention, by device-aware use of techniques like: + + Using information provided by other system layers + - stay deeply "off" except between open() and close() + - if transceiver/PHY indicates "nobody connected", stay "off" + - application protocols may include power commands or hints + + Using fewer CPU cycles + - using DMA instead of PIO + - removing timers, or making them lower frequency + - shortening "hot" code paths + - eliminating cache misses + - (sometimes) offloading work to device firmware + + Reducing other resource costs + - gating off unused clocks in software (or hardware) + - switching off unused power supplies + - eliminating (or delaying/merging) IRQs + - tuning DMA to use word and/or burst modes + + Using device-specific low power states + - using lower voltages + - avoiding needless DMA transfers + +Read your hardware documentation carefully to see the opportunities that +may be available. If you can, measure the actual power usage and check +it against the budget established for your project. + + +Examples: USB hosts, system timer, system CPU +---------------------------------------------- +USB host controllers make interesting, if complex, examples. In many cases +these have no work to do: no USB devices are connected, or all of them are +in the USB "suspend" state. Linux host controller drivers can then disable +periodic DMA transfers that would otherwise be a constant power drain on the +memory subsystem, and enter a suspend state. In power-aware controllers, +entering that suspend state may disable the clock used with USB signaling, +saving a certain amount of power. + +The controller will be woken from that state (with an IRQ) by changes to the +signal state on the data lines of a given port, for example by an existing +peripheral requesting "remote wakeup" or by plugging a new peripheral. The +same wakeup mechanism usually works from "standby" sleep states, and on some +systems also from "suspend to RAM" (or even "suspend to disk") states. +(Except that ACPI may be involved instead of normal IRQs, on some hardware.) + +System devices like timers and CPUs may have special roles in the platform +power management scheme. For example, system timers using a "dynamic tick" +approach don't just save CPU cycles (by eliminating needless timer IRQs), +but they may also open the door to using lower power CPU "idle" states that +cost more than a jiffie to enter and exit. On x86 systems these are states +like "C3"; note that periodic DMA transfers from a USB host controller will +also prevent entry to a C3 state, much like a periodic timer IRQ. + +That kind of runtime mechanism interaction is common. "System On Chip" (SOC) +processors often have low power idle modes that can't be entered unless +certain medium-speed clocks (often 12 or 48 MHz) are gated off. When the +drivers gate those clocks effectively, then the system idle task may be able +to use the lower power idle modes and thereby increase battery life. + +If the CPU can have a "cpufreq" driver, there also may be opportunities +to shift to lower voltage settings and reduce the power cost of executing +a given number of instructions. (Without voltage adjustment, it's rare +for cpufreq to save much power; the cost-per-instruction must go down.) + + +/sys/devices/.../power/state files +================================== +For now you can also test some of this functionality using sysfs. + + DEPRECATED: USE "power/state" ONLY FOR DRIVER TESTING, AND + AVOID USING dev->power.power_state IN DRIVERS. + + THESE WILL BE REMOVED. IF THE "power/state" FILE GETS REPLACED, + IT WILL BECOME SOMETHING COUPLED TO THE BUS OR DRIVER. + +In each device's directory, there is a 'power' directory, which contains +at least a 'state' file. The value of this field is effectively boolean, +PM_EVENT_ON or PM_EVENT_SUSPEND. + + * Reading from this file displays a value corresponding to + the power.power_state.event field. All nonzero values are + displayed as "2", corresponding to a low power state; zero + is displayed as "0", corresponding to normal operation. + + * Writing to this file initiates a transition using the + specified event code number; only '0', '2', and '3' are + accepted (without a newline); '2' and '3' are both + mapped to PM_EVENT_SUSPEND. + +On writes, the PM core relies on that recorded event code and the device/bus +capabilities to determine whether it uses a partial suspend() or resume() +sequence to change things so that the recorded event corresponds to the +numeric parameter. + + - If the bus requires the irqs-disabled suspend_late()/resume_early() + phases, writes fail because those operations are not supported here. + + - If the recorded value is the expected value, nothing is done. + + - If the recorded value is nonzero, the device is partially resumed, + using the bus.resume() and/or class.resume() methods. + + - If the target value is nonzero, the device is partially suspended, + using the class.suspend() and/or bus.suspend() methods and the + PM_EVENT_SUSPEND message. + +Drivers have no way to tell whether their suspend() and resume() calls +have come through the sysfs power/state file or as part of entering a +system sleep state, except that when accessed through sysfs the normal +parent/child sequencing rules are ignored. Drivers (such as bus, bridge, +or hub drivers) which expose child devices may need to enforce those rules +on their own. diff --git a/Documentation/power/interface.txt b/Documentation/power/interface.txt index 4117802af0f8..a66bec222b16 100644 --- a/Documentation/power/interface.txt +++ b/Documentation/power/interface.txt @@ -52,3 +52,18 @@ suspend image will be as small as possible. Reading from this file will display the current image size limit, which is set to 500 MB by default. + +/sys/power/pm_trace controls the code which saves the last PM event point in +the RTC across reboots, so that you can debug a machine that just hangs +during suspend (or more commonly, during resume). Namely, the RTC is only +used to save the last PM event point if this file contains '1'. Initially it +contains '0' which may be changed to '1' by writing a string representing a +nonzero integer into it. + +To use this debugging feature you should attempt to suspend the machine, then +reboot it and run + + dmesg -s 1000000 | grep 'hash matches' + +CAUTION: Using it will cause your machine's real-time (CMOS) clock to be +set to a random invalid time after a resume. diff --git a/Documentation/powerpc/booting-without-of.txt b/Documentation/powerpc/booting-without-of.txt index 217e51768b87..5c0ba235f5a5 100644 --- a/Documentation/powerpc/booting-without-of.txt +++ b/Documentation/powerpc/booting-without-of.txt @@ -1136,10 +1136,10 @@ Sense and level information should be encoded as follows: Devices connected to openPIC-compatible controllers should encode sense and polarity as follows: - 0 = high to low edge sensitive type enabled + 0 = low to high edge sensitive type enabled 1 = active low level sensitive type enabled - 2 = low to high edge sensitive type enabled - 3 = active high level sensitive type enabled + 2 = active high level sensitive type enabled + 3 = high to low edge sensitive type enabled ISA PIC interrupt controllers should adhere to the ISA PIC encodings listed below: @@ -1196,7 +1196,7 @@ platforms are moved over to use the flattened-device-tree model. - model : Model of the device. Can be "TSEC", "eTSEC", or "FEC" - compatible : Should be "gianfar" - reg : Offset and length of the register set for the device - - address : List of bytes representing the ethernet address of + - mac-address : List of bytes representing the ethernet address of this controller - interrupts : <a b> where a is the interrupt number and b is a field that represents an encoding of the sense and level @@ -1216,7 +1216,7 @@ platforms are moved over to use the flattened-device-tree model. model = "TSEC"; compatible = "gianfar"; reg = <24000 1000>; - address = [ 00 E0 0C 00 73 00 ]; + mac-address = [ 00 E0 0C 00 73 00 ]; interrupts = <d 3 e 3 12 3>; interrupt-parent = <40000>; phy-handle = <2452000> @@ -1436,9 +1436,9 @@ platforms are moved over to use the flattened-device-tree model. interrupts = <1d 3>; interrupt-parent = <40000>; num-channels = <4>; - channel-fifo-len = <24>; + channel-fifo-len = <18>; exec-units-mask = <000000fe>; - descriptor-types-mask = <073f1127>; + descriptor-types-mask = <012b0ebf>; }; @@ -1498,7 +1498,7 @@ not necessary as they are usually the same as the root node. model = "TSEC"; compatible = "gianfar"; reg = <24000 1000>; - address = [ 00 E0 0C 00 73 00 ]; + mac-address = [ 00 E0 0C 00 73 00 ]; interrupts = <d 3 e 3 12 3>; interrupt-parent = <40000>; phy-handle = <2452000>; @@ -1511,7 +1511,7 @@ not necessary as they are usually the same as the root node. model = "TSEC"; compatible = "gianfar"; reg = <25000 1000>; - address = [ 00 E0 0C 00 73 01 ]; + mac-address = [ 00 E0 0C 00 73 01 ]; interrupts = <13 3 14 3 18 3>; interrupt-parent = <40000>; phy-handle = <2452001>; @@ -1524,7 +1524,7 @@ not necessary as they are usually the same as the root node. model = "FEC"; compatible = "gianfar"; reg = <26000 1000>; - address = [ 00 E0 0C 00 73 02 ]; + mac-address = [ 00 E0 0C 00 73 02 ]; interrupts = <19 3>; interrupt-parent = <40000>; phy-handle = <2452002>; diff --git a/Documentation/ramdisk.txt b/Documentation/ramdisk.txt index 7c25584e082c..52f75b7d51c2 100644 --- a/Documentation/ramdisk.txt +++ b/Documentation/ramdisk.txt @@ -6,7 +6,7 @@ Contents: 1) Overview 2) Kernel Command Line Parameters 3) Using "rdev -r" - 4) An Example of Creating a Compressed RAM Disk + 4) An Example of Creating a Compressed RAM Disk 1) Overview @@ -34,7 +34,7 @@ make it clearer. The original "ramdisk=<ram_size>" has been kept around for compatibility reasons, but it may be removed in the future. The new RAM disk also has the ability to load compressed RAM disk images, -allowing one to squeeze more programs onto an average installation or +allowing one to squeeze more programs onto an average installation or rescue floppy disk. @@ -51,7 +51,7 @@ default is 4096 (4 MB) (8192 (8 MB) on S390). =================== This parameter tells the RAM disk driver how many bytes to use per block. The -default is 512. +default is 1024 (BLOCK_SIZE). 3) Using "rdev -r" @@ -70,7 +70,7 @@ These numbers are no magical secrets, as seen below: ./arch/i386/kernel/setup.c:#define RAMDISK_PROMPT_FLAG 0x8000 ./arch/i386/kernel/setup.c:#define RAMDISK_LOAD_FLAG 0x4000 -Consider a typical two floppy disk setup, where you will have the +Consider a typical two floppy disk setup, where you will have the kernel on disk one, and have already put a RAM disk image onto disk #2. Hence you want to set bits 0 to 13 as 0, meaning that your RAM disk @@ -97,12 +97,12 @@ Since the default start = 0 and the default prompt = 1, you could use: append = "load_ramdisk=1" -4) An Example of Creating a Compressed RAM Disk +4) An Example of Creating a Compressed RAM Disk ---------------------------------------------- To create a RAM disk image, you will need a spare block device to construct it on. This can be the RAM disk device itself, or an -unused disk partition (such as an unmounted swap partition). For this +unused disk partition (such as an unmounted swap partition). For this example, we will use the RAM disk device, "/dev/ram0". Note: This technique should not be done on a machine with less than 8 MB diff --git a/Documentation/robust-futexes.txt b/Documentation/robust-futexes.txt index df82d75245a0..76e8064b8c3a 100644 --- a/Documentation/robust-futexes.txt +++ b/Documentation/robust-futexes.txt @@ -95,7 +95,7 @@ comparison. If the thread has registered a list, then normally the list is empty. If the thread/process crashed or terminated in some incorrect way then the list might be non-empty: in this case the kernel carefully walks the list [not trusting it], and marks all locks that are owned by -this thread with the FUTEX_OWNER_DEAD bit, and wakes up one waiter (if +this thread with the FUTEX_OWNER_DIED bit, and wakes up one waiter (if any). The list is guaranteed to be private and per-thread at do_exit() time, diff --git a/Documentation/rt-mutex-design.txt b/Documentation/rt-mutex-design.txt new file mode 100644 index 000000000000..c472ffacc2f6 --- /dev/null +++ b/Documentation/rt-mutex-design.txt @@ -0,0 +1,781 @@ +# +# Copyright (c) 2006 Steven Rostedt +# Licensed under the GNU Free Documentation License, Version 1.2 +# + +RT-mutex implementation design +------------------------------ + +This document tries to describe the design of the rtmutex.c implementation. +It doesn't describe the reasons why rtmutex.c exists. For that please see +Documentation/rt-mutex.txt. Although this document does explain problems +that happen without this code, but that is in the concept to understand +what the code actually is doing. + +The goal of this document is to help others understand the priority +inheritance (PI) algorithm that is used, as well as reasons for the +decisions that were made to implement PI in the manner that was done. + + +Unbounded Priority Inversion +---------------------------- + +Priority inversion is when a lower priority process executes while a higher +priority process wants to run. This happens for several reasons, and +most of the time it can't be helped. Anytime a high priority process wants +to use a resource that a lower priority process has (a mutex for example), +the high priority process must wait until the lower priority process is done +with the resource. This is a priority inversion. What we want to prevent +is something called unbounded priority inversion. That is when the high +priority process is prevented from running by a lower priority process for +an undetermined amount of time. + +The classic example of unbounded priority inversion is were you have three +processes, let's call them processes A, B, and C, where A is the highest +priority process, C is the lowest, and B is in between. A tries to grab a lock +that C owns and must wait and lets C run to release the lock. But in the +meantime, B executes, and since B is of a higher priority than C, it preempts C, +but by doing so, it is in fact preempting A which is a higher priority process. +Now there's no way of knowing how long A will be sleeping waiting for C +to release the lock, because for all we know, B is a CPU hog and will +never give C a chance to release the lock. This is called unbounded priority +inversion. + +Here's a little ASCII art to show the problem. + + grab lock L1 (owned by C) + | +A ---+ + C preempted by B + | +C +----+ + +B +--------> + B now keeps A from running. + + +Priority Inheritance (PI) +------------------------- + +There are several ways to solve this issue, but other ways are out of scope +for this document. Here we only discuss PI. + +PI is where a process inherits the priority of another process if the other +process blocks on a lock owned by the current process. To make this easier +to understand, let's use the previous example, with processes A, B, and C again. + +This time, when A blocks on the lock owned by C, C would inherit the priority +of A. So now if B becomes runnable, it would not preempt C, since C now has +the high priority of A. As soon as C releases the lock, it loses its +inherited priority, and A then can continue with the resource that C had. + +Terminology +----------- + +Here I explain some terminology that is used in this document to help describe +the design that is used to implement PI. + +PI chain - The PI chain is an ordered series of locks and processes that cause + processes to inherit priorities from a previous process that is + blocked on one of its locks. This is described in more detail + later in this document. + +mutex - In this document, to differentiate from locks that implement + PI and spin locks that are used in the PI code, from now on + the PI locks will be called a mutex. + +lock - In this document from now on, I will use the term lock when + referring to spin locks that are used to protect parts of the PI + algorithm. These locks disable preemption for UP (when + CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from + entering critical sections simultaneously. + +spin lock - Same as lock above. + +waiter - A waiter is a struct that is stored on the stack of a blocked + process. Since the scope of the waiter is within the code for + a process being blocked on the mutex, it is fine to allocate + the waiter on the process's stack (local variable). This + structure holds a pointer to the task, as well as the mutex that + the task is blocked on. It also has the plist node structures to + place the task in the waiter_list of a mutex as well as the + pi_list of a mutex owner task (described below). + + waiter is sometimes used in reference to the task that is waiting + on a mutex. This is the same as waiter->task. + +waiters - A list of processes that are blocked on a mutex. + +top waiter - The highest priority process waiting on a specific mutex. + +top pi waiter - The highest priority process waiting on one of the mutexes + that a specific process owns. + +Note: task and process are used interchangeably in this document, mostly to + differentiate between two processes that are being described together. + + +PI chain +-------- + +The PI chain is a list of processes and mutexes that may cause priority +inheritance to take place. Multiple chains may converge, but a chain +would never diverge, since a process can't be blocked on more than one +mutex at a time. + +Example: + + Process: A, B, C, D, E + Mutexes: L1, L2, L3, L4 + + A owns: L1 + B blocked on L1 + B owns L2 + C blocked on L2 + C owns L3 + D blocked on L3 + D owns L4 + E blocked on L4 + +The chain would be: + + E->L4->D->L3->C->L2->B->L1->A + +To show where two chains merge, we could add another process F and +another mutex L5 where B owns L5 and F is blocked on mutex L5. + +The chain for F would be: + + F->L5->B->L1->A + +Since a process may own more than one mutex, but never be blocked on more than +one, the chains merge. + +Here we show both chains: + + E->L4->D->L3->C->L2-+ + | + +->B->L1->A + | + F->L5-+ + +For PI to work, the processes at the right end of these chains (or we may +also call it the Top of the chain) must be equal to or higher in priority +than the processes to the left or below in the chain. + +Also since a mutex may have more than one process blocked on it, we can +have multiple chains merge at mutexes. If we add another process G that is +blocked on mutex L2: + + G->L2->B->L1->A + +And once again, to show how this can grow I will show the merging chains +again. + + E->L4->D->L3->C-+ + +->L2-+ + | | + G-+ +->B->L1->A + | + F->L5-+ + + +Plist +----- + +Before I go further and talk about how the PI chain is stored through lists +on both mutexes and processes, I'll explain the plist. This is similar to +the struct list_head functionality that is already in the kernel. +The implementation of plist is out of scope for this document, but it is +very important to understand what it does. + +There are a few differences between plist and list, the most important one +being that plist is a priority sorted linked list. This means that the +priorities of the plist are sorted, such that it takes O(1) to retrieve the +highest priority item in the list. Obviously this is useful to store processes +based on their priorities. + +Another difference, which is important for implementation, is that, unlike +list, the head of the list is a different element than the nodes of a list. +So the head of the list is declared as struct plist_head and nodes that will +be added to the list are declared as struct plist_node. + + +Mutex Waiter List +----------------- + +Every mutex keeps track of all the waiters that are blocked on itself. The mutex +has a plist to store these waiters by priority. This list is protected by +a spin lock that is located in the struct of the mutex. This lock is called +wait_lock. Since the modification of the waiter list is never done in +interrupt context, the wait_lock can be taken without disabling interrupts. + + +Task PI List +------------ + +To keep track of the PI chains, each process has its own PI list. This is +a list of all top waiters of the mutexes that are owned by the process. +Note that this list only holds the top waiters and not all waiters that are +blocked on mutexes owned by the process. + +The top of the task's PI list is always the highest priority task that +is waiting on a mutex that is owned by the task. So if the task has +inherited a priority, it will always be the priority of the task that is +at the top of this list. + +This list is stored in the task structure of a process as a plist called +pi_list. This list is protected by a spin lock also in the task structure, +called pi_lock. This lock may also be taken in interrupt context, so when +locking the pi_lock, interrupts must be disabled. + + +Depth of the PI Chain +--------------------- + +The maximum depth of the PI chain is not dynamic, and could actually be +defined. But is very complex to figure it out, since it depends on all +the nesting of mutexes. Let's look at the example where we have 3 mutexes, +L1, L2, and L3, and four separate functions func1, func2, func3 and func4. +The following shows a locking order of L1->L2->L3, but may not actually +be directly nested that way. + +void func1(void) +{ + mutex_lock(L1); + + /* do anything */ + + mutex_unlock(L1); +} + +void func2(void) +{ + mutex_lock(L1); + mutex_lock(L2); + + /* do something */ + + mutex_unlock(L2); + mutex_unlock(L1); +} + +void func3(void) +{ + mutex_lock(L2); + mutex_lock(L3); + + /* do something else */ + + mutex_unlock(L3); + mutex_unlock(L2); +} + +void func4(void) +{ + mutex_lock(L3); + + /* do something again */ + + mutex_unlock(L3); +} + +Now we add 4 processes that run each of these functions separately. +Processes A, B, C, and D which run functions func1, func2, func3 and func4 +respectively, and such that D runs first and A last. With D being preempted +in func4 in the "do something again" area, we have a locking that follows: + +D owns L3 + C blocked on L3 + C owns L2 + B blocked on L2 + B owns L1 + A blocked on L1 + +And thus we have the chain A->L1->B->L2->C->L3->D. + +This gives us a PI depth of 4 (four processes), but looking at any of the +functions individually, it seems as though they only have at most a locking +depth of two. So, although the locking depth is defined at compile time, +it still is very difficult to find the possibilities of that depth. + +Now since mutexes can be defined by user-land applications, we don't want a DOS +type of application that nests large amounts of mutexes to create a large +PI chain, and have the code holding spin locks while looking at a large +amount of data. So to prevent this, the implementation not only implements +a maximum lock depth, but also only holds at most two different locks at a +time, as it walks the PI chain. More about this below. + + +Mutex owner and flags +--------------------- + +The mutex structure contains a pointer to the owner of the mutex. If the +mutex is not owned, this owner is set to NULL. Since all architectures +have the task structure on at least a four byte alignment (and if this is +not true, the rtmutex.c code will be broken!), this allows for the two +least significant bits to be used as flags. This part is also described +in Documentation/rt-mutex.txt, but will also be briefly described here. + +Bit 0 is used as the "Pending Owner" flag. This is described later. +Bit 1 is used as the "Has Waiters" flags. This is also described later + in more detail, but is set whenever there are waiters on a mutex. + + +cmpxchg Tricks +-------------- + +Some architectures implement an atomic cmpxchg (Compare and Exchange). This +is used (when applicable) to keep the fast path of grabbing and releasing +mutexes short. + +cmpxchg is basically the following function performed atomically: + +unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C) +{ + unsigned long T = *A; + if (*A == *B) { + *A = *C; + } + return T; +} +#define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c) + +This is really nice to have, since it allows you to only update a variable +if the variable is what you expect it to be. You know if it succeeded if +the return value (the old value of A) is equal to B. + +The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If +the architecture does not support CMPXCHG, then this macro is simply set +to fail every time. But if CMPXCHG is supported, then this will +help out extremely to keep the fast path short. + +The use of rt_mutex_cmpxchg with the flags in the owner field help optimize +the system for architectures that support it. This will also be explained +later in this document. + + +Priority adjustments +-------------------- + +The implementation of the PI code in rtmutex.c has several places that a +process must adjust its priority. With the help of the pi_list of a +process this is rather easy to know what needs to be adjusted. + +The functions implementing the task adjustments are rt_mutex_adjust_prio, +__rt_mutex_adjust_prio (same as the former, but expects the task pi_lock +to already be taken), rt_mutex_get_prio, and rt_mutex_setprio. + +rt_mutex_getprio and rt_mutex_setprio are only used in __rt_mutex_adjust_prio. + +rt_mutex_getprio returns the priority that the task should have. Either the +task's own normal priority, or if a process of a higher priority is waiting on +a mutex owned by the task, then that higher priority should be returned. +Since the pi_list of a task holds an order by priority list of all the top +waiters of all the mutexes that the task owns, rt_mutex_getprio simply needs +to compare the top pi waiter to its own normal priority, and return the higher +priority back. + +(Note: if looking at the code, you will notice that the lower number of + prio is returned. This is because the prio field in the task structure + is an inverse order of the actual priority. So a "prio" of 5 is + of higher priority than a "prio" of 10.) + +__rt_mutex_adjust_prio examines the result of rt_mutex_getprio, and if the +result does not equal the task's current priority, then rt_mutex_setprio +is called to adjust the priority of the task to the new priority. +Note that rt_mutex_setprio is defined in kernel/sched.c to implement the +actual change in priority. + +It is interesting to note that __rt_mutex_adjust_prio can either increase +or decrease the priority of the task. In the case that a higher priority +process has just blocked on a mutex owned by the task, __rt_mutex_adjust_prio +would increase/boost the task's priority. But if a higher priority task +were for some reason to leave the mutex (timeout or signal), this same function +would decrease/unboost the priority of the task. That is because the pi_list +always contains the highest priority task that is waiting on a mutex owned +by the task, so we only need to compare the priority of that top pi waiter +to the normal priority of the given task. + + +High level overview of the PI chain walk +---------------------------------------- + +The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain. + +The implementation has gone through several iterations, and has ended up +with what we believe is the best. It walks the PI chain by only grabbing +at most two locks at a time, and is very efficient. + +The rt_mutex_adjust_prio_chain can be used either to boost or lower process +priorities. + +rt_mutex_adjust_prio_chain is called with a task to be checked for PI +(de)boosting (the owner of a mutex that a process is blocking on), a flag to +check for deadlocking, the mutex that the task owns, and a pointer to a waiter +that is the process's waiter struct that is blocked on the mutex (although this +parameter may be NULL for deboosting). + +For this explanation, I will not mention deadlock detection. This explanation +will try to stay at a high level. + +When this function is called, there are no locks held. That also means +that the state of the owner and lock can change when entered into this function. + +Before this function is called, the task has already had rt_mutex_adjust_prio +performed on it. This means that the task is set to the priority that it +should be at, but the plist nodes of the task's waiter have not been updated +with the new priorities, and that this task may not be in the proper locations +in the pi_lists and wait_lists that the task is blocked on. This function +solves all that. + +A loop is entered, where task is the owner to be checked for PI changes that +was passed by parameter (for the first iteration). The pi_lock of this task is +taken to prevent any more changes to the pi_list of the task. This also +prevents new tasks from completing the blocking on a mutex that is owned by this +task. + +If the task is not blocked on a mutex then the loop is exited. We are at +the top of the PI chain. + +A check is now done to see if the original waiter (the process that is blocked +on the current mutex) is the top pi waiter of the task. That is, is this +waiter on the top of the task's pi_list. If it is not, it either means that +there is another process higher in priority that is blocked on one of the +mutexes that the task owns, or that the waiter has just woken up via a signal +or timeout and has left the PI chain. In either case, the loop is exited, since +we don't need to do any more changes to the priority of the current task, or any +task that owns a mutex that this current task is waiting on. A priority chain +walk is only needed when a new top pi waiter is made to a task. + +The next check sees if the task's waiter plist node has the priority equal to +the priority the task is set at. If they are equal, then we are done with +the loop. Remember that the function started with the priority of the +task adjusted, but the plist nodes that hold the task in other processes +pi_lists have not been adjusted. + +Next, we look at the mutex that the task is blocked on. The mutex's wait_lock +is taken. This is done by a spin_trylock, because the locking order of the +pi_lock and wait_lock goes in the opposite direction. If we fail to grab the +lock, the pi_lock is released, and we restart the loop. + +Now that we have both the pi_lock of the task as well as the wait_lock of +the mutex the task is blocked on, we update the task's waiter's plist node +that is located on the mutex's wait_list. + +Now we release the pi_lock of the task. + +Next the owner of the mutex has its pi_lock taken, so we can update the +task's entry in the owner's pi_list. If the task is the highest priority +process on the mutex's wait_list, then we remove the previous top waiter +from the owner's pi_list, and replace it with the task. + +Note: It is possible that the task was the current top waiter on the mutex, + in which case the task is not yet on the pi_list of the waiter. This + is OK, since plist_del does nothing if the plist node is not on any + list. + +If the task was not the top waiter of the mutex, but it was before we +did the priority updates, that means we are deboosting/lowering the +task. In this case, the task is removed from the pi_list of the owner, +and the new top waiter is added. + +Lastly, we unlock both the pi_lock of the task, as well as the mutex's +wait_lock, and continue the loop again. On the next iteration of the +loop, the previous owner of the mutex will be the task that will be +processed. + +Note: One might think that the owner of this mutex might have changed + since we just grab the mutex's wait_lock. And one could be right. + The important thing to remember is that the owner could not have + become the task that is being processed in the PI chain, since + we have taken that task's pi_lock at the beginning of the loop. + So as long as there is an owner of this mutex that is not the same + process as the tasked being worked on, we are OK. + + Looking closely at the code, one might be confused. The check for the + end of the PI chain is when the task isn't blocked on anything or the + task's waiter structure "task" element is NULL. This check is + protected only by the task's pi_lock. But the code to unlock the mutex + sets the task's waiter structure "task" element to NULL with only + the protection of the mutex's wait_lock, which was not taken yet. + Isn't this a race condition if the task becomes the new owner? + + The answer is No! The trick is the spin_trylock of the mutex's + wait_lock. If we fail that lock, we release the pi_lock of the + task and continue the loop, doing the end of PI chain check again. + + In the code to release the lock, the wait_lock of the mutex is held + the entire time, and it is not let go when we grab the pi_lock of the + new owner of the mutex. So if the switch of a new owner were to happen + after the check for end of the PI chain and the grabbing of the + wait_lock, the unlocking code would spin on the new owner's pi_lock + but never give up the wait_lock. So the PI chain loop is guaranteed to + fail the spin_trylock on the wait_lock, release the pi_lock, and + try again. + + If you don't quite understand the above, that's OK. You don't have to, + unless you really want to make a proof out of it ;) + + +Pending Owners and Lock stealing +-------------------------------- + +One of the flags in the owner field of the mutex structure is "Pending Owner". +What this means is that an owner was chosen by the process releasing the +mutex, but that owner has yet to wake up and actually take the mutex. + +Why is this important? Why can't we just give the mutex to another process +and be done with it? + +The PI code is to help with real-time processes, and to let the highest +priority process run as long as possible with little latencies and delays. +If a high priority process owns a mutex that a lower priority process is +blocked on, when the mutex is released it would be given to the lower priority +process. What if the higher priority process wants to take that mutex again. +The high priority process would fail to take that mutex that it just gave up +and it would need to boost the lower priority process to run with full +latency of that critical section (since the low priority process just entered +it). + +There's no reason a high priority process that gives up a mutex should be +penalized if it tries to take that mutex again. If the new owner of the +mutex has not woken up yet, there's no reason that the higher priority process +could not take that mutex away. + +To solve this, we introduced Pending Ownership and Lock Stealing. When a +new process is given a mutex that it was blocked on, it is only given +pending ownership. This means that it's the new owner, unless a higher +priority process comes in and tries to grab that mutex. If a higher priority +process does come along and wants that mutex, we let the higher priority +process "steal" the mutex from the pending owner (only if it is still pending) +and continue with the mutex. + + +Taking of a mutex (The walk through) +------------------------------------ + +OK, now let's take a look at the detailed walk through of what happens when +taking a mutex. + +The first thing that is tried is the fast taking of the mutex. This is +done when we have CMPXCHG enabled (otherwise the fast taking automatically +fails). Only when the owner field of the mutex is NULL can the lock be +taken with the CMPXCHG and nothing else needs to be done. + +If there is contention on the lock, whether it is owned or pending owner +we go about the slow path (rt_mutex_slowlock). + +The slow path function is where the task's waiter structure is created on +the stack. This is because the waiter structure is only needed for the +scope of this function. The waiter structure holds the nodes to store +the task on the wait_list of the mutex, and if need be, the pi_list of +the owner. + +The wait_lock of the mutex is taken since the slow path of unlocking the +mutex also takes this lock. + +We then call try_to_take_rt_mutex. This is where the architecture that +does not implement CMPXCHG would always grab the lock (if there's no +contention). + +try_to_take_rt_mutex is used every time the task tries to grab a mutex in the +slow path. The first thing that is done here is an atomic setting of +the "Has Waiters" flag of the mutex's owner field. Yes, this could really +be false, because if the the mutex has no owner, there are no waiters and +the current task also won't have any waiters. But we don't have the lock +yet, so we assume we are going to be a waiter. The reason for this is to +play nice for those architectures that do have CMPXCHG. By setting this flag +now, the owner of the mutex can't release the mutex without going into the +slow unlock path, and it would then need to grab the wait_lock, which this +code currently holds. So setting the "Has Waiters" flag forces the owner +to synchronize with this code. + +Now that we know that we can't have any races with the owner releasing the +mutex, we check to see if we can take the ownership. This is done if the +mutex doesn't have a owner, or if we can steal the mutex from a pending +owner. Let's look at the situations we have here. + + 1) Has owner that is pending + ---------------------------- + + The mutex has a owner, but it hasn't woken up and the mutex flag + "Pending Owner" is set. The first check is to see if the owner isn't the + current task. This is because this function is also used for the pending + owner to grab the mutex. When a pending owner wakes up, it checks to see + if it can take the mutex, and this is done if the owner is already set to + itself. If so, we succeed and leave the function, clearing the "Pending + Owner" bit. + + If the pending owner is not current, we check to see if the current priority is + higher than the pending owner. If not, we fail the function and return. + + There's also something special about a pending owner. That is a pending owner + is never blocked on a mutex. So there is no PI chain to worry about. It also + means that if the mutex doesn't have any waiters, there's no accounting needed + to update the pending owner's pi_list, since we only worry about processes + blocked on the current mutex. + + If there are waiters on this mutex, and we just stole the ownership, we need + to take the top waiter, remove it from the pi_list of the pending owner, and + add it to the current pi_list. Note that at this moment, the pending owner + is no longer on the list of waiters. This is fine, since the pending owner + would add itself back when it realizes that it had the ownership stolen + from itself. When the pending owner tries to grab the mutex, it will fail + in try_to_take_rt_mutex if the owner field points to another process. + + 2) No owner + ----------- + + If there is no owner (or we successfully stole the lock), we set the owner + of the mutex to current, and set the flag of "Has Waiters" if the current + mutex actually has waiters, or we clear the flag if it doesn't. See, it was + OK that we set that flag early, since now it is cleared. + + 3) Failed to grab ownership + --------------------------- + + The most interesting case is when we fail to take ownership. This means that + there exists an owner, or there's a pending owner with equal or higher + priority than the current task. + +We'll continue on the failed case. + +If the mutex has a timeout, we set up a timer to go off to break us out +of this mutex if we failed to get it after a specified amount of time. + +Now we enter a loop that will continue to try to take ownership of the mutex, or +fail from a timeout or signal. + +Once again we try to take the mutex. This will usually fail the first time +in the loop, since it had just failed to get the mutex. But the second time +in the loop, this would likely succeed, since the task would likely be +the pending owner. + +If the mutex is TASK_INTERRUPTIBLE a check for signals and timeout is done +here. + +The waiter structure has a "task" field that points to the task that is blocked +on the mutex. This field can be NULL the first time it goes through the loop +or if the task is a pending owner and had it's mutex stolen. If the "task" +field is NULL then we need to set up the accounting for it. + +Task blocks on mutex +-------------------- + +The accounting of a mutex and process is done with the waiter structure of +the process. The "task" field is set to the process, and the "lock" field +to the mutex. The plist nodes are initialized to the processes current +priority. + +Since the wait_lock was taken at the entry of the slow lock, we can safely +add the waiter to the wait_list. If the current process is the highest +priority process currently waiting on this mutex, then we remove the +previous top waiter process (if it exists) from the pi_list of the owner, +and add the current process to that list. Since the pi_list of the owner +has changed, we call rt_mutex_adjust_prio on the owner to see if the owner +should adjust its priority accordingly. + +If the owner is also blocked on a lock, and had its pi_list changed +(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead +and run rt_mutex_adjust_prio_chain on the owner, as described earlier. + +Now all locks are released, and if the current process is still blocked on a +mutex (waiter "task" field is not NULL), then we go to sleep (call schedule). + +Waking up in the loop +--------------------- + +The schedule can then wake up for a few reasons. + 1) we were given pending ownership of the mutex. + 2) we received a signal and was TASK_INTERRUPTIBLE + 3) we had a timeout and was TASK_INTERRUPTIBLE + +In any of these cases, we continue the loop and once again try to grab the +ownership of the mutex. If we succeed, we exit the loop, otherwise we continue +and on signal and timeout, will exit the loop, or if we had the mutex stolen +we just simply add ourselves back on the lists and go back to sleep. + +Note: For various reasons, because of timeout and signals, the steal mutex + algorithm needs to be careful. This is because the current process is + still on the wait_list. And because of dynamic changing of priorities, + especially on SCHED_OTHER tasks, the current process can be the + highest priority task on the wait_list. + +Failed to get mutex on Timeout or Signal +---------------------------------------- + +If a timeout or signal occurred, the waiter's "task" field would not be +NULL and the task needs to be taken off the wait_list of the mutex and perhaps +pi_list of the owner. If this process was a high priority process, then +the rt_mutex_adjust_prio_chain needs to be executed again on the owner, +but this time it will be lowering the priorities. + + +Unlocking the Mutex +------------------- + +The unlocking of a mutex also has a fast path for those architectures with +CMPXCHG. Since the taking of a mutex on contention always sets the +"Has Waiters" flag of the mutex's owner, we use this to know if we need to +take the slow path when unlocking the mutex. If the mutex doesn't have any +waiters, the owner field of the mutex would equal the current process and +the mutex can be unlocked by just replacing the owner field with NULL. + +If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available), +the slow unlock path is taken. + +The first thing done in the slow unlock path is to take the wait_lock of the +mutex. This synchronizes the locking and unlocking of the mutex. + +A check is made to see if the mutex has waiters or not. On architectures that +do not have CMPXCHG, this is the location that the owner of the mutex will +determine if a waiter needs to be awoken or not. On architectures that +do have CMPXCHG, that check is done in the fast path, but it is still needed +in the slow path too. If a waiter of a mutex woke up because of a signal +or timeout between the time the owner failed the fast path CMPXCHG check and +the grabbing of the wait_lock, the mutex may not have any waiters, thus the +owner still needs to make this check. If there are no waiters than the mutex +owner field is set to NULL, the wait_lock is released and nothing more is +needed. + +If there are waiters, then we need to wake one up and give that waiter +pending ownership. + +On the wake up code, the pi_lock of the current owner is taken. The top +waiter of the lock is found and removed from the wait_list of the mutex +as well as the pi_list of the current owner. The task field of the new +pending owner's waiter structure is set to NULL, and the owner field of the +mutex is set to the new owner with the "Pending Owner" bit set, as well +as the "Has Waiters" bit if there still are other processes blocked on the +mutex. + +The pi_lock of the previous owner is released, and the new pending owner's +pi_lock is taken. Remember that this is the trick to prevent the race +condition in rt_mutex_adjust_prio_chain from adding itself as a waiter +on the mutex. + +We now clear the "pi_blocked_on" field of the new pending owner, and if +the mutex still has waiters pending, we add the new top waiter to the pi_list +of the pending owner. + +Finally we unlock the pi_lock of the pending owner and wake it up. + + +Contact +------- + +For updates on this document, please email Steven Rostedt <rostedt@goodmis.org> + + +Credits +------- + +Author: Steven Rostedt <rostedt@goodmis.org> + +Reviewers: Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and Randy Dunlap + +Updates +------- + +This document was originally written for 2.6.17-rc3-mm1 diff --git a/Documentation/rt-mutex.txt b/Documentation/rt-mutex.txt new file mode 100644 index 000000000000..243393d882ee --- /dev/null +++ b/Documentation/rt-mutex.txt @@ -0,0 +1,79 @@ +RT-mutex subsystem with PI support +---------------------------------- + +RT-mutexes with priority inheritance are used to support PI-futexes, +which enable pthread_mutex_t priority inheritance attributes +(PTHREAD_PRIO_INHERIT). [See Documentation/pi-futex.txt for more details +about PI-futexes.] + +This technology was developed in the -rt tree and streamlined for +pthread_mutex support. + +Basic principles: +----------------- + +RT-mutexes extend the semantics of simple mutexes by the priority +inheritance protocol. + +A low priority owner of a rt-mutex inherits the priority of a higher +priority waiter until the rt-mutex is released. If the temporarily +boosted owner blocks on a rt-mutex itself it propagates the priority +boosting to the owner of the other rt_mutex it gets blocked on. The +priority boosting is immediately removed once the rt_mutex has been +unlocked. + +This approach allows us to shorten the block of high-prio tasks on +mutexes which protect shared resources. Priority inheritance is not a +magic bullet for poorly designed applications, but it allows +well-designed applications to use userspace locks in critical parts of +an high priority thread, without losing determinism. + +The enqueueing of the waiters into the rtmutex waiter list is done in +priority order. For same priorities FIFO order is chosen. For each +rtmutex, only the top priority waiter is enqueued into the owner's +priority waiters list. This list too queues in priority order. Whenever +the top priority waiter of a task changes (for example it timed out or +got a signal), the priority of the owner task is readjusted. [The +priority enqueueing is handled by "plists", see include/linux/plist.h +for more details.] + +RT-mutexes are optimized for fastpath operations and have no internal +locking overhead when locking an uncontended mutex or unlocking a mutex +without waiters. The optimized fastpath operations require cmpxchg +support. [If that is not available then the rt-mutex internal spinlock +is used] + +The state of the rt-mutex is tracked via the owner field of the rt-mutex +structure: + +rt_mutex->owner holds the task_struct pointer of the owner. Bit 0 and 1 +are used to keep track of the "owner is pending" and "rtmutex has +waiters" state. + + owner bit1 bit0 + NULL 0 0 mutex is free (fast acquire possible) + NULL 0 1 invalid state + NULL 1 0 Transitional state* + NULL 1 1 invalid state + taskpointer 0 0 mutex is held (fast release possible) + taskpointer 0 1 task is pending owner + taskpointer 1 0 mutex is held and has waiters + taskpointer 1 1 task is pending owner and mutex has waiters + +Pending-ownership handling is a performance optimization: +pending-ownership is assigned to the first (highest priority) waiter of +the mutex, when the mutex is released. The thread is woken up and once +it starts executing it can acquire the mutex. Until the mutex is taken +by it (bit 0 is cleared) a competing higher priority thread can "steal" +the mutex which puts the woken up thread back on the waiters list. + +The pending-ownership optimization is especially important for the +uninterrupted workflow of high-prio tasks which repeatedly +takes/releases locks that have lower-prio waiters. Without this +optimization the higher-prio thread would ping-pong to the lower-prio +task [because at unlock time we always assign a new owner]. + +(*) The "mutex has waiters" bit gets set to take the lock. If the lock +doesn't already have an owner, this bit is quickly cleared if there are +no waiters. So this is a transitional state to synchronize with looking +at the owner field of the mutex and the mutex owner releasing the lock. diff --git a/Documentation/rtc.txt b/Documentation/rtc.txt index 95d17b3e2eee..2a58f985795a 100644 --- a/Documentation/rtc.txt +++ b/Documentation/rtc.txt @@ -44,8 +44,10 @@ normal timer interrupt, which is 100Hz. Programming and/or enabling interrupt frequencies greater than 64Hz is only allowed by root. This is perhaps a bit conservative, but we don't want an evil user generating lots of IRQs on a slow 386sx-16, where it might have -a negative impact on performance. Note that the interrupt handler is only -a few lines of code to minimize any possibility of this effect. +a negative impact on performance. This 64Hz limit can be changed by writing +a different value to /proc/sys/dev/rtc/max-user-freq. Note that the +interrupt handler is only a few lines of code to minimize any possibility +of this effect. Also, if the kernel time is synchronized with an external source, the kernel will write the time back to the CMOS clock every 11 minutes. In @@ -81,6 +83,7 @@ that will be using this driver. */ #include <stdio.h> +#include <stdlib.h> #include <linux/rtc.h> #include <sys/ioctl.h> #include <sys/time.h> diff --git a/Documentation/scsi/ChangeLog.arcmsr b/Documentation/scsi/ChangeLog.arcmsr new file mode 100644 index 000000000000..162c47fdf45f --- /dev/null +++ b/Documentation/scsi/ChangeLog.arcmsr @@ -0,0 +1,56 @@ +************************************************************************** +** History +** +** REV# DATE NAME DESCRIPTION +** 1.00.00.00 3/31/2004 Erich Chen First release +** 1.10.00.04 7/28/2004 Erich Chen modify for ioctl +** 1.10.00.06 8/28/2004 Erich Chen modify for 2.6.x +** 1.10.00.08 9/28/2004 Erich Chen modify for x86_64 +** 1.10.00.10 10/10/2004 Erich Chen bug fix for SMP & ioctl +** 1.20.00.00 11/29/2004 Erich Chen bug fix with arcmsr_bus_reset when PHY error +** 1.20.00.02 12/09/2004 Erich Chen bug fix with over 2T bytes RAID Volume +** 1.20.00.04 1/09/2005 Erich Chen fits for Debian linux kernel version 2.2.xx +** 1.20.00.05 2/20/2005 Erich Chen cleanly as look like a Linux driver at 2.6.x +** thanks for peoples kindness comment +** Kornel Wieliczek +** Christoph Hellwig +** Adrian Bunk +** Andrew Morton +** Christoph Hellwig +** James Bottomley +** Arjan van de Ven +** 1.20.00.06 3/12/2005 Erich Chen fix with arcmsr_pci_unmap_dma "unsigned long" cast, +** modify PCCB POOL allocated by "dma_alloc_coherent" +** (Kornel Wieliczek's comment) +** 1.20.00.07 3/23/2005 Erich Chen bug fix with arcmsr_scsi_host_template_init +** occur segmentation fault, +** if RAID adapter does not on PCI slot +** and modprobe/rmmod this driver twice. +** bug fix enormous stack usage (Adrian Bunk's comment) +** 1.20.00.08 6/23/2005 Erich Chen bug fix with abort command, +** in case of heavy loading when sata cable +** working on low quality connection +** 1.20.00.09 9/12/2005 Erich Chen bug fix with abort command handling, firmware version check +** and firmware update notify for hardware bug fix +** 1.20.00.10 9/23/2005 Erich Chen enhance sysfs function for change driver's max tag Q number. +** add DMA_64BIT_MASK for backward compatible with all 2.6.x +** add some useful message for abort command +** add ioctl code 'ARCMSR_IOCTL_FLUSH_ADAPTER_CACHE' +** customer can send this command for sync raid volume data +** 1.20.00.11 9/29/2005 Erich Chen by comment of Arjan van de Ven fix incorrect msleep redefine +** cast off sizeof(dma_addr_t) condition for 64bit pci_set_dma_mask +** 1.20.00.12 9/30/2005 Erich Chen bug fix with 64bit platform's ccbs using if over 4G system memory +** change 64bit pci_set_consistent_dma_mask into 32bit +** increcct adapter count if adapter initialize fail. +** miss edit at arcmsr_build_ccb.... +** psge += sizeof(struct _SG64ENTRY *) => +** psge += sizeof(struct _SG64ENTRY) +** 64 bits sg entry would be incorrectly calculated +** thanks Kornel Wieliczek give me kindly notify +** and detail description +** 1.20.00.13 11/15/2005 Erich Chen scheduling pending ccb with FIFO +** change the architecture of arcmsr command queue list +** for linux standard list +** enable usage of pci message signal interrupt +** follow Randy.Danlup kindness suggestion cleanup this code +**************************************************************************
\ No newline at end of file diff --git a/Documentation/scsi/ChangeLog.megaraid b/Documentation/scsi/ChangeLog.megaraid index c173806c91fa..a056bbe67c7e 100644 --- a/Documentation/scsi/ChangeLog.megaraid +++ b/Documentation/scsi/ChangeLog.megaraid @@ -1,3 +1,126 @@ +Release Date : Fri May 19 09:31:45 EST 2006 - Seokmann Ju <sju@lsil.com> +Current Version : 2.20.4.9 (scsi module), 2.20.2.6 (cmm module) +Older Version : 2.20.4.8 (scsi module), 2.20.2.6 (cmm module) + +1. Fixed a bug in megaraid_init_mbox(). + Customer reported "garbage in file on x86_64 platform". + Root Cause: the driver registered controllers as 64-bit DMA capable + for those which are not support it. + Fix: Made change in the function inserting identification machanism + identifying 64-bit DMA capable controllers. + + > -----Original Message----- + > From: Vasily Averin [mailto:vvs@sw.ru] + > Sent: Thursday, May 04, 2006 2:49 PM + > To: linux-scsi@vger.kernel.org; Kolli, Neela; Mukker, Atul; + > Ju, Seokmann; Bagalkote, Sreenivas; + > James.Bottomley@SteelEye.com; devel@openvz.org + > Subject: megaraid_mbox: garbage in file + > + > Hello all, + > + > I've investigated customers claim on the unstable work of + > their node and found a + > strange effect: reading from some files leads to the + > "attempt to access beyond end of device" messages. + > + > I've checked filesystem, memory on the node, motherboard BIOS + > version, but it + > does not help and issue still has been reproduced by simple + > file reading. + > + > Reproducer is simple: + > + > echo 0xffffffff >/proc/sys/dev/scsi/logging_level ; + > cat /vz/private/101/root/etc/ld.so.cache >/tmp/ttt ; + > echo 0 >/proc/sys/dev/scsi/logging + > + > It leads to the following messages in dmesg + > + > sd_init_command: disk=sda, block=871769260, count=26 + > sda : block=871769260 + > sda : reading 26/26 512 byte blocks. + > scsi_add_timer: scmd: f79ed980, time: 7500, (c02b1420) + > sd 0:1:0:0: send 0xf79ed980 sd 0:1:0:0: + > command: Read (10): 28 00 33 f6 24 ac 00 00 1a 00 + > buffer = 0xf7cfb540, bufflen = 13312, done = 0xc0366b40, + > queuecommand 0xc0344010 + > leaving scsi_dispatch_cmnd() + > scsi_delete_timer: scmd: f79ed980, rtn: 1 + > sd 0:1:0:0: done 0xf79ed980 SUCCESS 0 sd 0:1:0:0: + > command: Read (10): 28 00 33 f6 24 ac 00 00 1a 00 + > scsi host busy 1 failed 0 + > sd 0:1:0:0: Notifying upper driver of completion (result 0) + > sd_rw_intr: sda: res=0x0 + > 26 sectors total, 13312 bytes done. + > use_sg is 4 + > attempt to access beyond end of device + > sda6: rw=0, want=1044134458, limit=951401367 + > Buffer I/O error on device sda6, logical block 522067228 + > attempt to access beyond end of device + +2. When INQUIRY with EVPD bit set issued to the MegaRAID controller, + system memory gets corrupted. + Root Cause: MegaRAID F/W handle the INQUIRY with EVPD bit set + incorrectly. + Fix: MegaRAID F/W has fixed the problem and being process of release, + soon. Meanwhile, driver will filter out the request. + +3. One of member in the data structure of the driver leads unaligne + issue on 64-bit platform. + Customer reporeted "kernel unaligned access addrss" issue when + application communicates with MegaRAID HBA driver. + Root Cause: in uioc_t structure, one of member had misaligned and it + led system to display the error message. + Fix: A patch submitted to community from following folk. + + > -----Original Message----- + > From: linux-scsi-owner@vger.kernel.org + > [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Sakurai Hiroomi + > Sent: Wednesday, July 12, 2006 4:20 AM + > To: linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org + > Subject: Re: Help: strange messages from kernel on IA64 platform + > + > Hi, + > + > I saw same message. + > + > When GAM(Global Array Manager) is started, The following + > message output. + > kernel: kernel unaligned access to 0xe0000001fe1080d4, + > ip=0xa000000200053371 + > + > The uioc structure used by ioctl is defined by packed, + > the allignment of each member are disturbed. + > In a 64 bit structure, the allignment of member doesn't fit 64 bit + > boundary. this causes this messages. + > In a 32 bit structure, we don't see the message because the allinment + > of member fit 32 bit boundary even if packed is specified. + > + > patch + > I Add 32 bit dummy member to fit 64 bit boundary. I tested. + > We confirmed this patch fix the problem by IA64 server. + > + > ************************************************************** + > **************** + > --- linux-2.6.9/drivers/scsi/megaraid/megaraid_ioctl.h.orig + > 2006-04-03 17:13:03.000000000 +0900 + > +++ linux-2.6.9/drivers/scsi/megaraid/megaraid_ioctl.h + > 2006-04-03 17:14:09.000000000 +0900 + > @@ -132,6 +132,10 @@ + > /* Driver Data: */ + > void __user * user_data; + > uint32_t user_data_len; + > + + > + /* 64bit alignment */ + > + uint32_t pad_0xBC; + > + + > mraid_passthru_t __user *user_pthru; + > + > mraid_passthru_t *pthru32; + > ************************************************************** + > **************** + Release Date : Mon Apr 11 12:27:22 EST 2006 - Seokmann Ju <sju@lsil.com> Current Version : 2.20.4.8 (scsi module), 2.20.2.6 (cmm module) Older Version : 2.20.4.7 (scsi module), 2.20.2.6 (cmm module) diff --git a/Documentation/scsi/ChangeLog.megaraid_sas b/Documentation/scsi/ChangeLog.megaraid_sas index 0a85a7e8120e..d9e5960dafd5 100644 --- a/Documentation/scsi/ChangeLog.megaraid_sas +++ b/Documentation/scsi/ChangeLog.megaraid_sas @@ -1,4 +1,20 @@ +1 Release Date : Sun May 14 22:49:52 PDT 2006 - Sumant Patro <Sumant.Patro@lsil.com> +2 Current Version : 00.00.03.01 +3 Older Version : 00.00.02.04 + +i. Added support for ZCR controller. + + New device id 0x413 added. + +ii. Bug fix : Disable controller interrupt before firing INIT cmd to FW. + + Interrupt is enabled after required initialization is over. + This is done to ensure that driver is ready to handle interrupts when + it is generated by the controller. + + -Sumant Patro <Sumant.Patro@lsil.com> + 1 Release Date : Wed Feb 03 14:31:44 PST 2006 - Sumant Patro <Sumant.Patro@lsil.com> 2 Current Version : 00.00.02.04 3 Older Version : 00.00.02.04 diff --git a/Documentation/scsi/aacraid.txt b/Documentation/scsi/aacraid.txt index be55670851a4..ee03678c8029 100644 --- a/Documentation/scsi/aacraid.txt +++ b/Documentation/scsi/aacraid.txt @@ -11,38 +11,43 @@ the original). Supported Cards/Chipsets ------------------------- PCI ID (pci.ids) OEM Product - 9005:0285:9005:028a Adaptec 2020ZCR (Skyhawk) - 9005:0285:9005:028e Adaptec 2020SA (Skyhawk) - 9005:0285:9005:028b Adaptec 2025ZCR (Terminator) - 9005:0285:9005:028f Adaptec 2025SA (Terminator) - 9005:0285:9005:0286 Adaptec 2120S (Crusader) - 9005:0286:9005:028d Adaptec 2130S (Lancer) + 9005:0283:9005:0283 Adaptec Catapult (3210S with arc firmware) + 9005:0284:9005:0284 Adaptec Tomcat (3410S with arc firmware) 9005:0285:9005:0285 Adaptec 2200S (Vulcan) + 9005:0285:9005:0286 Adaptec 2120S (Crusader) 9005:0285:9005:0287 Adaptec 2200S (Vulcan-2m) + 9005:0285:9005:0288 Adaptec 3230S (Harrier) + 9005:0285:9005:0289 Adaptec 3240S (Tornado) + 9005:0285:9005:028a Adaptec 2020ZCR (Skyhawk) + 9005:0285:9005:028b Adaptec 2025ZCR (Terminator) 9005:0286:9005:028c Adaptec 2230S (Lancer) 9005:0286:9005:028c Adaptec 2230SLP (Lancer) - 9005:0285:9005:0296 Adaptec 2240S (SabreExpress) + 9005:0286:9005:028d Adaptec 2130S (Lancer) + 9005:0285:9005:028e Adaptec 2020SA (Skyhawk) + 9005:0285:9005:028f Adaptec 2025SA (Terminator) 9005:0285:9005:0290 Adaptec 2410SA (Jaguar) - 9005:0285:9005:0293 Adaptec 21610SA (Corsair-16) 9005:0285:103c:3227 Adaptec 2610SA (Bearcat HP release) + 9005:0285:9005:0293 Adaptec 21610SA (Corsair-16) + 9005:0285:9005:0296 Adaptec 2240S (SabreExpress) 9005:0285:9005:0292 Adaptec 2810SA (Corsair-8) 9005:0285:9005:0294 Adaptec Prowler - 9005:0286:9005:029d Adaptec 2420SA (Intruder HP release) - 9005:0286:9005:029c Adaptec 2620SA (Intruder) - 9005:0286:9005:029b Adaptec 2820SA (Intruder) - 9005:0286:9005:02a7 Adaptec 2830SA (Skyray) - 9005:0286:9005:02a8 Adaptec 2430SA (Skyray) - 9005:0285:9005:0288 Adaptec 3230S (Harrier) - 9005:0285:9005:0289 Adaptec 3240S (Tornado) - 9005:0285:9005:0298 Adaptec 4000SAS (BlackBird) 9005:0285:9005:0297 Adaptec 4005SAS (AvonPark) + 9005:0285:9005:0298 Adaptec 4000SAS (BlackBird) 9005:0285:9005:0299 Adaptec 4800SAS (Marauder-X) 9005:0285:9005:029a Adaptec 4805SAS (Marauder-E) + 9005:0286:9005:029b Adaptec 2820SA (Intruder) + 9005:0286:9005:029c Adaptec 2620SA (Intruder) + 9005:0286:9005:029d Adaptec 2420SA (Intruder HP release) 9005:0286:9005:02a2 Adaptec 3800SAS (Hurricane44) + 9005:0286:9005:02a7 Adaptec 3805SAS (Hurricane80) + 9005:0286:9005:02a8 Adaptec 3400SAS (Hurricane40) + 9005:0286:9005:02ac Adaptec 1800SAS (Typhoon44) + 9005:0286:9005:02b3 Adaptec 2400SAS (Hurricane40lm) + 9005:0285:9005:02b5 Adaptec ASR5800 (Voodoo44) + 9005:0285:9005:02b6 Adaptec ASR5805 (Voodoo80) + 9005:0285:9005:02b7 Adaptec ASR5808 (Voodoo08) 1011:0046:9005:0364 Adaptec 5400S (Mustang) 1011:0046:9005:0365 Adaptec 5400S (Mustang) - 9005:0283:9005:0283 Adaptec Catapult (3210S with arc firmware) - 9005:0284:9005:0284 Adaptec Tomcat (3410S with arc firmware) 9005:0287:9005:0800 Adaptec Themisto (Jupiter) 9005:0200:9005:0200 Adaptec Themisto (Jupiter) 9005:0286:9005:0800 Adaptec Callisto (Jupiter) @@ -64,18 +69,20 @@ Supported Cards/Chipsets 9005:0285:9005:0290 IBM ServeRAID 7t (Jaguar) 9005:0285:1014:02F2 IBM ServeRAID 8i (AvonPark) 9005:0285:1014:0312 IBM ServeRAID 8i (AvonParkLite) - 9005:0286:1014:9580 IBM ServeRAID 8k/8k-l8 (Aurora) 9005:0286:1014:9540 IBM ServeRAID 8k/8k-l4 (AuroraLite) - 9005:0286:9005:029f ICP ICP9014R0 (Lancer) + 9005:0286:1014:9580 IBM ServeRAID 8k/8k-l8 (Aurora) + 9005:0286:1014:034d IBM ServeRAID 8s (Hurricane) 9005:0286:9005:029e ICP ICP9024R0 (Lancer) + 9005:0286:9005:029f ICP ICP9014R0 (Lancer) 9005:0286:9005:02a0 ICP ICP9047MA (Lancer) 9005:0286:9005:02a1 ICP ICP9087MA (Lancer) + 9005:0286:9005:02a3 ICP ICP5445AU (Hurricane44) 9005:0286:9005:02a4 ICP ICP9085LI (Marauder-X) 9005:0286:9005:02a5 ICP ICP5085BR (Marauder-E) - 9005:0286:9005:02a3 ICP ICP5445AU (Hurricane44) 9005:0286:9005:02a6 ICP ICP9067MA (Intruder-6) - 9005:0286:9005:02a9 ICP ICP5087AU (Skyray) - 9005:0286:9005:02aa ICP ICP5047AU (Skyray) + 9005:0286:9005:02a9 ICP ICP5085AU (Hurricane80) + 9005:0286:9005:02aa ICP ICP5045AU (Hurricane40) + 9005:0286:9005:02b4 ICP ICP5045AL (Hurricane40lm) People ------------------------- diff --git a/Documentation/scsi/arcmsr_spec.txt b/Documentation/scsi/arcmsr_spec.txt new file mode 100644 index 000000000000..5e0042340fd3 --- /dev/null +++ b/Documentation/scsi/arcmsr_spec.txt @@ -0,0 +1,574 @@ +******************************************************************************* +** ARECA FIRMWARE SPEC +******************************************************************************* +** Usage of IOP331 adapter +** (All In/Out is in IOP331's view) +** 1. Message 0 --> InitThread message and retrun code +** 2. Doorbell is used for RS-232 emulation +** inDoorBell : bit0 -- data in ready +** (DRIVER DATA WRITE OK) +** bit1 -- data out has been read +** (DRIVER DATA READ OK) +** outDooeBell: bit0 -- data out ready +** (IOP331 DATA WRITE OK) +** bit1 -- data in has been read +** (IOP331 DATA READ OK) +** 3. Index Memory Usage +** offset 0xf00 : for RS232 out (request buffer) +** offset 0xe00 : for RS232 in (scratch buffer) +** offset 0xa00 : for inbound message code message_rwbuffer +** (driver send to IOP331) +** offset 0xa00 : for outbound message code message_rwbuffer +** (IOP331 send to driver) +** 4. RS-232 emulation +** Currently 128 byte buffer is used +** 1st uint32_t : Data length (1--124) +** Byte 4--127 : Max 124 bytes of data +** 5. PostQ +** All SCSI Command must be sent through postQ: +** (inbound queue port) Request frame must be 32 bytes aligned +** #bit27--bit31 => flag for post ccb +** #bit0--bit26 => real address (bit27--bit31) of post arcmsr_cdb +** bit31 : +** 0 : 256 bytes frame +** 1 : 512 bytes frame +** bit30 : +** 0 : normal request +** 1 : BIOS request +** bit29 : reserved +** bit28 : reserved +** bit27 : reserved +** --------------------------------------------------------------------------- +** (outbount queue port) Request reply +** #bit27--bit31 +** => flag for reply +** #bit0--bit26 +** => real address (bit27--bit31) of reply arcmsr_cdb +** bit31 : must be 0 (for this type of reply) +** bit30 : reserved for BIOS handshake +** bit29 : reserved +** bit28 : +** 0 : no error, ignore AdapStatus/DevStatus/SenseData +** 1 : Error, error code in AdapStatus/DevStatus/SenseData +** bit27 : reserved +** 6. BIOS request +** All BIOS request is the same with request from PostQ +** Except : +** Request frame is sent from configuration space +** offset: 0x78 : Request Frame (bit30 == 1) +** offset: 0x18 : writeonly to generate +** IRQ to IOP331 +** Completion of request: +** (bit30 == 0, bit28==err flag) +** 7. Definition of SGL entry (structure) +** 8. Message1 Out - Diag Status Code (????) +** 9. Message0 message code : +** 0x00 : NOP +** 0x01 : Get Config +** ->offset 0xa00 :for outbound message code message_rwbuffer +** (IOP331 send to driver) +** Signature 0x87974060(4) +** Request len 0x00000200(4) +** numbers of queue 0x00000100(4) +** SDRAM Size 0x00000100(4)-->256 MB +** IDE Channels 0x00000008(4) +** vendor 40 bytes char +** model 8 bytes char +** FirmVer 16 bytes char +** Device Map 16 bytes char +** FirmwareVersion DWORD <== Added for checking of +** new firmware capability +** 0x02 : Set Config +** ->offset 0xa00 :for inbound message code message_rwbuffer +** (driver send to IOP331) +** Signature 0x87974063(4) +** UPPER32 of Request Frame (4)-->Driver Only +** 0x03 : Reset (Abort all queued Command) +** 0x04 : Stop Background Activity +** 0x05 : Flush Cache +** 0x06 : Start Background Activity +** (re-start if background is halted) +** 0x07 : Check If Host Command Pending +** (Novell May Need This Function) +** 0x08 : Set controller time +** ->offset 0xa00 : for inbound message code message_rwbuffer +** (driver to IOP331) +** byte 0 : 0xaa <-- signature +** byte 1 : 0x55 <-- signature +** byte 2 : year (04) +** byte 3 : month (1..12) +** byte 4 : date (1..31) +** byte 5 : hour (0..23) +** byte 6 : minute (0..59) +** byte 7 : second (0..59) +******************************************************************************* +******************************************************************************* +** RS-232 Interface for Areca Raid Controller +** The low level command interface is exclusive with VT100 terminal +** -------------------------------------------------------------------- +** 1. Sequence of command execution +** -------------------------------------------------------------------- +** (A) Header : 3 bytes sequence (0x5E, 0x01, 0x61) +** (B) Command block : variable length of data including length, +** command code, data and checksum byte +** (C) Return data : variable length of data +** -------------------------------------------------------------------- +** 2. Command block +** -------------------------------------------------------------------- +** (A) 1st byte : command block length (low byte) +** (B) 2nd byte : command block length (high byte) +** note ..command block length shouldn't > 2040 bytes, +** length excludes these two bytes +** (C) 3rd byte : command code +** (D) 4th and following bytes : variable length data bytes +** depends on command code +** (E) last byte : checksum byte (sum of 1st byte until last data byte) +** -------------------------------------------------------------------- +** 3. Command code and associated data +** -------------------------------------------------------------------- +** The following are command code defined in raid controller Command +** code 0x10--0x1? are used for system level management, +** no password checking is needed and should be implemented in separate +** well controlled utility and not for end user access. +** Command code 0x20--0x?? always check the password, +** password must be entered to enable these command. +** enum +** { +** GUI_SET_SERIAL=0x10, +** GUI_SET_VENDOR, +** GUI_SET_MODEL, +** GUI_IDENTIFY, +** GUI_CHECK_PASSWORD, +** GUI_LOGOUT, +** GUI_HTTP, +** GUI_SET_ETHERNET_ADDR, +** GUI_SET_LOGO, +** GUI_POLL_EVENT, +** GUI_GET_EVENT, +** GUI_GET_HW_MONITOR, +** // GUI_QUICK_CREATE=0x20, (function removed) +** GUI_GET_INFO_R=0x20, +** GUI_GET_INFO_V, +** GUI_GET_INFO_P, +** GUI_GET_INFO_S, +** GUI_CLEAR_EVENT, +** GUI_MUTE_BEEPER=0x30, +** GUI_BEEPER_SETTING, +** GUI_SET_PASSWORD, +** GUI_HOST_INTERFACE_MODE, +** GUI_REBUILD_PRIORITY, +** GUI_MAX_ATA_MODE, +** GUI_RESET_CONTROLLER, +** GUI_COM_PORT_SETTING, +** GUI_NO_OPERATION, +** GUI_DHCP_IP, +** GUI_CREATE_PASS_THROUGH=0x40, +** GUI_MODIFY_PASS_THROUGH, +** GUI_DELETE_PASS_THROUGH, +** GUI_IDENTIFY_DEVICE, +** GUI_CREATE_RAIDSET=0x50, +** GUI_DELETE_RAIDSET, +** GUI_EXPAND_RAIDSET, +** GUI_ACTIVATE_RAIDSET, +** GUI_CREATE_HOT_SPARE, +** GUI_DELETE_HOT_SPARE, +** GUI_CREATE_VOLUME=0x60, +** GUI_MODIFY_VOLUME, +** GUI_DELETE_VOLUME, +** GUI_START_CHECK_VOLUME, +** GUI_STOP_CHECK_VOLUME +** }; +** Command description : +** GUI_SET_SERIAL : Set the controller serial# +** byte 0,1 : length +** byte 2 : command code 0x10 +** byte 3 : password length (should be 0x0f) +** byte 4-0x13 : should be "ArEcATecHnoLogY" +** byte 0x14--0x23 : Serial number string (must be 16 bytes) +** GUI_SET_VENDOR : Set vendor string for the controller +** byte 0,1 : length +** byte 2 : command code 0x11 +** byte 3 : password length (should be 0x08) +** byte 4-0x13 : should be "ArEcAvAr" +** byte 0x14--0x3B : vendor string (must be 40 bytes) +** GUI_SET_MODEL : Set the model name of the controller +** byte 0,1 : length +** byte 2 : command code 0x12 +** byte 3 : password length (should be 0x08) +** byte 4-0x13 : should be "ArEcAvAr" +** byte 0x14--0x1B : model string (must be 8 bytes) +** GUI_IDENTIFY : Identify device +** byte 0,1 : length +** byte 2 : command code 0x13 +** return "Areca RAID Subsystem " +** GUI_CHECK_PASSWORD : Verify password +** byte 0,1 : length +** byte 2 : command code 0x14 +** byte 3 : password length +** byte 4-0x?? : user password to be checked +** GUI_LOGOUT : Logout GUI (force password checking on next command) +** byte 0,1 : length +** byte 2 : command code 0x15 +** GUI_HTTP : HTTP interface (reserved for Http proxy service)(0x16) +** +** GUI_SET_ETHERNET_ADDR : Set the ethernet MAC address +** byte 0,1 : length +** byte 2 : command code 0x17 +** byte 3 : password length (should be 0x08) +** byte 4-0x13 : should be "ArEcAvAr" +** byte 0x14--0x19 : Ethernet MAC address (must be 6 bytes) +** GUI_SET_LOGO : Set logo in HTTP +** byte 0,1 : length +** byte 2 : command code 0x18 +** byte 3 : Page# (0/1/2/3) (0xff --> clear OEM logo) +** byte 4/5/6/7 : 0x55/0xaa/0xa5/0x5a +** byte 8 : TITLE.JPG data (each page must be 2000 bytes) +** note page0 1st 2 byte must be +** actual length of the JPG file +** GUI_POLL_EVENT : Poll If Event Log Changed +** byte 0,1 : length +** byte 2 : command code 0x19 +** GUI_GET_EVENT : Read Event +** byte 0,1 : length +** byte 2 : command code 0x1a +** byte 3 : Event Page (0:1st page/1/2/3:last page) +** GUI_GET_HW_MONITOR : Get HW monitor data +** byte 0,1 : length +** byte 2 : command code 0x1b +** byte 3 : # of FANs(example 2) +** byte 4 : # of Voltage sensor(example 3) +** byte 5 : # of temperature sensor(example 2) +** byte 6 : # of power +** byte 7/8 : Fan#0 (RPM) +** byte 9/10 : Fan#1 +** byte 11/12 : Voltage#0 original value in *1000 +** byte 13/14 : Voltage#0 value +** byte 15/16 : Voltage#1 org +** byte 17/18 : Voltage#1 +** byte 19/20 : Voltage#2 org +** byte 21/22 : Voltage#2 +** byte 23 : Temp#0 +** byte 24 : Temp#1 +** byte 25 : Power indicator (bit0 : power#0, +** bit1 : power#1) +** byte 26 : UPS indicator +** GUI_QUICK_CREATE : Quick create raid/volume set +** byte 0,1 : length +** byte 2 : command code 0x20 +** byte 3/4/5/6 : raw capacity +** byte 7 : raid level +** byte 8 : stripe size +** byte 9 : spare +** byte 10/11/12/13: device mask (the devices to create raid/volume) +** This function is removed, application like +** to implement quick create function +** need to use GUI_CREATE_RAIDSET and GUI_CREATE_VOLUMESET function. +** GUI_GET_INFO_R : Get Raid Set Information +** byte 0,1 : length +** byte 2 : command code 0x20 +** byte 3 : raidset# +** typedef struct sGUI_RAIDSET +** { +** BYTE grsRaidSetName[16]; +** DWORD grsCapacity; +** DWORD grsCapacityX; +** DWORD grsFailMask; +** BYTE grsDevArray[32]; +** BYTE grsMemberDevices; +** BYTE grsNewMemberDevices; +** BYTE grsRaidState; +** BYTE grsVolumes; +** BYTE grsVolumeList[16]; +** BYTE grsRes1; +** BYTE grsRes2; +** BYTE grsRes3; +** BYTE grsFreeSegments; +** DWORD grsRawStripes[8]; +** DWORD grsRes4; +** DWORD grsRes5; // Total to 128 bytes +** DWORD grsRes6; // Total to 128 bytes +** } sGUI_RAIDSET, *pGUI_RAIDSET; +** GUI_GET_INFO_V : Get Volume Set Information +** byte 0,1 : length +** byte 2 : command code 0x21 +** byte 3 : volumeset# +** typedef struct sGUI_VOLUMESET +** { +** BYTE gvsVolumeName[16]; // 16 +** DWORD gvsCapacity; +** DWORD gvsCapacityX; +** DWORD gvsFailMask; +** DWORD gvsStripeSize; +** DWORD gvsNewFailMask; +** DWORD gvsNewStripeSize; +** DWORD gvsVolumeStatus; +** DWORD gvsProgress; // 32 +** sSCSI_ATTR gvsScsi; +** BYTE gvsMemberDisks; +** BYTE gvsRaidLevel; // 8 +** BYTE gvsNewMemberDisks; +** BYTE gvsNewRaidLevel; +** BYTE gvsRaidSetNumber; +** BYTE gvsRes0; // 4 +** BYTE gvsRes1[4]; // 64 bytes +** } sGUI_VOLUMESET, *pGUI_VOLUMESET; +** GUI_GET_INFO_P : Get Physical Drive Information +** byte 0,1 : length +** byte 2 : command code 0x22 +** byte 3 : drive # (from 0 to max-channels - 1) +** typedef struct sGUI_PHY_DRV +** { +** BYTE gpdModelName[40]; +** BYTE gpdSerialNumber[20]; +** BYTE gpdFirmRev[8]; +** DWORD gpdCapacity; +** DWORD gpdCapacityX; // Reserved for expansion +** BYTE gpdDeviceState; +** BYTE gpdPioMode; +** BYTE gpdCurrentUdmaMode; +** BYTE gpdUdmaMode; +** BYTE gpdDriveSelect; +** BYTE gpdRaidNumber; // 0xff if not belongs to a raid set +** sSCSI_ATTR gpdScsi; +** BYTE gpdReserved[40]; // Total to 128 bytes +** } sGUI_PHY_DRV, *pGUI_PHY_DRV; +** GUI_GET_INFO_S : Get System Information +** byte 0,1 : length +** byte 2 : command code 0x23 +** typedef struct sCOM_ATTR +** { +** BYTE comBaudRate; +** BYTE comDataBits; +** BYTE comStopBits; +** BYTE comParity; +** BYTE comFlowControl; +** } sCOM_ATTR, *pCOM_ATTR; +** typedef struct sSYSTEM_INFO +** { +** BYTE gsiVendorName[40]; +** BYTE gsiSerialNumber[16]; +** BYTE gsiFirmVersion[16]; +** BYTE gsiBootVersion[16]; +** BYTE gsiMbVersion[16]; +** BYTE gsiModelName[8]; +** BYTE gsiLocalIp[4]; +** BYTE gsiCurrentIp[4]; +** DWORD gsiTimeTick; +** DWORD gsiCpuSpeed; +** DWORD gsiICache; +** DWORD gsiDCache; +** DWORD gsiScache; +** DWORD gsiMemorySize; +** DWORD gsiMemorySpeed; +** DWORD gsiEvents; +** BYTE gsiMacAddress[6]; +** BYTE gsiDhcp; +** BYTE gsiBeeper; +** BYTE gsiChannelUsage; +** BYTE gsiMaxAtaMode; +** BYTE gsiSdramEcc; // 1:if ECC enabled +** BYTE gsiRebuildPriority; +** sCOM_ATTR gsiComA; // 5 bytes +** sCOM_ATTR gsiComB; // 5 bytes +** BYTE gsiIdeChannels; +** BYTE gsiScsiHostChannels; +** BYTE gsiIdeHostChannels; +** BYTE gsiMaxVolumeSet; +** BYTE gsiMaxRaidSet; +** BYTE gsiEtherPort; // 1:if ether net port supported +** BYTE gsiRaid6Engine; // 1:Raid6 engine supported +** BYTE gsiRes[75]; +** } sSYSTEM_INFO, *pSYSTEM_INFO; +** GUI_CLEAR_EVENT : Clear System Event +** byte 0,1 : length +** byte 2 : command code 0x24 +** GUI_MUTE_BEEPER : Mute current beeper +** byte 0,1 : length +** byte 2 : command code 0x30 +** GUI_BEEPER_SETTING : Disable beeper +** byte 0,1 : length +** byte 2 : command code 0x31 +** byte 3 : 0->disable, 1->enable +** GUI_SET_PASSWORD : Change password +** byte 0,1 : length +** byte 2 : command code 0x32 +** byte 3 : pass word length ( must <= 15 ) +** byte 4 : password (must be alpha-numerical) +** GUI_HOST_INTERFACE_MODE : Set host interface mode +** byte 0,1 : length +** byte 2 : command code 0x33 +** byte 3 : 0->Independent, 1->cluster +** GUI_REBUILD_PRIORITY : Set rebuild priority +** byte 0,1 : length +** byte 2 : command code 0x34 +** byte 3 : 0/1/2/3 (low->high) +** GUI_MAX_ATA_MODE : Set maximum ATA mode to be used +** byte 0,1 : length +** byte 2 : command code 0x35 +** byte 3 : 0/1/2/3 (133/100/66/33) +** GUI_RESET_CONTROLLER : Reset Controller +** byte 0,1 : length +** byte 2 : command code 0x36 +** *Response with VT100 screen (discard it) +** GUI_COM_PORT_SETTING : COM port setting +** byte 0,1 : length +** byte 2 : command code 0x37 +** byte 3 : 0->COMA (term port), +** 1->COMB (debug port) +** byte 4 : 0/1/2/3/4/5/6/7 +** (1200/2400/4800/9600/19200/38400/57600/115200) +** byte 5 : data bit +** (0:7 bit, 1:8 bit : must be 8 bit) +** byte 6 : stop bit (0:1, 1:2 stop bits) +** byte 7 : parity (0:none, 1:off, 2:even) +** byte 8 : flow control +** (0:none, 1:xon/xoff, 2:hardware => must use none) +** GUI_NO_OPERATION : No operation +** byte 0,1 : length +** byte 2 : command code 0x38 +** GUI_DHCP_IP : Set DHCP option and local IP address +** byte 0,1 : length +** byte 2 : command code 0x39 +** byte 3 : 0:dhcp disabled, 1:dhcp enabled +** byte 4/5/6/7 : IP address +** GUI_CREATE_PASS_THROUGH : Create pass through disk +** byte 0,1 : length +** byte 2 : command code 0x40 +** byte 3 : device # +** byte 4 : scsi channel (0/1) +** byte 5 : scsi id (0-->15) +** byte 6 : scsi lun (0-->7) +** byte 7 : tagged queue (1 : enabled) +** byte 8 : cache mode (1 : enabled) +** byte 9 : max speed (0/1/2/3/4, +** async/20/40/80/160 for scsi) +** (0/1/2/3/4, 33/66/100/133/150 for ide ) +** GUI_MODIFY_PASS_THROUGH : Modify pass through disk +** byte 0,1 : length +** byte 2 : command code 0x41 +** byte 3 : device # +** byte 4 : scsi channel (0/1) +** byte 5 : scsi id (0-->15) +** byte 6 : scsi lun (0-->7) +** byte 7 : tagged queue (1 : enabled) +** byte 8 : cache mode (1 : enabled) +** byte 9 : max speed (0/1/2/3/4, +** async/20/40/80/160 for scsi) +** (0/1/2/3/4, 33/66/100/133/150 for ide ) +** GUI_DELETE_PASS_THROUGH : Delete pass through disk +** byte 0,1 : length +** byte 2 : command code 0x42 +** byte 3 : device# to be deleted +** GUI_IDENTIFY_DEVICE : Identify Device +** byte 0,1 : length +** byte 2 : command code 0x43 +** byte 3 : Flash Method +** (0:flash selected, 1:flash not selected) +** byte 4/5/6/7 : IDE device mask to be flashed +** note .... no response data available +** GUI_CREATE_RAIDSET : Create Raid Set +** byte 0,1 : length +** byte 2 : command code 0x50 +** byte 3/4/5/6 : device mask +** byte 7-22 : raidset name (if byte 7 == 0:use default) +** GUI_DELETE_RAIDSET : Delete Raid Set +** byte 0,1 : length +** byte 2 : command code 0x51 +** byte 3 : raidset# +** GUI_EXPAND_RAIDSET : Expand Raid Set +** byte 0,1 : length +** byte 2 : command code 0x52 +** byte 3 : raidset# +** byte 4/5/6/7 : device mask for expansion +** byte 8/9/10 : (8:0 no change, 1 change, 0xff:terminate, +** 9:new raid level, +** 10:new stripe size +** 0/1/2/3/4/5->4/8/16/32/64/128K ) +** byte 11/12/13 : repeat for each volume in the raidset +** GUI_ACTIVATE_RAIDSET : Activate incomplete raid set +** byte 0,1 : length +** byte 2 : command code 0x53 +** byte 3 : raidset# +** GUI_CREATE_HOT_SPARE : Create hot spare disk +** byte 0,1 : length +** byte 2 : command code 0x54 +** byte 3/4/5/6 : device mask for hot spare creation +** GUI_DELETE_HOT_SPARE : Delete hot spare disk +** byte 0,1 : length +** byte 2 : command code 0x55 +** byte 3/4/5/6 : device mask for hot spare deletion +** GUI_CREATE_VOLUME : Create volume set +** byte 0,1 : length +** byte 2 : command code 0x60 +** byte 3 : raidset# +** byte 4-19 : volume set name +** (if byte4 == 0, use default) +** byte 20-27 : volume capacity (blocks) +** byte 28 : raid level +** byte 29 : stripe size +** (0/1/2/3/4/5->4/8/16/32/64/128K) +** byte 30 : channel +** byte 31 : ID +** byte 32 : LUN +** byte 33 : 1 enable tag +** byte 34 : 1 enable cache +** byte 35 : speed +** (0/1/2/3/4->async/20/40/80/160 for scsi) +** (0/1/2/3/4->33/66/100/133/150 for IDE ) +** byte 36 : 1 to select quick init +** +** GUI_MODIFY_VOLUME : Modify volume Set +** byte 0,1 : length +** byte 2 : command code 0x61 +** byte 3 : volumeset# +** byte 4-19 : new volume set name +** (if byte4 == 0, not change) +** byte 20-27 : new volume capacity (reserved) +** byte 28 : new raid level +** byte 29 : new stripe size +** (0/1/2/3/4/5->4/8/16/32/64/128K) +** byte 30 : new channel +** byte 31 : new ID +** byte 32 : new LUN +** byte 33 : 1 enable tag +** byte 34 : 1 enable cache +** byte 35 : speed +** (0/1/2/3/4->async/20/40/80/160 for scsi) +** (0/1/2/3/4->33/66/100/133/150 for IDE ) +** GUI_DELETE_VOLUME : Delete volume set +** byte 0,1 : length +** byte 2 : command code 0x62 +** byte 3 : volumeset# +** GUI_START_CHECK_VOLUME : Start volume consistency check +** byte 0,1 : length +** byte 2 : command code 0x63 +** byte 3 : volumeset# +** GUI_STOP_CHECK_VOLUME : Stop volume consistency check +** byte 0,1 : length +** byte 2 : command code 0x64 +** --------------------------------------------------------------------- +** 4. Returned data +** --------------------------------------------------------------------- +** (A) Header : 3 bytes sequence (0x5E, 0x01, 0x61) +** (B) Length : 2 bytes +** (low byte 1st, excludes length and checksum byte) +** (C) status or data : +** <1> If length == 1 ==> 1 byte status code +** #define GUI_OK 0x41 +** #define GUI_RAIDSET_NOT_NORMAL 0x42 +** #define GUI_VOLUMESET_NOT_NORMAL 0x43 +** #define GUI_NO_RAIDSET 0x44 +** #define GUI_NO_VOLUMESET 0x45 +** #define GUI_NO_PHYSICAL_DRIVE 0x46 +** #define GUI_PARAMETER_ERROR 0x47 +** #define GUI_UNSUPPORTED_COMMAND 0x48 +** #define GUI_DISK_CONFIG_CHANGED 0x49 +** #define GUI_INVALID_PASSWORD 0x4a +** #define GUI_NO_DISK_SPACE 0x4b +** #define GUI_CHECKSUM_ERROR 0x4c +** #define GUI_PASSWORD_REQUIRED 0x4d +** <2> If length > 1 ==> +** data block returned from controller +** and the contents depends on the command code +** (E) Checksum : checksum of length and status or data byte +************************************************************************** diff --git a/Documentation/scsi/libsas.txt b/Documentation/scsi/libsas.txt new file mode 100644 index 000000000000..9e2078b2a615 --- /dev/null +++ b/Documentation/scsi/libsas.txt @@ -0,0 +1,484 @@ +SAS Layer +--------- + +The SAS Layer is a management infrastructure which manages +SAS LLDDs. It sits between SCSI Core and SAS LLDDs. The +layout is as follows: while SCSI Core is concerned with +SAM/SPC issues, and a SAS LLDD+sequencer is concerned with +phy/OOB/link management, the SAS layer is concerned with: + + * SAS Phy/Port/HA event management (LLDD generates, + SAS Layer processes), + * SAS Port management (creation/destruction), + * SAS Domain discovery and revalidation, + * SAS Domain device management, + * SCSI Host registration/unregistration, + * Device registration with SCSI Core (SAS) or libata + (SATA), and + * Expander management and exporting expander control + to user space. + +A SAS LLDD is a PCI device driver. It is concerned with +phy/OOB management, and vendor specific tasks and generates +events to the SAS layer. + +The SAS Layer does most SAS tasks as outlined in the SAS 1.1 +spec. + +The sas_ha_struct describes the SAS LLDD to the SAS layer. +Most of it is used by the SAS Layer but a few fields need to +be initialized by the LLDDs. + +After initializing your hardware, from the probe() function +you call sas_register_ha(). It will register your LLDD with +the SCSI subsystem, creating a SCSI host and it will +register your SAS driver with the sysfs SAS tree it creates. +It will then return. Then you enable your phys to actually +start OOB (at which point your driver will start calling the +notify_* event callbacks). + +Structure descriptions: + +struct sas_phy -------------------- +Normally this is statically embedded to your driver's +phy structure: + struct my_phy { + blah; + struct sas_phy sas_phy; + bleh; + }; +And then all the phys are an array of my_phy in your HA +struct (shown below). + +Then as you go along and initialize your phys you also +initialize the sas_phy struct, along with your own +phy structure. + +In general, the phys are managed by the LLDD and the ports +are managed by the SAS layer. So the phys are initialized +and updated by the LLDD and the ports are initialized and +updated by the SAS layer. + +There is a scheme where the LLDD can RW certain fields, +and the SAS layer can only read such ones, and vice versa. +The idea is to avoid unnecessary locking. + +enabled -- must be set (0/1) +id -- must be set [0,MAX_PHYS) +class, proto, type, role, oob_mode, linkrate -- must be set +oob_mode -- you set this when OOB has finished and then notify +the SAS Layer. + +sas_addr -- this normally points to an array holding the sas +address of the phy, possibly somewhere in your my_phy +struct. + +attached_sas_addr -- set this when you (LLDD) receive an +IDENTIFY frame or a FIS frame, _before_ notifying the SAS +layer. The idea is that sometimes the LLDD may want to fake +or provide a different SAS address on that phy/port and this +allows it to do this. At best you should copy the sas +address from the IDENTIFY frame or maybe generate a SAS +address for SATA directly attached devices. The Discover +process may later change this. + +frame_rcvd -- this is where you copy the IDENTIFY/FIS frame +when you get it; you lock, copy, set frame_rcvd_size and +unlock the lock, and then call the event. It is a pointer +since there's no way to know your hw frame size _exactly_, +so you define the actual array in your phy struct and let +this pointer point to it. You copy the frame from your +DMAable memory to that area holding the lock. + +sas_prim -- this is where primitives go when they're +received. See sas.h. Grab the lock, set the primitive, +release the lock, notify. + +port -- this points to the sas_port if the phy belongs +to a port -- the LLDD only reads this. It points to the +sas_port this phy is part of. Set by the SAS Layer. + +ha -- may be set; the SAS layer sets it anyway. + +lldd_phy -- you should set this to point to your phy so you +can find your way around faster when the SAS layer calls one +of your callbacks and passes you a phy. If the sas_phy is +embedded you can also use container_of -- whatever you +prefer. + + +struct sas_port -------------------- +The LLDD doesn't set any fields of this struct -- it only +reads them. They should be self explanatory. + +phy_mask is 32 bit, this should be enough for now, as I +haven't heard of a HA having more than 8 phys. + +lldd_port -- I haven't found use for that -- maybe other +LLDD who wish to have internal port representation can make +use of this. + + +struct sas_ha_struct -------------------- +It normally is statically declared in your own LLDD +structure describing your adapter: +struct my_sas_ha { + blah; + struct sas_ha_struct sas_ha; + struct my_phy phys[MAX_PHYS]; + struct sas_port sas_ports[MAX_PHYS]; /* (1) */ + bleh; +}; + +(1) If your LLDD doesn't have its own port representation. + +What needs to be initialized (sample function given below). + +pcidev +sas_addr -- since the SAS layer doesn't want to mess with + memory allocation, etc, this points to statically + allocated array somewhere (say in your host adapter + structure) and holds the SAS address of the host + adapter as given by you or the manufacturer, etc. +sas_port +sas_phy -- an array of pointers to structures. (see + note above on sas_addr). + These must be set. See more notes below. +num_phys -- the number of phys present in the sas_phy array, + and the number of ports present in the sas_port + array. There can be a maximum num_phys ports (one per + port) so we drop the num_ports, and only use + num_phys. + +The event interface: + + /* LLDD calls these to notify the class of an event. */ + void (*notify_ha_event)(struct sas_ha_struct *, enum ha_event); + void (*notify_port_event)(struct sas_phy *, enum port_event); + void (*notify_phy_event)(struct sas_phy *, enum phy_event); + +When sas_register_ha() returns, those are set and can be +called by the LLDD to notify the SAS layer of such events +the SAS layer. + +The port notification: + + /* The class calls these to notify the LLDD of an event. */ + void (*lldd_port_formed)(struct sas_phy *); + void (*lldd_port_deformed)(struct sas_phy *); + +If the LLDD wants notification when a port has been formed +or deformed it sets those to a function satisfying the type. + +A SAS LLDD should also implement at least one of the Task +Management Functions (TMFs) described in SAM: + + /* Task Management Functions. Must be called from process context. */ + int (*lldd_abort_task)(struct sas_task *); + int (*lldd_abort_task_set)(struct domain_device *, u8 *lun); + int (*lldd_clear_aca)(struct domain_device *, u8 *lun); + int (*lldd_clear_task_set)(struct domain_device *, u8 *lun); + int (*lldd_I_T_nexus_reset)(struct domain_device *); + int (*lldd_lu_reset)(struct domain_device *, u8 *lun); + int (*lldd_query_task)(struct sas_task *); + +For more information please read SAM from T10.org. + +Port and Adapter management: + + /* Port and Adapter management */ + int (*lldd_clear_nexus_port)(struct sas_port *); + int (*lldd_clear_nexus_ha)(struct sas_ha_struct *); + +A SAS LLDD should implement at least one of those. + +Phy management: + + /* Phy management */ + int (*lldd_control_phy)(struct sas_phy *, enum phy_func); + +lldd_ha -- set this to point to your HA struct. You can also +use container_of if you embedded it as shown above. + +A sample initialization and registration function +can look like this (called last thing from probe()) +*but* before you enable the phys to do OOB: + +static int register_sas_ha(struct my_sas_ha *my_ha) +{ + int i; + static struct sas_phy *sas_phys[MAX_PHYS]; + static struct sas_port *sas_ports[MAX_PHYS]; + + my_ha->sas_ha.sas_addr = &my_ha->sas_addr[0]; + + for (i = 0; i < MAX_PHYS; i++) { + sas_phys[i] = &my_ha->phys[i].sas_phy; + sas_ports[i] = &my_ha->sas_ports[i]; + } + + my_ha->sas_ha.sas_phy = sas_phys; + my_ha->sas_ha.sas_port = sas_ports; + my_ha->sas_ha.num_phys = MAX_PHYS; + + my_ha->sas_ha.lldd_port_formed = my_port_formed; + + my_ha->sas_ha.lldd_dev_found = my_dev_found; + my_ha->sas_ha.lldd_dev_gone = my_dev_gone; + + my_ha->sas_ha.lldd_max_execute_num = lldd_max_execute_num; (1) + + my_ha->sas_ha.lldd_queue_size = ha_can_queue; + my_ha->sas_ha.lldd_execute_task = my_execute_task; + + my_ha->sas_ha.lldd_abort_task = my_abort_task; + my_ha->sas_ha.lldd_abort_task_set = my_abort_task_set; + my_ha->sas_ha.lldd_clear_aca = my_clear_aca; + my_ha->sas_ha.lldd_clear_task_set = my_clear_task_set; + my_ha->sas_ha.lldd_I_T_nexus_reset= NULL; (2) + my_ha->sas_ha.lldd_lu_reset = my_lu_reset; + my_ha->sas_ha.lldd_query_task = my_query_task; + + my_ha->sas_ha.lldd_clear_nexus_port = my_clear_nexus_port; + my_ha->sas_ha.lldd_clear_nexus_ha = my_clear_nexus_ha; + + my_ha->sas_ha.lldd_control_phy = my_control_phy; + + return sas_register_ha(&my_ha->sas_ha); +} + +(1) This is normally a LLDD parameter, something of the +lines of a task collector. What it tells the SAS Layer is +whether the SAS layer should run in Direct Mode (default: +value 0 or 1) or Task Collector Mode (value greater than 1). + +In Direct Mode, the SAS Layer calls Execute Task as soon as +it has a command to send to the SDS, _and_ this is a single +command, i.e. not linked. + +Some hardware (e.g. aic94xx) has the capability to DMA more +than one task at a time (interrupt) from host memory. Task +Collector Mode is an optional feature for HAs which support +this in their hardware. (Again, it is completely optional +even if your hardware supports it.) + +In Task Collector Mode, the SAS Layer would do _natural_ +coalescing of tasks and at the appropriate moment it would +call your driver to DMA more than one task in a single HA +interrupt. DMBS may want to use this by insmod/modprobe +setting the lldd_max_execute_num to something greater than +1. + +(2) SAS 1.1 does not define I_T Nexus Reset TMF. + +Events +------ + +Events are _the only way_ a SAS LLDD notifies the SAS layer +of anything. There is no other method or way a LLDD to tell +the SAS layer of anything happening internally or in the SAS +domain. + +Phy events: + PHYE_LOSS_OF_SIGNAL, (C) + PHYE_OOB_DONE, + PHYE_OOB_ERROR, (C) + PHYE_SPINUP_HOLD. + +Port events, passed on a _phy_: + PORTE_BYTES_DMAED, (M) + PORTE_BROADCAST_RCVD, (E) + PORTE_LINK_RESET_ERR, (C) + PORTE_TIMER_EVENT, (C) + PORTE_HARD_RESET. + +Host Adapter event: + HAE_RESET + +A SAS LLDD should be able to generate + - at least one event from group C (choice), + - events marked M (mandatory) are mandatory (only one), + - events marked E (expander) if it wants the SAS layer + to handle domain revalidation (only one such). + - Unmarked events are optional. + +Meaning: + +HAE_RESET -- when your HA got internal error and was reset. + +PORTE_BYTES_DMAED -- on receiving an IDENTIFY/FIS frame +PORTE_BROADCAST_RCVD -- on receiving a primitive +PORTE_LINK_RESET_ERR -- timer expired, loss of signal, loss +of DWS, etc. (*) +PORTE_TIMER_EVENT -- DWS reset timeout timer expired (*) +PORTE_HARD_RESET -- Hard Reset primitive received. + +PHYE_LOSS_OF_SIGNAL -- the device is gone (*) +PHYE_OOB_DONE -- OOB went fine and oob_mode is valid +PHYE_OOB_ERROR -- Error while doing OOB, the device probably +got disconnected. (*) +PHYE_SPINUP_HOLD -- SATA is present, COMWAKE not sent. + +(*) should set/clear the appropriate fields in the phy, + or alternatively call the inlined sas_phy_disconnected() + which is just a helper, from their tasklet. + +The Execute Command SCSI RPC: + + int (*lldd_execute_task)(struct sas_task *, int num, + unsigned long gfp_flags); + +Used to queue a task to the SAS LLDD. @task is the tasks to +be executed. @num should be the number of tasks being +queued at this function call (they are linked listed via +task::list), @gfp_mask should be the gfp_mask defining the +context of the caller. + +This function should implement the Execute Command SCSI RPC, +or if you're sending a SCSI Task as linked commands, you +should also use this function. + +That is, when lldd_execute_task() is called, the command(s) +go out on the transport *immediately*. There is *no* +queuing of any sort and at any level in a SAS LLDD. + +The use of task::list is two-fold, one for linked commands, +the other discussed below. + +It is possible to queue up more than one task at a time, by +initializing the list element of struct sas_task, and +passing the number of tasks enlisted in this manner in num. + +Returns: -SAS_QUEUE_FULL, -ENOMEM, nothing was queued; + 0, the task(s) were queued. + +If you want to pass num > 1, then either +A) you're the only caller of this function and keep track + of what you've queued to the LLDD, or +B) you know what you're doing and have a strategy of + retrying. + +As opposed to queuing one task at a time (function call), +batch queuing of tasks, by having num > 1, greatly +simplifies LLDD code, sequencer code, and _hardware design_, +and has some performance advantages in certain situations +(DBMS). + +The LLDD advertises if it can take more than one command at +a time at lldd_execute_task(), by setting the +lldd_max_execute_num parameter (controlled by "collector" +module parameter in aic94xx SAS LLDD). + +You should leave this to the default 1, unless you know what +you're doing. + +This is a function of the LLDD, to which the SAS layer can +cater to. + +int lldd_queue_size + The host adapter's queue size. This is the maximum +number of commands the lldd can have pending to domain +devices on behalf of all upper layers submitting through +lldd_execute_task(). + +You really want to set this to something (much) larger than +1. + +This _really_ has absolutely nothing to do with queuing. +There is no queuing in SAS LLDDs. + +struct sas_task { + dev -- the device this task is destined to + list -- must be initialized (INIT_LIST_HEAD) + task_proto -- _one_ of enum sas_proto + scatter -- pointer to scatter gather list array + num_scatter -- number of elements in scatter + total_xfer_len -- total number of bytes expected to be transfered + data_dir -- PCI_DMA_... + task_done -- callback when the task has finished execution +}; + +When an external entity, entity other than the LLDD or the +SAS Layer, wants to work with a struct domain_device, it +_must_ call kobject_get() when getting a handle on the +device and kobject_put() when it is done with the device. + +This does two things: + A) implements proper kfree() for the device; + B) increments/decrements the kref for all players: + domain_device + all domain_device's ... (if past an expander) + port + host adapter + pci device + and up the ladder, etc. + +DISCOVERY +--------- + +The sysfs tree has the following purposes: + a) It shows you the physical layout of the SAS domain at + the current time, i.e. how the domain looks in the + physical world right now. + b) Shows some device parameters _at_discovery_time_. + +This is a link to the tree(1) program, very useful in +viewing the SAS domain: +ftp://mama.indstate.edu/linux/tree/ +I expect user space applications to actually create a +graphical interface of this. + +That is, the sysfs domain tree doesn't show or keep state if +you e.g., change the meaning of the READY LED MEANING +setting, but it does show you the current connection status +of the domain device. + +Keeping internal device state changes is responsibility of +upper layers (Command set drivers) and user space. + +When a device or devices are unplugged from the domain, this +is reflected in the sysfs tree immediately, and the device(s) +removed from the system. + +The structure domain_device describes any device in the SAS +domain. It is completely managed by the SAS layer. A task +points to a domain device, this is how the SAS LLDD knows +where to send the task(s) to. A SAS LLDD only reads the +contents of the domain_device structure, but it never creates +or destroys one. + +Expander management from User Space +----------------------------------- + +In each expander directory in sysfs, there is a file called +"smp_portal". It is a binary sysfs attribute file, which +implements an SMP portal (Note: this is *NOT* an SMP port), +to which user space applications can send SMP requests and +receive SMP responses. + +Functionality is deceptively simple: + +1. Build the SMP frame you want to send. The format and layout + is described in the SAS spec. Leave the CRC field equal 0. +open(2) +2. Open the expander's SMP portal sysfs file in RW mode. +write(2) +3. Write the frame you built in 1. +read(2) +4. Read the amount of data you expect to receive for the frame you built. + If you receive different amount of data you expected to receive, + then there was some kind of error. +close(2) +All this process is shown in detail in the function do_smp_func() +and its callers, in the file "expander_conf.c". + +The kernel functionality is implemented in the file +"sas_expander.c". + +The program "expander_conf.c" implements this. It takes one +argument, the sysfs file name of the SMP portal to the +expander, and gives expander information, including routing +tables. + +The SMP portal gives you complete control of the expander, +so please be careful. diff --git a/Documentation/scsi/ppa.txt b/Documentation/scsi/ppa.txt index 0dac88d86d87..5d9223bc1bd5 100644 --- a/Documentation/scsi/ppa.txt +++ b/Documentation/scsi/ppa.txt @@ -12,5 +12,3 @@ http://www.torque.net/parport/ Email list for Linux Parport linux-parport@torque.net -Email for problems with ZIP or ZIP Plus drivers -campbell@torque.net diff --git a/Documentation/scsi/tmscsim.txt b/Documentation/scsi/tmscsim.txt index e165229adf50..df7a02bfb5bf 100644 --- a/Documentation/scsi/tmscsim.txt +++ b/Documentation/scsi/tmscsim.txt @@ -109,7 +109,7 @@ than the 33.33 MHz being in the PCI spec. If you want to share the IRQ with another device and the driver refuses to do so, you might succeed with changing the DC390_IRQ type in tmscsim.c to -SA_SHIRQ | SA_INTERRUPT. +IRQF_SHARED | IRQF_DISABLED. 3.Features diff --git a/Documentation/seclvl.txt b/Documentation/seclvl.txt deleted file mode 100644 index 97274d122d0e..000000000000 --- a/Documentation/seclvl.txt +++ /dev/null @@ -1,97 +0,0 @@ -BSD Secure Levels Linux Security Module -Michael A. Halcrow <mike@halcrow.us> - - -Introduction - -Under the BSD Secure Levels security model, sets of policies are -associated with levels. Levels range from -1 to 2, with -1 being the -weakest and 2 being the strongest. These security policies are -enforced at the kernel level, so not even the superuser is able to -disable or circumvent them. This hardens the machine against attackers -who gain root access to the system. - - -Levels and Policies - -Level -1 (Permanently Insecure): - - Cannot increase the secure level - -Level 0 (Insecure): - - Cannot ptrace the init process - -Level 1 (Default): - - /dev/mem and /dev/kmem are read-only - - IMMUTABLE and APPEND extended attributes, if set, may not be unset - - Cannot load or unload kernel modules - - Cannot write directly to a mounted block device - - Cannot perform raw I/O operations - - Cannot perform network administrative tasks - - Cannot setuid any file - -Level 2 (Secure): - - Cannot decrement the system time - - Cannot write to any block device, whether mounted or not - - Cannot unmount any mounted filesystems - - -Compilation - -To compile the BSD Secure Levels LSM, seclvl.ko, enable the -SECURITY_SECLVL configuration option. This is found under Security -options -> BSD Secure Levels in the kernel configuration menu. - - -Basic Usage - -Once the machine is in a running state, with all the necessary modules -loaded and all the filesystems mounted, you can load the seclvl.ko -module: - -# insmod seclvl.ko - -The module defaults to secure level 1, except when compiled directly -into the kernel, in which case it defaults to secure level 0. To raise -the secure level to 2, the administrator writes ``2'' to the -seclvl/seclvl file under the sysfs mount point (assumed to be /sys in -these examples): - -# echo -n "2" > /sys/seclvl/seclvl - -Alternatively, you can initialize the module at secure level 2 with -the initlvl module parameter: - -# insmod seclvl.ko initlvl=2 - -At this point, it is impossible to remove the module or reduce the -secure level. If the administrator wishes to have the option of doing -so, he must provide a module parameter, sha1_passwd, that specifies -the SHA1 hash of the password that can be used to reduce the secure -level to 0. - -To generate this SHA1 hash, the administrator can use OpenSSL: - -# echo -n "boogabooga" | openssl sha1 -abeda4e0f33defa51741217592bf595efb8d289c - -In order to use password-instigated secure level reduction, the SHA1 -crypto module must be loaded or compiled into the kernel: - -# insmod sha1.ko - -The administrator can then insmod the seclvl module, including the -SHA1 hash of the password: - -# insmod seclvl.ko - sha1_passwd=abeda4e0f33defa51741217592bf595efb8d289c - -To reduce the secure level, write the password to seclvl/passwd under -your sysfs mount point: - -# echo -n "boogabooga" > /sys/seclvl/passwd - -The September 2004 edition of Sys Admin Magazine has an article about -the BSD Secure Levels LSM. I encourage you to refer to that article -for a more in-depth treatment of this security module: - -http://www.samag.com/documents/s=9304/sam0409a/0409a.htm diff --git a/Documentation/sh/new-machine.txt b/Documentation/sh/new-machine.txt index eb2dd2e6993b..73988e0d112b 100644 --- a/Documentation/sh/new-machine.txt +++ b/Documentation/sh/new-machine.txt @@ -41,11 +41,6 @@ Board-specific code: | .. more boards here ... -It should also be noted that each board is required to have some certain -headers. At the time of this writing, io.h is the only thing that needs -to be provided for each board, and can generally just reference generic -functions (with the exception of isa_port2addr). - Next, for companion chips: . `-- arch @@ -104,12 +99,13 @@ and then populate that with sub-directories for each member of the family. Both the Solution Engine and the hp6xx boards are an example of this. After you have setup your new arch/sh/boards/ directory, remember that you -also must add a directory in include/asm-sh for headers localized to this -board. In order to interoperate seamlessly with the build system, it's best -to have this directory the same as the arch/sh/boards/ directory name, -though if your board is again part of a family, the build system has ways -of dealing with this, and you can feel free to name the directory after -the family member itself. +should also add a directory in include/asm-sh for headers localized to this +board (if there are going to be more than one). In order to interoperate +seamlessly with the build system, it's best to have this directory the same +as the arch/sh/boards/ directory name, though if your board is again part of +a family, the build system has ways of dealing with this (via incdir-y +overloading), and you can feel free to name the directory after the family +member itself. There are a few things that each board is required to have, both in the arch/sh/boards and the include/asm-sh/ heirarchy. In order to better @@ -122,6 +118,7 @@ might look something like: * arch/sh/boards/vapor/setup.c - Setup code for imaginary board */ #include <linux/init.h> +#include <asm/rtc.h> /* for board_time_init() */ const char *get_system_type(void) { @@ -152,79 +149,57 @@ int __init platform_setup(void) } Our new imaginary board will also have to tie into the machvec in order for it -to be of any use. Currently the machvec is slowly on its way out, but is still -required for the time being. As such, let us take a look at what needs to be -done for the machvec assignment. +to be of any use. machvec functions fall into a number of categories: - I/O functions to IO memory (inb etc) and PCI/main memory (readb etc). - - I/O remapping functions (ioremap etc) - - some initialisation functions - - a 'heartbeat' function - - some miscellaneous flags - -The tree can be built in two ways: - - as a fully generic build. All drivers are linked in, and all functions - go through the machvec - - as a machine specific build. In this case only the required drivers - will be linked in, and some macros may be redefined to not go through - the machvec where performance is important (in particular IO functions). - -There are three ways in which IO can be performed: - - none at all. This is really only useful for the 'unknown' machine type, - which us designed to run on a machine about which we know nothing, and - so all all IO instructions do nothing. - - fully custom. In this case all IO functions go to a machine specific - set of functions which can do what they like - - a generic set of functions. These will cope with most situations, - and rely on a single function, mv_port2addr, which is called through the - machine vector, and converts an IO address into a memory address, which - can be read from/written to directly. - -Thus adding a new machine involves the following steps (I will assume I am -adding a machine called vapor): - - - add a new file include/asm-sh/vapor/io.h which contains prototypes for + - I/O mapping functions (ioport_map, ioport_unmap, etc). + - a 'heartbeat' function. + - PCI and IRQ initialization routines. + - Consistent allocators (for boards that need special allocators, + particularly for allocating out of some board-specific SRAM for DMA + handles). + +There are machvec functions added and removed over time, so always be sure to +consult include/asm-sh/machvec.h for the current state of the machvec. + +The kernel will automatically wrap in generic routines for undefined function +pointers in the machvec at boot time, as machvec functions are referenced +unconditionally throughout most of the tree. Some boards have incredibly +sparse machvecs (such as the dreamcast and sh03), whereas others must define +virtually everything (rts7751r2d). + +Adding a new machine is relatively trivial (using vapor as an example): + +If the board-specific definitions are quite minimalistic, as is the case for +the vast majority of boards, simply having a single board-specific header is +sufficient. + + - add a new file include/asm-sh/vapor.h which contains prototypes for any machine specific IO functions prefixed with the machine name, for example vapor_inb. These will be needed when filling out the machine vector. - This is the minimum that is required, however there are ample - opportunities to optimise this. In particular, by making the prototypes - inline function definitions, it is possible to inline the function when - building machine specific versions. Note that the machine vector - functions will still be needed, so that a module built for a generic - setup can be loaded. - - - add a new file arch/sh/boards/vapor/mach.c. This contains the definition - of the machine vector. When building the machine specific version, this - will be the real machine vector (via an alias), while in the generic - version is used to initialise the machine vector, and then freed, by - making it initdata. This should be defined as: - - struct sh_machine_vector mv_vapor __initmv = { - .mv_name = "vapor", - } - ALIAS_MV(vapor) - - - finally add a file arch/sh/boards/vapor/io.c, which contains - definitions of the machine specific io functions. - -A note about initialisation functions. Three initialisation functions are -provided in the machine vector: - - mv_arch_init - called very early on from setup_arch - - mv_init_irq - called from init_IRQ, after the generic SH interrupt - initialisation - - mv_init_pci - currently not used - -Any other remaining functions which need to be called at start up can be -added to the list using the __initcalls macro (or module_init if the code -can be built as a module). Many generic drivers probe to see if the device -they are targeting is present, however this may not always be appropriate, -so a flag can be added to the machine vector which will be set on those -machines which have the hardware in question, reducing the probe to a -single conditional. + Note that these prototypes are generated automatically by setting + __IO_PREFIX to something sensible. A typical example would be: + + #define __IO_PREFIX vapor + #include <asm/io_generic.h> + + somewhere in the board-specific header. Any boards being ported that still + have a legacy io.h should remove it entirely and switch to the new model. + + - Add machine vector definitions to the board's setup.c. At a bare minimum, + this must be defined as something like: + + struct sh_machine_vector mv_vapor __initmv = { + .mv_name = "vapor", + }; + ALIAS_MV(vapor) + + - finally add a file arch/sh/boards/vapor/io.c, which contains definitions of + the machine specific io functions (if there are enough to warrant it). 3. Hooking into the Build System ================================ @@ -303,4 +278,3 @@ which will in turn copy the defconfig for this board, run it through oldconfig (prompting you for any new options since the time of creation), and start you on your way to having a functional kernel for your new board. - diff --git a/Documentation/sh/register-banks.txt b/Documentation/sh/register-banks.txt new file mode 100644 index 000000000000..a6719f2f6594 --- /dev/null +++ b/Documentation/sh/register-banks.txt @@ -0,0 +1,33 @@ + Notes on register bank usage in the kernel + ========================================== + +Introduction +------------ + +The SH-3 and SH-4 CPU families traditionally include a single partial register +bank (selected by SR.RB, only r0 ... r7 are banked), whereas other families +may have more full-featured banking or simply no such capabilities at all. + +SR.RB banking +------------- + +In the case of this type of banking, banked registers are mapped directly to +r0 ... r7 if SR.RB is set to the bank we are interested in, otherwise ldc/stc +can still be used to reference the banked registers (as r0_bank ... r7_bank) +when in the context of another bank. The developer must keep the SR.RB value +in mind when writing code that utilizes these banked registers, for obvious +reasons. Userspace is also not able to poke at the bank1 values, so these can +be used rather effectively as scratch registers by the kernel. + +Presently the kernel uses several of these registers. + + - r0_bank, r1_bank (referenced as k0 and k1, used for scratch + registers when doing exception handling). + - r2_bank (used to track the EXPEVT/INTEVT code) + - Used by do_IRQ() and friends for doing irq mapping based off + of the interrupt exception vector jump table offset + - r6_bank (global interrupt mask) + - The SR.IMASK interrupt handler makes use of this to set the + interrupt priority level (used by local_irq_enable()) + - r7_bank (current) + diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt index 87d76a5c73d0..e6b57dd46a4f 100644 --- a/Documentation/sound/alsa/ALSA-Configuration.txt +++ b/Documentation/sound/alsa/ALSA-Configuration.txt @@ -472,6 +472,22 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. The power-management is supported. + Module snd-darla20 + ------------------ + + Module for Echoaudio Darla20 + + This module supports multiple cards. + The driver requires the firmware loader support on kernel. + + Module snd-darla24 + ------------------ + + Module for Echoaudio Darla24 + + This module supports multiple cards. + The driver requires the firmware loader support on kernel. + Module snd-dt019x ----------------- @@ -499,6 +515,14 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. The power-management is supported. + Module snd-echo3g + ----------------- + + Module for Echoaudio 3G cards (Gina3G/Layla3G) + + This module supports multiple cards. + The driver requires the firmware loader support on kernel. + Module snd-emu10k1 ------------------ @@ -657,6 +681,22 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. The power-management is supported. + Module snd-gina20 + ----------------- + + Module for Echoaudio Gina20 + + This module supports multiple cards. + The driver requires the firmware loader support on kernel. + + Module snd-gina24 + ----------------- + + Module for Echoaudio Gina24 + + This module supports multiple cards. + The driver requires the firmware loader support on kernel. + Module snd-gusclassic --------------------- @@ -718,6 +758,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. position_fix - Fix DMA pointer (0 = auto, 1 = none, 2 = POSBUF, 3 = FIFO size) single_cmd - Use single immediate commands to communicate with codecs (for debugging only) + disable_msi - Disable Message Signaled Interrupt (MSI) This module supports one card and autoprobe. @@ -738,11 +779,16 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. 6stack-digout 6-jack with a SPDIF out w810 3-jack z71v 3-jack (HP shared SPDIF) - asus 3-jack + asus 3-jack (ASUS Mobo) + asus-w1v ASUS W1V + asus-dig ASUS with SPDIF out + asus-dig2 ASUS with SPDIF out (using GPIO2) uniwill 3-jack F1734 2-jack lg LG laptop (m1 express dual) - lg-lw LG LW20 laptop + lg-lw LG LW20/LW25 laptop + tcl TCL S700 + clevo Clevo laptops (m520G, m665n) test for testing/debugging purpose, almost all controls can be adjusted. Appearing only when compiled with $CONFIG_SND_DEBUG=y @@ -750,6 +796,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. ALC260 hp HP machines + hp-3013 HP machines (3013-variant) fujitsu Fujitsu S7020 acer Acer TravelMate basic fixed pin assignment (old default model) @@ -757,18 +804,32 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. ALC262 fujitsu Fujitsu Laptop + hp-bpc HP xw4400/6400/8400/9400 laptops + benq Benq ED8 basic fixed pin assignment w/o SPDIF auto auto-config reading BIOS (default) - ALC882/883/885 + ALC882/885 3stack-dig 3-jack with SPDIF I/O 6stck-dig 6-jack digital with SPDIF I/O + arima Arima W820Di1 + auto auto-config reading BIOS (default) + + ALC883/888 + 3stack-dig 3-jack with SPDIF I/O + 6stack-dig 6-jack digital with SPDIF I/O + 3stack-6ch 3-jack 6-channel + 3stack-6ch-dig 3-jack 6-channel with SPDIF I/O + 6stack-dig-demo 6-jack digital for Intel demo board + acer Acer laptops (Travelmate 3012WTMi, Aspire 5600, etc) auto auto-config reading BIOS (default) - ALC861 + ALC861/660 3stack 3-jack 3stack-dig 3-jack with SPDIF I/O 6stack-dig 6-jack with SPDIF I/O + 3stack-660 3-jack (for ALC660) + uniwill-m31 Uniwill M31 laptop auto auto-config reading BIOS (default) CMI9880 @@ -797,10 +858,21 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. 3stack-dig ditto with SPDIF laptop 3-jack with hp-jack automute laptop-dig ditto with SPDIF - auto auto-confgi reading BIOS (default) + auto auto-config reading BIOS (default) + + STAC9200/9205/9220/9221/9254 + ref Reference board + 3stack D945 3stack + 5stack D945 5stack + SPDIF + + STAC9227/9228/9229/927x + ref Reference board + 3stack D965 3stack + 5stack D965 5stack + SPDIF - STAC7661(?) + STAC9872 vaio Setup for VAIO FE550G/SZ110 + vaio-ar Setup for VAIO AR If the default configuration doesn't work and one of the above matches with your device, report it together with the PCI @@ -937,6 +1009,30 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. driver isn't configured properly or you want to try another type for testing. + Module snd-indigo + ----------------- + + Module for Echoaudio Indigo + + This module supports multiple cards. + The driver requires the firmware loader support on kernel. + + Module snd-indigodj + ------------------- + + Module for Echoaudio Indigo DJ + + This module supports multiple cards. + The driver requires the firmware loader support on kernel. + + Module snd-indigoio + ------------------- + + Module for Echoaudio Indigo IO + + This module supports multiple cards. + The driver requires the firmware loader support on kernel. + Module snd-intel8x0 ------------------- @@ -1036,6 +1132,22 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. This module supports multiple cards. + Module snd-layla20 + ------------------ + + Module for Echoaudio Layla20 + + This module supports multiple cards. + The driver requires the firmware loader support on kernel. + + Module snd-layla24 + ------------------ + + Module for Echoaudio Layla24 + + This module supports multiple cards. + The driver requires the firmware loader support on kernel. + Module snd-maestro3 ------------------- @@ -1056,6 +1168,14 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. The power-management is supported. + Module snd-mia + --------------- + + Module for Echoaudio Mia + + This module supports multiple cards. + The driver requires the firmware loader support on kernel. + Module snd-miro --------------- @@ -1088,6 +1208,14 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. When no hotplug fw loader is available, you need to load the firmware via mixartloader utility in alsa-tools package. + Module snd-mona + --------------- + + Module for Echoaudio Mona + + This module supports multiple cards. + The driver requires the firmware loader support on kernel. + Module snd-mpu401 ----------------- @@ -1111,6 +1239,14 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. Module supports only 1 card. This module has no enable option. + Module snd-mts64 + ---------------- + + Module for Ego Systems (ESI) Miditerminal 4140 + + This module supports multiple devices. + Requires parport (CONFIG_PARPORT). + Module snd-nm256 ---------------- diff --git a/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl b/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl index 635cbb94357c..4807ef79a94d 100644 --- a/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl +++ b/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl @@ -1054,9 +1054,8 @@ <para> For a device which allows hotplugging, you can use - <function>snd_card_free_in_thread</function>. This one will - postpone the destruction and wait in a kernel-thread until all - devices are closed. + <function>snd_card_free_when_closed</function>. This one will + postpone the destruction until all devices are closed. </para> </section> @@ -1149,7 +1148,7 @@ } chip->port = pci_resource_start(pci, 0); if (request_irq(pci->irq, snd_mychip_interrupt, - SA_INTERRUPT|SA_SHIRQ, "My Chip", chip)) { + IRQF_DISABLED|IRQF_SHARED, "My Chip", chip)) { printk(KERN_ERR "cannot grab irq %d\n", pci->irq); snd_mychip_free(chip); return -EBUSY; @@ -1172,7 +1171,7 @@ } /* PCI IDs */ - static struct pci_device_id snd_mychip_ids[] __devinitdata = { + static struct pci_device_id snd_mychip_ids[] = { { PCI_VENDOR_ID_FOO, PCI_DEVICE_ID_BAR, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0, }, .... @@ -1323,7 +1322,7 @@ <programlisting> <![CDATA[ if (request_irq(pci->irq, snd_mychip_interrupt, - SA_INTERRUPT|SA_SHIRQ, "My Chip", chip)) { + IRQF_DISABLED|IRQF_SHARED, "My Chip", chip)) { printk(KERN_ERR "cannot grab irq %d\n", pci->irq); snd_mychip_free(chip); return -EBUSY; @@ -1342,7 +1341,7 @@ <para> On the PCI bus, the interrupts can be shared. Thus, - <constant>SA_SHIRQ</constant> is given as the interrupt flag of + <constant>IRQF_SHARED</constant> is given as the interrupt flag of <function>request_irq()</function>. </para> @@ -1565,7 +1564,7 @@ <informalexample> <programlisting> <![CDATA[ - static struct pci_device_id snd_mychip_ids[] __devinitdata = { + static struct pci_device_id snd_mychip_ids[] = { { PCI_VENDOR_ID_FOO, PCI_DEVICE_ID_BAR, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0, }, .... @@ -3048,7 +3047,7 @@ struct _snd_pcm_runtime { </para> <para> - If you aquire a spinlock in the interrupt handler, and the + If you acquire a spinlock in the interrupt handler, and the lock is used in other pcm callbacks, too, then you have to release the lock before calling <function>snd_pcm_period_elapsed()</function>, because diff --git a/Documentation/sparc/sbus_drivers.txt b/Documentation/sparc/sbus_drivers.txt index 876195dc2aef..4b9351624f13 100644 --- a/Documentation/sparc/sbus_drivers.txt +++ b/Documentation/sparc/sbus_drivers.txt @@ -25,42 +25,84 @@ the bits necessary to run your device. The most commonly used members of this structure, and their typical usage, will be detailed below. - Here is how probing is performed by an SBUS driver -under Linux: + Here is a piece of skeleton code for perofming a device +probe in an SBUS driverunder Linux: - static void init_one_mydevice(struct sbus_dev *sdev) + static int __devinit mydevice_probe_one(struct sbus_dev *sdev) { + struct mysdevice *mp = kzalloc(sizeof(*mp), GFP_KERNEL); + + if (!mp) + return -ENODEV; + + ... + dev_set_drvdata(&sdev->ofdev.dev, mp); + return 0; ... } - static int mydevice_match(struct sbus_dev *sdev) + static int __devinit mydevice_probe(struct of_device *dev, + const struct of_device_id *match) { - if (some_criteria(sdev)) - return 1; - return 0; + struct sbus_dev *sdev = to_sbus_device(&dev->dev); + + return mydevice_probe_one(sdev); } - static void mydevice_probe(void) + static int __devexit mydevice_remove(struct of_device *dev) { - struct sbus_bus *sbus; - struct sbus_dev *sdev; + struct sbus_dev *sdev = to_sbus_device(&dev->dev); + struct mydevice *mp = dev_get_drvdata(&dev->dev); - for_each_sbus(sbus) { - for_each_sbusdev(sdev, sbus) { - if (mydevice_match(sdev)) - init_one_mydevice(sdev); - } - } + return mydevice_remove_one(sdev, mp); } - All this does is walk through all SBUS devices in the -system, checks each to see if it is of the type which -your driver is written for, and if so it calls the init -routine to attach the device and prepare to drive it. + static struct of_device_id mydevice_match[] = { + { + .name = "mydevice", + }, + {}, + }; + + MODULE_DEVICE_TABLE(of, mydevice_match); - "init_one_mydevice" might do things like allocate software -state structures, map in I/O registers, place the hardware -into an initialized state, etc. + static struct of_platform_driver mydevice_driver = { + .name = "mydevice", + .match_table = mydevice_match, + .probe = mydevice_probe, + .remove = __devexit_p(mydevice_remove), + }; + + static int __init mydevice_init(void) + { + return of_register_driver(&mydevice_driver, &sbus_bus_type); + } + + static void __exit mydevice_exit(void) + { + of_unregister_driver(&mydevice_driver); + } + + module_init(mydevice_init); + module_exit(mydevice_exit); + + The mydevice_match table is a series of entries which +describes what SBUS devices your driver is meant for. In the +simplest case you specify a string for the 'name' field. Every +SBUS device with a 'name' property matching your string will +be passed one-by-one to your .probe method. + + You should store away your device private state structure +pointer in the drvdata area so that you can retrieve it later on +in your .remove method. + + Any memory allocated, registers mapped, IRQs registered, +etc. must be undone by your .remove method so that all resources +of your device are relased by the time it returns. + + You should _NOT_ use the for_each_sbus(), for_each_sbusdev(), +and for_all_sbusdev() interfaces. They are deprecated, will be +removed, and no new driver should reference them ever. Mapping and Accessing I/O Registers @@ -263,10 +305,3 @@ discussed above and plus it handles both PCI and SBUS boards. Lance driver abuses consistent mappings for data transfer. It is a nifty trick which we do not particularly recommend... Just check it out and know that it's legal. - - Bad examples, do NOT use - - drivers/video/cgsix.c - This one uses result of sbus_ioremap as if it is an address. -This does NOT work on sparc64 and therefore is broken. We will -convert it at a later date. diff --git a/Documentation/sparse.txt b/Documentation/sparse.txt index 5a311c38dd1a..f9c99c9a54f9 100644 --- a/Documentation/sparse.txt +++ b/Documentation/sparse.txt @@ -69,10 +69,10 @@ recompiled, or use "make C=2" to run sparse on the files whether they need to be recompiled or not. The latter is a fast way to check the whole tree if you have already built it. -The optional make variable CF can be used to pass arguments to sparse. The -build system passes -Wbitwise to sparse automatically. To perform endianness -checks, you may define __CHECK_ENDIAN__: +The optional make variable CHECKFLAGS can be used to pass arguments to sparse. +The build system passes -Wbitwise to sparse automatically. To perform +endianness checks, you may define __CHECK_ENDIAN__: - make C=2 CF="-D__CHECK_ENDIAN__" + make C=2 CHECKFLAGS="-D__CHECK_ENDIAN__" These checks are disabled by default as they generate a host of warnings. diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt index 0b62c62142cf..5c3a51905969 100644 --- a/Documentation/sysctl/fs.txt +++ b/Documentation/sysctl/fs.txt @@ -25,6 +25,7 @@ Currently, these files are in /proc/sys/fs: - inode-state - overflowuid - overflowgid +- suid_dumpable - super-max - super-nr @@ -131,6 +132,25 @@ The default is 65534. ============================================================== +suid_dumpable: + +This value can be used to query and set the core dump mode for setuid +or otherwise protected/tainted binaries. The modes are + +0 - (default) - traditional behaviour. Any process which has changed + privilege levels or is execute only will not be dumped +1 - (debug) - all processes dump core when possible. The core dump is + owned by the current user and no security is applied. This is + intended for system debugging situations only. Ptrace is unchecked. +2 - (suidsafe) - any binary which normally would not be dumped is dumped + readable by root only. This allows the end user to remove + such a dump but not access it directly. For security reasons + core dumps in this mode will not overwrite one another or + other files. This mode is appropriate when adminstrators are + attempting to debug problems in a normal environment. + +============================================================== + super-max & super-nr: These numbers control the maximum number of superblocks, and diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index b0c7ab93dcb9..89bf8c20a586 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -50,7 +50,6 @@ show up in /proc/sys/kernel: - shmmax [ sysv ipc ] - shmmni - stop-a [ SPARC only ] -- suid_dumpable - sysrq ==> Documentation/sysrq.txt - tainted - threads-max @@ -211,9 +210,8 @@ Controls the kernel's behaviour when an oops or BUG is encountered. 0: try to continue operation -1: delay a few seconds (to give klogd time to record the oops output) and - then panic. If the `panic' sysctl is also non-zero then the machine will - be rebooted. +1: panic immediatly. If the `panic' sysctl is also non-zero then the + machine will be rebooted. ============================================================== @@ -311,25 +309,6 @@ kernel. This value defaults to SHMMAX. ============================================================== -suid_dumpable: - -This value can be used to query and set the core dump mode for setuid -or otherwise protected/tainted binaries. The modes are - -0 - (default) - traditional behaviour. Any process which has changed - privilege levels or is execute only will not be dumped -1 - (debug) - all processes dump core when possible. The core dump is - owned by the current user and no security is applied. This is - intended for system debugging situations only. Ptrace is unchecked. -2 - (suidsafe) - any binary which normally would not be dumped is dumped - readable by root only. This allows the end user to remove - such a dump but not access it directly. For security reasons - core dumps in this mode will not overwrite one another or - other files. This mode is appropriate when adminstrators are - attempting to debug problems in a normal environment. - -============================================================== - tainted: Non-zero if the kernel has been tainted. Numeric values, which diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 2dc246af4885..20d0d797f539 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -28,7 +28,8 @@ Currently, these files are in /proc/sys/vm: - block_dump - drop-caches - zone_reclaim_mode -- zone_reclaim_interval +- min_unmapped_ratio +- min_slab_ratio - panic_on_oom ============================================================== @@ -138,7 +139,6 @@ This is value ORed together of 1 = Zone reclaim on 2 = Zone reclaim writes dirty pages out 4 = Zone reclaim swaps pages -8 = Also do a global slab reclaim pass zone_reclaim_mode is set during bootup to 1 if it is determined that pages from remote zones will cause a measurable performance reduction. The @@ -162,22 +162,36 @@ Allowing regular swap effectively restricts allocations to the local node unless explicitly overridden by memory policies or cpuset configurations. -It may be advisable to allow slab reclaim if the system makes heavy -use of files and builds up large slab caches. However, the slab -shrink operation is global, may take a long time and free slabs -in all nodes of the system. +============================================================= + +min_unmapped_ratio: + +This is available only on NUMA kernels. + +A percentage of the total pages in each zone. Zone reclaim will only +occur if more than this percentage of pages are file backed and unmapped. +This is to insure that a minimal amount of local pages is still available for +file I/O even if the node is overallocated. + +The default is 1 percent. + +============================================================= -================================================================ +min_slab_ratio: -zone_reclaim_interval: +This is available only on NUMA kernels. -The time allowed for off node allocations after zone reclaim -has failed to reclaim enough pages to allow a local allocation. +A percentage of the total pages in each zone. On Zone reclaim +(fallback from the local zone occurs) slabs will be reclaimed if more +than this percentage of pages in a zone are reclaimable slab pages. +This insures that the slab growth stays under control even in NUMA +systems that rarely perform global reclaim. -Time is set in seconds and set by default to 30 seconds. +The default is 5 percent. -Reduce the interval if undesired off node allocations occur. However, too -frequent scans will have a negative impact onoff node allocation performance. +Note that slab reclaim is triggered in a per zone / node fashion. +The process of reclaiming slab memory is currently not node specific +and may not be fast. ============================================================= diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index ad0bedf678b3..e0188a23fd5e 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt @@ -115,8 +115,9 @@ trojan program is running at console and which could grab your password when you would try to login. It will kill all programs on given console and thus letting you make sure that the login prompt you see is actually the one from init, not some trojan program. -IMPORTANT:In its true form it is not a true SAK like the one in :IMPORTANT -IMPORTANT:c2 compliant systems, and it should be mistook as such. :IMPORTANT +IMPORTANT: In its true form it is not a true SAK like the one in a :IMPORTANT +IMPORTANT: c2 compliant system, and it should not be mistaken as :IMPORTANT +IMPORTANT: such. :IMPORTANT It seems other find it useful as (System Attention Key) which is useful when you want to exit a program that will not let you switch consoles. (For example, X or a svgalib program.) diff --git a/Documentation/tty.txt b/Documentation/tty.txt index 8ff7bc2a0811..dab56604745d 100644 --- a/Documentation/tty.txt +++ b/Documentation/tty.txt @@ -80,13 +80,6 @@ receive_buf() - Hand buffers of bytes from the driver to the ldisc for processing. Semantics currently rather mysterious 8( -receive_room() - Can be called by the driver layer at any time when - the ldisc is opened. The ldisc must be able to - handle the reported amount of data at that instant. - Synchronization between active receive_buf and - receive_room calls is down to the driver not the - ldisc. Must not sleep. - write_wakeup() - May be called at any point between open and close. The TTY_DO_WRITE_WAKEUP flag indicates if a call is needed but always races versus calls. Thus the diff --git a/Documentation/usb/error-codes.txt b/Documentation/usb/error-codes.txt index 867f4c38f356..39c68f8c4e6c 100644 --- a/Documentation/usb/error-codes.txt +++ b/Documentation/usb/error-codes.txt @@ -98,13 +98,13 @@ one or more packets could finish before an error stops further endpoint I/O. error, a failure to respond (often caused by device disconnect), or some other fault. --ETIMEDOUT (**) No response packet received within the prescribed +-ETIME (**) No response packet received within the prescribed bus turn-around time. This error may instead be reported as -EPROTO or -EILSEQ. - Note that the synchronous USB message functions - also use this code to indicate timeout expired - before the transfer completed. +-ETIMEDOUT Synchronous USB message functions use this code + to indicate timeout expired before the transfer + completed, and no other error was reported by HC. -EPIPE (**) Endpoint stalled. For non-control endpoints, reset this status with usb_clear_halt(). @@ -163,6 +163,3 @@ usb_get_*/usb_set_*(): usb_control_msg(): usb_bulk_msg(): -ETIMEDOUT Timeout expired before the transfer completed. - In the future this code may change to -ETIME, - whose definition is a closer match to this sort - of error. diff --git a/Documentation/usb/proc_usb_info.txt b/Documentation/usb/proc_usb_info.txt index f86550fe38ee..22c5331260ca 100644 --- a/Documentation/usb/proc_usb_info.txt +++ b/Documentation/usb/proc_usb_info.txt @@ -59,7 +59,7 @@ bind to an interface (or perhaps several) using an ioctl call. You would issue more ioctls to the device to communicate to it using control, bulk, or other kinds of USB transfers. The IOCTLs are listed in the <linux/usbdevice_fs.h> file, and at this writing the -source code (linux/drivers/usb/devio.c) is the primary reference +source code (linux/drivers/usb/core/devio.c) is the primary reference for how to access devices through those files. Note that since by default these BBB/DDD files are writable only by diff --git a/Documentation/usb/usb-help.txt b/Documentation/usb/usb-help.txt index b7c324973695..a7408593829f 100644 --- a/Documentation/usb/usb-help.txt +++ b/Documentation/usb/usb-help.txt @@ -5,8 +5,7 @@ For USB help other than the readme files that are located in Documentation/usb/*, see the following: Linux-USB project: http://www.linux-usb.org - mirrors at http://www.suse.cz/development/linux-usb/ - and http://usb.in.tum.de/linux-usb/ + mirrors at http://usb.in.tum.de/linux-usb/ and http://it.linux-usb.org Linux USB Guide: http://linux-usb.sourceforge.net Linux-USB device overview (working devices and drivers): diff --git a/Documentation/usb/usb-serial.txt b/Documentation/usb/usb-serial.txt index f001cd93b79b..a2dee6e6190d 100644 --- a/Documentation/usb/usb-serial.txt +++ b/Documentation/usb/usb-serial.txt @@ -399,10 +399,10 @@ REINER SCT cyberJack pinpad/e-com USB chipcard reader Prolific PL2303 Driver - This driver support any device that has the PL2303 chip from Prolific + This driver supports any device that has the PL2303 chip from Prolific in it. This includes a number of single port USB to serial converters and USB GPS devices. Devices from Aten (the UC-232) and - IO-Data work with this driver. + IO-Data work with this driver, as does the DCU-11 mobile-phone cable. For any questions or problems with this driver, please contact Greg Kroah-Hartman at greg@kroah.com @@ -433,6 +433,11 @@ Options supported: See http://www.uuhaus.de/linux/palmconnect.html for up-to-date information on this driver. +AIRcable USB Dongle Bluetooth driver + If there is the cdc_acm driver loaded in the system, you will find that the + cdc_acm claims the device before AIRcable can. This is simply corrected + by unloading both modules and then loading the aircable module before + cdc_acm module Generic Serial driver diff --git a/Documentation/video4linux/CARDLIST.bttv b/Documentation/video4linux/CARDLIST.bttv index b72706c58a44..4efa4645885f 100644 --- a/Documentation/video4linux/CARDLIST.bttv +++ b/Documentation/video4linux/CARDLIST.bttv @@ -87,7 +87,7 @@ 86 -> Osprey 101/151 w/ svid 87 -> Osprey 200/201/250/251 88 -> Osprey 200/250 [0070:ff01] - 89 -> Osprey 210/220 + 89 -> Osprey 210/220/230 90 -> Osprey 500 [0070:ff02] 91 -> Osprey 540 [0070:ff04] 92 -> Osprey 2000 [0070:ff03] @@ -111,7 +111,7 @@ 110 -> IVC-100 [ff00:a132] 111 -> IVC-120G [ff00:a182,ff01:a182,ff02:a182,ff03:a182,ff04:a182,ff05:a182,ff06:a182,ff07:a182,ff08:a182,ff09:a182,ff0a:a182,ff0b:a182,ff0c:a182,ff0d:a182,ff0e:a182,ff0f:a182] 112 -> pcHDTV HD-2000 TV [7063:2000] -113 -> Twinhan DST + clones [11bd:0026,1822:0001,270f:fc00] +113 -> Twinhan DST + clones [11bd:0026,1822:0001,270f:fc00,1822:0026] 114 -> Winfast VC100 [107d:6607] 115 -> Teppro TEV-560/InterVision IV-560 116 -> SIMUS GVC1100 [aa6a:82b2] diff --git a/Documentation/video4linux/CARDLIST.cx88 b/Documentation/video4linux/CARDLIST.cx88 index 3b39a91b24bd..00d9a1f2a54c 100644 --- a/Documentation/video4linux/CARDLIST.cx88 +++ b/Documentation/video4linux/CARDLIST.cx88 @@ -15,7 +15,7 @@ 14 -> KWorld/VStream XPert DVB-T [17de:08a6] 15 -> DViCO FusionHDTV DVB-T1 [18ac:db00] 16 -> KWorld LTV883RF - 17 -> DViCO FusionHDTV 3 Gold-Q [18ac:d810] + 17 -> DViCO FusionHDTV 3 Gold-Q [18ac:d810,18ac:d800] 18 -> Hauppauge Nova-T DVB-T [0070:9002,0070:9001] 19 -> Conexant DVB-T reference design [14f1:0187] 20 -> Provideo PV259 [1540:2580] @@ -40,8 +40,14 @@ 39 -> KWorld DVB-S 100 [17de:08b2] 40 -> Hauppauge WinTV-HVR1100 DVB-T/Hybrid [0070:9400,0070:9402] 41 -> Hauppauge WinTV-HVR1100 DVB-T/Hybrid (Low Profile) [0070:9800,0070:9802] - 42 -> digitalnow DNTV Live! DVB-T Pro [1822:0025] + 42 -> digitalnow DNTV Live! DVB-T Pro [1822:0025,1822:0019] 43 -> KWorld/VStream XPert DVB-T with cx22702 [17de:08a1] 44 -> DViCO FusionHDTV DVB-T Dual Digital [18ac:db50,18ac:db54] 45 -> KWorld HardwareMpegTV XPert [17de:0840] 46 -> DViCO FusionHDTV DVB-T Hybrid [18ac:db40,18ac:db44] + 47 -> pcHDTV HD5500 HDTV [7063:5500] + 48 -> Kworld MCE 200 Deluxe [17de:0841] + 49 -> PixelView PlayTV P7000 [1554:4813] + 50 -> NPG Tech Real TV FM Top 10 [14f1:0842] + 51 -> WinFast DTV2000 H [107d:665e] + 52 -> Geniatech DVB-S [14f1:0084] diff --git a/Documentation/video4linux/CARDLIST.saa7134 b/Documentation/video4linux/CARDLIST.saa7134 index bca50903233f..9068b669f5ee 100644 --- a/Documentation/video4linux/CARDLIST.saa7134 +++ b/Documentation/video4linux/CARDLIST.saa7134 @@ -93,3 +93,4 @@ 92 -> AVerMedia A169 B1 [1461:6360] 93 -> Medion 7134 Bridge #2 [16be:0005] 94 -> LifeView FlyDVB-T Hybrid Cardbus [5168:3306,5168:3502] + 95 -> LifeView FlyVIDEO3000 (NTSC) [5169:0138] diff --git a/Documentation/video4linux/CARDLIST.tuner b/Documentation/video4linux/CARDLIST.tuner index 1bcdac67dd8c..44134f04b82a 100644 --- a/Documentation/video4linux/CARDLIST.tuner +++ b/Documentation/video4linux/CARDLIST.tuner @@ -62,7 +62,7 @@ tuner=60 - Thomson DTT 761X (ATSC/NTSC) tuner=61 - Tena TNF9533-D/IF/TNF9533-B/DF tuner=62 - Philips TEA5767HN FM Radio tuner=63 - Philips FMD1216ME MK3 Hybrid Tuner -tuner=64 - LG TDVS-H062F/TUA6034 +tuner=64 - LG TDVS-H06xF tuner=65 - Ymec TVF66T5-B/DFF tuner=66 - LG TALN series tuner=67 - Philips TD1316 Hybrid Tuner @@ -71,3 +71,4 @@ tuner=69 - Tena TNF 5335 and similar models tuner=70 - Samsung TCPN 2121P30A tuner=71 - Xceive xc3028 tuner=72 - Thomson FE6600 +tuner=73 - Samsung TCPG 6121P30A diff --git a/Documentation/video4linux/CQcam.txt b/Documentation/video4linux/CQcam.txt index 464e4cec94cb..ade8651e2443 100644 --- a/Documentation/video4linux/CQcam.txt +++ b/Documentation/video4linux/CQcam.txt @@ -185,207 +185,10 @@ this work is documented at the video4linux2 site listed below. 9.0 --- A sample program using v4lgrabber, -This program is a simple image grabber that will copy a frame from the +v4lgrab is a simple image grabber that will copy a frame from the first video device, /dev/video0 to standard output in portable pixmap -format (.ppm) Using this like: 'v4lgrab | convert - c-qcam.jpg' -produced this picture of me at - http://mug.sys.virginia.edu/~drf5n/extras/c-qcam.jpg - --------------------- 8< ---------------- 8< ----------------------------- - -/* Simple Video4Linux image grabber. */ -/* - * Video4Linux Driver Test/Example Framegrabbing Program - * - * Compile with: - * gcc -s -Wall -Wstrict-prototypes v4lgrab.c -o v4lgrab - * Use as: - * v4lgrab >image.ppm - * - * Copyright (C) 1998-05-03, Phil Blundell <philb@gnu.org> - * Copied from http://www.tazenda.demon.co.uk/phil/vgrabber.c - * with minor modifications (Dave Forrest, drf5n@virginia.edu). - * - */ - -#include <unistd.h> -#include <sys/types.h> -#include <sys/stat.h> -#include <fcntl.h> -#include <stdio.h> -#include <sys/ioctl.h> -#include <stdlib.h> - -#include <linux/types.h> -#include <linux/videodev.h> - -#define FILE "/dev/video0" - -/* Stole this from tvset.c */ - -#define READ_VIDEO_PIXEL(buf, format, depth, r, g, b) \ -{ \ - switch (format) \ - { \ - case VIDEO_PALETTE_GREY: \ - switch (depth) \ - { \ - case 4: \ - case 6: \ - case 8: \ - (r) = (g) = (b) = (*buf++ << 8);\ - break; \ - \ - case 16: \ - (r) = (g) = (b) = \ - *((unsigned short *) buf); \ - buf += 2; \ - break; \ - } \ - break; \ - \ - \ - case VIDEO_PALETTE_RGB565: \ - { \ - unsigned short tmp = *(unsigned short *)buf; \ - (r) = tmp&0xF800; \ - (g) = (tmp<<5)&0xFC00; \ - (b) = (tmp<<11)&0xF800; \ - buf += 2; \ - } \ - break; \ - \ - case VIDEO_PALETTE_RGB555: \ - (r) = (buf[0]&0xF8)<<8; \ - (g) = ((buf[0] << 5 | buf[1] >> 3)&0xF8)<<8; \ - (b) = ((buf[1] << 2 ) & 0xF8)<<8; \ - buf += 2; \ - break; \ - \ - case VIDEO_PALETTE_RGB24: \ - (r) = buf[0] << 8; (g) = buf[1] << 8; \ - (b) = buf[2] << 8; \ - buf += 3; \ - break; \ - \ - default: \ - fprintf(stderr, \ - "Format %d not yet supported\n", \ - format); \ - } \ -} - -int get_brightness_adj(unsigned char *image, long size, int *brightness) { - long i, tot = 0; - for (i=0;i<size*3;i++) - tot += image[i]; - *brightness = (128 - tot/(size*3))/3; - return !((tot/(size*3)) >= 126 && (tot/(size*3)) <= 130); -} - -int main(int argc, char ** argv) -{ - int fd = open(FILE, O_RDONLY), f; - struct video_capability cap; - struct video_window win; - struct video_picture vpic; - - unsigned char *buffer, *src; - int bpp = 24, r, g, b; - unsigned int i, src_depth; - - if (fd < 0) { - perror(FILE); - exit(1); - } - - if (ioctl(fd, VIDIOCGCAP, &cap) < 0) { - perror("VIDIOGCAP"); - fprintf(stderr, "(" FILE " not a video4linux device?)\n"); - close(fd); - exit(1); - } - - if (ioctl(fd, VIDIOCGWIN, &win) < 0) { - perror("VIDIOCGWIN"); - close(fd); - exit(1); - } - - if (ioctl(fd, VIDIOCGPICT, &vpic) < 0) { - perror("VIDIOCGPICT"); - close(fd); - exit(1); - } - - if (cap.type & VID_TYPE_MONOCHROME) { - vpic.depth=8; - vpic.palette=VIDEO_PALETTE_GREY; /* 8bit grey */ - if(ioctl(fd, VIDIOCSPICT, &vpic) < 0) { - vpic.depth=6; - if(ioctl(fd, VIDIOCSPICT, &vpic) < 0) { - vpic.depth=4; - if(ioctl(fd, VIDIOCSPICT, &vpic) < 0) { - fprintf(stderr, "Unable to find a supported capture format.\n"); - close(fd); - exit(1); - } - } - } - } else { - vpic.depth=24; - vpic.palette=VIDEO_PALETTE_RGB24; - - if(ioctl(fd, VIDIOCSPICT, &vpic) < 0) { - vpic.palette=VIDEO_PALETTE_RGB565; - vpic.depth=16; - - if(ioctl(fd, VIDIOCSPICT, &vpic)==-1) { - vpic.palette=VIDEO_PALETTE_RGB555; - vpic.depth=15; - - if(ioctl(fd, VIDIOCSPICT, &vpic)==-1) { - fprintf(stderr, "Unable to find a supported capture format.\n"); - return -1; - } - } - } - } - - buffer = malloc(win.width * win.height * bpp); - if (!buffer) { - fprintf(stderr, "Out of memory.\n"); - exit(1); - } - - do { - int newbright; - read(fd, buffer, win.width * win.height * bpp); - f = get_brightness_adj(buffer, win.width * win.height, &newbright); - if (f) { - vpic.brightness += (newbright << 8); - if(ioctl(fd, VIDIOCSPICT, &vpic)==-1) { - perror("VIDIOSPICT"); - break; - } - } - } while (f); - - fprintf(stdout, "P6\n%d %d 255\n", win.width, win.height); - - src = buffer; - - for (i = 0; i < win.width * win.height; i++) { - READ_VIDEO_PIXEL(src, vpic.palette, src_depth, r, g, b); - fputc(r>>8, stdout); - fputc(g>>8, stdout); - fputc(b>>8, stdout); - } - - close(fd); - return 0; -} --------------------- 8< ---------------- 8< ----------------------------- +format (.ppm) To produce .jpg output, you can use it like this: +'v4lgrab | convert - c-qcam.jpg' 10.0 --- Other Information diff --git a/Documentation/video4linux/README.pvrusb2 b/Documentation/video4linux/README.pvrusb2 new file mode 100644 index 000000000000..c73a32c34528 --- /dev/null +++ b/Documentation/video4linux/README.pvrusb2 @@ -0,0 +1,212 @@ + +$Id$ +Mike Isely <isely@pobox.com> + + pvrusb2 driver + +Background: + + This driver is intended for the "Hauppauge WinTV PVR USB 2.0", which + is a USB 2.0 hosted TV Tuner. This driver is a work in progress. + Its history started with the reverse-engineering effort by Björn + Danielsson <pvrusb2@dax.nu> whose web page can be found here: + + http://pvrusb2.dax.nu/ + + From there Aurelien Alleaume <slts@free.fr> began an effort to + create a video4linux compatible driver. I began with Aurelien's + last known snapshot and evolved the driver to the state it is in + here. + + More information on this driver can be found at: + + http://www.isely.net/pvrusb2.html + + + This driver has a strong separation of layers. They are very + roughly: + + 1a. Low level wire-protocol implementation with the device. + + 1b. I2C adaptor implementation and corresponding I2C client drivers + implemented elsewhere in V4L. + + 1c. High level hardware driver implementation which coordinates all + activities that ensure correct operation of the device. + + 2. A "context" layer which manages instancing of driver, setup, + tear-down, arbitration, and interaction with high level + interfaces appropriately as devices are hotplugged in the + system. + + 3. High level interfaces which glue the driver to various published + Linux APIs (V4L, sysfs, maybe DVB in the future). + + The most important shearing layer is between the top 2 layers. A + lot of work went into the driver to ensure that any kind of + conceivable API can be laid on top of the core driver. (Yes, the + driver internally leverages V4L to do its work but that really has + nothing to do with the API published by the driver to the outside + world.) The architecture allows for different APIs to + simultaneously access the driver. I have a strong sense of fairness + about APIs and also feel that it is a good design principle to keep + implementation and interface isolated from each other. Thus while + right now the V4L high level interface is the most complete, the + sysfs high level interface will work equally well for similar + functions, and there's no reason I see right now why it shouldn't be + possible to produce a DVB high level interface that can sit right + alongside V4L. + + NOTE: Complete documentation on the pvrusb2 driver is contained in + the html files within the doc directory; these are exactly the same + as what is on the web site at the time. Browse those files + (especially the FAQ) before asking questions. + + +Building + + To build these modules essentially amounts to just running "Make", + but you need the kernel source tree nearby and you will likely also + want to set a few controlling environment variables first in order + to link things up with that source tree. Please see the Makefile + here for comments that explain how to do that. + + +Source file list / functional overview: + + (Note: The term "module" used below generally refers to loosely + defined functional units within the pvrusb2 driver and bears no + relation to the Linux kernel's concept of a loadable module.) + + pvrusb2-audio.[ch] - This is glue logic that resides between this + driver and the msp3400.ko I2C client driver (which is found + elsewhere in V4L). + + pvrusb2-context.[ch] - This module implements the context for an + instance of the driver. Everything else eventually ties back to + or is otherwise instanced within the data structures implemented + here. Hotplugging is ultimately coordinated here. All high level + interfaces tie into the driver through this module. This module + helps arbitrate each interface's access to the actual driver core, + and is designed to allow concurrent access through multiple + instances of multiple interfaces (thus you can for example change + the tuner's frequency through sysfs while simultaneously streaming + video through V4L out to an instance of mplayer). + + pvrusb2-debug.h - This header defines a printk() wrapper and a mask + of debugging bit definitions for the various kinds of debug + messages that can be enabled within the driver. + + pvrusb2-debugifc.[ch] - This module implements a crude command line + oriented debug interface into the driver. Aside from being part + of the process for implementing manual firmware extraction (see + the pvrusb2 web site mentioned earlier), probably I'm the only one + who has ever used this. It is mainly a debugging aid. + + pvrusb2-eeprom.[ch] - This is glue logic that resides between this + driver the tveeprom.ko module, which is itself implemented + elsewhere in V4L. + + pvrusb2-encoder.[ch] - This module implements all protocol needed to + interact with the Conexant mpeg2 encoder chip within the pvrusb2 + device. It is a crude echo of corresponding logic in ivtv, + however the design goals (strict isolation) and physical layer + (proxy through USB instead of PCI) are enough different that this + implementation had to be completely different. + + pvrusb2-hdw-internal.h - This header defines the core data structure + in the driver used to track ALL internal state related to control + of the hardware. Nobody outside of the core hardware-handling + modules should have any business using this header. All external + access to the driver should be through one of the high level + interfaces (e.g. V4L, sysfs, etc), and in fact even those high + level interfaces are restricted to the API defined in + pvrusb2-hdw.h and NOT this header. + + pvrusb2-hdw.h - This header defines the full internal API for + controlling the hardware. High level interfaces (e.g. V4L, sysfs) + will work through here. + + pvrusb2-hdw.c - This module implements all the various bits of logic + that handle overall control of a specific pvrusb2 device. + (Policy, instantiation, and arbitration of pvrusb2 devices fall + within the jurisdiction of pvrusb-context not here). + + pvrusb2-i2c-chips-*.c - These modules implement the glue logic to + tie together and configure various I2C modules as they attach to + the I2C bus. There are two versions of this file. The "v4l2" + version is intended to be used in-tree alongside V4L, where we + implement just the logic that makes sense for a pure V4L + environment. The "all" version is intended for use outside of + V4L, where we might encounter other possibly "challenging" modules + from ivtv or older kernel snapshots (or even the support modules + in the standalone snapshot). + + pvrusb2-i2c-cmd-v4l1.[ch] - This module implements generic V4L1 + compatible commands to the I2C modules. It is here where state + changes inside the pvrusb2 driver are translated into V4L1 + commands that are in turn send to the various I2C modules. + + pvrusb2-i2c-cmd-v4l2.[ch] - This module implements generic V4L2 + compatible commands to the I2C modules. It is here where state + changes inside the pvrusb2 driver are translated into V4L2 + commands that are in turn send to the various I2C modules. + + pvrusb2-i2c-core.[ch] - This module provides an implementation of a + kernel-friendly I2C adaptor driver, through which other external + I2C client drivers (e.g. msp3400, tuner, lirc) may connect and + operate corresponding chips within the the pvrusb2 device. It is + through here that other V4L modules can reach into this driver to + operate specific pieces (and those modules are in turn driven by + glue logic which is coordinated by pvrusb2-hdw, doled out by + pvrusb2-context, and then ultimately made available to users + through one of the high level interfaces). + + pvrusb2-io.[ch] - This module implements a very low level ring of + transfer buffers, required in order to stream data from the + device. This module is *very* low level. It only operates the + buffers and makes no attempt to define any policy or mechanism for + how such buffers might be used. + + pvrusb2-ioread.[ch] - This module layers on top of pvrusb2-io.[ch] + to provide a streaming API usable by a read() system call style of + I/O. Right now this is the only layer on top of pvrusb2-io.[ch], + however the underlying architecture here was intended to allow for + other styles of I/O to be implemented with additonal modules, like + mmap()'ed buffers or something even more exotic. + + pvrusb2-main.c - This is the top level of the driver. Module level + and USB core entry points are here. This is our "main". + + pvrusb2-sysfs.[ch] - This is the high level interface which ties the + pvrusb2 driver into sysfs. Through this interface you can do + everything with the driver except actually stream data. + + pvrusb2-tuner.[ch] - This is glue logic that resides between this + driver and the tuner.ko I2C client driver (which is found + elsewhere in V4L). + + pvrusb2-util.h - This header defines some common macros used + throughout the driver. These macros are not really specific to + the driver, but they had to go somewhere. + + pvrusb2-v4l2.[ch] - This is the high level interface which ties the + pvrusb2 driver into video4linux. It is through here that V4L + applications can open and operate the driver in the usual V4L + ways. Note that **ALL** V4L functionality is published only + through here and nowhere else. + + pvrusb2-video-*.[ch] - This is glue logic that resides between this + driver and the saa711x.ko I2C client driver (which is found + elsewhere in V4L). Note that saa711x.ko used to be known as + saa7115.ko in ivtv. There are two versions of this; one is + selected depending on the particular saa711[5x].ko that is found. + + pvrusb2.h - This header contains compile time tunable parameters + (and at the moment the driver has very little that needs to be + tuned). + + + -Mike Isely + isely@pobox.com + diff --git a/Documentation/video4linux/Zoran b/Documentation/video4linux/Zoran index be9f21b84555..040a2c841ae9 100644 --- a/Documentation/video4linux/Zoran +++ b/Documentation/video4linux/Zoran @@ -33,6 +33,21 @@ Inputs/outputs: Composite and S-video Norms: PAL, SECAM (720x576 @ 25 fps), NTSC (720x480 @ 29.97 fps) Card number: 7 +AverMedia 6 Eyes AVS6EYES: +* Zoran zr36067 PCI controller +* Zoran zr36060 MJPEG codec +* Samsung ks0127 TV decoder +* Conexant bt866 TV encoder +Drivers to use: videodev, i2c-core, i2c-algo-bit, + videocodec, ks0127, bt866, zr36060, zr36067 +Inputs/outputs: Six physical inputs. 1-6 are composite, + 1-2, 3-4, 5-6 doubles as S-video, + 1-3 triples as component. + One composite output. +Norms: PAL, SECAM (720x576 @ 25 fps), NTSC (720x480 @ 29.97 fps) +Card number: 8 +Not autodetected, card=8 is necessary. + Linux Media Labs LML33: * Zoran zr36067 PCI controller * Zoran zr36060 MJPEG codec @@ -192,6 +207,10 @@ Micronas vpx3220a TV decoder was introduced in 1996, is used in the DC30 and DC30+ and can handle: PAL B/G/H/I, PAL N, PAL M, NTSC M, NTSC 44, PAL 60, SECAM,NTSC Comb +Samsung ks0127 TV decoder +is used in the AVS6EYES card and +can handle: NTSC-M/N/44, PAL-M/N/B/G/H/I/D/K/L and SECAM + =========================== 1.2 What the TV encoder can do an what not @@ -221,6 +240,10 @@ ITT mse3000 TV encoder was introduced in 1991, is used in the DC10 old can generate: PAL , NTSC , SECAM +Conexant bt866 TV encoder +is used in AVS6EYES, and +can generate: NTSC/PAL, PALM, PALN + The adv717x, should be able to produce PAL N. But you find nothing PAL N specific in the registers. Seem that you have to reuse a other standard to generate PAL N, maybe it would work if you use the PAL M settings. diff --git a/Documentation/video4linux/bttv/CONTRIBUTORS b/Documentation/video4linux/bttv/CONTRIBUTORS index aef49db8847d..8aad6dd93d6b 100644 --- a/Documentation/video4linux/bttv/CONTRIBUTORS +++ b/Documentation/video4linux/bttv/CONTRIBUTORS @@ -1,4 +1,4 @@ -Contributors to bttv: +Contributors to bttv: Michael Chu <mmchu@pobox.com> AverMedia fix and more flexible card recognition @@ -8,8 +8,8 @@ Alan Cox <alan@redhat.com> Chris Kleitsch Hardware I2C - -Gerd Knorr <kraxel@cs.tu-berlin.de> + +Gerd Knorr <kraxel@cs.tu-berlin.de> Radio card (ITT sound processor) bigfoot <bigfoot@net-way.net> @@ -18,7 +18,7 @@ Ragnar Hojland Espinosa <ragnar@macula.net> + many more (please mail me if you are missing in this list and would - like to be mentioned) + like to be mentioned) diff --git a/Documentation/video4linux/cx2341x/fw-calling.txt b/Documentation/video4linux/cx2341x/fw-calling.txt new file mode 100644 index 000000000000..8d21181de537 --- /dev/null +++ b/Documentation/video4linux/cx2341x/fw-calling.txt @@ -0,0 +1,69 @@ +This page describes how to make calls to the firmware api. + +How to call +=========== + +The preferred calling convention is known as the firmware mailbox. The +mailboxes are basically a fixed length array that serves as the call-stack. + +Firmware mailboxes can be located by searching the encoder and decoder memory +for a 16 byte signature. That signature will be located on a 256-byte boundary. + +Signature: +0x78, 0x56, 0x34, 0x12, 0x12, 0x78, 0x56, 0x34, +0x34, 0x12, 0x78, 0x56, 0x56, 0x34, 0x12, 0x78 + +The firmware implements 20 mailboxes of 20 32-bit words. The first 10 are +reserved for API calls. The second 10 are used by the firmware for event +notification. + + Index Name + ----- ---- + 0 Flags + 1 Command + 2 Return value + 3 Timeout + 4-19 Parameter/Result + + +The flags are defined in the following table. The direction is from the +perspective of the firmware. + + Bit Direction Purpose + --- --------- ------- + 2 O Firmware has processed the command. + 1 I Driver has finished setting the parameters. + 0 I Driver is using this mailbox. + + +The command is a 32-bit enumerator. The API specifics may be found in the +fw-*-api.txt documents. + +The return value is a 32-bit enumerator. Only two values are currently defined: +0=success and -1=command undefined. + +There are 16 parameters/results 32-bit fields. The driver populates these fields +with values for all the parameters required by the call. The driver overwrites +these fields with result values returned by the call. The API specifics may be +found in the fw-*-api.txt documents. + +The timeout value protects the card from a hung driver thread. If the driver +doesn't handle the completed call within the timeout specified, the firmware +will reset that mailbox. + +To make an API call, the driver iterates over each mailbox looking for the +first one available (bit 0 has been cleared). The driver sets that bit, fills +in the command enumerator, the timeout value and any required parameters. The +driver then sets the parameter ready bit (bit 1). The firmware scans the +mailboxes for pending commands, processes them, sets the result code, populates +the result value array with that call's return values and sets the call +complete bit (bit 2). Once bit 2 is set, the driver should retrieve the results +and clear all the flags. If the driver does not perform this task within the +time set in the timeout register, the firmware will reset that mailbox. + +Event notifications are sent from the firmware to the host. The host tells the +firmware which events it is interested in via an API call. That call tells the +firmware which notification mailbox to use. The firmware signals the host via +an interrupt. Only the 16 Results fields are used, the Flags, Command, Return +value and Timeout words are not used. + diff --git a/Documentation/video4linux/cx2341x/fw-decoder-api.txt b/Documentation/video4linux/cx2341x/fw-decoder-api.txt new file mode 100644 index 000000000000..9df4fb3ea0f2 --- /dev/null +++ b/Documentation/video4linux/cx2341x/fw-decoder-api.txt @@ -0,0 +1,319 @@ +Decoder firmware API description +================================ + +Note: this API is part of the decoder firmware, so it's cx23415 only. + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_PING_FW +Enum 0/0x00 +Description + This API call does nothing. It may be used to check if the firmware + is responding. + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_START_PLAYBACK +Enum 1/0x01 +Description + Begin or resume playback. +Param[0] + 0 based frame number in GOP to begin playback from. +Param[1] + Specifies the number of muted audio frames to play before normal + audio resumes. + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_STOP_PLAYBACK +Enum 2/0x02 +Description + Ends playback and clears all decoder buffers. If PTS is not zero, + playback stops at specified PTS. +Param[0] + Display 0=last frame, 1=black +Param[1] + PTS low +Param[2] + PTS high + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_SET_PLAYBACK_SPEED +Enum 3/0x03 +Description + Playback stream at speed other than normal. There are two modes of + operation: + Smooth: host transfers entire stream and firmware drops unused + frames. + Coarse: host drops frames based on indexing as required to achieve + desired speed. +Param[0] + Bitmap: + 0:7 0 normal + 1 fast only "1.5 times" + n nX fast, 1/nX slow + 30 Framedrop: + '0' during 1.5 times play, every other B frame is dropped + '1' during 1.5 times play, stream is unchanged (bitrate + must not exceed 8mbps) + 31 Speed: + '0' slow + '1' fast +Param[1] + Direction: 0=forward, 1=reverse +Param[2] + Picture mask: + 1=I frames + 3=I, P frames + 7=I, P, B frames +Param[3] + B frames per GOP (for reverse play only) +Param[4] + Mute audio: 0=disable, 1=enable +Param[5] + Display 0=frame, 1=field +Param[6] + Specifies the number of muted audio frames to play before normal audio + resumes. + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_STEP_VIDEO +Enum 5/0x05 +Description + Each call to this API steps the playback to the next unit defined below + in the current playback direction. +Param[0] + 0=frame, 1=top field, 2=bottom field + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_SET_DMA_BLOCK_SIZE +Enum 8/0x08 +Description + Set DMA transfer block size. Counterpart to API 0xC9 +Param[0] + DMA transfer block size in bytes. A different size may be specified + when issuing the DMA transfer command. + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_GET_XFER_INFO +Enum 9/0x09 +Description + This API call may be used to detect an end of stream condtion. +Result[0] + Stream type +Result[1] + Address offset +Result[2] + Maximum bytes to transfer +Result[3] + Buffer fullness + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_GET_DMA_STATUS +Enum 10/0x0A +Description + Status of the last DMA transfer +Result[0] + Bit 1 set means transfer complete + Bit 2 set means DMA error + Bit 3 set means linked list error +Result[1] + DMA type: 0=MPEG, 1=OSD, 2=YUV + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_SCHED_DMA_FROM_HOST +Enum 11/0x0B +Description + Setup DMA from host operation. Counterpart to API 0xCC +Param[0] + Memory address of link list +Param[1] + Total # of bytes to transfer +Param[2] + DMA type (0=MPEG, 1=OSD, 2=YUV) + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_PAUSE_PLAYBACK +Enum 13/0x0D +Description + Freeze playback immediately. In this mode, when internal buffers are + full, no more data will be accepted and data request IRQs will be + masked. +Param[0] + Display: 0=last frame, 1=black + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_HALT_FW +Enum 14/0x0E +Description + The firmware is halted and no further API calls are serviced until + the firmware is uploaded again. + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_SET_STANDARD +Enum 16/0x10 +Description + Selects display standard +Param[0] + 0=NTSC, 1=PAL + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_GET_VERSION +Enum 17/0x11 +Description + Returns decoder firmware version information +Result[0] + Version bitmask: + Bits 0:15 build + Bits 16:23 minor + Bits 24:31 major + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_SET_STREAM_INPUT +Enum 20/0x14 +Description + Select decoder stream input port +Param[0] + 0=memory (default), 1=streaming + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_GET_TIMING_INFO +Enum 21/0x15 +Description + Returns timing information from start of playback +Result[0] + Frame count by decode order +Result[1] + Video PTS bits 0:31 by display order +Result[2] + Video PTS bit 32 by display order +Result[3] + SCR bits 0:31 by display order +Result[4] + SCR bit 32 by display order + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_SET_AUDIO_MODE +Enum 22/0x16 +Description + Select audio mode +Param[0] + Dual mono mode action +Param[1] + Stereo mode action: + 0=Stereo, 1=Left, 2=Right, 3=Mono, 4=Swap, -1=Unchanged + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_SET_EVENT_NOTIFICATION +Enum 23/0x17 +Description + Setup firmware to notify the host about a particular event. + Counterpart to API 0xD5 +Param[0] + Event: 0=Audio mode change between stereo and dual channel +Param[1] + Notification 0=disabled, 1=enabled +Param[2] + Interrupt bit +Param[3] + Mailbox slot, -1 if no mailbox required. + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_SET_DISPLAY_BUFFERS +Enum 24/0x18 +Description + Number of display buffers. To decode all frames in reverse playback you + must use nine buffers. +Param[0] + 0=six buffers, 1=nine buffers + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_EXTRACT_VBI +Enum 25/0x19 +Description + Extracts VBI data +Param[0] + 0=extract from extension & user data, 1=extract from private packets +Result[0] + VBI table location +Result[1] + VBI table size + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_SET_DECODER_SOURCE +Enum 26/0x1A +Description + Selects decoder source. Ensure that the parameters passed to this + API match the encoder settings. +Param[0] + Mode: 0=MPEG from host, 1=YUV from encoder, 2=YUV from host +Param[1] + YUV picture width +Param[2] + YUV picture height +Param[3] + Bitmap: see Param[0] of API 0xBD + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_SET_AUDIO_OUTPUT +Enum 27/0x1B +Description + Select audio output format +Param[0] + Bitmask: + 0:1 Data size: + '00' 16 bit + '01' 20 bit + '10' 24 bit + 2:7 Unused + 8:9 Mode: + '00' 2 channels + '01' 4 channels + '10' 6 channels + '11' 6 channels with one line data mode + (for left justified MSB first mode, 20 bit only) + 10:11 Unused + 12:13 Channel format: + '00' right justified MSB first mode + '01' left justified MSB first mode + '10' I2S mode + 14:15 Unused + 16:21 Right justify bit count + 22:31 Unused + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_SET_AV_DELAY +Enum 28/0x1C +Description + Set audio/video delay in 90Khz ticks +Param[0] + 0=A/V in sync, negative=audio lags, positive=video lags + +------------------------------------------------------------------------------- + +Name CX2341X_DEC_SET_PREBUFFERING +Enum 30/0x1E +Description + Decoder prebuffering, when enabled up to 128KB are buffered for + streams <8mpbs or 640KB for streams >8mbps +Param[0] + 0=off, 1=on diff --git a/Documentation/video4linux/cx2341x/fw-dma.txt b/Documentation/video4linux/cx2341x/fw-dma.txt new file mode 100644 index 000000000000..8123e262d5b6 --- /dev/null +++ b/Documentation/video4linux/cx2341x/fw-dma.txt @@ -0,0 +1,94 @@ +This page describes the structures and procedures used by the cx2341x DMA +engine. + +Introduction +============ + +The cx2341x PCI interface is busmaster capable. This means it has a DMA +engine to efficiently transfer large volumes of data between the card and main +memory without requiring help from a CPU. Like most hardware, it must operate +on contiguous physical memory. This is difficult to come by in large quantities +on virtual memory machines. + +Therefore, it also supports a technique called "scatter-gather". The card can +transfer multiple buffers in one operation. Instead of allocating one large +contiguous buffer, the driver can allocate several smaller buffers. + +In practice, I've seen the average transfer to be roughly 80K, but transfers +above 128K were not uncommon, particularly at startup. The 128K figure is +important, because that is the largest block that the kernel can normally +allocate. Even still, 128K blocks are hard to come by, so the driver writer is +urged to choose a smaller block size and learn the scatter-gather technique. + +Mailbox #10 is reserved for DMA transfer information. + +Flow +==== + +This section describes, in general, the order of events when handling DMA +transfers. Detailed information follows this section. + +- The card raises the Encoder interrupt. +- The driver reads the transfer type, offset and size from Mailbox #10. +- The driver constructs the scatter-gather array from enough free dma buffers + to cover the size. +- The driver schedules the DMA transfer via the ScheduleDMAtoHost API call. +- The card raises the DMA Complete interrupt. +- The driver checks the DMA status register for any errors. +- The driver post-processes the newly transferred buffers. + +NOTE! It is possible that the Encoder and DMA Complete interrupts get raised +simultaneously. (End of the last, start of the next, etc.) + +Mailbox #10 +=========== + +The Flags, Command, Return Value and Timeout fields are ignored. + +Name: Mailbox #10 +Results[0]: Type: 0: MPEG. +Results[1]: Offset: The position relative to the card's memory space. +Results[2]: Size: The exact number of bytes to transfer. + +My speculation is that since the StartCapture API has a capture type of "RAW" +available, that the type field will have other values that correspond to YUV +and PCM data. + +Scatter-Gather Array +==================== + +The scatter-gather array is a contiguously allocated block of memory that +tells the card the source and destination of each data-block to transfer. +Card "addresses" are derived from the offset supplied by Mailbox #10. Host +addresses are the physical memory location of the target DMA buffer. + +Each S-G array element is a struct of three 32-bit words. The first word is +the source address, the second is the destination address. Both take up the +entire 32 bits. The lowest 16 bits of the third word is the transfer byte +count. The high-bit of the third word is the "last" flag. The last-flag tells +the card to raise the DMA_DONE interrupt. From hard personal experience, if +you forget to set this bit, the card will still "work" but the stream will +most likely get corrupted. + +The transfer count must be a multiple of 256. Therefore, the driver will need +to track how much data in the target buffer is valid and deal with it +accordingly. + +Array Element: + +- 32-bit Source Address +- 32-bit Destination Address +- 16-bit reserved (high bit is the last flag) +- 16-bit byte count + +DMA Transfer Status +=================== + +Register 0x0004 holds the DMA Transfer Status: + +Bit +4 Scatter-Gather array error +3 DMA write error +2 DMA read error +1 write completed +0 read completed diff --git a/Documentation/video4linux/cx2341x/fw-encoder-api.txt b/Documentation/video4linux/cx2341x/fw-encoder-api.txt new file mode 100644 index 000000000000..001c68644b08 --- /dev/null +++ b/Documentation/video4linux/cx2341x/fw-encoder-api.txt @@ -0,0 +1,694 @@ +Encoder firmware API description +================================ + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_PING_FW +Enum 128/0x80 +Description + Does nothing. Can be used to check if the firmware is responding. + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_START_CAPTURE +Enum 129/0x81 +Description + Commences the capture of video, audio and/or VBI data. All encoding + parameters must be initialized prior to this API call. Captures frames + continuously or until a predefined number of frames have been captured. +Param[0] + Capture stream type: + 0=MPEG + 1=Raw + 2=Raw passthrough + 3=VBI + +Param[1] + Bitmask: + Bit 0 when set, captures YUV + Bit 1 when set, captures PCM audio + Bit 2 when set, captures VBI (same as param[0]=3) + Bit 3 when set, the capture destination is the decoder + (same as param[0]=2) + Bit 4 when set, the capture destination is the host + Note: this parameter is only meaningful for RAW capture type. + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_STOP_CAPTURE +Enum 130/0x82 +Description + Ends a capture in progress +Param[0] + 0=stop at end of GOP (generates IRQ) + 1=stop immediate (no IRQ) +Param[1] + Stream type to stop, see param[0] of API 0x81 +Param[2] + Subtype, see param[1] of API 0x81 + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_AUDIO_ID +Enum 137/0x89 +Description + Assigns the transport stream ID of the encoded audio stream +Param[0] + Audio Stream ID + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_VIDEO_ID +Enum 139/0x8B +Description + Set video transport stream ID +Param[0] + Video stream ID + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_PCR_ID +Enum 141/0x8D +Description + Assigns the transport stream ID for PCR packets +Param[0] + PCR Stream ID + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_FRAME_RATE +Enum 143/0x8F +Description + Set video frames per second. Change occurs at start of new GOP. +Param[0] + 0=30fps + 1=25fps + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_FRAME_SIZE +Enum 145/0x91 +Description + Select video stream encoding resolution. +Param[0] + Height in lines. Default 480 +Param[1] + Width in pixels. Default 720 + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_BIT_RATE +Enum 149/0x95 +Description + Assign average video stream bitrate. Note on the last three params: + Param[3] and [4] seem to be always 0, param [5] doesn't seem to be used. +Param[0] + 0=variable bitrate, 1=constant bitrate +Param[1] + bitrate in bits per second +Param[2] + peak bitrate in bits per second, divided by 400 +Param[3] + Mux bitrate in bits per second, divided by 400. May be 0 (default). +Param[4] + Rate Control VBR Padding +Param[5] + VBV Buffer used by encoder + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_GOP_PROPERTIES +Enum 151/0x97 +Description + Setup the GOP structure +Param[0] + GOP size (maximum is 34) +Param[1] + Number of B frames between the I and P frame, plus 1. + For example: IBBPBBPBBPBB --> GOP size: 12, number of B frames: 2+1 = 3 + Note that GOP size must be a multiple of (B-frames + 1). + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_ASPECT_RATIO +Enum 153/0x99 +Description + Sets the encoding aspect ratio. Changes in the aspect ratio take effect + at the start of the next GOP. +Param[0] + '0000' forbidden + '0001' 1:1 square + '0010' 4:3 + '0011' 16:9 + '0100' 2.21:1 + '0101' reserved + .... + '1111' reserved + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_DNR_FILTER_MODE +Enum 155/0x9B +Description + Assign Dynamic Noise Reduction operating mode +Param[0] + Bit0: Spatial filter, set=auto, clear=manual + Bit1: Temporal filter, set=auto, clear=manual +Param[1] + Median filter: + 0=Disabled + 1=Horizontal + 2=Vertical + 3=Horiz/Vert + 4=Diagonal + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_DNR_FILTER_PROPS +Enum 157/0x9D +Description + These Dynamic Noise Reduction filter values are only meaningful when + the respective filter is set to "manual" (See API 0x9B) +Param[0] + Spatial filter: default 0, range 0:15 +Param[1] + Temporal filter: default 0, range 0:31 + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_CORING_LEVELS +Enum 159/0x9F +Description + Assign Dynamic Noise Reduction median filter properties. +Param[0] + Threshold above which the luminance median filter is enabled. + Default: 0, range 0:255 +Param[1] + Threshold below which the luminance median filter is enabled. + Default: 255, range 0:255 +Param[2] + Threshold above which the chrominance median filter is enabled. + Default: 0, range 0:255 +Param[3] + Threshold below which the chrominance median filter is enabled. + Default: 255, range 0:255 + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_SPATIAL_FILTER_TYPE +Enum 161/0xA1 +Description + Assign spatial prefilter parameters +Param[0] + Luminance filter + 0=Off + 1=1D Horizontal + 2=1D Vertical + 3=2D H/V Separable (default) + 4=2D Symmetric non-separable +Param[1] + Chrominance filter + 0=Off + 1=1D Horizontal (default) + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_3_2_PULLDOWN +Enum 177/0xB1 +Description + 3:2 pulldown properties +Param[0] + 0=enabled + 1=disabled + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_VBI_LINE +Enum 183/0xB7 +Description + Selects VBI line number. +Param[0] + Bits 0:4 line number + Bit 31 0=top_field, 1=bottom_field + Bits 0:31 all set specifies "all lines" +Param[1] + VBI line information features: 0=disabled, 1=enabled +Param[2] + Slicing: 0=None, 1=Closed Caption + Almost certainly not implemented. Set to 0. +Param[3] + Luminance samples in this line. + Almost certainly not implemented. Set to 0. +Param[4] + Chrominance samples in this line + Almost certainly not implemented. Set to 0. + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_STREAM_TYPE +Enum 185/0xB9 +Description + Assign stream type + Note: Transport stream is not working in recent firmwares. + And in older firmwares the timestamps in the TS seem to be + unreliable. +Param[0] + 0=Program stream + 1=Transport stream + 2=MPEG1 stream + 3=PES A/V stream + 5=PES Video stream + 7=PES Audio stream + 10=DVD stream + 11=VCD stream + 12=SVCD stream + 13=DVD_S1 stream + 14=DVD_S2 stream + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_OUTPUT_PORT +Enum 187/0xBB +Description + Assign stream output port. Normally 0 when the data is copied through + the PCI bus (DMA), and 1 when the data is streamed to another chip + (pvrusb and cx88-blackbird). +Param[0] + 0=Memory (default) + 1=Streaming + 2=Serial +Param[1] + Unknown, but leaving this to 0 seems to work best. Indications are that + this might have to do with USB support, although passing anything but 0 + onl breaks things. + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_AUDIO_PROPERTIES +Enum 189/0xBD +Description + Set audio stream properties, may be called while encoding is in progress. + Note: all bitfields are consistent with ISO11172 documentation except + bits 2:3 which ISO docs define as: + '11' Layer I + '10' Layer II + '01' Layer III + '00' Undefined + This discrepancy may indicate a possible error in the documentation. + Testing indicated that only Layer II is actually working, and that + the minimum bitrate should be 192 kbps. +Param[0] + Bitmask: + 0:1 '00' 44.1Khz + '01' 48Khz + '10' 32Khz + '11' reserved + + 2:3 '01'=Layer I + '10'=Layer II + + 4:7 Bitrate: + Index | Layer I | Layer II + ------+-------------+------------ + '0000' | free format | free format + '0001' | 32 kbit/s | 32 kbit/s + '0010' | 64 kbit/s | 48 kbit/s + '0011' | 96 kbit/s | 56 kbit/s + '0100' | 128 kbit/s | 64 kbit/s + '0101' | 160 kbit/s | 80 kbit/s + '0110' | 192 kbit/s | 96 kbit/s + '0111' | 224 kbit/s | 112 kbit/s + '1000' | 256 kbit/s | 128 kbit/s + '1001' | 288 kbit/s | 160 kbit/s + '1010' | 320 kbit/s | 192 kbit/s + '1011' | 352 kbit/s | 224 kbit/s + '1100' | 384 kbit/s | 256 kbit/s + '1101' | 416 kbit/s | 320 kbit/s + '1110' | 448 kbit/s | 384 kbit/s + Note: For Layer II, not all combinations of total bitrate + and mode are allowed. See ISO11172-3 3-Annex B, Table 3-B.2 + + 8:9 '00'=Stereo + '01'=JointStereo + '10'=Dual + '11'=Mono + Note: testing seems to indicate that Mono and possibly + JointStereo are not working (default to stereo). + Dual does work, though. + + 10:11 Mode Extension used in joint_stereo mode. + In Layer I and II they indicate which subbands are in + intensity_stereo. All other subbands are coded in stereo. + '00' subbands 4-31 in intensity_stereo, bound==4 + '01' subbands 8-31 in intensity_stereo, bound==8 + '10' subbands 12-31 in intensity_stereo, bound==12 + '11' subbands 16-31 in intensity_stereo, bound==16 + + 12:13 Emphasis: + '00' None + '01' 50/15uS + '10' reserved + '11' CCITT J.17 + + 14 CRC: + '0' off + '1' on + + 15 Copyright: + '0' off + '1' on + + 16 Generation: + '0' copy + '1' original + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_HALT_FW +Enum 195/0xC3 +Description + The firmware is halted and no further API calls are serviced until the + firmware is uploaded again. + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_GET_VERSION +Enum 196/0xC4 +Description + Returns the version of the encoder firmware. +Result[0] + Version bitmask: + Bits 0:15 build + Bits 16:23 minor + Bits 24:31 major + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_GOP_CLOSURE +Enum 197/0xC5 +Description + Assigns the GOP open/close property. +Param[0] + 0=Open + 1=Closed + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_GET_SEQ_END +Enum 198/0xC6 +Description + Obtains the sequence end code of the encoder's buffer. When a capture + is started a number of interrupts are still generated, the last of + which will have Result[0] set to 1 and Result[1] will contain the size + of the buffer. +Result[0] + State of the transfer (1 if last buffer) +Result[1] + If Result[0] is 1, this contains the size of the last buffer, undefined + otherwise. + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_PGM_INDEX_INFO +Enum 199/0xC7 +Description + Sets the Program Index Information. +Param[0] + Picture Mask: + 0=No index capture + 1=I frames + 3=I,P frames + 7=I,P,B frames +Param[1] + Elements requested (up to 400) +Result[0] + Offset in SDF memory of the table. +Result[1] + Number of allocated elements up to a maximum of Param[1] + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_VBI_CONFIG +Enum 200/0xC8 +Description + Configure VBI settings +Param[0] + Bitmap: + 0 Mode '0' Sliced, '1' Raw + 1:3 Insertion: + '000' insert in extension & user data + '001' insert in private packets + '010' separate stream and user data + '111' separate stream and private data + 8:15 Stream ID (normally 0xBD) +Param[1] + Frames per interrupt (max 8). Only valid in raw mode. +Param[2] + Total raw VBI frames. Only valid in raw mode. +Param[3] + Start codes +Param[4] + Stop codes +Param[5] + Lines per frame +Param[6] + Byte per line +Result[0] + Observed frames per interrupt in raw mode only. Rage 1 to Param[1] +Result[1] + Observed number of frames in raw mode. Range 1 to Param[2] +Result[2] + Memory offset to start or raw VBI data + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_DMA_BLOCK_SIZE +Enum 201/0xC9 +Description + Set DMA transfer block size +Param[0] + DMA transfer block size in bytes or frames. When unit is bytes, + supported block sizes are 2^7, 2^8 and 2^9 bytes. +Param[1] + Unit: 0=bytes, 1=frames + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_GET_PREV_DMA_INFO_MB_10 +Enum 202/0xCA +Description + Returns information on the previous DMA transfer in conjunction with + bit 27 of the interrupt mask. Uses mailbox 10. +Result[0] + Type of stream +Result[1] + Address Offset +Result[2] + Maximum size of transfer + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_GET_PREV_DMA_INFO_MB_9 +Enum 203/0xCB +Description + Returns information on the previous DMA transfer in conjunction with + bit 27 of the interrupt mask. Uses mailbox 9. +Result[0] + Status bits: + Bit 0 set indicates transfer complete + Bit 2 set indicates transfer error + Bit 4 set indicates linked list error +Result[1] + DMA type +Result[2] + Presentation Time Stamp bits 0..31 +Result[3] + Presentation Time Stamp bit 32 + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SCHED_DMA_TO_HOST +Enum 204/0xCC +Description + Setup DMA to host operation +Param[0] + Memory address of link list +Param[1] + Length of link list (wtf: what units ???) +Param[2] + DMA type (0=MPEG) + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_INITIALIZE_INPUT +Enum 205/0xCD +Description + Initializes the video input + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_FRAME_DROP_RATE +Enum 208/0xD0 +Description + For each frame captured, skip specified number of frames. +Param[0] + Number of frames to skip + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_PAUSE_ENCODER +Enum 210/0xD2 +Description + During a pause condition, all frames are dropped instead of being encoded. +Param[0] + 0=Pause encoding + 1=Continue encoding + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_REFRESH_INPUT +Enum 211/0xD3 +Description + Refreshes the video input + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_COPYRIGHT +Enum 212/0xD4 +Description + Sets stream copyright property +Param[0] + 0=Stream is not copyrighted + 1=Stream is copyrighted + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_EVENT_NOTIFICATION +Enum 213/0xD5 +Description + Setup firmware to notify the host about a particular event. Host must + unmask the interrupt bit. +Param[0] + Event (0=refresh encoder input) +Param[1] + Notification 0=disabled 1=enabled +Param[2] + Interrupt bit +Param[3] + Mailbox slot, -1 if no mailbox required. + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_NUM_VSYNC_LINES +Enum 214/0xD6 +Description + Depending on the analog video decoder used, this assigns the number + of lines for field 1 and 2. +Param[0] + Field 1 number of lines: + 0x00EF for SAA7114 + 0x00F0 for SAA7115 + 0x0105 for Micronas +Param[1] + Field 2 number of lines: + 0x00EF for SAA7114 + 0x00F0 for SAA7115 + 0x0106 for Micronas + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_SET_PLACEHOLDER +Enum 215/0xD7 +Description + Provides a mechanism of inserting custom user data in the MPEG stream. +Param[0] + 0=extension & user data + 1=private packet with stream ID 0xBD +Param[1] + Rate at which to insert data, in units of frames (for private packet) + or GOPs (for ext. & user data) +Param[2] + Number of data DWORDs (below) to insert +Param[3] + Custom data 0 +Param[4] + Custom data 1 +Param[5] + Custom data 2 +Param[6] + Custom data 3 +Param[7] + Custom data 4 +Param[8] + Custom data 5 +Param[9] + Custom data 6 +Param[10] + Custom data 7 +Param[11] + Custom data 8 + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_MUTE_VIDEO +Enum 217/0xD9 +Description + Video muting +Param[0] + Bit usage: + 0 '0'=video not muted + '1'=video muted, creates frames with the YUV color defined below + 1:7 Unused + 8:15 V chrominance information + 16:23 U chrominance information + 24:31 Y luminance information + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_MUTE_AUDIO +Enum 218/0xDA +Description + Audio muting +Param[0] + 0=audio not muted + 1=audio muted (produces silent mpeg audio stream) + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_UNKNOWN +Enum 219/0xDB +Description + Unknown API, it's used by Hauppauge though. +Param[0] + 0 This is the value Hauppauge uses, Unknown what it means. + +------------------------------------------------------------------------------- + +Name CX2341X_ENC_MISC +Enum 220/0xDC +Description + Miscellaneous actions. Not known for 100% what it does. It's really a + sort of ioctl call. The first parameter is a command number, the second + the value. +Param[0] + Command number: + 1=set initial SCR value when starting encoding. + 2=set quality mode (apparently some test setting). + 3=setup advanced VIM protection handling (supposedly only for the cx23416 + for raw YUV). + Actually it looks like this should be 0 for saa7114/5 based card and 1 + for cx25840 based cards. + 4=generate artificial PTS timestamps + 5=USB flush mode + 6=something to do with the quantization matrix + 7=set navigation pack insertion for DVD + 8=enable scene change detection (seems to be a failure) + 9=set history parameters of the video input module + 10=set input field order of VIM + 11=set quantization matrix + 12=reset audio interface + 13=set audio volume delay + 14=set audio delay + +Param[1] + Command value. diff --git a/Documentation/video4linux/cx2341x/fw-memory.txt b/Documentation/video4linux/cx2341x/fw-memory.txt new file mode 100644 index 000000000000..ef0aad3f88fc --- /dev/null +++ b/Documentation/video4linux/cx2341x/fw-memory.txt @@ -0,0 +1,141 @@ +This document describes the cx2341x memory map and documents some of the register +space. + +Warning! This information was figured out from searching through the memory and +registers, this information may not be correct and is certainly not complete, and +was not derived from anything more than searching through the memory space with +commands like: + + ivtvctl -O min=0x02000000,max=0x020000ff + +So take this as is, I'm always searching for more stuff, it's a large +register space :-). + +Memory Map +========== + +The cx2341x exposes its entire 64M memory space to the PCI host via the PCI BAR0 +(Base Address Register 0). The addresses here are offsets relative to the +address held in BAR0. + +0x00000000-0x00ffffff Encoder memory space +0x00000000-0x0003ffff Encode.rom + ???-??? MPEG buffer(s) + ???-??? Raw video capture buffer(s) + ???-??? Raw audio capture buffer(s) + ???-??? Display buffers (6 or 9) + +0x01000000-0x01ffffff Decoder memory space +0x01000000-0x0103ffff Decode.rom + ???-??? MPEG buffers(s) +0x0114b000-0x0115afff Audio.rom (deprecated?) + +0x02000000-0x0200ffff Register Space + +Registers +========= + +The registers occupy the 64k space starting at the 0x02000000 offset from BAR0. +All of these registers are 32 bits wide. + +DMA Registers 0x000-0xff: + + 0x00 - Control: + 0=reset/cancel, 1=read, 2=write, 4=stop + 0x04 - DMA status: + 1=read busy, 2=write busy, 4=read error, 8=write error, 16=link list error + 0x08 - pci DMA pointer for read link list + 0x0c - pci DMA pointer for write link list + 0x10 - read/write DMA enable: + 1=read enable, 2=write enable + 0x14 - always 0xffffffff, if set any lower instability occurs, 0x00 crashes + 0x18 - ?? + 0x1c - always 0x20 or 32, smaller values slow down DMA transactions + 0x20 - always value of 0x780a010a + 0x24-0x3c - usually just random values??? + 0x40 - Interrupt status + 0x44 - Write a bit here and shows up in Interrupt status 0x40 + 0x48 - Interrupt Mask + 0x4C - always value of 0xfffdffff, + if changed to 0xffffffff DMA write interrupts break. + 0x50 - always 0xffffffff + 0x54 - always 0xffffffff (0x4c, 0x50, 0x54 seem like interrupt masks, are + 3 processors on chip, Java ones, VPU, SPU, APU, maybe these are the + interrupt masks???). + 0x60-0x7C - random values + 0x80 - first write linked list reg, for Encoder Memory addr + 0x84 - first write linked list reg, for pci memory addr + 0x88 - first write linked list reg, for length of buffer in memory addr + (|0x80000000 or this for last link) + 0x8c-0xcc - rest of write linked list reg, 8 sets of 3 total, DMA goes here + from linked list addr in reg 0x0c, firmware must push through or + something. + 0xe0 - first (and only) read linked list reg, for pci memory addr + 0xe4 - first (and only) read linked list reg, for Decoder memory addr + 0xe8 - first (and only) read linked list reg, for length of buffer + 0xec-0xff - Nothing seems to be in these registers, 0xec-f4 are 0x00000000. + +Memory locations for Encoder Buffers 0x700-0x7ff: + +These registers show offsets of memory locations pertaining to each +buffer area used for encoding, have to shift them by <<1 first. + +0x07F8: Encoder SDRAM refresh +0x07FC: Encoder SDRAM pre-charge + +Memory locations for Decoder Buffers 0x800-0x8ff: + +These registers show offsets of memory locations pertaining to each +buffer area used for decoding, have to shift them by <<1 first. + +0x08F8: Decoder SDRAM refresh +0x08FC: Decoder SDRAM pre-charge + +Other memory locations: + +0x2800: Video Display Module control +0x2D00: AO (audio output?) control +0x2D24: Bytes Flushed +0x7000: LSB I2C write clock bit (inverted) +0x7004: LSB I2C write data bit (inverted) +0x7008: LSB I2C read clock bit +0x700c: LSB I2C read data bit +0x9008: GPIO get input state +0x900c: GPIO set output state +0x9020: GPIO direction (Bit7 (GPIO 0..7) - 0:input, 1:output) +0x9050: SPU control +0x9054: Reset HW blocks +0x9058: VPU control +0xA018: Bit6: interrupt pending? +0xA064: APU command + + +Interrupt Status Register +========================= + +The definition of the bits in the interrupt status register 0x0040, and the +interrupt mask 0x0048. If a bit is cleared in the mask, then we want our ISR to +execute. + +Bit +31 Encoder Start Capture +30 Encoder EOS +29 Encoder VBI capture +28 Encoder Video Input Module reset event +27 Encoder DMA complete +26 +25 Decoder copy protect detection event +24 Decoder audio mode change detection event +23 +22 Decoder data request +21 Decoder I-Frame? done +20 Decoder DMA complete +19 Decoder VBI re-insertion +18 Decoder DMA err (linked-list bad) + +Missing +Encoder API call completed +Decoder API call completed +Encoder API post(?) +Decoder API post(?) +Decoder VTRACE event diff --git a/Documentation/video4linux/cx2341x/fw-osd-api.txt b/Documentation/video4linux/cx2341x/fw-osd-api.txt new file mode 100644 index 000000000000..da98ae30a37a --- /dev/null +++ b/Documentation/video4linux/cx2341x/fw-osd-api.txt @@ -0,0 +1,342 @@ +OSD firmware API description +============================ + +Note: this API is part of the decoder firmware, so it's cx23415 only. + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_GET_FRAMEBUFFER +Enum 65/0x41 +Description + Return base and length of contiguous OSD memory. +Result[0] + OSD base address +Result[1] + OSD length + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_GET_PIXEL_FORMAT +Enum 66/0x42 +Description + Query OSD format +Result[0] + 0=8bit index, 4=AlphaRGB 8:8:8:8 + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_SET_PIXEL_FORMAT +Enum 67/0x43 +Description + Assign pixel format +Param[0] + 0=8bit index, 4=AlphaRGB 8:8:8:8 + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_GET_STATE +Enum 68/0x44 +Description + Query OSD state +Result[0] + Bit 0 0=off, 1=on + Bits 1:2 alpha control + Bits 3:5 pixel format + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_SET_STATE +Enum 69/0x45 +Description + OSD switch +Param[0] + 0=off, 1=on + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_GET_OSD_COORDS +Enum 70/0x46 +Description + Retrieve coordinates of OSD area blended with video +Result[0] + OSD buffer address +Result[1] + Stride in pixels +Result[2] + Lines in OSD buffer +Result[3] + Horizontal offset in buffer +Result[4] + Vertical offset in buffer + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_SET_OSD_COORDS +Enum 71/0x47 +Description + Assign the coordinates of the OSD area to blend with video +Param[0] + buffer address +Param[1] + buffer stride in pixels +Param[2] + lines in buffer +Param[3] + horizontal offset +Param[4] + vertical offset + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_GET_SCREEN_COORDS +Enum 72/0x48 +Description + Retrieve OSD screen area coordinates +Result[0] + top left horizontal offset +Result[1] + top left vertical offset +Result[2] + bottom right hotizontal offset +Result[3] + bottom right vertical offset + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_SET_SCREEN_COORDS +Enum 73/0x49 +Description + Assign the coordinates of the screen area to blend with video +Param[0] + top left horizontal offset +Param[1] + top left vertical offset +Param[2] + bottom left horizontal offset +Param[3] + bottom left vertical offset + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_GET_GLOBAL_ALPHA +Enum 74/0x4A +Description + Retrieve OSD global alpha +Result[0] + global alpha: 0=off, 1=on +Result[1] + bits 0:7 global alpha + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_SET_GLOBAL_ALPHA +Enum 75/0x4B +Description + Update global alpha +Param[0] + global alpha: 0=off, 1=on +Param[1] + global alpha (8 bits) +Param[2] + local alpha: 0=on, 1=off + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_SET_BLEND_COORDS +Enum 78/0x4C +Description + Move start of blending area within display buffer +Param[0] + horizontal offset in buffer +Param[1] + vertical offset in buffer + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_GET_FLICKER_STATE +Enum 79/0x4F +Description + Retrieve flicker reduction module state +Result[0] + flicker state: 0=off, 1=on + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_SET_FLICKER_STATE +Enum 80/0x50 +Description + Set flicker reduction module state +Param[0] + State: 0=off, 1=on + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_BLT_COPY +Enum 82/0x52 +Description + BLT copy +Param[0] +'0000' zero +'0001' ~destination AND ~source +'0010' ~destination AND source +'0011' ~destination +'0100' destination AND ~source +'0101' ~source +'0110' destination XOR source +'0111' ~destination OR ~source +'1000' ~destination AND ~source +'1001' destination XNOR source +'1010' source +'1011' ~destination OR source +'1100' destination +'1101' destination OR ~source +'1110' destination OR source +'1111' one + +Param[1] + Resulting alpha blending + '01' source_alpha + '10' destination_alpha + '11' source_alpha*destination_alpha+1 + (zero if both source and destination alpha are zero) +Param[2] + '00' output_pixel = source_pixel + + '01' if source_alpha=0: + output_pixel = destination_pixel + if 256 > source_alpha > 1: + output_pixel = ((source_alpha + 1)*source_pixel + + (255 - source_alpha)*destination_pixel)/256 + + '10' if destination_alpha=0: + output_pixel = source_pixel + if 255 > destination_alpha > 0: + output_pixel = ((255 - destination_alpha)*source_pixel + + (destination_alpha + 1)*destination_pixel)/256 + + '11' if source_alpha=0: + source_temp = 0 + if source_alpha=255: + source_temp = source_pixel*256 + if 255 > source_alpha > 0: + source_temp = source_pixel*(source_alpha + 1) + if destination_alpha=0: + destination_temp = 0 + if destination_alpha=255: + destination_temp = destination_pixel*256 + if 255 > destination_alpha > 0: + destination_temp = destination_pixel*(destination_alpha + 1) + output_pixel = (source_temp + destination_temp)/256 +Param[3] + width +Param[4] + height +Param[5] + destination pixel mask +Param[6] + destination rectangle start address +Param[7] + destination stride in dwords +Param[8] + source stride in dwords +Param[9] + source rectangle start address + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_BLT_FILL +Enum 83/0x53 +Description + BLT fill color +Param[0] + Same as Param[0] on API 0x52 +Param[1] + Same as Param[1] on API 0x52 +Param[2] + Same as Param[2] on API 0x52 +Param[3] + width +Param[4] + height +Param[5] + destination pixel mask +Param[6] + destination rectangle start address +Param[7] + destination stride in dwords +Param[8] + color fill value + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_BLT_TEXT +Enum 84/0x54 +Description + BLT for 8 bit alpha text source +Param[0] + Same as Param[0] on API 0x52 +Param[1] + Same as Param[1] on API 0x52 +Param[2] + Same as Param[2] on API 0x52 +Param[3] + width +Param[4] + height +Param[5] + destination pixel mask +Param[6] + destination rectangle start address +Param[7] + destination stride in dwords +Param[8] + source stride in dwords +Param[9] + source rectangle start address +Param[10] + color fill value + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_SET_FRAMEBUFFER_WINDOW +Enum 86/0x56 +Description + Positions the main output window on the screen. The coordinates must be + such that the entire window fits on the screen. +Param[0] + window width +Param[1] + window height +Param[2] + top left window corner horizontal offset +Param[3] + top left window corner vertical offset + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_SET_CHROMA_KEY +Enum 96/0x60 +Description + Chroma key switch and color +Param[0] + state: 0=off, 1=on +Param[1] + color + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_GET_ALPHA_CONTENT_INDEX +Enum 97/0x61 +Description + Retrieve alpha content index +Result[0] + alpha content index, Range 0:15 + +------------------------------------------------------------------------------- + +Name CX2341X_OSD_SET_ALPHA_CONTENT_INDEX +Enum 98/0x62 +Description + Assign alpha content index +Param[0] + alpha content index, range 0:15 diff --git a/Documentation/video4linux/cx2341x/fw-upload.txt b/Documentation/video4linux/cx2341x/fw-upload.txt new file mode 100644 index 000000000000..60c502ce3215 --- /dev/null +++ b/Documentation/video4linux/cx2341x/fw-upload.txt @@ -0,0 +1,49 @@ +This document describes how to upload the cx2341x firmware to the card. + +How to find +=========== + +See the web pages of the various projects that uses this chip for information +on how to obtain the firmware. + +The firmware stored in a Windows driver can be detected as follows: + +- Each firmware image is 256k bytes. +- The 1st 32-bit word of the Encoder image is 0x0000da7 +- The 1st 32-bit word of the Decoder image is 0x00003a7 +- The 2nd 32-bit word of both images is 0xaa55bb66 + +How to load +=========== + +- Issue the FWapi command to stop the encoder if it is running. Wait for the + command to complete. +- Issue the FWapi command to stop the decoder if it is running. Wait for the + command to complete. +- Issue the I2C command to the digitizer to stop emitting VSYNC events. +- Issue the FWapi command to halt the encoder's firmware. +- Sleep for 10ms. +- Issue the FWapi command to halt the decoder's firmware. +- Sleep for 10ms. +- Write 0x00000000 to register 0x2800 to stop the Video Display Module. +- Write 0x00000005 to register 0x2D00 to stop the AO (audio output?). +- Write 0x00000000 to register 0xA064 to ping? the APU. +- Write 0xFFFFFFFE to register 0x9058 to stop the VPU. +- Write 0xFFFFFFFF to register 0x9054 to reset the HW blocks. +- Write 0x00000001 to register 0x9050 to stop the SPU. +- Sleep for 10ms. +- Write 0x0000001A to register 0x07FC to init the Encoder SDRAM's pre-charge. +- Write 0x80000640 to register 0x07F8 to init the Encoder SDRAM's refresh to 1us. +- Write 0x0000001A to register 0x08FC to init the Decoder SDRAM's pre-charge. +- Write 0x80000640 to register 0x08F8 to init the Decoder SDRAM's refresh to 1us. +- Sleep for 512ms. (600ms is recommended) +- Transfer the encoder's firmware image to offset 0 in Encoder memory space. +- Transfer the decoder's firmware image to offset 0 in Decoder memory space. +- Use a read-modify-write operation to Clear bit 0 of register 0x9050 to + re-enable the SPU. +- Sleep for 1 second. +- Use a read-modify-write operation to Clear bits 3 and 0 of register 0x9058 + to re-enable the VPU. +- Sleep for 1 second. +- Issue status API commands to both firmware images to verify. + diff --git a/Documentation/video4linux/cx88/hauppauge-wintv-cx88-ir.txt b/Documentation/video4linux/cx88/hauppauge-wintv-cx88-ir.txt new file mode 100644 index 000000000000..93fec32a1188 --- /dev/null +++ b/Documentation/video4linux/cx88/hauppauge-wintv-cx88-ir.txt @@ -0,0 +1,54 @@ +The controls for the mux are GPIO [0,1] for source, and GPIO 2 for muting. + +GPIO0 GPIO1 + 0 0 TV Audio + 1 0 FM radio + 0 1 Line-In + 1 1 Mono tuner bypass or CD passthru (tuner specific) + +GPIO 16(i believe) is tied to the IR port (if present). + +------------------------------------------------------------------------------------ + +>From the data sheet: + Register 24'h20004 PCI Interrupt Status + bit [18] IR_SMP_INT Set when 32 input samples have been collected over + gpio[16] pin into GP_SAMPLE register. + +What's missing from the data sheet: + +Setup 4KHz sampling rate (roughly 2x oversampled; good enough for our RC5 +compat remote) +set register 0x35C050 to 0xa80a80 + +enable sampling +set register 0x35C054 to 0x5 + +Of course, enable the IRQ bit 18 in the interrupt mask register .(and +provide for a handler) + +GP_SAMPLE register is at 0x35C058 + +Bits are then right shifted into the GP_SAMPLE register at the specified +rate; you get an interrupt when a full DWORD is recieved. +You need to recover the actual RC5 bits out of the (oversampled) IR sensor +bits. (Hint: look for the 0/1and 1/0 crossings of the RC5 bi-phase data) An +actual raw RC5 code will span 2-3 DWORDS, depending on the actual alignment. + +I'm pretty sure when no IR signal is present the receiver is always in a +marking state(1); but stray light, etc can cause intermittent noise values +as well. Remember, this is a free running sample of the IR receiver state +over time, so don't assume any sample starts at any particular place. + +http://www.atmel.com/dyn/resources/prod_documents/doc2817.pdf +This data sheet (google search) seems to have a lovely description of the +RC5 basics + +http://users.pandora.be/nenya/electronics/rc5/ and more data + +http://www.ee.washington.edu/circuit_archive/text/ir_decode.txt +and even a reference to how to decode a bi-phase data stream. + +http://www.xs4all.nl/~sbp/knowledge/ir/rc5.htm +still more info + diff --git a/Documentation/video4linux/et61x251.txt b/Documentation/video4linux/et61x251.txt index 29340282ab5f..cd584f20a997 100644 --- a/Documentation/video4linux/et61x251.txt +++ b/Documentation/video4linux/et61x251.txt @@ -1,9 +1,9 @@ - ET61X[12]51 PC Camera Controllers - Driver for Linux - ================================= + ET61X[12]51 PC Camera Controllers + Driver for Linux + ================================= - - Documentation - + - Documentation - Index @@ -156,46 +156,46 @@ Name: video_nr Type: short array (min = 0, max = 64) Syntax: <-1|n[,...]> Description: Specify V4L2 minor mode number: - -1 = use next available - n = use minor number n - You can specify up to 64 cameras this way. - For example: - video_nr=-1,2,-1 would assign minor number 2 to the second - registered camera and use auto for the first one and for every - other camera. + -1 = use next available + n = use minor number n + You can specify up to 64 cameras this way. + For example: + video_nr=-1,2,-1 would assign minor number 2 to the second + registered camera and use auto for the first one and for every + other camera. Default: -1 ------------------------------------------------------------------------------- Name: force_munmap Type: bool array (min = 0, max = 64) Syntax: <0|1[,...]> Description: Force the application to unmap previously mapped buffer memory - before calling any VIDIOC_S_CROP or VIDIOC_S_FMT ioctl's. Not - all the applications support this feature. This parameter is - specific for each detected camera. - 0 = do not force memory unmapping - 1 = force memory unmapping (save memory) + before calling any VIDIOC_S_CROP or VIDIOC_S_FMT ioctl's. Not + all the applications support this feature. This parameter is + specific for each detected camera. + 0 = do not force memory unmapping + 1 = force memory unmapping (save memory) Default: 0 ------------------------------------------------------------------------------- Name: frame_timeout Type: uint array (min = 0, max = 64) Syntax: <n[,...]> Description: Timeout for a video frame in seconds. This parameter is - specific for each detected camera. This parameter can be - changed at runtime thanks to the /sys filesystem interface. + specific for each detected camera. This parameter can be + changed at runtime thanks to the /sys filesystem interface. Default: 2 ------------------------------------------------------------------------------- Name: debug Type: ushort Syntax: <n> Description: Debugging information level, from 0 to 3: - 0 = none (use carefully) - 1 = critical errors - 2 = significant informations - 3 = more verbose messages - Level 3 is useful for testing only, when only one device - is used at the same time. It also shows some more informations - about the hardware being detected. This module parameter can be - changed at runtime thanks to the /sys filesystem interface. + 0 = none (use carefully) + 1 = critical errors + 2 = significant informations + 3 = more verbose messages + Level 3 is useful for testing only, when only one device + is used at the same time. It also shows some more informations + about the hardware being detected. This module parameter can be + changed at runtime thanks to the /sys filesystem interface. Default: 2 ------------------------------------------------------------------------------- diff --git a/Documentation/video4linux/ibmcam.txt b/Documentation/video4linux/ibmcam.txt index 4a40a2e99451..397a94eb77b8 100644 --- a/Documentation/video4linux/ibmcam.txt +++ b/Documentation/video4linux/ibmcam.txt @@ -21,7 +21,7 @@ Internal interface: Video For Linux (V4L) Supported controls: - by V4L: Contrast, Brightness, Color, Hue - by driver options: frame rate, lighting conditions, video format, - default picture settings, sharpness. + default picture settings, sharpness. SUPPORTED CAMERAS: @@ -191,66 +191,66 @@ init_model2_sat Integer 0..255 [0x34] init_model2_sat=65 init_model2_yb Integer 0..255 [0xa0] init_model2_yb=200 debug You don't need this option unless you are a developer. - If you are a developer then you will see in the code - what values do what. 0=off. + If you are a developer then you will see in the code + what values do what. 0=off. flags This is a bit mask, and you can combine any number of - bits to produce what you want. Usually you don't want - any of extra features this option provides: - - FLAGS_RETRY_VIDIOCSYNC 1 This bit allows to retry failed - VIDIOCSYNC ioctls without failing. - Will work with xawtv, will not - with xrealproducer. Default is - not set. - FLAGS_MONOCHROME 2 Activates monochrome (b/w) mode. - FLAGS_DISPLAY_HINTS 4 Shows colored pixels which have - magic meaning to developers. - FLAGS_OVERLAY_STATS 8 Shows tiny numbers on screen, - useful only for debugging. - FLAGS_FORCE_TESTPATTERN 16 Shows blue screen with numbers. - FLAGS_SEPARATE_FRAMES 32 Shows each frame separately, as - it was received from the camera. - Default (not set) is to mix the - preceding frame in to compensate - for occasional loss of Isoc data - on high frame rates. - FLAGS_CLEAN_FRAMES 64 Forces "cleanup" of each frame - prior to use; relevant only if - FLAGS_SEPARATE_FRAMES is set. - Default is not to clean frames, - this is a little faster but may - produce flicker if frame rate is - too high and Isoc data gets lost. - FLAGS_NO_DECODING 128 This flag turns the video stream - decoder off, and dumps the raw - Isoc data from the camera into - the reading process. Useful to - developers, but not to users. + bits to produce what you want. Usually you don't want + any of extra features this option provides: + + FLAGS_RETRY_VIDIOCSYNC 1 This bit allows to retry failed + VIDIOCSYNC ioctls without failing. + Will work with xawtv, will not + with xrealproducer. Default is + not set. + FLAGS_MONOCHROME 2 Activates monochrome (b/w) mode. + FLAGS_DISPLAY_HINTS 4 Shows colored pixels which have + magic meaning to developers. + FLAGS_OVERLAY_STATS 8 Shows tiny numbers on screen, + useful only for debugging. + FLAGS_FORCE_TESTPATTERN 16 Shows blue screen with numbers. + FLAGS_SEPARATE_FRAMES 32 Shows each frame separately, as + it was received from the camera. + Default (not set) is to mix the + preceding frame in to compensate + for occasional loss of Isoc data + on high frame rates. + FLAGS_CLEAN_FRAMES 64 Forces "cleanup" of each frame + prior to use; relevant only if + FLAGS_SEPARATE_FRAMES is set. + Default is not to clean frames, + this is a little faster but may + produce flicker if frame rate is + too high and Isoc data gets lost. + FLAGS_NO_DECODING 128 This flag turns the video stream + decoder off, and dumps the raw + Isoc data from the camera into + the reading process. Useful to + developers, but not to users. framerate This setting controls frame rate of the camera. This is - an approximate setting (in terms of "worst" ... "best") - because camera changes frame rate depending on amount - of light available. Setting 0 is slowest, 6 is fastest. - Beware - fast settings are very demanding and may not - work well with all video sizes. Be conservative. + an approximate setting (in terms of "worst" ... "best") + because camera changes frame rate depending on amount + of light available. Setting 0 is slowest, 6 is fastest. + Beware - fast settings are very demanding and may not + work well with all video sizes. Be conservative. hue_correction This highly optional setting allows to adjust the - hue of the image in a way slightly different from - what usual "hue" control does. Both controls affect - YUV colorspace: regular "hue" control adjusts only - U component, and this "hue_correction" option similarly - adjusts only V component. However usually it is enough - to tweak only U or V to compensate for colored light or - color temperature; this option simply allows more - complicated correction when and if it is necessary. + hue of the image in a way slightly different from + what usual "hue" control does. Both controls affect + YUV colorspace: regular "hue" control adjusts only + U component, and this "hue_correction" option similarly + adjusts only V component. However usually it is enough + to tweak only U or V to compensate for colored light or + color temperature; this option simply allows more + complicated correction when and if it is necessary. init_brightness These settings specify _initial_ values which will be init_contrast used to set up the camera. If your V4L application has init_color its own controls to adjust the picture then these init_hue controls will be used too. These options allow you to - preconfigure the camera when it gets connected, before - any V4L application connects to it. Good for webcams. + preconfigure the camera when it gets connected, before + any V4L application connects to it. Good for webcams. init_model2_rg These initial settings alter color balance of the init_model2_rg2 camera on hardware level. All four settings may be used @@ -258,47 +258,47 @@ init_model2_sat to tune the camera to specific lighting conditions. These init_model2_yb settings only apply to Model 2 cameras. lighting This option selects one of three hardware-defined - photosensitivity settings of the camera. 0=bright light, - 1=Medium (default), 2=Low light. This setting affects - frame rate: the dimmer the lighting the lower the frame - rate (because longer exposition time is needed). The - Model 2 cameras allow values more than 2 for this option, - thus enabling extremely high sensitivity at cost of frame - rate, color saturation and imaging sensor noise. + photosensitivity settings of the camera. 0=bright light, + 1=Medium (default), 2=Low light. This setting affects + frame rate: the dimmer the lighting the lower the frame + rate (because longer exposition time is needed). The + Model 2 cameras allow values more than 2 for this option, + thus enabling extremely high sensitivity at cost of frame + rate, color saturation and imaging sensor noise. sharpness This option controls smoothing (noise reduction) - made by camera. Setting 0 is most smooth, setting 6 - is most sharp. Be aware that CMOS sensor used in the - camera is pretty noisy, so if you choose 6 you will - be greeted with "snowy" image. Default is 4. Model 2 - cameras do not support this feature. + made by camera. Setting 0 is most smooth, setting 6 + is most sharp. Be aware that CMOS sensor used in the + camera is pretty noisy, so if you choose 6 you will + be greeted with "snowy" image. Default is 4. Model 2 + cameras do not support this feature. size This setting chooses one of several image sizes that are - supported by this driver. Cameras may support more, but - it's difficult to reverse-engineer all formats. - Following video sizes are supported: - - size=0 128x96 (Model 1 only) - size=1 160x120 - size=2 176x144 - size=3 320x240 (Model 2 only) - size=4 352x240 (Model 2 only) - size=5 352x288 - size=6 640x480 (Model 3 only) - - The 352x288 is the native size of the Model 1 sensor - array, so it's the best resolution the camera can - yield. The best resolution of Model 2 is 176x144, and - larger images are produced by stretching the bitmap. - Model 3 has sensor with 640x480 grid, and it works too, - but the frame rate will be exceptionally low (1-2 FPS); - it may be still OK for some applications, like security. - Choose the image size you need. The smaller image can - support faster frame rate. Default is 352x288. + supported by this driver. Cameras may support more, but + it's difficult to reverse-engineer all formats. + Following video sizes are supported: + + size=0 128x96 (Model 1 only) + size=1 160x120 + size=2 176x144 + size=3 320x240 (Model 2 only) + size=4 352x240 (Model 2 only) + size=5 352x288 + size=6 640x480 (Model 3 only) + + The 352x288 is the native size of the Model 1 sensor + array, so it's the best resolution the camera can + yield. The best resolution of Model 2 is 176x144, and + larger images are produced by stretching the bitmap. + Model 3 has sensor with 640x480 grid, and it works too, + but the frame rate will be exceptionally low (1-2 FPS); + it may be still OK for some applications, like security. + Choose the image size you need. The smaller image can + support faster frame rate. Default is 352x288. For more information and the Troubleshooting FAQ visit this URL: - http://www.linux-usb.org/ibmcam/ + http://www.linux-usb.org/ibmcam/ WHAT NEEDS TO BE DONE: diff --git a/Documentation/video4linux/ov511.txt b/Documentation/video4linux/ov511.txt index 142741e3c578..79af610d4ba5 100644 --- a/Documentation/video4linux/ov511.txt +++ b/Documentation/video4linux/ov511.txt @@ -81,7 +81,7 @@ MODULE PARAMETERS: TYPE: integer (Boolean) DEFAULT: 1 DESC: Brightness is normally under automatic control and can't be set - manually by the video app. Set to 0 for manual control. + manually by the video app. Set to 0 for manual control. NAME: autogain TYPE: integer (Boolean) @@ -97,13 +97,13 @@ MODULE PARAMETERS: TYPE: integer (0-6) DEFAULT: 3 DESC: Sets the threshold for printing debug messages. The higher the value, - the more is printed. The levels are cumulative, and are as follows: - 0=no debug messages - 1=init/detection/unload and other significant messages - 2=some warning messages - 3=config/control function calls - 4=most function calls and data parsing messages - 5=highly repetitive mesgs + the more is printed. The levels are cumulative, and are as follows: + 0=no debug messages + 1=init/detection/unload and other significant messages + 2=some warning messages + 3=config/control function calls + 4=most function calls and data parsing messages + 5=highly repetitive mesgs NAME: snapshot TYPE: integer (Boolean) @@ -116,24 +116,24 @@ MODULE PARAMETERS: TYPE: integer (1-4 for OV511, 1-31 for OV511+) DEFAULT: 1 DESC: Number of cameras allowed to stream simultaneously on a single bus. - Values higher than 1 reduce the data rate of each camera, allowing two - or more to be used at once. If you have a complicated setup involving - both OV511 and OV511+ cameras, trial-and-error may be necessary for - finding the optimum setting. + Values higher than 1 reduce the data rate of each camera, allowing two + or more to be used at once. If you have a complicated setup involving + both OV511 and OV511+ cameras, trial-and-error may be necessary for + finding the optimum setting. NAME: compress TYPE: integer (Boolean) DEFAULT: 0 DESC: Set this to 1 to turn on the camera's compression engine. This can - potentially increase the frame rate at the expense of quality, if you - have a fast CPU. You must load the proper compression module for your - camera before starting your application (ov511_decomp or ov518_decomp). + potentially increase the frame rate at the expense of quality, if you + have a fast CPU. You must load the proper compression module for your + camera before starting your application (ov511_decomp or ov518_decomp). NAME: testpat TYPE: integer (Boolean) DEFAULT: 0 DESC: This configures the camera's sensor to transmit a colored test-pattern - instead of an image. This does not work correctly yet. + instead of an image. This does not work correctly yet. NAME: dumppix TYPE: integer (0-2) diff --git a/Documentation/video4linux/sn9c102.txt b/Documentation/video4linux/sn9c102.txt index 142920bc011f..1d20895b4354 100644 --- a/Documentation/video4linux/sn9c102.txt +++ b/Documentation/video4linux/sn9c102.txt @@ -1,9 +1,9 @@ - SN9C10x PC Camera Controllers - Driver for Linux - ============================= + SN9C10x PC Camera Controllers + Driver for Linux + ============================= - - Documentation - + - Documentation - Index @@ -176,46 +176,46 @@ Name: video_nr Type: short array (min = 0, max = 64) Syntax: <-1|n[,...]> Description: Specify V4L2 minor mode number: - -1 = use next available - n = use minor number n - You can specify up to 64 cameras this way. - For example: - video_nr=-1,2,-1 would assign minor number 2 to the second - recognized camera and use auto for the first one and for every - other camera. + -1 = use next available + n = use minor number n + You can specify up to 64 cameras this way. + For example: + video_nr=-1,2,-1 would assign minor number 2 to the second + recognized camera and use auto for the first one and for every + other camera. Default: -1 ------------------------------------------------------------------------------- Name: force_munmap Type: bool array (min = 0, max = 64) Syntax: <0|1[,...]> Description: Force the application to unmap previously mapped buffer memory - before calling any VIDIOC_S_CROP or VIDIOC_S_FMT ioctl's. Not - all the applications support this feature. This parameter is - specific for each detected camera. - 0 = do not force memory unmapping - 1 = force memory unmapping (save memory) + before calling any VIDIOC_S_CROP or VIDIOC_S_FMT ioctl's. Not + all the applications support this feature. This parameter is + specific for each detected camera. + 0 = do not force memory unmapping + 1 = force memory unmapping (save memory) Default: 0 ------------------------------------------------------------------------------- Name: frame_timeout Type: uint array (min = 0, max = 64) Syntax: <n[,...]> Description: Timeout for a video frame in seconds. This parameter is - specific for each detected camera. This parameter can be - changed at runtime thanks to the /sys filesystem interface. + specific for each detected camera. This parameter can be + changed at runtime thanks to the /sys filesystem interface. Default: 2 ------------------------------------------------------------------------------- Name: debug Type: ushort Syntax: <n> Description: Debugging information level, from 0 to 3: - 0 = none (use carefully) - 1 = critical errors - 2 = significant informations - 3 = more verbose messages - Level 3 is useful for testing only, when only one device - is used. It also shows some more informations about the - hardware being detected. This parameter can be changed at - runtime thanks to the /sys filesystem interface. + 0 = none (use carefully) + 1 = critical errors + 2 = significant informations + 3 = more verbose messages + Level 3 is useful for testing only, when only one device + is used. It also shows some more informations about the + hardware being detected. This parameter can be changed at + runtime thanks to the /sys filesystem interface. Default: 2 ------------------------------------------------------------------------------- @@ -280,24 +280,24 @@ Byte # Value Description 0x04 0xC4 Frame synchronisation pattern. 0x05 0x96 Frame synchronisation pattern. 0x06 0xXX Unknown meaning. The exact value depends on the chip; - possible values are 0x00, 0x01 and 0x20. + possible values are 0x00, 0x01 and 0x20. 0x07 0xXX Variable value, whose bits are ff00uzzc, where ff is a - frame counter, u is unknown, zz is a size indicator - (00 = VGA, 01 = SIF, 10 = QSIF) and c stands for - "compression enabled" (1 = yes, 0 = no). + frame counter, u is unknown, zz is a size indicator + (00 = VGA, 01 = SIF, 10 = QSIF) and c stands for + "compression enabled" (1 = yes, 0 = no). 0x08 0xXX Brightness sum inside Auto-Exposure area (low-byte). 0x09 0xXX Brightness sum inside Auto-Exposure area (high-byte). - For a pure white image, this number will be equal to 500 - times the area of the specified AE area. For images - that are not pure white, the value scales down according - to relative whiteness. + For a pure white image, this number will be equal to 500 + times the area of the specified AE area. For images + that are not pure white, the value scales down according + to relative whiteness. 0x0A 0xXX Brightness sum outside Auto-Exposure area (low-byte). 0x0B 0xXX Brightness sum outside Auto-Exposure area (high-byte). - For a pure white image, this number will be equal to 125 - times the area outside of the specified AE area. For - images that are not pure white, the value scales down - according to relative whiteness. - according to relative whiteness. + For a pure white image, this number will be equal to 125 + times the area outside of the specified AE area. For + images that are not pure white, the value scales down + according to relative whiteness. + according to relative whiteness. The following bytes are used by the SN9C103 bridge only: diff --git a/Documentation/video4linux/v4lgrab.c b/Documentation/video4linux/v4lgrab.c new file mode 100644 index 000000000000..079b628481cf --- /dev/null +++ b/Documentation/video4linux/v4lgrab.c @@ -0,0 +1,192 @@ +/* Simple Video4Linux image grabber. */ +/* + * Video4Linux Driver Test/Example Framegrabbing Program + * + * Compile with: + * gcc -s -Wall -Wstrict-prototypes v4lgrab.c -o v4lgrab + * Use as: + * v4lgrab >image.ppm + * + * Copyright (C) 1998-05-03, Phil Blundell <philb@gnu.org> + * Copied from http://www.tazenda.demon.co.uk/phil/vgrabber.c + * with minor modifications (Dave Forrest, drf5n@virginia.edu). + * + */ + +#include <unistd.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> +#include <stdio.h> +#include <sys/ioctl.h> +#include <stdlib.h> + +#include <linux/types.h> +#include <linux/videodev.h> + +#define FILE "/dev/video0" + +/* Stole this from tvset.c */ + +#define READ_VIDEO_PIXEL(buf, format, depth, r, g, b) \ +{ \ + switch (format) \ + { \ + case VIDEO_PALETTE_GREY: \ + switch (depth) \ + { \ + case 4: \ + case 6: \ + case 8: \ + (r) = (g) = (b) = (*buf++ << 8);\ + break; \ + \ + case 16: \ + (r) = (g) = (b) = \ + *((unsigned short *) buf); \ + buf += 2; \ + break; \ + } \ + break; \ + \ + \ + case VIDEO_PALETTE_RGB565: \ + { \ + unsigned short tmp = *(unsigned short *)buf; \ + (r) = tmp&0xF800; \ + (g) = (tmp<<5)&0xFC00; \ + (b) = (tmp<<11)&0xF800; \ + buf += 2; \ + } \ + break; \ + \ + case VIDEO_PALETTE_RGB555: \ + (r) = (buf[0]&0xF8)<<8; \ + (g) = ((buf[0] << 5 | buf[1] >> 3)&0xF8)<<8; \ + (b) = ((buf[1] << 2 ) & 0xF8)<<8; \ + buf += 2; \ + break; \ + \ + case VIDEO_PALETTE_RGB24: \ + (r) = buf[0] << 8; (g) = buf[1] << 8; \ + (b) = buf[2] << 8; \ + buf += 3; \ + break; \ + \ + default: \ + fprintf(stderr, \ + "Format %d not yet supported\n", \ + format); \ + } \ +} + +int get_brightness_adj(unsigned char *image, long size, int *brightness) { + long i, tot = 0; + for (i=0;i<size*3;i++) + tot += image[i]; + *brightness = (128 - tot/(size*3))/3; + return !((tot/(size*3)) >= 126 && (tot/(size*3)) <= 130); +} + +int main(int argc, char ** argv) +{ + int fd = open(FILE, O_RDONLY), f; + struct video_capability cap; + struct video_window win; + struct video_picture vpic; + + unsigned char *buffer, *src; + int bpp = 24, r, g, b; + unsigned int i, src_depth; + + if (fd < 0) { + perror(FILE); + exit(1); + } + + if (ioctl(fd, VIDIOCGCAP, &cap) < 0) { + perror("VIDIOGCAP"); + fprintf(stderr, "(" FILE " not a video4linux device?)\n"); + close(fd); + exit(1); + } + + if (ioctl(fd, VIDIOCGWIN, &win) < 0) { + perror("VIDIOCGWIN"); + close(fd); + exit(1); + } + + if (ioctl(fd, VIDIOCGPICT, &vpic) < 0) { + perror("VIDIOCGPICT"); + close(fd); + exit(1); + } + + if (cap.type & VID_TYPE_MONOCHROME) { + vpic.depth=8; + vpic.palette=VIDEO_PALETTE_GREY; /* 8bit grey */ + if(ioctl(fd, VIDIOCSPICT, &vpic) < 0) { + vpic.depth=6; + if(ioctl(fd, VIDIOCSPICT, &vpic) < 0) { + vpic.depth=4; + if(ioctl(fd, VIDIOCSPICT, &vpic) < 0) { + fprintf(stderr, "Unable to find a supported capture format.\n"); + close(fd); + exit(1); + } + } + } + } else { + vpic.depth=24; + vpic.palette=VIDEO_PALETTE_RGB24; + + if(ioctl(fd, VIDIOCSPICT, &vpic) < 0) { + vpic.palette=VIDEO_PALETTE_RGB565; + vpic.depth=16; + + if(ioctl(fd, VIDIOCSPICT, &vpic)==-1) { + vpic.palette=VIDEO_PALETTE_RGB555; + vpic.depth=15; + + if(ioctl(fd, VIDIOCSPICT, &vpic)==-1) { + fprintf(stderr, "Unable to find a supported capture format.\n"); + return -1; + } + } + } + } + + buffer = malloc(win.width * win.height * bpp); + if (!buffer) { + fprintf(stderr, "Out of memory.\n"); + exit(1); + } + + do { + int newbright; + read(fd, buffer, win.width * win.height * bpp); + f = get_brightness_adj(buffer, win.width * win.height, &newbright); + if (f) { + vpic.brightness += (newbright << 8); + if(ioctl(fd, VIDIOCSPICT, &vpic)==-1) { + perror("VIDIOSPICT"); + break; + } + } + } while (f); + + fprintf(stdout, "P6\n%d %d 255\n", win.width, win.height); + + src = buffer; + + for (i = 0; i < win.width * win.height; i++) { + READ_VIDEO_PIXEL(src, vpic.palette, src_depth, r, g, b); + fputc(r>>8, stdout); + fputc(g>>8, stdout); + fputc(b>>8, stdout); + } + + close(fd); + return 0; +} diff --git a/Documentation/video4linux/w9968cf.txt b/Documentation/video4linux/w9968cf.txt index 3b704f2aae6d..0d53ce774b01 100644 --- a/Documentation/video4linux/w9968cf.txt +++ b/Documentation/video4linux/w9968cf.txt @@ -1,9 +1,9 @@ - W996[87]CF JPEG USB Dual Mode Camera Chip - Driver for Linux 2.6 (basic version) - ========================================= + W996[87]CF JPEG USB Dual Mode Camera Chip + Driver for Linux 2.6 (basic version) + ========================================= - - Documentation - + - Documentation - Index @@ -188,57 +188,57 @@ Name: ovmod_load Type: bool Syntax: <0|1> Description: Automatic 'ovcamchip' module loading: 0 disabled, 1 enabled. - If enabled, 'insmod' searches for the required 'ovcamchip' - module in the system, according to its configuration, and - loads that module automatically. This action is performed as - once soon as the 'w9968cf' module is loaded into memory. + If enabled, 'insmod' searches for the required 'ovcamchip' + module in the system, according to its configuration, and + loads that module automatically. This action is performed as + once soon as the 'w9968cf' module is loaded into memory. Default: 1 Note: The kernel must be compiled with the CONFIG_KMOD option - enabled for the 'ovcamchip' module to be loaded and for - this parameter to be present. + enabled for the 'ovcamchip' module to be loaded and for + this parameter to be present. ------------------------------------------------------------------------------- Name: simcams Type: int Syntax: <n> Description: Number of cameras allowed to stream simultaneously. - n may vary from 0 to 32. + n may vary from 0 to 32. Default: 32 ------------------------------------------------------------------------------- Name: video_nr Type: int array (min = 0, max = 32) Syntax: <-1|n[,...]> Description: Specify V4L minor mode number. - -1 = use next available - n = use minor number n - You can specify up to 32 cameras this way. - For example: - video_nr=-1,2,-1 would assign minor number 2 to the second - recognized camera and use auto for the first one and for every - other camera. + -1 = use next available + n = use minor number n + You can specify up to 32 cameras this way. + For example: + video_nr=-1,2,-1 would assign minor number 2 to the second + recognized camera and use auto for the first one and for every + other camera. Default: -1 ------------------------------------------------------------------------------- Name: packet_size Type: int array (min = 0, max = 32) Syntax: <n[,...]> Description: Specify the maximum data payload size in bytes for alternate - settings, for each device. n is scaled between 63 and 1023. + settings, for each device. n is scaled between 63 and 1023. Default: 1023 ------------------------------------------------------------------------------- Name: max_buffers Type: int array (min = 0, max = 32) Syntax: <n[,...]> Description: For advanced users. - Specify the maximum number of video frame buffers to allocate - for each device, from 2 to 32. + Specify the maximum number of video frame buffers to allocate + for each device, from 2 to 32. Default: 2 ------------------------------------------------------------------------------- Name: double_buffer Type: bool array (min = 0, max = 32) Syntax: <0|1[,...]> Description: Hardware double buffering: 0 disabled, 1 enabled. - It should be enabled if you want smooth video output: if you - obtain out of sync. video, disable it, or try to - decrease the 'clockdiv' module parameter value. + It should be enabled if you want smooth video output: if you + obtain out of sync. video, disable it, or try to + decrease the 'clockdiv' module parameter value. Default: 1 for every device. ------------------------------------------------------------------------------- Name: clamping @@ -251,9 +251,9 @@ Name: filter_type Type: int array (min = 0, max = 32) Syntax: <0|1|2[,...]> Description: Video filter type. - 0 none, 1 (1-2-1) 3-tap filter, 2 (2-3-6-3-2) 5-tap filter. - The filter is used to reduce noise and aliasing artifacts - produced by the CCD or CMOS image sensor. + 0 none, 1 (1-2-1) 3-tap filter, 2 (2-3-6-3-2) 5-tap filter. + The filter is used to reduce noise and aliasing artifacts + produced by the CCD or CMOS image sensor. Default: 0 for every device. ------------------------------------------------------------------------------- Name: largeview @@ -266,9 +266,9 @@ Name: upscaling Type: bool array (min = 0, max = 32) Syntax: <0|1[,...]> Description: Software scaling (for non-compressed video only): - 0 disabled, 1 enabled. - Disable it if you have a slow CPU or you don't have enough - memory. + 0 disabled, 1 enabled. + Disable it if you have a slow CPU or you don't have enough + memory. Default: 0 for every device. Note: If 'w9968cf-vpp' is not present, this parameter is set to 0. ------------------------------------------------------------------------------- @@ -276,36 +276,36 @@ Name: decompression Type: int array (min = 0, max = 32) Syntax: <0|1|2[,...]> Description: Software video decompression: - 0 = disables decompression - (doesn't allow formats needing decompression). - 1 = forces decompression - (allows formats needing decompression only). - 2 = allows any permitted formats. - Formats supporting (de)compressed video are YUV422P and - YUV420P/YUV420 in any resolutions where width and height are - multiples of 16. + 0 = disables decompression + (doesn't allow formats needing decompression). + 1 = forces decompression + (allows formats needing decompression only). + 2 = allows any permitted formats. + Formats supporting (de)compressed video are YUV422P and + YUV420P/YUV420 in any resolutions where width and height are + multiples of 16. Default: 2 for every device. Note: If 'w9968cf-vpp' is not present, forcing decompression is not - allowed; in this case this parameter is set to 2. + allowed; in this case this parameter is set to 2. ------------------------------------------------------------------------------- Name: force_palette Type: int array (min = 0, max = 32) Syntax: <0|9|10|13|15|8|7|1|6|3|4|5[,...]> Description: Force picture palette. - In order: - 0 = Off - allows any of the following formats: - 9 = UYVY 16 bpp - Original video, compression disabled - 10 = YUV420 12 bpp - Original video, compression enabled - 13 = YUV422P 16 bpp - Original video, compression enabled - 15 = YUV420P 12 bpp - Original video, compression enabled - 8 = YUVY 16 bpp - Software conversion from UYVY - 7 = YUV422 16 bpp - Software conversion from UYVY - 1 = GREY 8 bpp - Software conversion from UYVY - 6 = RGB555 16 bpp - Software conversion from UYVY - 3 = RGB565 16 bpp - Software conversion from UYVY - 4 = RGB24 24 bpp - Software conversion from UYVY - 5 = RGB32 32 bpp - Software conversion from UYVY - When not 0, this parameter will override 'decompression'. + In order: + 0 = Off - allows any of the following formats: + 9 = UYVY 16 bpp - Original video, compression disabled + 10 = YUV420 12 bpp - Original video, compression enabled + 13 = YUV422P 16 bpp - Original video, compression enabled + 15 = YUV420P 12 bpp - Original video, compression enabled + 8 = YUVY 16 bpp - Software conversion from UYVY + 7 = YUV422 16 bpp - Software conversion from UYVY + 1 = GREY 8 bpp - Software conversion from UYVY + 6 = RGB555 16 bpp - Software conversion from UYVY + 3 = RGB565 16 bpp - Software conversion from UYVY + 4 = RGB24 24 bpp - Software conversion from UYVY + 5 = RGB32 32 bpp - Software conversion from UYVY + When not 0, this parameter will override 'decompression'. Default: 0 for every device. Initial palette is 9 (UYVY). Note: If 'w9968cf-vpp' is not present, this parameter is set to 9. ------------------------------------------------------------------------------- @@ -313,77 +313,77 @@ Name: force_rgb Type: bool array (min = 0, max = 32) Syntax: <0|1[,...]> Description: Read RGB video data instead of BGR: - 1 = use RGB component ordering. - 0 = use BGR component ordering. - This parameter has effect when using RGBX palettes only. + 1 = use RGB component ordering. + 0 = use BGR component ordering. + This parameter has effect when using RGBX palettes only. Default: 0 for every device. ------------------------------------------------------------------------------- Name: autobright Type: bool array (min = 0, max = 32) Syntax: <0|1[,...]> Description: Image sensor automatically changes brightness: - 0 = no, 1 = yes + 0 = no, 1 = yes Default: 0 for every device. ------------------------------------------------------------------------------- Name: autoexp Type: bool array (min = 0, max = 32) Syntax: <0|1[,...]> Description: Image sensor automatically changes exposure: - 0 = no, 1 = yes + 0 = no, 1 = yes Default: 1 for every device. ------------------------------------------------------------------------------- Name: lightfreq Type: int array (min = 0, max = 32) Syntax: <50|60[,...]> Description: Light frequency in Hz: - 50 for European and Asian lighting, 60 for American lighting. + 50 for European and Asian lighting, 60 for American lighting. Default: 50 for every device. ------------------------------------------------------------------------------- Name: bandingfilter Type: bool array (min = 0, max = 32) Syntax: <0|1[,...]> Description: Banding filter to reduce effects of fluorescent - lighting: - 0 disabled, 1 enabled. - This filter tries to reduce the pattern of horizontal - light/dark bands caused by some (usually fluorescent) lighting. + lighting: + 0 disabled, 1 enabled. + This filter tries to reduce the pattern of horizontal + light/dark bands caused by some (usually fluorescent) lighting. Default: 0 for every device. ------------------------------------------------------------------------------- Name: clockdiv Type: int array (min = 0, max = 32) Syntax: <-1|n[,...]> Description: Force pixel clock divisor to a specific value (for experts): - n may vary from 0 to 127. - -1 for automatic value. - See also the 'double_buffer' module parameter. + n may vary from 0 to 127. + -1 for automatic value. + See also the 'double_buffer' module parameter. Default: -1 for every device. ------------------------------------------------------------------------------- Name: backlight Type: bool array (min = 0, max = 32) Syntax: <0|1[,...]> Description: Objects are lit from behind: - 0 = no, 1 = yes + 0 = no, 1 = yes Default: 0 for every device. ------------------------------------------------------------------------------- Name: mirror Type: bool array (min = 0, max = 32) Syntax: <0|1[,...]> Description: Reverse image horizontally: - 0 = no, 1 = yes + 0 = no, 1 = yes Default: 0 for every device. ------------------------------------------------------------------------------- Name: monochrome Type: bool array (min = 0, max = 32) Syntax: <0|1[,...]> Description: The image sensor is monochrome: - 0 = no, 1 = yes + 0 = no, 1 = yes Default: 0 for every device. ------------------------------------------------------------------------------- Name: brightness Type: long array (min = 0, max = 32) Syntax: <n[,...]> Description: Set picture brightness (0-65535). - This parameter has no effect if 'autobright' is enabled. + This parameter has no effect if 'autobright' is enabled. Default: 31000 for every device. ------------------------------------------------------------------------------- Name: hue @@ -414,23 +414,23 @@ Name: debug Type: int Syntax: <n> Description: Debugging information level, from 0 to 6: - 0 = none (use carefully) - 1 = critical errors - 2 = significant informations - 3 = configuration or general messages - 4 = warnings - 5 = called functions - 6 = function internals - Level 5 and 6 are useful for testing only, when only one - device is used. + 0 = none (use carefully) + 1 = critical errors + 2 = significant informations + 3 = configuration or general messages + 4 = warnings + 5 = called functions + 6 = function internals + Level 5 and 6 are useful for testing only, when only one + device is used. Default: 2 ------------------------------------------------------------------------------- Name: specific_debug Type: bool Syntax: <0|1> Description: Enable or disable specific debugging messages: - 0 = print messages concerning every level <= 'debug' level. - 1 = print messages concerning the level indicated by 'debug'. + 0 = print messages concerning every level <= 'debug' level. + 1 = print messages concerning the level indicated by 'debug'. Default: 0 ------------------------------------------------------------------------------- diff --git a/Documentation/video4linux/zc0301.txt b/Documentation/video4linux/zc0301.txt index f55262c6733b..f406f5e80046 100644 --- a/Documentation/video4linux/zc0301.txt +++ b/Documentation/video4linux/zc0301.txt @@ -1,9 +1,9 @@ - ZC0301 Image Processor and Control Chip - Driver for Linux - ======================================= + ZC0301 and ZC0301P Image Processor and Control Chip + Driver for Linux + =================================================== - - Documentation - + - Documentation - Index @@ -51,13 +51,13 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. 4. Overview and features ======================== -This driver supports the video interface of the devices mounting the ZC0301 -Image Processor and Control Chip. +This driver supports the video interface of the devices mounting the ZC0301 or +ZC0301P Image Processors and Control Chips. The driver relies on the Video4Linux2 and USB core modules. It has been designed to run properly on SMP systems as well. -The latest version of the ZC0301 driver can be found at the following URL: +The latest version of the ZC0301[P] driver can be found at the following URL: http://www.linux-projects.org/ Some of the features of the driver are: @@ -117,7 +117,7 @@ supported by the USB Audio driver thanks to the ALSA API: And finally: - # USB Multimedia devices + # V4L USB devices # CONFIG_USB_ZC0301=m @@ -146,46 +146,46 @@ Name: video_nr Type: short array (min = 0, max = 64) Syntax: <-1|n[,...]> Description: Specify V4L2 minor mode number: - -1 = use next available - n = use minor number n - You can specify up to 64 cameras this way. - For example: - video_nr=-1,2,-1 would assign minor number 2 to the second - registered camera and use auto for the first one and for every - other camera. + -1 = use next available + n = use minor number n + You can specify up to 64 cameras this way. + For example: + video_nr=-1,2,-1 would assign minor number 2 to the second + registered camera and use auto for the first one and for every + other camera. Default: -1 ------------------------------------------------------------------------------- Name: force_munmap Type: bool array (min = 0, max = 64) Syntax: <0|1[,...]> Description: Force the application to unmap previously mapped buffer memory - before calling any VIDIOC_S_CROP or VIDIOC_S_FMT ioctl's. Not - all the applications support this feature. This parameter is - specific for each detected camera. - 0 = do not force memory unmapping - 1 = force memory unmapping (save memory) + before calling any VIDIOC_S_CROP or VIDIOC_S_FMT ioctl's. Not + all the applications support this feature. This parameter is + specific for each detected camera. + 0 = do not force memory unmapping + 1 = force memory unmapping (save memory) Default: 0 ------------------------------------------------------------------------------- Name: frame_timeout Type: uint array (min = 0, max = 64) Syntax: <n[,...]> Description: Timeout for a video frame in seconds. This parameter is - specific for each detected camera. This parameter can be - changed at runtime thanks to the /sys filesystem interface. + specific for each detected camera. This parameter can be + changed at runtime thanks to the /sys filesystem interface. Default: 2 ------------------------------------------------------------------------------- Name: debug Type: ushort Syntax: <n> Description: Debugging information level, from 0 to 3: - 0 = none (use carefully) - 1 = critical errors - 2 = significant informations - 3 = more verbose messages - Level 3 is useful for testing only, when only one device - is used at the same time. It also shows some more informations - about the hardware being detected. This module parameter can be - changed at runtime thanks to the /sys filesystem interface. + 0 = none (use carefully) + 1 = critical errors + 2 = significant informations + 3 = more verbose messages + Level 3 is useful for testing only, when only one device + is used at the same time. It also shows some more informations + about the hardware being detected. This module parameter can be + changed at runtime thanks to the /sys filesystem interface. Default: 2 ------------------------------------------------------------------------------- @@ -204,11 +204,25 @@ Vendor ID Product ID 0x041e 0x4017 0x041e 0x401c 0x041e 0x401e +0x041e 0x401f +0x041e 0x4022 0x041e 0x4034 0x041e 0x4035 +0x041e 0x4036 +0x041e 0x403a +0x0458 0x7007 +0x0458 0x700C +0x0458 0x700f +0x046d 0x08ae +0x055f 0xd003 +0x055f 0xd004 0x046d 0x08ae 0x0ac8 0x0301 +0x0ac8 0x301b +0x0ac8 0x303b +0x10fd 0x0128 0x10fd 0x8050 +0x10fd 0x804e The list above does not imply that all those devices work with this driver: up until now only the ones that mount the following image sensors are supported; @@ -217,6 +231,7 @@ kernel messages will always tell you whether this is the case: Model Manufacturer ----- ------------ PAS202BCB PixArt Imaging, Inc. +PB-0330 Photobit Corporation 9. Notes for V4L2 application developers @@ -250,5 +265,6 @@ the fingerprint is: '88E8 F32F 7244 68BA 3958 5D40 99DA 5D2A FCE6 35A4'. been taken from the documentation of the ZC030x Video4Linux1 driver written by Andrew Birkett <andy@nobugs.org>; - The initialization values of the ZC0301 controller connected to the PAS202BCB - image sensor have been taken from the SPCA5XX driver maintained by - Michel Xhaard <mxhaard@magic.fr>. + and PB-0330 image sensors have been taken from the SPCA5XX driver maintained + by Michel Xhaard <mxhaard@magic.fr>; +- Stanislav Lechev donated one camera. diff --git a/Documentation/watchdog/pcwd-watchdog.txt b/Documentation/watchdog/pcwd-watchdog.txt index 12187a33e310..d9ee6336c1d4 100644 --- a/Documentation/watchdog/pcwd-watchdog.txt +++ b/Documentation/watchdog/pcwd-watchdog.txt @@ -22,78 +22,9 @@ to run the program with an "&" to run it in the background!) If you want to write a program to be compatible with the PC Watchdog - driver, simply do the following: - --- Snippet of code -- -/* - * Watchdog Driver Test Program - */ - -#include <stdio.h> -#include <stdlib.h> -#include <string.h> -#include <unistd.h> -#include <fcntl.h> -#include <sys/ioctl.h> -#include <linux/types.h> -#include <linux/watchdog.h> - -int fd; - -/* - * This function simply sends an IOCTL to the driver, which in turn ticks - * the PC Watchdog card to reset its internal timer so it doesn't trigger - * a computer reset. - */ -void keep_alive(void) -{ - int dummy; - - ioctl(fd, WDIOC_KEEPALIVE, &dummy); -} - -/* - * The main program. Run the program with "-d" to disable the card, - * or "-e" to enable the card. - */ -int main(int argc, char *argv[]) -{ - fd = open("/dev/watchdog", O_WRONLY); - - if (fd == -1) { - fprintf(stderr, "Watchdog device not enabled.\n"); - fflush(stderr); - exit(-1); - } - - if (argc > 1) { - if (!strncasecmp(argv[1], "-d", 2)) { - ioctl(fd, WDIOC_SETOPTIONS, WDIOS_DISABLECARD); - fprintf(stderr, "Watchdog card disabled.\n"); - fflush(stderr); - exit(0); - } else if (!strncasecmp(argv[1], "-e", 2)) { - ioctl(fd, WDIOC_SETOPTIONS, WDIOS_ENABLECARD); - fprintf(stderr, "Watchdog card enabled.\n"); - fflush(stderr); - exit(0); - } else { - fprintf(stderr, "-d to disable, -e to enable.\n"); - fprintf(stderr, "run by itself to tick the card.\n"); - fflush(stderr); - exit(0); - } - } else { - fprintf(stderr, "Watchdog Ticking Away!\n"); - fflush(stderr); - } - - while(1) { - keep_alive(); - sleep(1); - } -} --- End snippet -- + driver, simply use of modify the watchdog test program: + Documentation/watchdog/src/watchdog-test.c + Other IOCTL functions include: diff --git a/Documentation/watchdog/src/watchdog-simple.c b/Documentation/watchdog/src/watchdog-simple.c new file mode 100644 index 000000000000..85cf17c48669 --- /dev/null +++ b/Documentation/watchdog/src/watchdog-simple.c @@ -0,0 +1,15 @@ +#include <stdlib.h> +#include <fcntl.h> + +int main(int argc, const char *argv[]) { + int fd = open("/dev/watchdog", O_WRONLY); + if (fd == -1) { + perror("watchdog"); + exit(1); + } + while (1) { + write(fd, "\0", 1); + fsync(fd); + sleep(10); + } +} diff --git a/Documentation/watchdog/src/watchdog-test.c b/Documentation/watchdog/src/watchdog-test.c new file mode 100644 index 000000000000..65f6c19cb865 --- /dev/null +++ b/Documentation/watchdog/src/watchdog-test.c @@ -0,0 +1,68 @@ +/* + * Watchdog Driver Test Program + */ + +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <fcntl.h> +#include <sys/ioctl.h> +#include <linux/types.h> +#include <linux/watchdog.h> + +int fd; + +/* + * This function simply sends an IOCTL to the driver, which in turn ticks + * the PC Watchdog card to reset its internal timer so it doesn't trigger + * a computer reset. + */ +void keep_alive(void) +{ + int dummy; + + ioctl(fd, WDIOC_KEEPALIVE, &dummy); +} + +/* + * The main program. Run the program with "-d" to disable the card, + * or "-e" to enable the card. + */ +int main(int argc, char *argv[]) +{ + fd = open("/dev/watchdog", O_WRONLY); + + if (fd == -1) { + fprintf(stderr, "Watchdog device not enabled.\n"); + fflush(stderr); + exit(-1); + } + + if (argc > 1) { + if (!strncasecmp(argv[1], "-d", 2)) { + ioctl(fd, WDIOC_SETOPTIONS, WDIOS_DISABLECARD); + fprintf(stderr, "Watchdog card disabled.\n"); + fflush(stderr); + exit(0); + } else if (!strncasecmp(argv[1], "-e", 2)) { + ioctl(fd, WDIOC_SETOPTIONS, WDIOS_ENABLECARD); + fprintf(stderr, "Watchdog card enabled.\n"); + fflush(stderr); + exit(0); + } else { + fprintf(stderr, "-d to disable, -e to enable.\n"); + fprintf(stderr, "run by itself to tick the card.\n"); + fflush(stderr); + exit(0); + } + } else { + fprintf(stderr, "Watchdog Ticking Away!\n"); + fflush(stderr); + } + + while(1) { + keep_alive(); + sleep(1); + } +} diff --git a/Documentation/watchdog/watchdog-api.txt b/Documentation/watchdog/watchdog-api.txt index 21ed51173662..958ff3d48be3 100644 --- a/Documentation/watchdog/watchdog-api.txt +++ b/Documentation/watchdog/watchdog-api.txt @@ -34,22 +34,7 @@ activates as soon as /dev/watchdog is opened and will reboot unless the watchdog is pinged within a certain time, this time is called the timeout or margin. The simplest way to ping the watchdog is to write some data to the device. So a very simple watchdog daemon would look -like this: - -#include <stdlib.h> -#include <fcntl.h> - -int main(int argc, const char *argv[]) { - int fd=open("/dev/watchdog",O_WRONLY); - if (fd==-1) { - perror("watchdog"); - exit(1); - } - while(1) { - write(fd, "\0", 1); - sleep(10); - } -} +like this source file: see Documentation/watchdog/src/watchdog-simple.c A more advanced driver could for example check that a HTTP server is still responding before doing the write call to ping the watchdog. @@ -110,7 +95,40 @@ current timeout using the GETTIMEOUT ioctl. ioctl(fd, WDIOC_GETTIMEOUT, &timeout); printf("The timeout was is %d seconds\n", timeout); -Envinronmental monitoring: +Pretimeouts: + +Some watchdog timers can be set to have a trigger go off before the +actual time they will reset the system. This can be done with an NMI, +interrupt, or other mechanism. This allows Linux to record useful +information (like panic information and kernel coredumps) before it +resets. + + pretimeout = 10; + ioctl(fd, WDIOC_SETPRETIMEOUT, &pretimeout); + +Note that the pretimeout is the number of seconds before the time +when the timeout will go off. It is not the number of seconds until +the pretimeout. So, for instance, if you set the timeout to 60 seconds +and the pretimeout to 10 seconds, the pretimout will go of in 50 +seconds. Setting a pretimeout to zero disables it. + +There is also a get function for getting the pretimeout: + + ioctl(fd, WDIOC_GETPRETIMEOUT, &timeout); + printf("The pretimeout was is %d seconds\n", timeout); + +Not all watchdog drivers will support a pretimeout. + +Get the number of seconds before reboot: + +Some watchdog drivers have the ability to report the remaining time +before the system will reboot. The WDIOC_GETTIMELEFT is the ioctl +that returns the number of seconds before reboot. + + ioctl(fd, WDIOC_GETTIMELEFT, &timeleft); + printf("The timeout was is %d seconds\n", timeleft); + +Environmental monitoring: All watchdog drivers are required return more information about the system, some do temperature, fan and power level monitoring, some can tell you @@ -169,6 +187,10 @@ The watchdog saw a keepalive ping since it was last queried. WDIOF_SETTIMEOUT Can set/get the timeout +The watchdog can do pretimeouts. + + WDIOF_PRETIMEOUT Pretimeout (in seconds), get/set + For those drivers that return any bits set in the option field, the GETSTATUS and GETBOOTSTATUS ioctls can be used to ask for the current diff --git a/Documentation/watchdog/watchdog.txt b/Documentation/watchdog/watchdog.txt index dffda29c8799..4b1ff69cc19a 100644 --- a/Documentation/watchdog/watchdog.txt +++ b/Documentation/watchdog/watchdog.txt @@ -65,28 +65,7 @@ The external event interfaces on the WDT boards are not currently supported. Minor numbers are however allocated for it. -Example Watchdog Driver ------------------------ - -#include <stdio.h> -#include <unistd.h> -#include <fcntl.h> - -int main(int argc, const char *argv[]) -{ - int fd=open("/dev/watchdog",O_WRONLY); - if(fd==-1) - { - perror("watchdog"); - exit(1); - } - while(1) - { - write(fd,"\0",1); - fsync(fd); - sleep(10); - } -} +Example Watchdog Driver: see Documentation/watchdog/src/watchdog-simple.c Contact Information diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt index f2cd6ef53ff3..74b77f9e91bc 100644 --- a/Documentation/x86_64/boot-options.txt +++ b/Documentation/x86_64/boot-options.txt @@ -199,12 +199,38 @@ IOMMU allowed overwrite iommu off workarounds for specific chipsets. soft Use software bounce buffering (default for Intel machines) noaperture Don't touch the aperture for AGP. + allowdac Allow DMA >4GB + When off all DMA over >4GB is forced through an IOMMU or bounce + buffering. + nodac Forbid DMA >4GB + panic Always panic when IOMMU overflows swiotlb=pages[,force] pages Prereserve that many 128K pages for the software IO bounce buffering. force Force all IO through the software TLB. + calgary=[64k,128k,256k,512k,1M,2M,4M,8M] + calgary=[translate_empty_slots] + calgary=[disable=<PCI bus number>] + + 64k,...,8M - Set the size of each PCI slot's translation table + when using the Calgary IOMMU. This is the size of the translation + table itself in main memory. The smallest table, 64k, covers an IO + space of 32MB; the largest, 8MB table, can cover an IO space of + 4GB. Normally the kernel will make the right choice by itself. + + translate_empty_slots - Enable translation even on slots that have + no devices attached to them, in case a device will be hotplugged + in the future. + + disable=<PCI bus number> - Disable translation on a given PHB. For + example, the built-in graphics adapter resides on the first bridge + (PCI bus number 0); if translation (isolation) is enabled on this + bridge, X servers that access the hardware directly from user + space might stop working. Use this option if you have devices that + are accessed from userspace directly on some PCI host bridge. + Debugging oops=panic Always panic on oopses. Default is to just kill the process, @@ -217,6 +243,20 @@ Debugging pagefaulttrace Dump all page faults. Only useful for extreme debugging and will create a lot of output. + call_trace=[old|both|newfallback|new] + old: use old inexact backtracer + new: use new exact dwarf2 unwinder + both: print entries from both + newfallback: use new unwinder but fall back to old if it gets + stuck (default) + + call_trace=[old|both|newfallback|new] + old: use old inexact backtracer + new: use new exact dwarf2 unwinder + both: print entries from both + newfallback: use new unwinder but fall back to old if it gets + stuck (default) + Misc noreplacement Don't replace instructions with more appropriate ones diff --git a/Documentation/x86_64/kernel-stacks b/Documentation/x86_64/kernel-stacks new file mode 100644 index 000000000000..bddfddd466ab --- /dev/null +++ b/Documentation/x86_64/kernel-stacks @@ -0,0 +1,99 @@ +Most of the text from Keith Owens, hacked by AK + +x86_64 page size (PAGE_SIZE) is 4K. + +Like all other architectures, x86_64 has a kernel stack for every +active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. +These stacks contain useful data as long as a thread is alive or a +zombie. While the thread is in user space the kernel stack is empty +except for the thread_info structure at the bottom. + +In addition to the per thread stacks, there are specialized stacks +associated with each cpu. These stacks are only used while the kernel +is in control on that cpu, when a cpu returns to user space the +specialized stacks contain no useful data. The main cpu stacks is + +* Interrupt stack. IRQSTACKSIZE + + Used for external hardware interrupts. If this is the first external + hardware interrupt (i.e. not a nested hardware interrupt) then the + kernel switches from the current task to the interrupt stack. Like + the split thread and interrupt stacks on i386 (with CONFIG_4KSTACKS), + this gives more room for kernel interrupt processing without having + to increase the size of every per thread stack. + + The interrupt stack is also used when processing a softirq. + +Switching to the kernel interrupt stack is done by software based on a +per CPU interrupt nest counter. This is needed because x86-64 "IST" +hardware stacks cannot nest without races. + +x86_64 also has a feature which is not available on i386, the ability +to automatically switch to a new stack for designated events such as +double fault or NMI, which makes it easier to handle these unusual +events on x86_64. This feature is called the Interrupt Stack Table +(IST). There can be up to 7 IST entries per cpu. The IST code is an +index into the Task State Segment (TSS), the IST entries in the TSS +point to dedicated stacks, each stack can be a different size. + +An IST is selected by an non-zero value in the IST field of an +interrupt-gate descriptor. When an interrupt occurs and the hardware +loads such a descriptor, the hardware automatically sets the new stack +pointer based on the IST value, then invokes the interrupt handler. If +software wants to allow nested IST interrupts then the handler must +adjust the IST values on entry to and exit from the interrupt handler. +(this is occasionally done, e.g. for debug exceptions) + +Events with different IST codes (i.e. with different stacks) can be +nested. For example, a debug interrupt can safely be interrupted by an +NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack +pointers on entry to and exit from all IST events, in theory allowing +IST events with the same code to be nested. However in most cases, the +stack size allocated to an IST assumes no nesting for the same code. +If that assumption is ever broken then the stacks will become corrupt. + +The currently assigned IST stacks are :- + +* STACKFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 12 - Stack Fault Exception (#SS). + + This allows to recover from invalid stack segments. Rarely + happens. + +* DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 8 - Double Fault Exception (#DF). + + Invoked when handling a exception causes another exception. Happens + when the kernel is very confused (e.g. kernel stack pointer corrupt) + Using a separate stack allows to recover from it well enough in many + cases to still output an oops. + +* NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for non-maskable interrupts (NMI). + + NMI can be delivered at any time, including when the kernel is in the + middle of switching stacks. Using IST for NMI events avoids making + assumptions about the previous state of the kernel stack. + +* DEBUG_STACK. DEBUG_STKSZ + + Used for hardware debug interrupts (interrupt 1) and for software + debug interrupts (INT3). + + When debugging a kernel, debug interrupts (both hardware and + software) can occur at any time. Using IST for these interrupts + avoids making assumptions about the previous state of the kernel + stack. + +* MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 18 - Machine Check Exception (#MC). + + MCE can be delivered at any time, including when the kernel is in the + middle of switching stacks. Using IST for MCE events avoids making + assumptions about the previous state of the kernel stack. + +For more details see the Intel IA32 or AMD AMD64 architecture manuals. |