diff options
Diffstat (limited to 'Documentation')
334 files changed, 18127 insertions, 8149 deletions
diff --git a/Documentation/ABI/stable/sysfs-class-infiniband b/Documentation/ABI/stable/sysfs-class-infiniband index aed21b8916a2..96dfe1926b76 100644 --- a/Documentation/ABI/stable/sysfs-class-infiniband +++ b/Documentation/ABI/stable/sysfs-class-infiniband @@ -314,25 +314,6 @@ Description: board_id: (RO) Manufacturing board ID -sysfs interface for Chelsio T3 RDMA Driver (cxgb3) --------------------------------------------------- - -What: /sys/class/infiniband/cxgb3_X/hw_rev -What: /sys/class/infiniband/cxgb3_X/hca_type -What: /sys/class/infiniband/cxgb3_X/board_id -Date: Feb, 2007 -KernelVersion: v2.6.21 -Contact: linux-rdma@vger.kernel.org -Description: - hw_rev: (RO) Hardware revision number - - hca_type: (RO) HCA type. Here it is a driver short name. - It should normally match the name in its bus - driver structure (e.g. pci_driver::name). - - board_id: (RO) Manufacturing board id - - sysfs interface for Mellanox ConnectX HCA IB driver (mlx4) ---------------------------------------------------------- diff --git a/Documentation/ABI/stable/sysfs-driver-ib_srp b/Documentation/ABI/stable/sysfs-driver-ib_srp index 7049a2b50359..84972a57caae 100644 --- a/Documentation/ABI/stable/sysfs-driver-ib_srp +++ b/Documentation/ABI/stable/sysfs-driver-ib_srp @@ -67,6 +67,8 @@ Description: Interface for making ib_srp connect to a new target. initiator is allowed to queue per SCSI host. The default value for this parameter is 62. The lowest supported value is 2. + * max_it_iu_size, a decimal number specifying the maximum + initiator to target information unit length. What: /sys/class/infiniband_srp/srp-<hca>-<port_number>/ibdev Date: January 2, 2006 diff --git a/Documentation/ABI/testing/debugfs-hisi-hpre b/Documentation/ABI/testing/debugfs-hisi-hpre new file mode 100644 index 000000000000..ec4a79e3a807 --- /dev/null +++ b/Documentation/ABI/testing/debugfs-hisi-hpre @@ -0,0 +1,57 @@ +What: /sys/kernel/debug/hisi_hpre/<bdf>/cluster[0-3]/regs +Date: Sep 2019 +Contact: linux-crypto@vger.kernel.org +Description: Dump debug registers from the HPRE cluster. + Only available for PF. + +What: /sys/kernel/debug/hisi_hpre/<bdf>/cluster[0-3]/cluster_ctrl +Date: Sep 2019 +Contact: linux-crypto@vger.kernel.org +Description: Write the HPRE core selection in the cluster into this file, + and then we can read the debug information of the core. + Only available for PF. + +What: /sys/kernel/debug/hisi_hpre/<bdf>/rdclr_en +Date: Sep 2019 +Contact: linux-crypto@vger.kernel.org +Description: HPRE cores debug registers read clear control. 1 means enable + register read clear, otherwise 0. Writing to this file has no + functional effect, only enable or disable counters clear after + reading of these registers. + Only available for PF. + +What: /sys/kernel/debug/hisi_hpre/<bdf>/current_qm +Date: Sep 2019 +Contact: linux-crypto@vger.kernel.org +Description: One HPRE controller has one PF and multiple VFs, each function + has a QM. Select the QM which below qm refers to. + Only available for PF. + +What: /sys/kernel/debug/hisi_hpre/<bdf>/regs +Date: Sep 2019 +Contact: linux-crypto@vger.kernel.org +Description: Dump debug registers from the HPRE. + Only available for PF. + +What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/qm_regs +Date: Sep 2019 +Contact: linux-crypto@vger.kernel.org +Description: Dump debug registers from the QM. + Available for PF and VF in host. VF in guest currently only + has one debug register. + +What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/current_q +Date: Sep 2019 +Contact: linux-crypto@vger.kernel.org +Description: One QM may contain multiple queues. Select specific queue to + show its debug registers in above qm_regs. + Only available for PF. + +What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/clear_enable +Date: Sep 2019 +Contact: linux-crypto@vger.kernel.org +Description: QM debug registers(qm_regs) read clear control. 1 means enable + register read clear, otherwise 0. + Writing to this file has no functional effect, only enable or + disable counters clear after reading of these registers. + Only available for PF. diff --git a/Documentation/ABI/testing/debugfs-hisi-sec b/Documentation/ABI/testing/debugfs-hisi-sec new file mode 100644 index 000000000000..06adb899495e --- /dev/null +++ b/Documentation/ABI/testing/debugfs-hisi-sec @@ -0,0 +1,43 @@ +What: /sys/kernel/debug/hisi_sec/<bdf>/sec_dfx +Date: Oct 2019 +Contact: linux-crypto@vger.kernel.org +Description: Dump the debug registers of SEC cores. + Only available for PF. + +What: /sys/kernel/debug/hisi_sec/<bdf>/clear_enable +Date: Oct 2019 +Contact: linux-crypto@vger.kernel.org +Description: Enabling/disabling of clear action after reading + the SEC debug registers. + 0: disable, 1: enable. + Only available for PF, and take no other effect on SEC. + +What: /sys/kernel/debug/hisi_sec/<bdf>/current_qm +Date: Oct 2019 +Contact: linux-crypto@vger.kernel.org +Description: One SEC controller has one PF and multiple VFs, each function + has a QM. This file can be used to select the QM which below + qm refers to. + Only available for PF. + +What: /sys/kernel/debug/hisi_sec/<bdf>/qm/qm_regs +Date: Oct 2019 +Contact: linux-crypto@vger.kernel.org +Description: Dump of QM related debug registers. + Available for PF and VF in host. VF in guest currently only + has one debug register. + +What: /sys/kernel/debug/hisi_sec/<bdf>/qm/current_q +Date: Oct 2019 +Contact: linux-crypto@vger.kernel.org +Description: One QM of SEC may contain multiple queues. Select specific + queue to show its debug registers in above 'qm_regs'. + Only available for PF. + +What: /sys/kernel/debug/hisi_sec/<bdf>/qm/clear_enable +Date: Oct 2019 +Contact: linux-crypto@vger.kernel.org +Description: Enabling/disabling of clear action after reading + the SEC's QM debug registers. + 0: disable, 1: enable. + Only available for PF, and take no other effect on SEC. diff --git a/Documentation/ABI/testing/procfs-diskstats b/Documentation/ABI/testing/procfs-diskstats index 2c44b4f1b060..70dcaf2481f4 100644 --- a/Documentation/ABI/testing/procfs-diskstats +++ b/Documentation/ABI/testing/procfs-diskstats @@ -29,4 +29,9 @@ Description: 17 - sectors discarded 18 - time spent discarding + Kernel 5.5+ appends two more fields for flush requests: + + 19 - flush requests completed successfully + 20 - time spent flushing + For more details refer to Documentation/admin-guide/iostats.rst diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index f8c7c7126bb1..ed8c14f161ee 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block @@ -15,6 +15,12 @@ Description: 9 - I/Os currently in progress 10 - time spent doing I/Os (ms) 11 - weighted time spent doing I/Os (ms) + 12 - discards completed + 13 - discards merged + 14 - sectors discarded + 15 - time spent discarding (ms) + 16 - flush requests completed + 17 - time spent flushing (ms) For more details refer Documentation/admin-guide/iostats.rst diff --git a/Documentation/ABI/testing/sysfs-bus-fsi b/Documentation/ABI/testing/sysfs-bus-fsi index 57c806350d6c..320697bdf41d 100644 --- a/Documentation/ABI/testing/sysfs-bus-fsi +++ b/Documentation/ABI/testing/sysfs-bus-fsi @@ -1,25 +1,25 @@ -What: /sys/bus/platform/devices/fsi-master/rescan +What: /sys/bus/platform/devices/../fsi-master/fsi0/rescan Date: May 2017 KernelVersion: 4.12 -Contact: cbostic@linux.vnet.ibm.com +Contact: linux-fsi@lists.ozlabs.org Description: Initiates a FSI master scan for all connected slave devices on its links. -What: /sys/bus/platform/devices/fsi-master/break +What: /sys/bus/platform/devices/../fsi-master/fsi0/break Date: May 2017 KernelVersion: 4.12 -Contact: cbostic@linux.vnet.ibm.com +Contact: linux-fsi@lists.ozlabs.org Description: Sends an FSI BREAK command on a master's communication link to any connnected slaves. A BREAK resets connected device's logic and preps it to receive further commands from the master. -What: /sys/bus/platform/devices/fsi-master/slave@00:00/term +What: /sys/bus/platform/devices/../fsi-master/fsi0/slave@00:00/term Date: May 2017 KernelVersion: 4.12 -Contact: cbostic@linux.vnet.ibm.com +Contact: linux-fsi@lists.ozlabs.org Description: Sends an FSI terminate command from the master to its connected slave. A terminate resets the slave's state machines @@ -29,10 +29,10 @@ Description: ongoing operation in case of an expired 'Master Time Out' timer. -What: /sys/bus/platform/devices/fsi-master/slave@00:00/raw +What: /sys/bus/platform/devices/../fsi-master/fsi0/slave@00:00/raw Date: May 2017 KernelVersion: 4.12 -Contact: cbostic@linux.vnet.ibm.com +Contact: linux-fsi@lists.ozlabs.org Description: Provides a means of reading/writing a 32 bit value from/to a specified FSI bus address. diff --git a/Documentation/ABI/testing/sysfs-bus-iio b/Documentation/ABI/testing/sysfs-bus-iio index 680451695422..faaa2166d741 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio +++ b/Documentation/ABI/testing/sysfs-bus-iio @@ -753,6 +753,8 @@ What: /sys/.../events/in_illuminance0_thresh_falling_value what: /sys/.../events/in_illuminance0_thresh_rising_value what: /sys/.../events/in_proximity0_thresh_falling_value what: /sys/.../events/in_proximity0_thresh_rising_value +What: /sys/.../events/in_illuminance_thresh_rising_value +What: /sys/.../events/in_illuminance_thresh_falling_value KernelVersion: 2.6.37 Contact: linux-iio@vger.kernel.org Description: @@ -972,6 +974,7 @@ What: /sys/.../events/in_activity_jogging_thresh_rising_period What: /sys/.../events/in_activity_jogging_thresh_falling_period What: /sys/.../events/in_activity_running_thresh_rising_period What: /sys/.../events/in_activity_running_thresh_falling_period +What: /sys/.../events/in_illuminance_thresh_either_period KernelVersion: 2.6.37 Contact: linux-iio@vger.kernel.org Description: @@ -1715,3 +1718,11 @@ Description: Mass concentration reading of particulate matter in ug / m3. pmX consists of particles with aerodynamic diameter less or equal to X micrometers. + +What: /sys/bus/iio/devices/iio:deviceX/events/in_illuminance_period_available +Date: November 2019 +KernelVersion: 5.4 +Contact: linux-iio@vger.kernel.org +Description: + List of valid periods (in seconds) for which the light intensity + must be above the threshold level before interrupt is asserted. diff --git a/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 b/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 new file mode 100644 index 000000000000..7627d3be08f5 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 @@ -0,0 +1,39 @@ +What: /sys/bus/iio/devices/iio:deviceX/ac_excitation_en +KernelVersion: +Contact: linux-iio@vger.kernel.org +Description: + Reading gives the state of AC excitation. + Writing '1' enables AC excitation. + +What: /sys/bus/iio/devices/iio:deviceX/bridge_switch_en +KernelVersion: +Contact: linux-iio@vger.kernel.org +Description: + This bridge switch is used to disconnect it when there is a + need to minimize the system current consumption. + Reading gives the state of the bridge switch. + Writing '1' enables the bridge switch. + +What: /sys/bus/iio/devices/iio:deviceX/in_voltagex_sys_calibration +KernelVersion: +Contact: linux-iio@vger.kernel.org +Description: + Initiates the system calibration procedure. This is done on a + single channel at a time. Write '1' to start the calibration. + +What: /sys/bus/iio/devices/iio:deviceX/in_voltagex_sys_calibration_mode_available +KernelVersion: +Contact: linux-iio@vger.kernel.org +Description: + Reading returns a list with the possible calibration modes. + There are two available options: + "zero_scale" - calibrate to zero scale + "full_scale" - calibrate to full scale + +What: /sys/bus/iio/devices/iio:deviceX/in_voltagex_sys_calibration_mode +KernelVersion: +Contact: linux-iio@vger.kernel.org +Description: + Sets up the calibration mode used in the system calibration + procedure. Reading returns the current calibration mode. + Writing sets the system calibration mode. diff --git a/Documentation/ABI/testing/sysfs-bus-mei b/Documentation/ABI/testing/sysfs-bus-mei index 6bd45346ac7e..3d37e2796d5a 100644 --- a/Documentation/ABI/testing/sysfs-bus-mei +++ b/Documentation/ABI/testing/sysfs-bus-mei @@ -4,7 +4,7 @@ KernelVersion: 3.10 Contact: Samuel Ortiz <sameo@linux.intel.com> linux-mei@linux.intel.com Description: Stores the same MODALIAS value emitted by uevent - Format: mei:<mei device name>:<device uuid>: + Format: mei:<mei device name>:<device uuid>:<protocol version> What: /sys/bus/mei/devices/.../name Date: May 2015 @@ -26,3 +26,24 @@ KernelVersion: 4.3 Contact: Tomas Winkler <tomas.winkler@intel.com> Description: Stores mei client protocol version Format: %d + +What: /sys/bus/mei/devices/.../max_conn +Date: Nov 2019 +KernelVersion: 5.5 +Contact: Tomas Winkler <tomas.winkler@intel.com> +Description: Stores mei client maximum number of connections + Format: %d + +What: /sys/bus/mei/devices/.../fixed +Date: Nov 2019 +KernelVersion: 5.5 +Contact: Tomas Winkler <tomas.winkler@intel.com> +Description: Stores mei client fixed address, if any + Format: %d + +What: /sys/bus/mei/devices/.../max_len +Date: Nov 2019 +KernelVersion: 5.5 +Contact: Tomas Winkler <tomas.winkler@intel.com> +Description: Stores mei client maximum message length + Format: %d diff --git a/Documentation/ABI/testing/sysfs-bus-thunderbolt b/Documentation/ABI/testing/sysfs-bus-thunderbolt index b21fba14689b..82e80de78dd0 100644 --- a/Documentation/ABI/testing/sysfs-bus-thunderbolt +++ b/Documentation/ABI/testing/sysfs-bus-thunderbolt @@ -80,6 +80,14 @@ Contact: thunderbolt-software@lists.01.org Description: This attribute contains 1 if Thunderbolt device was already authorized on boot and 0 otherwise. +What: /sys/bus/thunderbolt/devices/.../generation +Date: Jan 2020 +KernelVersion: 5.5 +Contact: Christian Kellner <christian@kellner.me> +Description: This attribute contains the generation of the Thunderbolt + controller associated with the device. It will contain 4 + for USB4. + What: /sys/bus/thunderbolt/devices/.../key Date: Sep 2017 KernelVersion: 4.13 @@ -104,6 +112,34 @@ Contact: thunderbolt-software@lists.01.org Description: This attribute contains name of this device extracted from the device DROM. +What: /sys/bus/thunderbolt/devices/.../rx_speed +Date: Jan 2020 +KernelVersion: 5.5 +Contact: Mika Westerberg <mika.westerberg@linux.intel.com> +Description: This attribute reports the device RX speed per lane. + All RX lanes run at the same speed. + +What: /sys/bus/thunderbolt/devices/.../rx_lanes +Date: Jan 2020 +KernelVersion: 5.5 +Contact: Mika Westerberg <mika.westerberg@linux.intel.com> +Description: This attribute reports number of RX lanes the device is + using simultaneusly through its upstream port. + +What: /sys/bus/thunderbolt/devices/.../tx_speed +Date: Jan 2020 +KernelVersion: 5.5 +Contact: Mika Westerberg <mika.westerberg@linux.intel.com> +Description: This attribute reports the TX speed per lane. + All TX lanes run at the same speed. + +What: /sys/bus/thunderbolt/devices/.../tx_lanes +Date: Jan 2020 +KernelVersion: 5.5 +Contact: Mika Westerberg <mika.westerberg@linux.intel.com> +Description: This attribute reports number of TX lanes the device is + using simultaneusly through its upstream port. + What: /sys/bus/thunderbolt/devices/.../vendor Date: Sep 2017 KernelVersion: 4.13 diff --git a/Documentation/ABI/testing/sysfs-class-mei b/Documentation/ABI/testing/sysfs-class-mei index a92d844f806e..e9dc110650ae 100644 --- a/Documentation/ABI/testing/sysfs-class-mei +++ b/Documentation/ABI/testing/sysfs-class-mei @@ -80,3 +80,13 @@ Description: Display the ME device state. DISABLED POWER_DOWN POWER_UP + +What: /sys/class/mei/meiN/trc +Date: Nov 2019 +KernelVersion: 5.5 +Contact: Tomas Winkler <tomas.winkler@intel.com> +Description: Display trc status register content + + The ME FW writes Glitch Detection HW (TRC) + status information into trc status register + for BIOS and OS to monitor fw health. diff --git a/Documentation/ABI/testing/sysfs-class-net-statistics b/Documentation/ABI/testing/sysfs-class-net-statistics index 397118de7b5e..55db27815361 100644 --- a/Documentation/ABI/testing/sysfs-class-net-statistics +++ b/Documentation/ABI/testing/sysfs-class-net-statistics @@ -51,6 +51,14 @@ Description: packet processing. See the network driver for the exact meaning of this value. +What: /sys/class/<iface>/statistics/rx_errors +Date: April 2005 +KernelVersion: 2.6.12 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of receive errors on this network device. + See the network driver for the exact meaning of this value. + What: /sys/class/<iface>/statistics/rx_fifo_errors Date: April 2005 KernelVersion: 2.6.12 @@ -88,6 +96,14 @@ Description: due to lack of capacity in the receive side. See the network driver for the exact meaning of this value. +What: /sys/class/<iface>/statistics/rx_nohandler +Date: February 2016 +KernelVersion: 4.6 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of received packets that were dropped on + an inactive device by the network core. + What: /sys/class/<iface>/statistics/rx_over_errors Date: April 2005 KernelVersion: 2.6.12 diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 06d0931119cc..fc20cde63d1e 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -486,6 +486,8 @@ What: /sys/devices/system/cpu/vulnerabilities /sys/devices/system/cpu/vulnerabilities/spec_store_bypass /sys/devices/system/cpu/vulnerabilities/l1tf /sys/devices/system/cpu/vulnerabilities/mds + /sys/devices/system/cpu/vulnerabilities/tsx_async_abort + /sys/devices/system/cpu/vulnerabilities/itlb_multihit Date: January 2018 Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> Description: Information about CPU vulnerabilities diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme index 72634d3ae4f4..3683cb1cdc3d 100644 --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme @@ -106,3 +106,135 @@ KernelVersion: 5.4 Contact: Wu Hao <hao.wu@intel.com> Description: Read-only. Read this file to get the second error detected by hardware. + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/name +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-Only. Read this file to get the name of hwmon device, it + supports values: + 'dfl_fme_thermal' - thermal hwmon device name + 'dfl_fme_power' - power hwmon device name + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_input +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-Only. It returns FPGA device temperature in millidegrees + Celsius. + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_max +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-Only. It returns hardware threshold1 temperature in + millidegrees Celsius. If temperature rises at or above this + threshold, hardware starts 50% or 90% throttling (see + 'temp1_max_policy'). + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_crit +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-Only. It returns hardware threshold2 temperature in + millidegrees Celsius. If temperature rises at or above this + threshold, hardware starts 100% throttling. + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_emergency +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-Only. It returns hardware trip threshold temperature in + millidegrees Celsius. If temperature rises at or above this + threshold, a fatal event will be triggered to board management + controller (BMC) to shutdown FPGA. + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_max_alarm +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-only. It returns 1 if temperature is currently at or above + hardware threshold1 (see 'temp1_max'), otherwise 0. + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_crit_alarm +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-only. It returns 1 if temperature is currently at or above + hardware threshold2 (see 'temp1_crit'), otherwise 0. + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_max_policy +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-Only. Read this file to get the policy of hardware threshold1 + (see 'temp1_max'). It only supports two values (policies): + 0 - AP2 state (90% throttling) + 1 - AP1 state (50% throttling) + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_input +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-Only. It returns current FPGA power consumption in uW. + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-Write. Read this file to get current hardware power + threshold1 in uW. If power consumption rises at or above + this threshold, hardware starts 50% throttling. + Write this file to set current hardware power threshold1 in uW. + As hardware only accepts values in Watts, so input value will + be round down per Watts (< 1 watts part will be discarded) and + clamped within the range from 0 to 127 Watts. Write fails with + -EINVAL if input parsing fails. + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-Write. Read this file to get current hardware power + threshold2 in uW. If power consumption rises at or above + this threshold, hardware starts 90% throttling. + Write this file to set current hardware power threshold2 in uW. + As hardware only accepts values in Watts, so input value will + be round down per Watts (< 1 watts part will be discarded) and + clamped within the range from 0 to 127 Watts. Write fails with + -EINVAL if input parsing fails. + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max_alarm +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-only. It returns 1 if power consumption is currently at or + above hardware threshold1 (see 'power1_max'), otherwise 0. + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit_alarm +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-only. It returns 1 if power consumption is currently at or + above hardware threshold2 (see 'power1_crit'), otherwise 0. + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_xeon_limit +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-Only. It returns power limit for XEON in uW. + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_fpga_limit +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-Only. It returns power limit for FPGA in uW. + +What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_ltr +Date: October 2019 +KernelVersion: 5.5 +Contact: Wu Hao <hao.wu@intel.com> +Description: Read-only. Read this file to get current Latency Tolerance + Reporting (ltr) value. It returns 1 if all Accelerated + Function Units (AFUs) can tolerate latency >= 40us for memory + access or 0 if any AFU is latency sensitive (< 40us). diff --git a/Documentation/DMA-attributes.txt b/Documentation/DMA-attributes.txt index 8f8d97f65d73..29dcbe8826e8 100644 --- a/Documentation/DMA-attributes.txt +++ b/Documentation/DMA-attributes.txt @@ -5,24 +5,6 @@ DMA attributes This document describes the semantics of the DMA attributes that are defined in linux/dma-mapping.h. -DMA_ATTR_WRITE_BARRIER ----------------------- - -DMA_ATTR_WRITE_BARRIER is a (write) barrier attribute for DMA. DMA -to a memory region with the DMA_ATTR_WRITE_BARRIER attribute forces -all pending DMA writes to complete, and thus provides a mechanism to -strictly order DMA from a device across all intervening busses and -bridges. This barrier is not specific to a particular type of -interconnect, it applies to the system as a whole, and so its -implementation must account for the idiosyncrasies of the system all -the way from the DMA device to memory. - -As an example of a situation where DMA_ATTR_WRITE_BARRIER would be -useful, suppose that a device does a DMA write to indicate that data is -ready and available in memory. The DMA of the "completion indication" -could race with data DMA. Mapping the memory used for completion -indications with DMA_ATTR_WRITE_BARRIER would prevent the race. - DMA_ATTR_WEAK_ORDERING ---------------------- diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html deleted file mode 100644 index c30c1957c7e6..000000000000 --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html +++ /dev/null @@ -1,1391 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" - "http://www.w3.org/TR/html4/loose.dtd"> - <html> - <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title> - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> - - <p>December 18, 2016</p> - <p>This article was contributed by Paul E. McKenney</p> - -<h3>Introduction</h3> - -This document describes RCU's major data structures and their relationship -to each other. - -<ol> -<li> <a href="#Data-Structure Relationships"> - Data-Structure Relationships</a> -<li> <a href="#The rcu_state Structure"> - The <tt>rcu_state</tt> Structure</a> -<li> <a href="#The rcu_node Structure"> - The <tt>rcu_node</tt> Structure</a> -<li> <a href="#The rcu_segcblist Structure"> - The <tt>rcu_segcblist</tt> Structure</a> -<li> <a href="#The rcu_data Structure"> - The <tt>rcu_data</tt> Structure</a> -<li> <a href="#The rcu_head Structure"> - The <tt>rcu_head</tt> Structure</a> -<li> <a href="#RCU-Specific Fields in the task_struct Structure"> - RCU-Specific Fields in the <tt>task_struct</tt> Structure</a> -<li> <a href="#Accessor Functions"> - Accessor Functions</a> -</ol> - -<h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3> - -<p>RCU is for all intents and purposes a large state machine, and its -data structures maintain the state in such a way as to allow RCU readers -to execute extremely quickly, while also processing the RCU grace periods -requested by updaters in an efficient and extremely scalable fashion. -The efficiency and scalability of RCU updaters is provided primarily -by a combining tree, as shown below: - -</p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%"> - -</p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure -containing a tree of <tt>rcu_node</tt> structures. -Each leaf node of the <tt>rcu_node</tt> tree has up to 16 -<tt>rcu_data</tt> structures associated with it, so that there -are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures, -one for each possible CPU. -This structure is adjusted at boot time, if needed, to handle the -common case where <tt>nr_cpu_ids</tt> is much less than -<tt>NR_CPUs</tt>. -For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>, -which results in a three-level <tt>rcu_node</tt> tree. -If the actual hardware has only 16 CPUs, RCU will adjust itself -at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node. - -</p><p>The purpose of this combining tree is to allow per-CPU events -such as quiescent states, dyntick-idle transitions, -and CPU hotplug operations to be processed efficiently -and scalably. -Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures, -and other events are recorded by the leaf-level <tt>rcu_node</tt> -structures. -All of these events are combined at each level of the tree until finally -grace periods are completed at the tree's root <tt>rcu_node</tt> -structure. -A grace period can be completed at the root once every CPU -(or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task) -has passed through a quiescent state. -Once a grace period has completed, record of that fact is propagated -back down the tree. - -</p><p>As can be seen from the diagram, on a 64-bit system -a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout -of 64 at the root and a fanout of 16 at the leaves. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why isn't the fanout at the leaves also 64? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Because there are more types of events that affect the leaf-level - <tt>rcu_node</tt> structures than further up the tree. - Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of - 64, the contention on these structures' <tt>->structures</tt> - becomes excessive. - Experimentation on a wide variety of systems has shown that a fanout - of 16 works well for the leaves of the <tt>rcu_node</tt> tree. - </font> - - <p><font color="ffffff">Of course, further experience with - systems having hundreds or thousands of CPUs may demonstrate - that the fanout for the non-leaf <tt>rcu_node</tt> structures - must also be reduced. - Such reduction can be easily carried out when and if it proves - necessary. - In the meantime, if you are using such a system and running into - contention problems on the non-leaf <tt>rcu_node</tt> structures, - you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration - parameter to reduce the non-leaf fanout as needed. - </font> - - <p><font color="ffffff">Kernels built for systems with - strong NUMA characteristics might also need to adjust - <tt>CONFIG_RCU_FANOUT</tt> so that the domains of the - <tt>rcu_node</tt> structures align with hardware boundaries. - However, there has thus far been no need for this. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>If your system has more than 1,024 CPUs (or more than 512 CPUs on -a 32-bit system), then RCU will automatically add more levels to the -tree. -For example, if you are crazy enough to build a 64-bit system with 65,536 -CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows: - -</p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%"> - -</p><p>RCU currently permits up to a four-level tree, which on a 64-bit system -accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for -32-bit systems. -On the other hand, you can set both <tt>CONFIG_RCU_FANOUT</tt> and -<tt>CONFIG_RCU_FANOUT_LEAF</tt> to be as small as 2, which would result -in a 16-CPU test using a 4-level tree. -This can be useful for testing large-system capabilities on small test -machines. - -</p><p>This multi-level combining tree allows us to get most of the -performance and scalability -benefits of partitioning, even though RCU grace-period detection is -inherently a global operation. -The trick here is that only the last CPU to report a quiescent state -into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt> -structure at the next level up the tree. -This means that at the leaf-level <tt>rcu_node</tt> structure, only -one access out of sixteen will progress up the tree. -For the internal <tt>rcu_node</tt> structures, the situation is even -more extreme: Only one access out of sixty-four will progress up -the tree. -Because the vast majority of the CPUs do not progress up the tree, -the lock contention remains roughly constant up the tree. -No matter how many CPUs there are in the system, at most 64 quiescent-state -reports per grace period will progress all the way to the root -<tt>rcu_node</tt> structure, thus ensuring that the lock contention -on that root <tt>rcu_node</tt> structure remains acceptably low. - -</p><p>In effect, the combining tree acts like a big shock absorber, -keeping lock contention under control at all tree levels regardless -of the level of loading on the system. - -</p><p>RCU updaters wait for normal grace periods by registering -RCU callbacks, either directly via <tt>call_rcu()</tt> -or indirectly via <tt>synchronize_rcu()</tt> and friends. -RCU callbacks are represented by <tt>rcu_head</tt> structures, -which are queued on <tt>rcu_data</tt> structures while they are -waiting for a grace period to elapse, as shown in the following figure: - -</p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%"> - -</p><p>This figure shows how <tt>TREE_RCU</tt>'s and -<tt>PREEMPT_RCU</tt>'s major data structures are related. -Lesser data structures will be introduced with the algorithms that -make use of them. - -</p><p>Note that each of the data structures in the above figure has -its own synchronization: - -<p><ol> -<li> Each <tt>rcu_state</tt> structures has a lock and a mutex, - and some fields are protected by the corresponding root - <tt>rcu_node</tt> structure's lock. -<li> Each <tt>rcu_node</tt> structure has a spinlock. -<li> The fields in <tt>rcu_data</tt> are private to the corresponding - CPU, although a few can be read and written by other CPUs. -</ol> - -<p>It is important to note that different data structures can have -very different ideas about the state of RCU at any given time. -For but one example, awareness of the start or end of a given RCU -grace period propagates slowly through the data structures. -This slow propagation is absolutely necessary for RCU to have good -read-side performance. -If this balkanized implementation seems foreign to you, one useful -trick is to consider each instance of these data structures to be -a different person, each having the usual slightly different -view of reality. - -</p><p>The general role of each of these data structures is as -follows: - -</p><ol> -<li> <tt>rcu_state</tt>: - This structure forms the interconnection between the - <tt>rcu_node</tt> and <tt>rcu_data</tt> structures, - tracks grace periods, serves as short-term repository - for callbacks orphaned by CPU-hotplug events, - maintains <tt>rcu_barrier()</tt> state, - tracks expedited grace-period state, - and maintains state used to force quiescent states when - grace periods extend too long, -<li> <tt>rcu_node</tt>: This structure forms the combining - tree that propagates quiescent-state - information from the leaves to the root, and also propagates - grace-period information from the root to the leaves. - It provides local copies of the grace-period state in order - to allow this information to be accessed in a synchronized - manner without suffering the scalability limitations that - would otherwise be imposed by global locking. - In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists - of tasks that have blocked while in their current - RCU read-side critical section. - In <tt>CONFIG_PREEMPT_RCU</tt> with - <tt>CONFIG_RCU_BOOST</tt>, it manages the - per-<tt>rcu_node</tt> priority-boosting - kernel threads (kthreads) and state. - Finally, it records CPU-hotplug state in order to determine - which CPUs should be ignored during a given grace period. -<li> <tt>rcu_data</tt>: This per-CPU structure is the - focus of quiescent-state detection and RCU callback queuing. - It also tracks its relationship to the corresponding leaf - <tt>rcu_node</tt> structure to allow more-efficient - propagation of quiescent states up the <tt>rcu_node</tt> - combining tree. - Like the <tt>rcu_node</tt> structure, it provides a local - copy of the grace-period information to allow for-free - synchronized - access to this information from the corresponding CPU. - Finally, this structure records past dyntick-idle state - for the corresponding CPU and also tracks statistics. -<li> <tt>rcu_head</tt>: - This structure represents RCU callbacks, and is the - only structure allocated and managed by RCU users. - The <tt>rcu_head</tt> structure is normally embedded - within the RCU-protected data structure. -</ol> - -<p>If all you wanted from this article was a general notion of how -RCU's data structures are related, you are done. -Otherwise, each of the following sections give more details on -the <tt>rcu_state</tt>, <tt>rcu_node</tt> and <tt>rcu_data</tt> data -structures. - -<h3><a name="The rcu_state Structure"> -The <tt>rcu_state</tt> Structure</a></h3> - -<p>The <tt>rcu_state</tt> structure is the base structure that -represents the state of RCU in the system. -This structure forms the interconnection between the -<tt>rcu_node</tt> and <tt>rcu_data</tt> structures, -tracks grace periods, contains the lock used to -synchronize with CPU-hotplug events, -and maintains state used to force quiescent states when -grace periods extend too long, - -</p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed, -singly and in groups, in the following sections. -The more specialized fields are covered in the discussion of their -use. - -<h5>Relationship to rcu_node and rcu_data Structures</h5> - -This portion of the <tt>rcu_state</tt> structure is declared -as follows: - -<pre> - 1 struct rcu_node node[NUM_RCU_NODES]; - 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; - 3 struct rcu_data __percpu *rda; -</pre> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Wait a minute! - You said that the <tt>rcu_node</tt> structures formed a tree, - but they are declared as a flat array! - What gives? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - The tree is laid out in the array. - The first node In the array is the head, the next set of nodes in the - array are children of the head node, and so on until the last set of - nodes in the array are the leaves. - </font> - - <p><font color="ffffff">See the following diagrams to see how - this works. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>The <tt>rcu_node</tt> tree is embedded into the -<tt>->node[]</tt> array as shown in the following figure: - -</p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%"> - -</p><p>One interesting consequence of this mapping is that a -breadth-first traversal of the tree is implemented as a simple -linear scan of the array, which is in fact what the -<tt>rcu_for_each_node_breadth_first()</tt> macro does. -This macro is used at the beginning and ends of grace periods. - -</p><p>Each entry of the <tt>->level</tt> array references -the first <tt>rcu_node</tt> structure on the corresponding level -of the tree, for example, as shown below: - -</p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%"> - -</p><p>The zero<sup>th</sup> element of the array references the root -<tt>rcu_node</tt> structure, the first element references the -first child of the root <tt>rcu_node</tt>, and finally the second -element references the first leaf <tt>rcu_node</tt> structure. - -</p><p>For whatever it is worth, if you draw the tree to be tree-shaped -rather than array-shaped, it is easy to draw a planar representation: - -</p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%"> - -</p><p>Finally, the <tt>->rda</tt> field references a per-CPU -pointer to the corresponding CPU's <tt>rcu_data</tt> structure. - -</p><p>All of these fields are constant once initialization is complete, -and therefore need no protection. - -<h5>Grace-Period Tracking</h5> - -<p>This portion of the <tt>rcu_state</tt> structure is declared -as follows: - -<pre> - 1 unsigned long gp_seq; -</pre> - -<p>RCU grace periods are numbered, and -the <tt>->gp_seq</tt> field contains the current grace-period -sequence number. -The bottom two bits are the state of the current grace period, -which can be zero for not yet started or one for in progress. -In other words, if the bottom two bits of <tt>->gp_seq</tt> are -zero, then RCU is idle. -Any other value in the bottom two bits indicates that something is broken. -This field is protected by the root <tt>rcu_node</tt> structure's -<tt>->lock</tt> field. - -</p><p>There are <tt>->gp_seq</tt> fields -in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures -as well. -The fields in the <tt>rcu_state</tt> structure represent the -most current value, and those of the other structures are compared -in order to detect the beginnings and ends of grace periods in a distributed -fashion. -The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt> -(down the tree from the root to the leaves) to <tt>rcu_data</tt>. - -<h5>Miscellaneous</h5> - -<p>This portion of the <tt>rcu_state</tt> structure is declared -as follows: - -<pre> - 1 unsigned long gp_max; - 2 char abbr; - 3 char *name; -</pre> - -<p>The <tt>->gp_max</tt> field tracks the duration of the longest -grace period in jiffies. -It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>. - -<p>The <tt>->name</tt> and <tt>->abbr</tt> fields distinguish -between preemptible RCU (“rcu_preempt” and “p”) -and non-preemptible RCU (“rcu_sched” and “s”). -These fields are used for diagnostic and tracing purposes. - -<h3><a name="The rcu_node Structure"> -The <tt>rcu_node</tt> Structure</a></h3> - -<p>The <tt>rcu_node</tt> structures form the combining -tree that propagates quiescent-state -information from the leaves to the root and also that propagates -grace-period information from the root down to the leaves. -They provides local copies of the grace-period state in order -to allow this information to be accessed in a synchronized -manner without suffering the scalability limitations that -would otherwise be imposed by global locking. -In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists -of tasks that have blocked while in their current -RCU read-side critical section. -In <tt>CONFIG_PREEMPT_RCU</tt> with -<tt>CONFIG_RCU_BOOST</tt>, they manage the -per-<tt>rcu_node</tt> priority-boosting -kernel threads (kthreads) and state. -Finally, they record CPU-hotplug state in order to determine -which CPUs should be ignored during a given grace period. - -</p><p>The <tt>rcu_node</tt> structure's fields are discussed, -singly and in groups, in the following sections. - -<h5>Connection to Combining Tree</h5> - -<p>This portion of the <tt>rcu_node</tt> structure is declared -as follows: - -<pre> - 1 struct rcu_node *parent; - 2 u8 level; - 3 u8 grpnum; - 4 unsigned long grpmask; - 5 int grplo; - 6 int grphi; -</pre> - -<p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt> -one level up in the tree, and is <tt>NULL</tt> for the root -<tt>rcu_node</tt>. -The RCU implementation makes heavy use of this field to push quiescent -states up the tree. -The <tt>->level</tt> field gives the level in the tree, with -the root being at level zero, its children at level one, and so on. -The <tt>->grpnum</tt> field gives this node's position within -the children of its parent, so this number can range between 0 and 31 -on 32-bit systems and between 0 and 63 on 64-bit systems. -The <tt>->level</tt> and <tt>->grpnum</tt> fields are -used only during initialization and for tracing. -The <tt>->grpmask</tt> field is the bitmask counterpart of -<tt>->grpnum</tt>, and therefore always has exactly one bit set. -This mask is used to clear the bit corresponding to this <tt>rcu_node</tt> -structure in its parent's bitmasks, which are described later. -Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields -contain the lowest and highest numbered CPU served by this -<tt>rcu_node</tt> structure, respectively. - -</p><p>All of these fields are constant, and thus do not require any -synchronization. - -<h5>Synchronization</h5> - -<p>This field of the <tt>rcu_node</tt> structure is declared -as follows: - -<pre> - 1 raw_spinlock_t lock; -</pre> - -<p>This field is used to protect the remaining fields in this structure, -unless otherwise stated. -That said, all of the fields in this structure can be accessed without -locking for tracing purposes. -Yes, this can result in confusing traces, but better some tracing confusion -than to be heisenbugged out of existence. - -<h5>Grace-Period Tracking</h5> - -<p>This portion of the <tt>rcu_node</tt> structure is declared -as follows: - -<pre> - 1 unsigned long gp_seq; - 2 unsigned long gp_seq_needed; -</pre> - -<p>The <tt>rcu_node</tt> structures' <tt>->gp_seq</tt> fields are -the counterparts of the field of the same name in the <tt>rcu_state</tt> -structure. -They each may lag up to one step behind their <tt>rcu_state</tt> -counterpart. -If the bottom two bits of a given <tt>rcu_node</tt> structure's -<tt>->gp_seq</tt> field is zero, then this <tt>rcu_node</tt> -structure believes that RCU is idle. -</p><p>The <tt>>gp_seq</tt> field of each <tt>rcu_node</tt> -structure is updated at the beginning and the end -of each grace period. - -<p>The <tt>->gp_seq_needed</tt> fields record the -furthest-in-the-future grace period request seen by the corresponding -<tt>rcu_node</tt> structure. The request is considered fulfilled when -the value of the <tt>->gp_seq</tt> field equals or exceeds that of -the <tt>->gp_seq_needed</tt> field. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Suppose that this <tt>rcu_node</tt> structure doesn't see - a request for a very long time. - Won't wrapping of the <tt>->gp_seq</tt> field cause - problems? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - No, because if the <tt>->gp_seq_needed</tt> field lags behind the - <tt>->gp_seq</tt> field, the <tt>->gp_seq_needed</tt> field - will be updated at the end of the grace period. - Modulo-arithmetic comparisons therefore will always get the - correct answer, even with wrapping. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h5>Quiescent-State Tracking</h5> - -<p>These fields manage the propagation of quiescent states up the -combining tree. - -</p><p>This portion of the <tt>rcu_node</tt> structure has fields -as follows: - -<pre> - 1 unsigned long qsmask; - 2 unsigned long expmask; - 3 unsigned long qsmaskinit; - 4 unsigned long expmaskinit; -</pre> - -<p>The <tt>->qsmask</tt> field tracks which of this -<tt>rcu_node</tt> structure's children still need to report -quiescent states for the current normal grace period. -Such children will have a value of 1 in their corresponding bit. -Note that the leaf <tt>rcu_node</tt> structures should be -thought of as having <tt>rcu_data</tt> structures as their -children. -Similarly, the <tt>->expmask</tt> field tracks which -of this <tt>rcu_node</tt> structure's children still need to report -quiescent states for the current expedited grace period. -An expedited grace period has -the same conceptual properties as a normal grace period, but the -expedited implementation accepts extreme CPU overhead to obtain -much lower grace-period latency, for example, consuming a few -tens of microseconds worth of CPU time to reduce grace-period -duration from milliseconds to tens of microseconds. -The <tt>->qsmaskinit</tt> field tracks which of this -<tt>rcu_node</tt> structure's children cover for at least -one online CPU. -This mask is used to initialize <tt>->qsmask</tt>, -and <tt>->expmaskinit</tt> is used to initialize -<tt>->expmask</tt> and the beginning of the -normal and expedited grace periods, respectively. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why are these bitmasks protected by locking? - Come on, haven't you heard of atomic instructions??? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Lockless grace-period computation! Such a tantalizing possibility! - </font> - - <p><font color="ffffff">But consider the following sequence of events: - </font> - - <ol> - <li> <font color="ffffff">CPU 0 has been in dyntick-idle - mode for quite some time. - When it wakes up, it notices that the current RCU - grace period needs it to report in, so it sets a - flag where the scheduling clock interrupt will find it. - </font><p> - <li> <font color="ffffff">Meanwhile, CPU 1 is running - <tt>force_quiescent_state()</tt>, - and notices that CPU 0 has been in dyntick idle mode, - which qualifies as an extended quiescent state. - </font><p> - <li> <font color="ffffff">CPU 0's scheduling clock - interrupt fires in the - middle of an RCU read-side critical section, and notices - that the RCU core needs something, so commences RCU softirq - processing. - </font> - <p> - <li> <font color="ffffff">CPU 0's softirq handler - executes and is just about ready - to report its quiescent state up the <tt>rcu_node</tt> - tree. - </font><p> - <li> <font color="ffffff">But CPU 1 beats it to the punch, - completing the current - grace period and starting a new one. - </font><p> - <li> <font color="ffffff">CPU 0 now reports its quiescent - state for the wrong - grace period. - That grace period might now end before the RCU read-side - critical section. - If that happens, disaster will ensue. - </font> - </ol> - - <p><font color="ffffff">So the locking is absolutely required in - order to coordinate clearing of the bits with updating of the - grace-period sequence number in <tt>->gp_seq</tt>. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h5>Blocked-Task Management</h5> - -<p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the -midst of their RCU read-side critical sections, and these tasks -must be tracked explicitly. -The details of exactly why and how they are tracked will be covered -in a separate article on RCU read-side processing. -For now, it is enough to know that the <tt>rcu_node</tt> -structure tracks them. - -<pre> - 1 struct list_head blkd_tasks; - 2 struct list_head *gp_tasks; - 3 struct list_head *exp_tasks; - 4 bool wait_blkd_tasks; -</pre> - -<p>The <tt>->blkd_tasks</tt> field is a list header for -the list of blocked and preempted tasks. -As tasks undergo context switches within RCU read-side critical -sections, their <tt>task_struct</tt> structures are enqueued -(via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt> -field) onto the head of the <tt>->blkd_tasks</tt> list for the -leaf <tt>rcu_node</tt> structure corresponding to the CPU -on which the outgoing context switch executed. -As these tasks later exit their RCU read-side critical sections, -they remove themselves from the list. -This list is therefore in reverse time order, so that if one of the tasks -is blocking the current grace period, all subsequent tasks must -also be blocking that same grace period. -Therefore, a single pointer into this list suffices to track -all tasks blocking a given grace period. -That pointer is stored in <tt>->gp_tasks</tt> for normal -grace periods and in <tt>->exp_tasks</tt> for expedited -grace periods. -These last two fields are <tt>NULL</tt> if either there is -no grace period in flight or if there are no blocked tasks -preventing that grace period from completing. -If either of these two pointers is referencing a task that -removes itself from the <tt>->blkd_tasks</tt> list, -then that task must advance the pointer to the next task on -the list, or set the pointer to <tt>NULL</tt> if there -are no subsequent tasks on the list. - -</p><p>For example, suppose that tasks T1, T2, and T3 are -all hard-affinitied to the largest-numbered CPU in the system. -Then if task T1 blocked in an RCU read-side -critical section, then an expedited grace period started, -then task T2 blocked in an RCU read-side critical section, -then a normal grace period started, and finally task 3 blocked -in an RCU read-side critical section, then the state of the -last leaf <tt>rcu_node</tt> structure's blocked-task list -would be as shown below: - -</p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%"> - -</p><p>Task T1 is blocking both grace periods, task T2 is -blocking only the normal grace period, and task T3 is blocking -neither grace period. -Note that these tasks will not remove themselves from this list -immediately upon resuming execution. -They will instead remain on the list until they execute the outermost -<tt>rcu_read_unlock()</tt> that ends their RCU read-side critical -section. - -<p> -The <tt>->wait_blkd_tasks</tt> field indicates whether or not -the current grace period is waiting on a blocked task. - -<h5>Sizing the <tt>rcu_node</tt> Array</h5> - -<p>The <tt>rcu_node</tt> array is sized via a series of -C-preprocessor expressions as follows: - -<pre> - 1 #ifdef CONFIG_RCU_FANOUT - 2 #define RCU_FANOUT CONFIG_RCU_FANOUT - 3 #else - 4 # ifdef CONFIG_64BIT - 5 # define RCU_FANOUT 64 - 6 # else - 7 # define RCU_FANOUT 32 - 8 # endif - 9 #endif -10 -11 #ifdef CONFIG_RCU_FANOUT_LEAF -12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF -13 #else -14 # ifdef CONFIG_64BIT -15 # define RCU_FANOUT_LEAF 64 -16 # else -17 # define RCU_FANOUT_LEAF 32 -18 # endif -19 #endif -20 -21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) -22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) -23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) -24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) -25 -26 #if NR_CPUS <= RCU_FANOUT_1 -27 # define RCU_NUM_LVLS 1 -28 # define NUM_RCU_LVL_0 1 -29 # define NUM_RCU_NODES NUM_RCU_LVL_0 -30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } -31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } -32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } -33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } -34 #elif NR_CPUS <= RCU_FANOUT_2 -35 # define RCU_NUM_LVLS 2 -36 # define NUM_RCU_LVL_0 1 -37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) -38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) -39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } -40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } -41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } -42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } -43 #elif NR_CPUS <= RCU_FANOUT_3 -44 # define RCU_NUM_LVLS 3 -45 # define NUM_RCU_LVL_0 1 -46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) -47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) -48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) -49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } -50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } -51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } -52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } -53 #elif NR_CPUS <= RCU_FANOUT_4 -54 # define RCU_NUM_LVLS 4 -55 # define NUM_RCU_LVL_0 1 -56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) -57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) -58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) -59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) -60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } -61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } -62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } -63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } -64 #else -65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" -66 #endif -</pre> - -<p>The maximum number of levels in the <tt>rcu_node</tt> structure -is currently limited to four, as specified by lines 21-24 -and the structure of the subsequent “if” statement. -For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which -should be sufficient for the next few years at least. -For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which -should see us through the next decade or so. -This four-level tree also allows kernels built with -<tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs, -which might be useful in very large systems having eight CPUs per -socket (but please note that no one has yet shown any measurable -performance degradation due to misaligned socket and <tt>rcu_node</tt> -boundaries). -In addition, building kernels with a full four levels of <tt>rcu_node</tt> -tree permits better testing of RCU's combining-tree code. - -</p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children -are permitted at each non-leaf level of the <tt>rcu_node</tt> tree. -If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified, -it is set based on the word size of the system, which is also -the Kconfig default. - -</p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are -handled by each leaf <tt>rcu_node</tt> structure. -Experience has shown that allowing a given leaf <tt>rcu_node</tt> -structure to handle 64 CPUs, as permitted by the number of bits in -the <tt>->qsmask</tt> field on a 64-bit system, results in -excessive contention for the leaf <tt>rcu_node</tt> structures' -<tt>->lock</tt> fields. -The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore -limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>. -If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value -selected is based on the word size of the system, just as for -<tt>CONFIG_RCU_FANOUT</tt>. -Lines 11-19 perform this computation. - -</p><p>Lines 21-24 compute the maximum number of CPUs supported by -a single-level (which contains a single <tt>rcu_node</tt> structure), -two-level, three-level, and four-level <tt>rcu_node</tt> tree, -respectively, given the fanout specified by <tt>RCU_FANOUT</tt> -and <tt>RCU_FANOUT_LEAF</tt>. -These numbers of CPUs are retained in the -<tt>RCU_FANOUT_1</tt>, -<tt>RCU_FANOUT_2</tt>, -<tt>RCU_FANOUT_3</tt>, and -<tt>RCU_FANOUT_4</tt> -C-preprocessor variables, respectively. - -</p><p>These variables are used to control the C-preprocessor <tt>#if</tt> -statement spanning lines 26-66 that computes the number of -<tt>rcu_node</tt> structures required for each level of the tree, -as well as the number of levels required. -The number of levels is placed in the <tt>NUM_RCU_LVLS</tt> -C-preprocessor variable by lines 27, 35, 44, and 54. -The number of <tt>rcu_node</tt> structures for the topmost level -of the tree is always exactly one, and this value is unconditionally -placed into <tt>NUM_RCU_LVL_0</tt> by lines 28, 36, 45, and 55. -The rest of the levels (if any) of the <tt>rcu_node</tt> tree -are computed by dividing the maximum number of CPUs by the -fanout supported by the number of levels from the current level down, -rounding up. This computation is performed by lines 37, -46-47, and 56-58. -Lines 31-33, 40-42, 50-52, and 62-63 create initializers -for lockdep lock-class names. -Finally, lines 64-66 produce an error if the maximum number of -CPUs is too large for the specified fanout. - -<h3><a name="The rcu_segcblist Structure"> -The <tt>rcu_segcblist</tt> Structure</a></h3> - -The <tt>rcu_segcblist</tt> structure maintains a segmented list of -callbacks as follows: - -<pre> - 1 #define RCU_DONE_TAIL 0 - 2 #define RCU_WAIT_TAIL 1 - 3 #define RCU_NEXT_READY_TAIL 2 - 4 #define RCU_NEXT_TAIL 3 - 5 #define RCU_CBLIST_NSEGS 4 - 6 - 7 struct rcu_segcblist { - 8 struct rcu_head *head; - 9 struct rcu_head **tails[RCU_CBLIST_NSEGS]; -10 unsigned long gp_seq[RCU_CBLIST_NSEGS]; -11 long len; -12 long len_lazy; -13 }; -</pre> - -<p> -The segments are as follows: - -<ol> -<li> <tt>RCU_DONE_TAIL</tt>: Callbacks whose grace periods have elapsed. - These callbacks are ready to be invoked. -<li> <tt>RCU_WAIT_TAIL</tt>: Callbacks that are waiting for the - current grace period. - Note that different CPUs can have different ideas about which - grace period is current, hence the <tt>->gp_seq</tt> field. -<li> <tt>RCU_NEXT_READY_TAIL</tt>: Callbacks waiting for the next - grace period to start. -<li> <tt>RCU_NEXT_TAIL</tt>: Callbacks that have not yet been - associated with a grace period. -</ol> - -<p> -The <tt>->head</tt> pointer references the first callback or -is <tt>NULL</tt> if the list contains no callbacks (which is -<i>not</i> the same as being empty). -Each element of the <tt>->tails[]</tt> array references the -<tt>->next</tt> pointer of the last callback in the corresponding -segment of the list, or the list's <tt>->head</tt> pointer if -that segment and all previous segments are empty. -If the corresponding segment is empty but some previous segment is -not empty, then the array element is identical to its predecessor. -Older callbacks are closer to the head of the list, and new callbacks -are added at the tail. -This relationship between the <tt>->head</tt> pointer, the -<tt>->tails[]</tt> array, and the callbacks is shown in this -diagram: - -</p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%"> - -</p><p>In this figure, the <tt>->head</tt> pointer references the -first -RCU callback in the list. -The <tt>->tails[RCU_DONE_TAIL]</tt> array element references -the <tt>->head</tt> pointer itself, indicating that none -of the callbacks is ready to invoke. -The <tt>->tails[RCU_WAIT_TAIL]</tt> array element references callback -CB 2's <tt>->next</tt> pointer, which indicates that -CB 1 and CB 2 are both waiting on the current grace period, -give or take possible disagreements about exactly which grace period -is the current one. -The <tt>->tails[RCU_NEXT_READY_TAIL]</tt> array element -references the same RCU callback that <tt>->tails[RCU_WAIT_TAIL]</tt> -does, which indicates that there are no callbacks waiting on the next -RCU grace period. -The <tt>->tails[RCU_NEXT_TAIL]</tt> array element references -CB 4's <tt>->next</tt> pointer, indicating that all the -remaining RCU callbacks have not yet been assigned to an RCU grace -period. -Note that the <tt>->tails[RCU_NEXT_TAIL]</tt> array element -always references the last RCU callback's <tt>->next</tt> pointer -unless the callback list is empty, in which case it references -the <tt>->head</tt> pointer. - -<p> -There is one additional important special case for the -<tt>->tails[RCU_NEXT_TAIL]</tt> array element: It can be <tt>NULL</tt> -when this list is <i>disabled</i>. -Lists are disabled when the corresponding CPU is offline or when -the corresponding CPU's callbacks are offloaded to a kthread, -both of which are described elsewhere. - -</p><p>CPUs advance their callbacks from the -<tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the -<tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments -as grace periods advance. - -</p><p>The <tt>->gp_seq[]</tt> array records grace-period -numbers corresponding to the list segments. -This is what allows different CPUs to have different ideas as to -which is the current grace period while still avoiding premature -invocation of their callbacks. -In particular, this allows CPUs that go idle for extended periods -to determine which of their callbacks are ready to be invoked after -reawakening. - -</p><p>The <tt>->len</tt> counter contains the number of -callbacks in <tt>->head</tt>, and the -<tt>->len_lazy</tt> contains the number of those callbacks that -are known to only free memory, and whose invocation can therefore -be safely deferred. - -<p><b>Important note</b>: It is the <tt>->len</tt> field that -determines whether or not there are callbacks associated with -this <tt>rcu_segcblist</tt> structure, <i>not</i> the <tt>->head</tt> -pointer. -The reason for this is that all the ready-to-invoke callbacks -(that is, those in the <tt>RCU_DONE_TAIL</tt> segment) are extracted -all at once at callback-invocation time (<tt>rcu_do_batch</tt>), due -to which <tt>->head</tt> may be set to NULL if there are no not-done -callbacks remaining in the <tt>rcu_segcblist</tt>. -If callback invocation must be postponed, for example, because a -high-priority process just woke up on this CPU, then the remaining -callbacks are placed back on the <tt>RCU_DONE_TAIL</tt> segment and -<tt>->head</tt> once again points to the start of the segment. -In short, the head field can briefly be <tt>NULL</tt> even though the -CPU has callbacks present the entire time. -Therefore, it is not appropriate to test the <tt>->head</tt> pointer -for <tt>NULL</tt>. - -<p>In contrast, the <tt>->len</tt> and <tt>->len_lazy</tt> counts -are adjusted only after the corresponding callbacks have been invoked. -This means that the <tt>->len</tt> count is zero only if -the <tt>rcu_segcblist</tt> structure really is devoid of callbacks. -Of course, off-CPU sampling of the <tt>->len</tt> count requires -careful use of appropriate synchronization, for example, memory barriers. -This synchronization can be a bit subtle, particularly in the case -of <tt>rcu_barrier()</tt>. - -<h3><a name="The rcu_data Structure"> -The <tt>rcu_data</tt> Structure</a></h3> - -<p>The <tt>rcu_data</tt> maintains the per-CPU state for the RCU subsystem. -The fields in this structure may be accessed only from the corresponding -CPU (and from tracing) unless otherwise stated. -This structure is the -focus of quiescent-state detection and RCU callback queuing. -It also tracks its relationship to the corresponding leaf -<tt>rcu_node</tt> structure to allow more-efficient -propagation of quiescent states up the <tt>rcu_node</tt> -combining tree. -Like the <tt>rcu_node</tt> structure, it provides a local -copy of the grace-period information to allow for-free -synchronized -access to this information from the corresponding CPU. -Finally, this structure records past dyntick-idle state -for the corresponding CPU and also tracks statistics. - -</p><p>The <tt>rcu_data</tt> structure's fields are discussed, -singly and in groups, in the following sections. - -<h5>Connection to Other Data Structures</h5> - -<p>This portion of the <tt>rcu_data</tt> structure is declared -as follows: - -<pre> - 1 int cpu; - 2 struct rcu_node *mynode; - 3 unsigned long grpmask; - 4 bool beenonline; -</pre> - -<p>The <tt>->cpu</tt> field contains the number of the -corresponding CPU and the <tt>->mynode</tt> field references the -corresponding <tt>rcu_node</tt> structure. -The <tt>->mynode</tt> is used to propagate quiescent states -up the combining tree. -These two fields are constant and therefore do not require synchronization. - -<p>The <tt>->grpmask</tt> field indicates the bit in -the <tt>->mynode->qsmask</tt> corresponding to this -<tt>rcu_data</tt> structure, and is also used when propagating -quiescent states. -The <tt>->beenonline</tt> flag is set whenever the corresponding -CPU comes online, which means that the debugfs tracing need not dump -out any <tt>rcu_data</tt> structure for which this flag is not set. - -<h5>Quiescent-State and Grace-Period Tracking</h5> - -<p>This portion of the <tt>rcu_data</tt> structure is declared -as follows: - -<pre> - 1 unsigned long gp_seq; - 2 unsigned long gp_seq_needed; - 3 bool cpu_no_qs; - 4 bool core_needs_qs; - 5 bool gpwrap; -</pre> - -<p>The <tt>->gp_seq</tt> field is the counterpart of the field of the same -name in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures. The -<tt>->gp_seq_needed</tt> field is the counterpart of the field of the same -name in the rcu_node</tt> structure. -They may each lag up to one behind their <tt>rcu_node</tt> -counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and -<tt>CONFIG_NO_HZ_FULL</tt> kernels can lag -arbitrarily far behind for CPUs in dyntick-idle mode (but these counters -will catch up upon exit from dyntick-idle mode). -If the lower two bits of a given <tt>rcu_data</tt> structure's -<tt>->gp_seq</tt> are zero, then this <tt>rcu_data</tt> -structure believes that RCU is idle. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - All this replication of the grace period numbers can only cause - massive confusion. - Why not just keep a global sequence number and be done with it??? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Because if there was only a single global sequence - numbers, there would need to be a single global lock to allow - safely accessing and updating it. - And if we are not going to have a single global lock, we need - to carefully manage the numbers on a per-node basis. - Recall from the answer to a previous Quick Quiz that the consequences - of applying a previously sampled quiescent state to the wrong - grace period are quite severe. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>The <tt>->cpu_no_qs</tt> flag indicates that the -CPU has not yet passed through a quiescent state, -while the <tt>->core_needs_qs</tt> flag indicates that the -RCU core needs a quiescent state from the corresponding CPU. -The <tt>->gpwrap</tt> field indicates that the corresponding -CPU has remained idle for so long that the -<tt>gp_seq</tt> counter is in danger of overflow, which -will cause the CPU to disregard the values of its counters on -its next exit from idle. - -<h5>RCU Callback Handling</h5> - -<p>In the absence of CPU-hotplug events, RCU callbacks are invoked by -the same CPU that registered them. -This is strictly a cache-locality optimization: callbacks can and -do get invoked on CPUs other than the one that registered them. -After all, if the CPU that registered a given callback has gone -offline before the callback can be invoked, there really is no other -choice. - -</p><p>This portion of the <tt>rcu_data</tt> structure is declared -as follows: - -<pre> - 1 struct rcu_segcblist cblist; - 2 long qlen_last_fqs_check; - 3 unsigned long n_cbs_invoked; - 4 unsigned long n_nocbs_invoked; - 5 unsigned long n_cbs_orphaned; - 6 unsigned long n_cbs_adopted; - 7 unsigned long n_force_qs_snap; - 8 long blimit; -</pre> - -<p>The <tt>->cblist</tt> structure is the segmented callback list -described earlier. -The CPU advances the callbacks in its <tt>rcu_data</tt> structure -whenever it notices that another RCU grace period has completed. -The CPU detects the completion of an RCU grace period by noticing -that the value of its <tt>rcu_data</tt> structure's -<tt>->gp_seq</tt> field differs from that of its leaf -<tt>rcu_node</tt> structure. -Recall that each <tt>rcu_node</tt> structure's -<tt>->gp_seq</tt> field is updated at the beginnings and ends of each -grace period. - -<p> -The <tt>->qlen_last_fqs_check</tt> and -<tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent -states from <tt>call_rcu()</tt> and friends when callback -lists grow excessively long. - -</p><p>The <tt>->n_cbs_invoked</tt>, -<tt>->n_cbs_orphaned</tt>, and <tt>->n_cbs_adopted</tt> -fields count the number of callbacks invoked, -sent to other CPUs when this CPU goes offline, -and received from other CPUs when those other CPUs go offline. -The <tt>->n_nocbs_invoked</tt> is used when the CPU's callbacks -are offloaded to a kthread. - -<p> -Finally, the <tt>->blimit</tt> counter is the maximum number of -RCU callbacks that may be invoked at a given time. - -<h5>Dyntick-Idle Handling</h5> - -<p>This portion of the <tt>rcu_data</tt> structure is declared -as follows: - -<pre> - 1 int dynticks_snap; - 2 unsigned long dynticks_fqs; -</pre> - -The <tt>->dynticks_snap</tt> field is used to take a snapshot -of the corresponding CPU's dyntick-idle state when forcing -quiescent states, and is therefore accessed from other CPUs. -Finally, the <tt>->dynticks_fqs</tt> field is used to -count the number of times this CPU is determined to be in -dyntick-idle state, and is used for tracing and debugging purposes. - -<p> -This portion of the rcu_data structure is declared as follows: - -<pre> - 1 long dynticks_nesting; - 2 long dynticks_nmi_nesting; - 3 atomic_t dynticks; - 4 bool rcu_need_heavy_qs; - 5 bool rcu_urgent_qs; -</pre> - -<p>These fields in the rcu_data structure maintain the per-CPU dyntick-idle -state for the corresponding CPU. -The fields may be accessed only from the corresponding CPU (and from tracing) -unless otherwise stated. - -<p>The <tt>->dynticks_nesting</tt> field counts the -nesting depth of process execution, so that in normal circumstances -this counter has value zero or one. -NMIs, irqs, and tracers are counted by the <tt>->dynticks_nmi_nesting</tt> -field. -Because NMIs cannot be masked, changes to this variable have to be -undertaken carefully using an algorithm provided by Andy Lutomirski. -The initial transition from idle adds one, and nested transitions -add two, so that a nesting level of five is represented by a -<tt>->dynticks_nmi_nesting</tt> value of nine. -This counter can therefore be thought of as counting the number -of reasons why this CPU cannot be permitted to enter dyntick-idle -mode, aside from process-level transitions. - -<p>However, it turns out that when running in non-idle kernel context, -the Linux kernel is fully capable of entering interrupt handlers that -never exit and perhaps also vice versa. -Therefore, whenever the <tt>->dynticks_nesting</tt> field is -incremented up from zero, the <tt>->dynticks_nmi_nesting</tt> field -is set to a large positive number, and whenever the -<tt>->dynticks_nesting</tt> field is decremented down to zero, -the the <tt>->dynticks_nmi_nesting</tt> field is set to zero. -Assuming that the number of misnested interrupts is not sufficient -to overflow the counter, this approach corrects the -<tt>->dynticks_nmi_nesting</tt> field every time the corresponding -CPU enters the idle loop from process context. - -</p><p>The <tt>->dynticks</tt> field counts the corresponding -CPU's transitions to and from either dyntick-idle or user mode, so -that this counter has an even value when the CPU is in dyntick-idle -mode or user mode and an odd value otherwise. The transitions to/from -user mode need to be counted for user mode adaptive-ticks support -(see timers/NO_HZ.txt). - -</p><p>The <tt>->rcu_need_heavy_qs</tt> field is used -to record the fact that the RCU core code would really like to -see a quiescent state from the corresponding CPU, so much so that -it is willing to call for heavy-weight dyntick-counter operations. -This flag is checked by RCU's context-switch and <tt>cond_resched()</tt> -code, which provide a momentary idle sojourn in response. - -</p><p>Finally, the <tt>->rcu_urgent_qs</tt> field is used to record -the fact that the RCU core code would really like to see a quiescent state from -the corresponding CPU, with the various other fields indicating just how badly -RCU wants this quiescent state. -This flag is checked by RCU's context-switch path -(<tt>rcu_note_context_switch</tt>) and the cond_resched code. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why not simply combine the <tt>->dynticks_nesting</tt> - and <tt>->dynticks_nmi_nesting</tt> counters into a - single counter that just counts the number of reasons that - the corresponding CPU is non-idle? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Because this would fail in the presence of interrupts whose - handlers never return and of handlers that manage to return - from a made-up interrupt. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>Additional fields are present for some special-purpose -builds, and are discussed separately. - -<h3><a name="The rcu_head Structure"> -The <tt>rcu_head</tt> Structure</a></h3> - -<p>Each <tt>rcu_head</tt> structure represents an RCU callback. -These structures are normally embedded within RCU-protected data -structures whose algorithms use asynchronous grace periods. -In contrast, when using algorithms that block waiting for RCU grace periods, -RCU users need not provide <tt>rcu_head</tt> structures. - -</p><p>The <tt>rcu_head</tt> structure has fields as follows: - -<pre> - 1 struct rcu_head *next; - 2 void (*func)(struct rcu_head *head); -</pre> - -<p>The <tt>->next</tt> field is used -to link the <tt>rcu_head</tt> structures together in the -lists within the <tt>rcu_data</tt> structures. -The <tt>->func</tt> field is a pointer to the function -to be called when the callback is ready to be invoked, and -this function is passed a pointer to the <tt>rcu_head</tt> -structure. -However, <tt>kfree_rcu()</tt> uses the <tt>->func</tt> -field to record the offset of the <tt>rcu_head</tt> -structure within the enclosing RCU-protected data structure. - -</p><p>Both of these fields are used internally by RCU. -From the viewpoint of RCU users, this structure is an -opaque “cookie”. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Given that the callback function <tt>->func</tt> - is passed a pointer to the <tt>rcu_head</tt> structure, - how is that function supposed to find the beginning of the - enclosing RCU-protected data structure? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - In actual practice, there is a separate callback function per - type of RCU-protected data structure. - The callback function can therefore use the <tt>container_of()</tt> - macro in the Linux kernel (or other pointer-manipulation facilities - in other software environments) to find the beginning of the - enclosing structure. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h3><a name="RCU-Specific Fields in the task_struct Structure"> -RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3> - -<p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some -additional fields in the <tt>task_struct</tt> structure: - -<pre> - 1 #ifdef CONFIG_PREEMPT_RCU - 2 int rcu_read_lock_nesting; - 3 union rcu_special rcu_read_unlock_special; - 4 struct list_head rcu_node_entry; - 5 struct rcu_node *rcu_blocked_node; - 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ - 7 #ifdef CONFIG_TASKS_RCU - 8 unsigned long rcu_tasks_nvcsw; - 9 bool rcu_tasks_holdout; -10 struct list_head rcu_tasks_holdout_list; -11 int rcu_tasks_idle_cpu; -12 #endif /* #ifdef CONFIG_TASKS_RCU */ -</pre> - -<p>The <tt>->rcu_read_lock_nesting</tt> field records the -nesting level for RCU read-side critical sections, and -the <tt>->rcu_read_unlock_special</tt> field is a bitmask -that records special conditions that require <tt>rcu_read_unlock()</tt> -to do additional work. -The <tt>->rcu_node_entry</tt> field is used to form lists of -tasks that have blocked within preemptible-RCU read-side critical -sections and the <tt>->rcu_blocked_node</tt> field references -the <tt>rcu_node</tt> structure whose list this task is a member of, -or <tt>NULL</tt> if it is not blocked within a preemptible-RCU -read-side critical section. - -<p>The <tt>->rcu_tasks_nvcsw</tt> field tracks the number of -voluntary context switches that this task had undergone at the -beginning of the current tasks-RCU grace period, -<tt>->rcu_tasks_holdout</tt> is set if the current tasks-RCU -grace period is waiting on this task, <tt>->rcu_tasks_holdout_list</tt> -is a list element enqueuing this task on the holdout list, -and <tt>->rcu_tasks_idle_cpu</tt> tracks which CPU this -idle task is running, but only if the task is currently running, -that is, if the CPU is currently idle. - -<h3><a name="Accessor Functions"> -Accessor Functions</a></h3> - -<p>The following listing shows the -<tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt> and -<tt>rcu_for_each_leaf_node()</tt> function and macros: - -<pre> - 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) - 2 { - 3 return &rsp->node[0]; - 4 } - 5 - 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ - 7 for ((rnp) = &(rsp)->node[0]; \ - 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) - 9 - 10 #define rcu_for_each_leaf_node(rsp, rnp) \ - 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ - 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) -</pre> - -<p>The <tt>rcu_get_root()</tt> simply returns a pointer to the -first element of the specified <tt>rcu_state</tt> structure's -<tt>->node[]</tt> array, which is the root <tt>rcu_node</tt> -structure. - -</p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt> -macro takes advantage of the layout of the <tt>rcu_node</tt> -structures in the <tt>rcu_state</tt> structure's -<tt>->node[]</tt> array, performing a breadth-first traversal by -simply traversing the array in order. -Similarly, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only -the last part of the array, thus traversing only the leaf -<tt>rcu_node</tt> structures. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - What does - <tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree - contains only a single node? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - In the single-node case, - <tt>rcu_for_each_leaf_node()</tt> traverses the single node. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h3><a name="Summary"> -Summary</a></h3> - -So the state of RCU is represented by an <tt>rcu_state</tt> structure, -which contains a combining tree of <tt>rcu_node</tt> and -<tt>rcu_data</tt> structures. -Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle -state is tracked by dynticks-related fields in the <tt>rcu_data</tt> structure. - -If you made it this far, you are well prepared to read the code -walkthroughs in the other articles in this series. - -<h3><a name="Acknowledgments"> -Acknowledgments</a></h3> - -I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul -Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn -for helping me get this document into a more human-readable state. - -<h3><a name="Legal Statement"> -Legal Statement</a></h3> - -<p>This work represents the view of the author and does not necessarily -represent the view of IBM. - -</p><p>Linux is a registered trademark of Linus Torvalds. - -</p><p>Other company, product, and service names may be trademarks or -service marks of others. - -</body></html> diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst new file mode 100644 index 000000000000..4a48e20a46f2 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst @@ -0,0 +1,1163 @@ +=================================================== +A Tour Through TREE_RCU's Data Structures [LWN.net] +=================================================== + +December 18, 2016 + +This article was contributed by Paul E. McKenney + +Introduction +============ + +This document describes RCU's major data structures and their relationship +to each other. + +Data-Structure Relationships +============================ + +RCU is for all intents and purposes a large state machine, and its +data structures maintain the state in such a way as to allow RCU readers +to execute extremely quickly, while also processing the RCU grace periods +requested by updaters in an efficient and extremely scalable fashion. +The efficiency and scalability of RCU updaters is provided primarily +by a combining tree, as shown below: + +.. kernel-figure:: BigTreeClassicRCU.svg + +This diagram shows an enclosing ``rcu_state`` structure containing a tree +of ``rcu_node`` structures. Each leaf node of the ``rcu_node`` tree has up +to 16 ``rcu_data`` structures associated with it, so that there are +``NR_CPUS`` number of ``rcu_data`` structures, one for each possible CPU. +This structure is adjusted at boot time, if needed, to handle the common +case where ``nr_cpu_ids`` is much less than ``NR_CPUs``. +For example, a number of Linux distributions set ``NR_CPUs=4096``, +which results in a three-level ``rcu_node`` tree. +If the actual hardware has only 16 CPUs, RCU will adjust itself +at boot time, resulting in an ``rcu_node`` tree with only a single node. + +The purpose of this combining tree is to allow per-CPU events +such as quiescent states, dyntick-idle transitions, +and CPU hotplug operations to be processed efficiently +and scalably. +Quiescent states are recorded by the per-CPU ``rcu_data`` structures, +and other events are recorded by the leaf-level ``rcu_node`` +structures. +All of these events are combined at each level of the tree until finally +grace periods are completed at the tree's root ``rcu_node`` +structure. +A grace period can be completed at the root once every CPU +(or, in the case of ``CONFIG_PREEMPT_RCU``, task) +has passed through a quiescent state. +Once a grace period has completed, record of that fact is propagated +back down the tree. + +As can be seen from the diagram, on a 64-bit system +a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout +of 64 at the root and a fanout of 16 at the leaves. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why isn't the fanout at the leaves also 64? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because there are more types of events that affect the leaf-level | +| ``rcu_node`` structures than further up the tree. Therefore, if the | +| leaf ``rcu_node`` structures have fanout of 64, the contention on | +| these structures' ``->structures`` becomes excessive. Experimentation | +| on a wide variety of systems has shown that a fanout of 16 works well | +| for the leaves of the ``rcu_node`` tree. | +| | +| Of course, further experience with systems having hundreds or | +| thousands of CPUs may demonstrate that the fanout for the non-leaf | +| ``rcu_node`` structures must also be reduced. Such reduction can be | +| easily carried out when and if it proves necessary. In the meantime, | +| if you are using such a system and running into contention problems | +| on the non-leaf ``rcu_node`` structures, you may use the | +| ``CONFIG_RCU_FANOUT`` kernel configuration parameter to reduce the | +| non-leaf fanout as needed. | +| | +| Kernels built for systems with strong NUMA characteristics might | +| also need to adjust ``CONFIG_RCU_FANOUT`` so that the domains of | +| the ``rcu_node`` structures align with hardware boundaries. | +| However, there has thus far been no need for this. | ++-----------------------------------------------------------------------+ + +If your system has more than 1,024 CPUs (or more than 512 CPUs on a +32-bit system), then RCU will automatically add more levels to the tree. +For example, if you are crazy enough to build a 64-bit system with +65,536 CPUs, RCU would configure the ``rcu_node`` tree as follows: + +.. kernel-figure:: HugeTreeClassicRCU.svg + +RCU currently permits up to a four-level tree, which on a 64-bit system +accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for +32-bit systems. On the other hand, you can set both +``CONFIG_RCU_FANOUT`` and ``CONFIG_RCU_FANOUT_LEAF`` to be as small as +2, which would result in a 16-CPU test using a 4-level tree. This can be +useful for testing large-system capabilities on small test machines. + +This multi-level combining tree allows us to get most of the performance +and scalability benefits of partitioning, even though RCU grace-period +detection is inherently a global operation. The trick here is that only +the last CPU to report a quiescent state into a given ``rcu_node`` +structure need advance to the ``rcu_node`` structure at the next level +up the tree. This means that at the leaf-level ``rcu_node`` structure, +only one access out of sixteen will progress up the tree. For the +internal ``rcu_node`` structures, the situation is even more extreme: +Only one access out of sixty-four will progress up the tree. Because the +vast majority of the CPUs do not progress up the tree, the lock +contention remains roughly constant up the tree. No matter how many CPUs +there are in the system, at most 64 quiescent-state reports per grace +period will progress all the way to the root ``rcu_node`` structure, +thus ensuring that the lock contention on that root ``rcu_node`` +structure remains acceptably low. + +In effect, the combining tree acts like a big shock absorber, keeping +lock contention under control at all tree levels regardless of the level +of loading on the system. + +RCU updaters wait for normal grace periods by registering RCU callbacks, +either directly via ``call_rcu()`` or indirectly via +``synchronize_rcu()`` and friends. RCU callbacks are represented by +``rcu_head`` structures, which are queued on ``rcu_data`` structures +while they are waiting for a grace period to elapse, as shown in the +following figure: + +.. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg + +This figure shows how ``TREE_RCU``'s and ``PREEMPT_RCU``'s major data +structures are related. Lesser data structures will be introduced with +the algorithms that make use of them. + +Note that each of the data structures in the above figure has its own +synchronization: + +#. Each ``rcu_state`` structures has a lock and a mutex, and some fields + are protected by the corresponding root ``rcu_node`` structure's lock. +#. Each ``rcu_node`` structure has a spinlock. +#. The fields in ``rcu_data`` are private to the corresponding CPU, + although a few can be read and written by other CPUs. + +It is important to note that different data structures can have very +different ideas about the state of RCU at any given time. For but one +example, awareness of the start or end of a given RCU grace period +propagates slowly through the data structures. This slow propagation is +absolutely necessary for RCU to have good read-side performance. If this +balkanized implementation seems foreign to you, one useful trick is to +consider each instance of these data structures to be a different +person, each having the usual slightly different view of reality. + +The general role of each of these data structures is as follows: + +#. ``rcu_state``: This structure forms the interconnection between the + ``rcu_node`` and ``rcu_data`` structures, tracks grace periods, + serves as short-term repository for callbacks orphaned by CPU-hotplug + events, maintains ``rcu_barrier()`` state, tracks expedited + grace-period state, and maintains state used to force quiescent + states when grace periods extend too long, +#. ``rcu_node``: This structure forms the combining tree that propagates + quiescent-state information from the leaves to the root, and also + propagates grace-period information from the root to the leaves. It + provides local copies of the grace-period state in order to allow + this information to be accessed in a synchronized manner without + suffering the scalability limitations that would otherwise be imposed + by global locking. In ``CONFIG_PREEMPT_RCU`` kernels, it manages the + lists of tasks that have blocked while in their current RCU read-side + critical section. In ``CONFIG_PREEMPT_RCU`` with + ``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node`` + priority-boosting kernel threads (kthreads) and state. Finally, it + records CPU-hotplug state in order to determine which CPUs should be + ignored during a given grace period. +#. ``rcu_data``: This per-CPU structure is the focus of quiescent-state + detection and RCU callback queuing. It also tracks its relationship + to the corresponding leaf ``rcu_node`` structure to allow + more-efficient propagation of quiescent states up the ``rcu_node`` + combining tree. Like the ``rcu_node`` structure, it provides a local + copy of the grace-period information to allow for-free synchronized + access to this information from the corresponding CPU. Finally, this + structure records past dyntick-idle state for the corresponding CPU + and also tracks statistics. +#. ``rcu_head``: This structure represents RCU callbacks, and is the + only structure allocated and managed by RCU users. The ``rcu_head`` + structure is normally embedded within the RCU-protected data + structure. + +If all you wanted from this article was a general notion of how RCU's +data structures are related, you are done. Otherwise, each of the +following sections give more details on the ``rcu_state``, ``rcu_node`` +and ``rcu_data`` data structures. + +The ``rcu_state`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_state`` structure is the base structure that represents the +state of RCU in the system. This structure forms the interconnection +between the ``rcu_node`` and ``rcu_data`` structures, tracks grace +periods, contains the lock used to synchronize with CPU-hotplug events, +and maintains state used to force quiescent states when grace periods +extend too long, + +A few of the ``rcu_state`` structure's fields are discussed, singly and +in groups, in the following sections. The more specialized fields are +covered in the discussion of their use. + +Relationship to rcu_node and rcu_data Structures +'''''''''''''''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + + 1 struct rcu_node node[NUM_RCU_NODES]; + 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; + 3 struct rcu_data __percpu *rda; + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! You said that the ``rcu_node`` structures formed a | +| tree, but they are declared as a flat array! What gives? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| The tree is laid out in the array. The first node In the array is the | +| head, the next set of nodes in the array are children of the head | +| node, and so on until the last set of nodes in the array are the | +| leaves. | +| See the following diagrams to see how this works. | ++-----------------------------------------------------------------------+ + +The ``rcu_node`` tree is embedded into the ``->node[]`` array as shown +in the following figure: + +.. kernel-figure:: TreeMapping.svg + +One interesting consequence of this mapping is that a breadth-first +traversal of the tree is implemented as a simple linear scan of the +array, which is in fact what the ``rcu_for_each_node_breadth_first()`` +macro does. This macro is used at the beginning and ends of grace +periods. + +Each entry of the ``->level`` array references the first ``rcu_node`` +structure on the corresponding level of the tree, for example, as shown +below: + +.. kernel-figure:: TreeMappingLevel.svg + +The zero\ :sup:`th` element of the array references the root +``rcu_node`` structure, the first element references the first child of +the root ``rcu_node``, and finally the second element references the +first leaf ``rcu_node`` structure. + +For whatever it is worth, if you draw the tree to be tree-shaped rather +than array-shaped, it is easy to draw a planar representation: + +.. kernel-figure:: TreeLevel.svg + +Finally, the ``->rda`` field references a per-CPU pointer to the +corresponding CPU's ``rcu_data`` structure. + +All of these fields are constant once initialization is complete, and +therefore need no protection. + +Grace-Period Tracking +''''''''''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + + 1 unsigned long gp_seq; + +RCU grace periods are numbered, and the ``->gp_seq`` field contains the +current grace-period sequence number. The bottom two bits are the state +of the current grace period, which can be zero for not yet started or +one for in progress. In other words, if the bottom two bits of +``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom +two bits indicates that something is broken. This field is protected by +the root ``rcu_node`` structure's ``->lock`` field. + +There are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data`` +structures as well. The fields in the ``rcu_state`` structure represent +the most current value, and those of the other structures are compared +in order to detect the beginnings and ends of grace periods in a +distributed fashion. The values flow from ``rcu_state`` to ``rcu_node`` +(down the tree from the root to the leaves) to ``rcu_data``. + +Miscellaneous +''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + + 1 unsigned long gp_max; + 2 char abbr; + 3 char *name; + +The ``->gp_max`` field tracks the duration of the longest grace period +in jiffies. It is protected by the root ``rcu_node``'s ``->lock``. + +The ``->name`` and ``->abbr`` fields distinguish between preemptible RCU +(“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”). +These fields are used for diagnostic and tracing purposes. + +The ``rcu_node`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_node`` structures form the combining tree that propagates +quiescent-state information from the leaves to the root and also that +propagates grace-period information from the root down to the leaves. +They provides local copies of the grace-period state in order to allow +this information to be accessed in a synchronized manner without +suffering the scalability limitations that would otherwise be imposed by +global locking. In ``CONFIG_PREEMPT_RCU`` kernels, they manage the lists +of tasks that have blocked while in their current RCU read-side critical +section. In ``CONFIG_PREEMPT_RCU`` with ``CONFIG_RCU_BOOST``, they +manage the per-\ ``rcu_node`` priority-boosting kernel threads +(kthreads) and state. Finally, they record CPU-hotplug state in order to +determine which CPUs should be ignored during a given grace period. + +The ``rcu_node`` structure's fields are discussed, singly and in groups, +in the following sections. + +Connection to Combining Tree +'''''''''''''''''''''''''''' + +This portion of the ``rcu_node`` structure is declared as follows: + +:: + + 1 struct rcu_node *parent; + 2 u8 level; + 3 u8 grpnum; + 4 unsigned long grpmask; + 5 int grplo; + 6 int grphi; + +The ``->parent`` pointer references the ``rcu_node`` one level up in the +tree, and is ``NULL`` for the root ``rcu_node``. The RCU implementation +makes heavy use of this field to push quiescent states up the tree. The +``->level`` field gives the level in the tree, with the root being at +level zero, its children at level one, and so on. The ``->grpnum`` field +gives this node's position within the children of its parent, so this +number can range between 0 and 31 on 32-bit systems and between 0 and 63 +on 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only +during initialization and for tracing. The ``->grpmask`` field is the +bitmask counterpart of ``->grpnum``, and therefore always has exactly +one bit set. This mask is used to clear the bit corresponding to this +``rcu_node`` structure in its parent's bitmasks, which are described +later. Finally, the ``->grplo`` and ``->grphi`` fields contain the +lowest and highest numbered CPU served by this ``rcu_node`` structure, +respectively. + +All of these fields are constant, and thus do not require any +synchronization. + +Synchronization +''''''''''''''' + +This field of the ``rcu_node`` structure is declared as follows: + +:: + + 1 raw_spinlock_t lock; + +This field is used to protect the remaining fields in this structure, +unless otherwise stated. That said, all of the fields in this structure +can be accessed without locking for tracing purposes. Yes, this can +result in confusing traces, but better some tracing confusion than to be +heisenbugged out of existence. + +.. _grace-period-tracking-1: + +Grace-Period Tracking +''''''''''''''''''''' + +This portion of the ``rcu_node`` structure is declared as follows: + +:: + + 1 unsigned long gp_seq; + 2 unsigned long gp_seq_needed; + +The ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of +the field of the same name in the ``rcu_state`` structure. They each may +lag up to one step behind their ``rcu_state`` counterpart. If the bottom +two bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero, +then this ``rcu_node`` structure believes that RCU is idle. + +The ``>gp_seq`` field of each ``rcu_node`` structure is updated at the +beginning and the end of each grace period. + +The ``->gp_seq_needed`` fields record the furthest-in-the-future grace +period request seen by the corresponding ``rcu_node`` structure. The +request is considered fulfilled when the value of the ``->gp_seq`` field +equals or exceeds that of the ``->gp_seq_needed`` field. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Suppose that this ``rcu_node`` structure doesn't see a request for a | +| very long time. Won't wrapping of the ``->gp_seq`` field cause | +| problems? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, because if the ``->gp_seq_needed`` field lags behind the | +| ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at | +| the end of the grace period. Modulo-arithmetic comparisons therefore | +| will always get the correct answer, even with wrapping. | ++-----------------------------------------------------------------------+ + +Quiescent-State Tracking +'''''''''''''''''''''''' + +These fields manage the propagation of quiescent states up the combining +tree. + +This portion of the ``rcu_node`` structure has fields as follows: + +:: + + 1 unsigned long qsmask; + 2 unsigned long expmask; + 3 unsigned long qsmaskinit; + 4 unsigned long expmaskinit; + +The ``->qsmask`` field tracks which of this ``rcu_node`` structure's +children still need to report quiescent states for the current normal +grace period. Such children will have a value of 1 in their +corresponding bit. Note that the leaf ``rcu_node`` structures should be +thought of as having ``rcu_data`` structures as their children. +Similarly, the ``->expmask`` field tracks which of this ``rcu_node`` +structure's children still need to report quiescent states for the +current expedited grace period. An expedited grace period has the same +conceptual properties as a normal grace period, but the expedited +implementation accepts extreme CPU overhead to obtain much lower +grace-period latency, for example, consuming a few tens of microseconds +worth of CPU time to reduce grace-period duration from milliseconds to +tens of microseconds. The ``->qsmaskinit`` field tracks which of this +``rcu_node`` structure's children cover for at least one online CPU. +This mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is +used to initialize ``->expmask`` and the beginning of the normal and +expedited grace periods, respectively. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why are these bitmasks protected by locking? Come on, haven't you | +| heard of atomic instructions??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Lockless grace-period computation! Such a tantalizing possibility! | +| But consider the following sequence of events: | +| | +| #. CPU 0 has been in dyntick-idle mode for quite some time. When it | +| wakes up, it notices that the current RCU grace period needs it to | +| report in, so it sets a flag where the scheduling clock interrupt | +| will find it. | +| #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and | +| notices that CPU 0 has been in dyntick idle mode, which qualifies | +| as an extended quiescent state. | +| #. CPU 0's scheduling clock interrupt fires in the middle of an RCU | +| read-side critical section, and notices that the RCU core needs | +| something, so commences RCU softirq processing. | +| #. CPU 0's softirq handler executes and is just about ready to report | +| its quiescent state up the ``rcu_node`` tree. | +| #. But CPU 1 beats it to the punch, completing the current grace | +| period and starting a new one. | +| #. CPU 0 now reports its quiescent state for the wrong grace period. | +| That grace period might now end before the RCU read-side critical | +| section. If that happens, disaster will ensue. | +| | +| So the locking is absolutely required in order to coordinate clearing | +| of the bits with updating of the grace-period sequence number in | +| ``->gp_seq``. | ++-----------------------------------------------------------------------+ + +Blocked-Task Management +''''''''''''''''''''''' + +``PREEMPT_RCU`` allows tasks to be preempted in the midst of their RCU +read-side critical sections, and these tasks must be tracked explicitly. +The details of exactly why and how they are tracked will be covered in a +separate article on RCU read-side processing. For now, it is enough to +know that the ``rcu_node`` structure tracks them. + +:: + + 1 struct list_head blkd_tasks; + 2 struct list_head *gp_tasks; + 3 struct list_head *exp_tasks; + 4 bool wait_blkd_tasks; + +The ``->blkd_tasks`` field is a list header for the list of blocked and +preempted tasks. As tasks undergo context switches within RCU read-side +critical sections, their ``task_struct`` structures are enqueued (via +the ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the +``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding +to the CPU on which the outgoing context switch executed. As these tasks +later exit their RCU read-side critical sections, they remove themselves +from the list. This list is therefore in reverse time order, so that if +one of the tasks is blocking the current grace period, all subsequent +tasks must also be blocking that same grace period. Therefore, a single +pointer into this list suffices to track all tasks blocking a given +grace period. That pointer is stored in ``->gp_tasks`` for normal grace +periods and in ``->exp_tasks`` for expedited grace periods. These last +two fields are ``NULL`` if either there is no grace period in flight or +if there are no blocked tasks preventing that grace period from +completing. If either of these two pointers is referencing a task that +removes itself from the ``->blkd_tasks`` list, then that task must +advance the pointer to the next task on the list, or set the pointer to +``NULL`` if there are no subsequent tasks on the list. + +For example, suppose that tasks T1, T2, and T3 are all hard-affinitied +to the largest-numbered CPU in the system. Then if task T1 blocked in an +RCU read-side critical section, then an expedited grace period started, +then task T2 blocked in an RCU read-side critical section, then a normal +grace period started, and finally task 3 blocked in an RCU read-side +critical section, then the state of the last leaf ``rcu_node`` +structure's blocked-task list would be as shown below: + +.. kernel-figure:: blkd_task.svg + +Task T1 is blocking both grace periods, task T2 is blocking only the +normal grace period, and task T3 is blocking neither grace period. Note +that these tasks will not remove themselves from this list immediately +upon resuming execution. They will instead remain on the list until they +execute the outermost ``rcu_read_unlock()`` that ends their RCU +read-side critical section. + +The ``->wait_blkd_tasks`` field indicates whether or not the current +grace period is waiting on a blocked task. + +Sizing the ``rcu_node`` Array +''''''''''''''''''''''''''''' + +The ``rcu_node`` array is sized via a series of C-preprocessor +expressions as follows: + +:: + + 1 #ifdef CONFIG_RCU_FANOUT + 2 #define RCU_FANOUT CONFIG_RCU_FANOUT + 3 #else + 4 # ifdef CONFIG_64BIT + 5 # define RCU_FANOUT 64 + 6 # else + 7 # define RCU_FANOUT 32 + 8 # endif + 9 #endif + 10 + 11 #ifdef CONFIG_RCU_FANOUT_LEAF + 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF + 13 #else + 14 # ifdef CONFIG_64BIT + 15 # define RCU_FANOUT_LEAF 64 + 16 # else + 17 # define RCU_FANOUT_LEAF 32 + 18 # endif + 19 #endif + 20 + 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) + 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) + 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) + 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) + 25 + 26 #if NR_CPUS <= RCU_FANOUT_1 + 27 # define RCU_NUM_LVLS 1 + 28 # define NUM_RCU_LVL_0 1 + 29 # define NUM_RCU_NODES NUM_RCU_LVL_0 + 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } + 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } + 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } + 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } + 34 #elif NR_CPUS <= RCU_FANOUT_2 + 35 # define RCU_NUM_LVLS 2 + 36 # define NUM_RCU_LVL_0 1 + 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) + 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) + 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } + 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } + 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } + 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } + 43 #elif NR_CPUS <= RCU_FANOUT_3 + 44 # define RCU_NUM_LVLS 3 + 45 # define NUM_RCU_LVL_0 1 + 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) + 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) + 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) + 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } + 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } + 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } + 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } + 53 #elif NR_CPUS <= RCU_FANOUT_4 + 54 # define RCU_NUM_LVLS 4 + 55 # define NUM_RCU_LVL_0 1 + 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) + 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) + 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) + 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) + 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } + 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } + 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } + 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } + 64 #else + 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" + 66 #endif + +The maximum number of levels in the ``rcu_node`` structure is currently +limited to four, as specified by lines 21-24 and the structure of the +subsequent “if” statement. For 32-bit systems, this allows +16*32*32*32=524,288 CPUs, which should be sufficient for the next few +years at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is +allowed, which should see us through the next decade or so. This +four-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8`` +to support up to 4096 CPUs, which might be useful in very large systems +having eight CPUs per socket (but please note that no one has yet shown +any measurable performance degradation due to misaligned socket and +``rcu_node`` boundaries). In addition, building kernels with a full four +levels of ``rcu_node`` tree permits better testing of RCU's +combining-tree code. + +The ``RCU_FANOUT`` symbol controls how many children are permitted at +each non-leaf level of the ``rcu_node`` tree. If the +``CONFIG_RCU_FANOUT`` Kconfig option is not specified, it is set based +on the word size of the system, which is also the Kconfig default. + +The ``RCU_FANOUT_LEAF`` symbol controls how many CPUs are handled by +each leaf ``rcu_node`` structure. Experience has shown that allowing a +given leaf ``rcu_node`` structure to handle 64 CPUs, as permitted by the +number of bits in the ``->qsmask`` field on a 64-bit system, results in +excessive contention for the leaf ``rcu_node`` structures' ``->lock`` +fields. The number of CPUs per leaf ``rcu_node`` structure is therefore +limited to 16 given the default value of ``CONFIG_RCU_FANOUT_LEAF``. If +``CONFIG_RCU_FANOUT_LEAF`` is unspecified, the value selected is based +on the word size of the system, just as for ``CONFIG_RCU_FANOUT``. +Lines 11-19 perform this computation. + +Lines 21-24 compute the maximum number of CPUs supported by a +single-level (which contains a single ``rcu_node`` structure), +two-level, three-level, and four-level ``rcu_node`` tree, respectively, +given the fanout specified by ``RCU_FANOUT`` and ``RCU_FANOUT_LEAF``. +These numbers of CPUs are retained in the ``RCU_FANOUT_1``, +``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor +variables, respectively. + +These variables are used to control the C-preprocessor ``#if`` statement +spanning lines 26-66 that computes the number of ``rcu_node`` structures +required for each level of the tree, as well as the number of levels +required. The number of levels is placed in the ``NUM_RCU_LVLS`` +C-preprocessor variable by lines 27, 35, 44, and 54. The number of +``rcu_node`` structures for the topmost level of the tree is always +exactly one, and this value is unconditionally placed into +``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels +(if any) of the ``rcu_node`` tree are computed by dividing the maximum +number of CPUs by the fanout supported by the number of levels from the +current level down, rounding up. This computation is performed by +lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create +initializers for lockdep lock-class names. Finally, lines 64-66 produce +an error if the maximum number of CPUs is too large for the specified +fanout. + +The ``rcu_segcblist`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_segcblist`` structure maintains a segmented list of callbacks +as follows: + +:: + + 1 #define RCU_DONE_TAIL 0 + 2 #define RCU_WAIT_TAIL 1 + 3 #define RCU_NEXT_READY_TAIL 2 + 4 #define RCU_NEXT_TAIL 3 + 5 #define RCU_CBLIST_NSEGS 4 + 6 + 7 struct rcu_segcblist { + 8 struct rcu_head *head; + 9 struct rcu_head **tails[RCU_CBLIST_NSEGS]; + 10 unsigned long gp_seq[RCU_CBLIST_NSEGS]; + 11 long len; + 12 long len_lazy; + 13 }; + +The segments are as follows: + +#. ``RCU_DONE_TAIL``: Callbacks whose grace periods have elapsed. These + callbacks are ready to be invoked. +#. ``RCU_WAIT_TAIL``: Callbacks that are waiting for the current grace + period. Note that different CPUs can have different ideas about which + grace period is current, hence the ``->gp_seq`` field. +#. ``RCU_NEXT_READY_TAIL``: Callbacks waiting for the next grace period + to start. +#. ``RCU_NEXT_TAIL``: Callbacks that have not yet been associated with a + grace period. + +The ``->head`` pointer references the first callback or is ``NULL`` if +the list contains no callbacks (which is *not* the same as being empty). +Each element of the ``->tails[]`` array references the ``->next`` +pointer of the last callback in the corresponding segment of the list, +or the list's ``->head`` pointer if that segment and all previous +segments are empty. If the corresponding segment is empty but some +previous segment is not empty, then the array element is identical to +its predecessor. Older callbacks are closer to the head of the list, and +new callbacks are added at the tail. This relationship between the +``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown +in this diagram: + +.. kernel-figure:: nxtlist.svg + +In this figure, the ``->head`` pointer references the first RCU callback +in the list. The ``->tails[RCU_DONE_TAIL]`` array element references the +``->head`` pointer itself, indicating that none of the callbacks is +ready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references +callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2 +are both waiting on the current grace period, give or take possible +disagreements about exactly which grace period is the current one. The +``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU +callback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that +there are no callbacks waiting on the next RCU grace period. The +``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next`` +pointer, indicating that all the remaining RCU callbacks have not yet +been assigned to an RCU grace period. Note that the +``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU +callback's ``->next`` pointer unless the callback list is empty, in +which case it references the ``->head`` pointer. + +There is one additional important special case for the +``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this +list is *disabled*. Lists are disabled when the corresponding CPU is +offline or when the corresponding CPU's callbacks are offloaded to a +kthread, both of which are described elsewhere. + +CPUs advance their callbacks from the ``RCU_NEXT_TAIL`` to the +``RCU_NEXT_READY_TAIL`` to the ``RCU_WAIT_TAIL`` to the +``RCU_DONE_TAIL`` list segments as grace periods advance. + +The ``->gp_seq[]`` array records grace-period numbers corresponding to +the list segments. This is what allows different CPUs to have different +ideas as to which is the current grace period while still avoiding +premature invocation of their callbacks. In particular, this allows CPUs +that go idle for extended periods to determine which of their callbacks +are ready to be invoked after reawakening. + +The ``->len`` counter contains the number of callbacks in ``->head``, +and the ``->len_lazy`` contains the number of those callbacks that are +known to only free memory, and whose invocation can therefore be safely +deferred. + +.. important:: + + It is the ``->len`` field that determines whether or + not there are callbacks associated with this ``rcu_segcblist`` + structure, *not* the ``->head`` pointer. The reason for this is that all + the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL`` + segment) are extracted all at once at callback-invocation time + (``rcu_do_batch``), due to which ``->head`` may be set to NULL if there + are no not-done callbacks remaining in the ``rcu_segcblist``. If + callback invocation must be postponed, for example, because a + high-priority process just woke up on this CPU, then the remaining + callbacks are placed back on the ``RCU_DONE_TAIL`` segment and + ``->head`` once again points to the start of the segment. In short, the + head field can briefly be ``NULL`` even though the CPU has callbacks + present the entire time. Therefore, it is not appropriate to test the + ``->head`` pointer for ``NULL``. + +In contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only +after the corresponding callbacks have been invoked. This means that the +``->len`` count is zero only if the ``rcu_segcblist`` structure really +is devoid of callbacks. Of course, off-CPU sampling of the ``->len`` +count requires careful use of appropriate synchronization, for example, +memory barriers. This synchronization can be a bit subtle, particularly +in the case of ``rcu_barrier()``. + +The ``rcu_data`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The +fields in this structure may be accessed only from the corresponding CPU +(and from tracing) unless otherwise stated. This structure is the focus +of quiescent-state detection and RCU callback queuing. It also tracks +its relationship to the corresponding leaf ``rcu_node`` structure to +allow more-efficient propagation of quiescent states up the ``rcu_node`` +combining tree. Like the ``rcu_node`` structure, it provides a local +copy of the grace-period information to allow for-free synchronized +access to this information from the corresponding CPU. Finally, this +structure records past dyntick-idle state for the corresponding CPU and +also tracks statistics. + +The ``rcu_data`` structure's fields are discussed, singly and in groups, +in the following sections. + +Connection to Other Data Structures +''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 int cpu; + 2 struct rcu_node *mynode; + 3 unsigned long grpmask; + 4 bool beenonline; + +The ``->cpu`` field contains the number of the corresponding CPU and the +``->mynode`` field references the corresponding ``rcu_node`` structure. +The ``->mynode`` is used to propagate quiescent states up the combining +tree. These two fields are constant and therefore do not require +synchronization. + +The ``->grpmask`` field indicates the bit in the ``->mynode->qsmask`` +corresponding to this ``rcu_data`` structure, and is also used when +propagating quiescent states. The ``->beenonline`` flag is set whenever +the corresponding CPU comes online, which means that the debugfs tracing +need not dump out any ``rcu_data`` structure for which this flag is not +set. + +Quiescent-State and Grace-Period Tracking +''''''''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 unsigned long gp_seq; + 2 unsigned long gp_seq_needed; + 3 bool cpu_no_qs; + 4 bool core_needs_qs; + 5 bool gpwrap; + +The ``->gp_seq`` field is the counterpart of the field of the same name +in the ``rcu_state`` and ``rcu_node`` structures. The +``->gp_seq_needed`` field is the counterpart of the field of the same +name in the rcu_node structure. They may each lag up to one behind their +``rcu_node`` counterparts, but in ``CONFIG_NO_HZ_IDLE`` and +``CONFIG_NO_HZ_FULL`` kernels can lag arbitrarily far behind for CPUs in +dyntick-idle mode (but these counters will catch up upon exit from +dyntick-idle mode). If the lower two bits of a given ``rcu_data`` +structure's ``->gp_seq`` are zero, then this ``rcu_data`` structure +believes that RCU is idle. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| All this replication of the grace period numbers can only cause | +| massive confusion. Why not just keep a global sequence number and be | +| done with it??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because if there was only a single global sequence numbers, there | +| would need to be a single global lock to allow safely accessing and | +| updating it. And if we are not going to have a single global lock, we | +| need to carefully manage the numbers on a per-node basis. Recall from | +| the answer to a previous Quick Quiz that the consequences of applying | +| a previously sampled quiescent state to the wrong grace period are | +| quite severe. | ++-----------------------------------------------------------------------+ + +The ``->cpu_no_qs`` flag indicates that the CPU has not yet passed +through a quiescent state, while the ``->core_needs_qs`` flag indicates +that the RCU core needs a quiescent state from the corresponding CPU. +The ``->gpwrap`` field indicates that the corresponding CPU has remained +idle for so long that the ``gp_seq`` counter is in danger of overflow, +which will cause the CPU to disregard the values of its counters on its +next exit from idle. + +RCU Callback Handling +''''''''''''''''''''' + +In the absence of CPU-hotplug events, RCU callbacks are invoked by the +same CPU that registered them. This is strictly a cache-locality +optimization: callbacks can and do get invoked on CPUs other than the +one that registered them. After all, if the CPU that registered a given +callback has gone offline before the callback can be invoked, there +really is no other choice. + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 struct rcu_segcblist cblist; + 2 long qlen_last_fqs_check; + 3 unsigned long n_cbs_invoked; + 4 unsigned long n_nocbs_invoked; + 5 unsigned long n_cbs_orphaned; + 6 unsigned long n_cbs_adopted; + 7 unsigned long n_force_qs_snap; + 8 long blimit; + +The ``->cblist`` structure is the segmented callback list described +earlier. The CPU advances the callbacks in its ``rcu_data`` structure +whenever it notices that another RCU grace period has completed. The CPU +detects the completion of an RCU grace period by noticing that the value +of its ``rcu_data`` structure's ``->gp_seq`` field differs from that of +its leaf ``rcu_node`` structure. Recall that each ``rcu_node`` +structure's ``->gp_seq`` field is updated at the beginnings and ends of +each grace period. + +The ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the +forcing of quiescent states from ``call_rcu()`` and friends when +callback lists grow excessively long. + +The ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted`` +fields count the number of callbacks invoked, sent to other CPUs when +this CPU goes offline, and received from other CPUs when those other +CPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's +callbacks are offloaded to a kthread. + +Finally, the ``->blimit`` counter is the maximum number of RCU callbacks +that may be invoked at a given time. + +Dyntick-Idle Handling +''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 int dynticks_snap; + 2 unsigned long dynticks_fqs; + +The ``->dynticks_snap`` field is used to take a snapshot of the +corresponding CPU's dyntick-idle state when forcing quiescent states, +and is therefore accessed from other CPUs. Finally, the +``->dynticks_fqs`` field is used to count the number of times this CPU +is determined to be in dyntick-idle state, and is used for tracing and +debugging purposes. + +This portion of the rcu_data structure is declared as follows: + +:: + + 1 long dynticks_nesting; + 2 long dynticks_nmi_nesting; + 3 atomic_t dynticks; + 4 bool rcu_need_heavy_qs; + 5 bool rcu_urgent_qs; + +These fields in the rcu_data structure maintain the per-CPU dyntick-idle +state for the corresponding CPU. The fields may be accessed only from +the corresponding CPU (and from tracing) unless otherwise stated. + +The ``->dynticks_nesting`` field counts the nesting depth of process +execution, so that in normal circumstances this counter has value zero +or one. NMIs, irqs, and tracers are counted by the +``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes +to this variable have to be undertaken carefully using an algorithm +provided by Andy Lutomirski. The initial transition from idle adds one, +and nested transitions add two, so that a nesting level of five is +represented by a ``->dynticks_nmi_nesting`` value of nine. This counter +can therefore be thought of as counting the number of reasons why this +CPU cannot be permitted to enter dyntick-idle mode, aside from +process-level transitions. + +However, it turns out that when running in non-idle kernel context, the +Linux kernel is fully capable of entering interrupt handlers that never +exit and perhaps also vice versa. Therefore, whenever the +``->dynticks_nesting`` field is incremented up from zero, the +``->dynticks_nmi_nesting`` field is set to a large positive number, and +whenever the ``->dynticks_nesting`` field is decremented down to zero, +the the ``->dynticks_nmi_nesting`` field is set to zero. Assuming that +the number of misnested interrupts is not sufficient to overflow the +counter, this approach corrects the ``->dynticks_nmi_nesting`` field +every time the corresponding CPU enters the idle loop from process +context. + +The ``->dynticks`` field counts the corresponding CPU's transitions to +and from either dyntick-idle or user mode, so that this counter has an +even value when the CPU is in dyntick-idle mode or user mode and an odd +value otherwise. The transitions to/from user mode need to be counted +for user mode adaptive-ticks support (see timers/NO_HZ.txt). + +The ``->rcu_need_heavy_qs`` field is used to record the fact that the +RCU core code would really like to see a quiescent state from the +corresponding CPU, so much so that it is willing to call for +heavy-weight dyntick-counter operations. This flag is checked by RCU's +context-switch and ``cond_resched()`` code, which provide a momentary +idle sojourn in response. + +Finally, the ``->rcu_urgent_qs`` field is used to record the fact that +the RCU core code would really like to see a quiescent state from the +corresponding CPU, with the various other fields indicating just how +badly RCU wants this quiescent state. This flag is checked by RCU's +context-switch path (``rcu_note_context_switch``) and the cond_resched +code. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why not simply combine the ``->dynticks_nesting`` and | +| ``->dynticks_nmi_nesting`` counters into a single counter that just | +| counts the number of reasons that the corresponding CPU is non-idle? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because this would fail in the presence of interrupts whose handlers | +| never return and of handlers that manage to return from a made-up | +| interrupt. | ++-----------------------------------------------------------------------+ + +Additional fields are present for some special-purpose builds, and are +discussed separately. + +The ``rcu_head`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Each ``rcu_head`` structure represents an RCU callback. These structures +are normally embedded within RCU-protected data structures whose +algorithms use asynchronous grace periods. In contrast, when using +algorithms that block waiting for RCU grace periods, RCU users need not +provide ``rcu_head`` structures. + +The ``rcu_head`` structure has fields as follows: + +:: + + 1 struct rcu_head *next; + 2 void (*func)(struct rcu_head *head); + +The ``->next`` field is used to link the ``rcu_head`` structures +together in the lists within the ``rcu_data`` structures. The ``->func`` +field is a pointer to the function to be called when the callback is +ready to be invoked, and this function is passed a pointer to the +``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func`` +field to record the offset of the ``rcu_head`` structure within the +enclosing RCU-protected data structure. + +Both of these fields are used internally by RCU. From the viewpoint of +RCU users, this structure is an opaque “cookie”. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Given that the callback function ``->func`` is passed a pointer to | +| the ``rcu_head`` structure, how is that function supposed to find the | +| beginning of the enclosing RCU-protected data structure? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In actual practice, there is a separate callback function per type of | +| RCU-protected data structure. The callback function can therefore use | +| the ``container_of()`` macro in the Linux kernel (or other | +| pointer-manipulation facilities in other software environments) to | +| find the beginning of the enclosing structure. | ++-----------------------------------------------------------------------+ + +RCU-Specific Fields in the ``task_struct`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``CONFIG_PREEMPT_RCU`` implementation uses some additional fields in +the ``task_struct`` structure: + +:: + + 1 #ifdef CONFIG_PREEMPT_RCU + 2 int rcu_read_lock_nesting; + 3 union rcu_special rcu_read_unlock_special; + 4 struct list_head rcu_node_entry; + 5 struct rcu_node *rcu_blocked_node; + 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ + 7 #ifdef CONFIG_TASKS_RCU + 8 unsigned long rcu_tasks_nvcsw; + 9 bool rcu_tasks_holdout; + 10 struct list_head rcu_tasks_holdout_list; + 11 int rcu_tasks_idle_cpu; + 12 #endif /* #ifdef CONFIG_TASKS_RCU */ + +The ``->rcu_read_lock_nesting`` field records the nesting level for RCU +read-side critical sections, and the ``->rcu_read_unlock_special`` field +is a bitmask that records special conditions that require +``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry`` +field is used to form lists of tasks that have blocked within +preemptible-RCU read-side critical sections and the +``->rcu_blocked_node`` field references the ``rcu_node`` structure whose +list this task is a member of, or ``NULL`` if it is not blocked within a +preemptible-RCU read-side critical section. + +The ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context +switches that this task had undergone at the beginning of the current +tasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current +tasks-RCU grace period is waiting on this task, +``->rcu_tasks_holdout_list`` is a list element enqueuing this task on +the holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this +idle task is running, but only if the task is currently running, that +is, if the CPU is currently idle. + +Accessor Functions +~~~~~~~~~~~~~~~~~~ + +The following listing shows the ``rcu_get_root()``, +``rcu_for_each_node_breadth_first`` and ``rcu_for_each_leaf_node()`` +function and macros: + +:: + + 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) + 2 { + 3 return &rsp->node[0]; + 4 } + 5 + 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ + 7 for ((rnp) = &(rsp)->node[0]; \ + 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) + 9 + 10 #define rcu_for_each_leaf_node(rsp, rnp) \ + 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ + 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) + +The ``rcu_get_root()`` simply returns a pointer to the first element of +the specified ``rcu_state`` structure's ``->node[]`` array, which is the +root ``rcu_node`` structure. + +As noted earlier, the ``rcu_for_each_node_breadth_first()`` macro takes +advantage of the layout of the ``rcu_node`` structures in the +``rcu_state`` structure's ``->node[]`` array, performing a breadth-first +traversal by simply traversing the array in order. Similarly, the +``rcu_for_each_leaf_node()`` macro traverses only the last part of the +array, thus traversing only the leaf ``rcu_node`` structures. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What does ``rcu_for_each_leaf_node()`` do if the ``rcu_node`` tree | +| contains only a single node? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In the single-node case, ``rcu_for_each_leaf_node()`` traverses the | +| single node. | ++-----------------------------------------------------------------------+ + +Summary +~~~~~~~ + +So the state of RCU is represented by an ``rcu_state`` structure, which +contains a combining tree of ``rcu_node`` and ``rcu_data`` structures. +Finally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state +is tracked by dynticks-related fields in the ``rcu_data`` structure. If +you made it this far, you are well prepared to read the code +walkthroughs in the other articles in this series. + +Acknowledgments +~~~~~~~~~~~~~~~ + +I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul +Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn for +helping me get this document into a more human-readable state. + +Legal Statement +~~~~~~~~~~~~~~~ + +This work represents the view of the author and does not necessarily +represent the view of IBM. + +Linux is a registered trademark of Linus Torvalds. + +Other company, product, and service names may be trademarks or service +marks of others. diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html deleted file mode 100644 index 57300db4b5ff..000000000000 --- a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html +++ /dev/null @@ -1,668 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" - "http://www.w3.org/TR/html4/loose.dtd"> - <html> - <head><title>A Tour Through TREE_RCU's Expedited Grace Periods</title> - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> - -<h2>Introduction</h2> - -This document describes RCU's expedited grace periods. -Unlike RCU's normal grace periods, which accept long latencies to attain -high efficiency and minimal disturbance, expedited grace periods accept -lower efficiency and significant disturbance to attain shorter latencies. - -<p> -There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier -third RCU-bh flavor having been implemented in terms of the other two. -Each of the two implementations is covered in its own section. - -<ol> -<li> <a href="#Expedited Grace Period Design"> - Expedited Grace Period Design</a> -<li> <a href="#RCU-preempt Expedited Grace Periods"> - RCU-preempt Expedited Grace Periods</a> -<li> <a href="#RCU-sched Expedited Grace Periods"> - RCU-sched Expedited Grace Periods</a> -<li> <a href="#Expedited Grace Period and CPU Hotplug"> - Expedited Grace Period and CPU Hotplug</a> -<li> <a href="#Expedited Grace Period Refinements"> - Expedited Grace Period Refinements</a> -</ol> - -<h2><a name="Expedited Grace Period Design"> -Expedited Grace Period Design</a></h2> - -<p> -The expedited RCU grace periods cannot be accused of being subtle, -given that they for all intents and purposes hammer every CPU that -has not yet provided a quiescent state for the current expedited -grace period. -The one saving grace is that the hammer has grown a bit smaller -over time: The old call to <tt>try_stop_cpus()</tt> has been -replaced with a set of calls to <tt>smp_call_function_single()</tt>, -each of which results in an IPI to the target CPU. -The corresponding handler function checks the CPU's state, motivating -a faster quiescent state where possible, and triggering a report -of that quiescent state. -As always for RCU, once everything has spent some time in a quiescent -state, the expedited grace period has completed. - -<p> -The details of the <tt>smp_call_function_single()</tt> handler's -operation depend on the RCU flavor, as described in the following -sections. - -<h2><a name="RCU-preempt Expedited Grace Periods"> -RCU-preempt Expedited Grace Periods</a></h2> - -<p> -<tt>CONFIG_PREEMPT=y</tt> kernels implement RCU-preempt. -The overall flow of the handling of a given CPU by an RCU-preempt -expedited grace period is shown in the following diagram: - -<p><img src="ExpRCUFlow.svg" alt="ExpRCUFlow.svg" width="55%"> - -<p> -The solid arrows denote direct action, for example, a function call. -The dotted arrows denote indirect action, for example, an IPI -or a state that is reached after some time. - -<p> -If a given CPU is offline or idle, <tt>synchronize_rcu_expedited()</tt> -will ignore it because idle and offline CPUs are already residing -in quiescent states. -Otherwise, the expedited grace period will use -<tt>smp_call_function_single()</tt> to send the CPU an IPI, which -is handled by <tt>rcu_exp_handler()</tt>. - -<p> -However, because this is preemptible RCU, <tt>rcu_exp_handler()</tt> -can check to see if the CPU is currently running in an RCU read-side -critical section. -If not, the handler can immediately report a quiescent state. -Otherwise, it sets flags so that the outermost <tt>rcu_read_unlock()</tt> -invocation will provide the needed quiescent-state report. -This flag-setting avoids the previous forced preemption of all -CPUs that might have RCU read-side critical sections. -In addition, this flag-setting is done so as to avoid increasing -the overhead of the common-case fastpath through the scheduler. - -<p> -Again because this is preemptible RCU, an RCU read-side critical section -can be preempted. -When that happens, RCU will enqueue the task, which will the continue to -block the current expedited grace period until it resumes and finds its -outermost <tt>rcu_read_unlock()</tt>. -The CPU will report a quiescent state just after enqueuing the task because -the CPU is no longer blocking the grace period. -It is instead the preempted task doing the blocking. -The list of blocked tasks is managed by <tt>rcu_preempt_ctxt_queue()</tt>, -which is called from <tt>rcu_preempt_note_context_switch()</tt>, which -in turn is called from <tt>rcu_note_context_switch()</tt>, which in -turn is called from the scheduler. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why not just have the expedited grace period check the - state of all the CPUs? - After all, that would avoid all those real-time-unfriendly IPIs. -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Because we want the RCU read-side critical sections to run fast, - which means no memory barriers. - Therefore, it is not possible to safely check the state from some - other CPU. - And even if it was possible to safely check the state, it would - still be necessary to IPI the CPU to safely interact with the - upcoming <tt>rcu_read_unlock()</tt> invocation, which means that - the remote state testing would not help the worst-case - latency that real-time applications care about. - - <p><font color="ffffff">One way to prevent your real-time - application from getting hit with these IPIs is to - build your kernel with <tt>CONFIG_NO_HZ_FULL=y</tt>. - RCU would then perceive the CPU running your application - as being idle, and it would be able to safely detect that - state without needing to IPI the CPU. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -Please note that this is just the overall flow: -Additional complications can arise due to races with CPUs going idle -or offline, among other things. - -<h2><a name="RCU-sched Expedited Grace Periods"> -RCU-sched Expedited Grace Periods</a></h2> - -<p> -<tt>CONFIG_PREEMPT=n</tt> kernels implement RCU-sched. -The overall flow of the handling of a given CPU by an RCU-sched -expedited grace period is shown in the following diagram: - -<p><img src="ExpSchedFlow.svg" alt="ExpSchedFlow.svg" width="55%"> - -<p> -As with RCU-preempt, RCU-sched's -<tt>synchronize_rcu_expedited()</tt> ignores offline and -idle CPUs, again because they are in remotely detectable -quiescent states. -However, because the -<tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt> -leave no trace of their invocation, in general it is not possible to tell -whether or not the current CPU is in an RCU read-side critical section. -The best that RCU-sched's <tt>rcu_exp_handler()</tt> can do is to check -for idle, on the off-chance that the CPU went idle while the IPI -was in flight. -If the CPU is idle, then <tt>rcu_exp_handler()</tt> reports -the quiescent state. - -<p> Otherwise, the handler forces a future context switch by setting the -NEED_RESCHED flag of the current task's thread flag and the CPU preempt -counter. -At the time of the context switch, the CPU reports the quiescent state. -Should the CPU go offline first, it will report the quiescent state -at that time. - -<h2><a name="Expedited Grace Period and CPU Hotplug"> -Expedited Grace Period and CPU Hotplug</a></h2> - -<p> -The expedited nature of expedited grace periods require a much tighter -interaction with CPU hotplug operations than is required for normal -grace periods. -In addition, attempting to IPI offline CPUs will result in splats, but -failing to IPI online CPUs can result in too-short grace periods. -Neither option is acceptable in production kernels. - -<p> -The interaction between expedited grace periods and CPU hotplug operations -is carried out at several levels: - -<ol> -<li> The number of CPUs that have ever been online is tracked - by the <tt>rcu_state</tt> structure's <tt>->ncpus</tt> - field. - The <tt>rcu_state</tt> structure's <tt>->ncpus_snap</tt> - field tracks the number of CPUs that have ever been online - at the beginning of an RCU expedited grace period. - Note that this number never decreases, at least in the absence - of a time machine. -<li> The identities of the CPUs that have ever been online is - tracked by the <tt>rcu_node</tt> structure's - <tt>->expmaskinitnext</tt> field. - The <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> - field tracks the identities of the CPUs that were online - at least once at the beginning of the most recent RCU - expedited grace period. - The <tt>rcu_state</tt> structure's <tt>->ncpus</tt> and - <tt>->ncpus_snap</tt> fields are used to detect when - new CPUs have come online for the first time, that is, - when the <tt>rcu_node</tt> structure's <tt>->expmaskinitnext</tt> - field has changed since the beginning of the last RCU - expedited grace period, which triggers an update of each - <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> - field from its <tt>->expmaskinitnext</tt> field. -<li> Each <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> - field is used to initialize that structure's - <tt>->expmask</tt> at the beginning of each RCU - expedited grace period. - This means that only those CPUs that have been online at least - once will be considered for a given grace period. -<li> Any CPU that goes offline will clear its bit in its leaf - <tt>rcu_node</tt> structure's <tt>->qsmaskinitnext</tt> - field, so any CPU with that bit clear can safely be ignored. - However, it is possible for a CPU coming online or going offline - to have this bit set for some time while <tt>cpu_online</tt> - returns <tt>false</tt>. -<li> For each non-idle CPU that RCU believes is currently online, the grace - period invokes <tt>smp_call_function_single()</tt>. - If this succeeds, the CPU was fully online. - Failure indicates that the CPU is in the process of coming online - or going offline, in which case it is necessary to wait for a - short time period and try again. - The purpose of this wait (or series of waits, as the case may be) - is to permit a concurrent CPU-hotplug operation to complete. -<li> In the case of RCU-sched, one of the last acts of an outgoing CPU - is to invoke <tt>rcu_report_dead()</tt>, which - reports a quiescent state for that CPU. - However, this is likely paranoia-induced redundancy. <!-- @@@ --> -</ol> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why all the dancing around with multiple counters and masks - tracking CPUs that were once online? - Why not just have a single set of masks tracking the currently - online CPUs and be done with it? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Maintaining single set of masks tracking the online CPUs <i>sounds</i> - easier, at least until you try working out all the race conditions - between grace-period initialization and CPU-hotplug operations. - For example, suppose initialization is progressing down the - tree while a CPU-offline operation is progressing up the tree. - This situation can result in bits set at the top of the tree - that have no counterparts at the bottom of the tree. - Those bits will never be cleared, which will result in - grace-period hangs. - In short, that way lies madness, to say nothing of a great many - bugs, hangs, and deadlocks. - - <p><font color="ffffff"> - In contrast, the current multi-mask multi-counter scheme ensures - that grace-period initialization will always see consistent masks - up and down the tree, which brings significant simplifications - over the single-mask method. - - <p><font color="ffffff"> - This is an instance of - <a href="http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1992/cucs-039-92.ps.gz"><font color="ffffff"> - deferring work in order to avoid synchronization</a>. - Lazily recording CPU-hotplug events at the beginning of the next - grace period greatly simplifies maintenance of the CPU-tracking - bitmasks in the <tt>rcu_node</tt> tree. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h2><a name="Expedited Grace Period Refinements"> -Expedited Grace Period Refinements</a></h2> - -<ol> -<li> <a href="#Idle-CPU Checks">Idle-CPU checks</a>. -<li> <a href="#Batching via Sequence Counter"> - Batching via sequence counter</a>. -<li> <a href="#Funnel Locking and Wait/Wakeup"> - Funnel locking and wait/wakeup</a>. -<li> <a href="#Use of Workqueues">Use of Workqueues</a>. -<li> <a href="#Stall Warnings">Stall warnings</a>. -<li> <a href="#Mid-Boot Operation">Mid-boot operation</a>. -</ol> - -<h3><a name="Idle-CPU Checks">Idle-CPU Checks</a></h3> - -<p> -Each expedited grace period checks for idle CPUs when initially forming -the mask of CPUs to be IPIed and again just before IPIing a CPU -(both checks are carried out by <tt>sync_rcu_exp_select_cpus()</tt>). -If the CPU is idle at any time between those two times, the CPU will -not be IPIed. -Instead, the task pushing the grace period forward will include the -idle CPUs in the mask passed to <tt>rcu_report_exp_cpu_mult()</tt>. - -<p> -For RCU-sched, there is an additional check: -If the IPI has interrupted the idle loop, then -<tt>rcu_exp_handler()</tt> invokes <tt>rcu_report_exp_rdp()</tt> -to report the corresponding quiescent state. - -<p> -For RCU-preempt, there is no specific check for idle in the -IPI handler (<tt>rcu_exp_handler()</tt>), but because -RCU read-side critical sections are not permitted within the -idle loop, if <tt>rcu_exp_handler()</tt> sees that the CPU is within -RCU read-side critical section, the CPU cannot possibly be idle. -Otherwise, <tt>rcu_exp_handler()</tt> invokes -<tt>rcu_report_exp_rdp()</tt> to report the corresponding quiescent -state, regardless of whether or not that quiescent state was due to -the CPU being idle. - -<p> -In summary, RCU expedited grace periods check for idle when building -the bitmask of CPUs that must be IPIed, just before sending each IPI, -and (either explicitly or implicitly) within the IPI handler. - -<h3><a name="Batching via Sequence Counter"> -Batching via Sequence Counter</a></h3> - -<p> -If each grace-period request was carried out separately, expedited -grace periods would have abysmal scalability and -problematic high-load characteristics. -Because each grace-period operation can serve an unlimited number of -updates, it is important to <i>batch</i> requests, so that a single -expedited grace-period operation will cover all requests in the -corresponding batch. - -<p> -This batching is controlled by a sequence counter named -<tt>->expedited_sequence</tt> in the <tt>rcu_state</tt> structure. -This counter has an odd value when there is an expedited grace period -in progress and an even value otherwise, so that dividing the counter -value by two gives the number of completed grace periods. -During any given update request, the counter must transition from -even to odd and then back to even, thus indicating that a grace -period has elapsed. -Therefore, if the initial value of the counter is <tt>s</tt>, -the updater must wait until the counter reaches at least the -value <tt>(s+3)&~0x1</tt>. -This counter is managed by the following access functions: - -<ol> -<li> <tt>rcu_exp_gp_seq_start()</tt>, which marks the start of - an expedited grace period. -<li> <tt>rcu_exp_gp_seq_end()</tt>, which marks the end of an - expedited grace period. -<li> <tt>rcu_exp_gp_seq_snap()</tt>, which obtains a snapshot of - the counter. -<li> <tt>rcu_exp_gp_seq_done()</tt>, which returns <tt>true</tt> - if a full expedited grace period has elapsed since the - corresponding call to <tt>rcu_exp_gp_seq_snap()</tt>. -</ol> - -<p> -Again, only one request in a given batch need actually carry out -a grace-period operation, which means there must be an efficient -way to identify which of many concurrent reqeusts will initiate -the grace period, and that there be an efficient way for the -remaining requests to wait for that grace period to complete. -However, that is the topic of the next section. - -<h3><a name="Funnel Locking and Wait/Wakeup"> -Funnel Locking and Wait/Wakeup</a></h3> - -<p> -The natural way to sort out which of a batch of updaters will initiate -the expedited grace period is to use the <tt>rcu_node</tt> combining -tree, as implemented by the <tt>exp_funnel_lock()</tt> function. -The first updater corresponding to a given grace period arriving -at a given <tt>rcu_node</tt> structure records its desired grace-period -sequence number in the <tt>->exp_seq_rq</tt> field and moves up -to the next level in the tree. -Otherwise, if the <tt>->exp_seq_rq</tt> field already contains -the sequence number for the desired grace period or some later one, -the updater blocks on one of four wait queues in the -<tt>->exp_wq[]</tt> array, using the second-from-bottom -and third-from bottom bits as an index. -An <tt>->exp_lock</tt> field in the <tt>rcu_node</tt> structure -synchronizes access to these fields. - -<p> -An empty <tt>rcu_node</tt> tree is shown in the following diagram, -with the white cells representing the <tt>->exp_seq_rq</tt> field -and the red cells representing the elements of the -<tt>->exp_wq[]</tt> array. - -<p><img src="Funnel0.svg" alt="Funnel0.svg" width="75%"> - -<p> -The next diagram shows the situation after the arrival of Task A -and Task B at the leftmost and rightmost leaf <tt>rcu_node</tt> -structures, respectively. -The current value of the <tt>rcu_state</tt> structure's -<tt>->expedited_sequence</tt> field is zero, so adding three and -clearing the bottom bit results in the value two, which both tasks -record in the <tt>->exp_seq_rq</tt> field of their respective -<tt>rcu_node</tt> structures: - -<p><img src="Funnel1.svg" alt="Funnel1.svg" width="75%"> - -<p> -Each of Tasks A and B will move up to the root -<tt>rcu_node</tt> structure. -Suppose that Task A wins, recording its desired grace-period sequence -number and resulting in the state shown below: - -<p><img src="Funnel2.svg" alt="Funnel2.svg" width="75%"> - -<p> -Task A now advances to initiate a new grace period, while Task B -moves up to the root <tt>rcu_node</tt> structure, and, seeing that -its desired sequence number is already recorded, blocks on -<tt>->exp_wq[1]</tt>. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why <tt>->exp_wq[1]</tt>? - Given that the value of these tasks' desired sequence number is - two, so shouldn't they instead block on <tt>->exp_wq[2]</tt>? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - No. - - <p><font color="ffffff"> - Recall that the bottom bit of the desired sequence number indicates - whether or not a grace period is currently in progress. - It is therefore necessary to shift the sequence number right one - bit position to obtain the number of the grace period. - This results in <tt>->exp_wq[1]</tt>. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -If Tasks C and D also arrive at this point, they will compute the -same desired grace-period sequence number, and see that both leaf -<tt>rcu_node</tt> structures already have that value recorded. -They will therefore block on their respective <tt>rcu_node</tt> -structures' <tt>->exp_wq[1]</tt> fields, as shown below: - -<p><img src="Funnel3.svg" alt="Funnel3.svg" width="75%"> - -<p> -Task A now acquires the <tt>rcu_state</tt> structure's -<tt>->exp_mutex</tt> and initiates the grace period, which -increments <tt>->expedited_sequence</tt>. -Therefore, if Tasks E and F arrive, they will compute -a desired sequence number of 4 and will record this value as -shown below: - -<p><img src="Funnel4.svg" alt="Funnel4.svg" width="75%"> - -<p> -Tasks E and F will propagate up the <tt>rcu_node</tt> -combining tree, with Task F blocking on the root <tt>rcu_node</tt> -structure and Task E wait for Task A to finish so that -it can start the next grace period. -The resulting state is as shown below: - -<p><img src="Funnel5.svg" alt="Funnel5.svg" width="75%"> - -<p> -Once the grace period completes, Task A -starts waking up the tasks waiting for this grace period to complete, -increments the <tt>->expedited_sequence</tt>, -acquires the <tt>->exp_wake_mutex</tt> and then releases the -<tt>->exp_mutex</tt>. -This results in the following state: - -<p><img src="Funnel6.svg" alt="Funnel6.svg" width="75%"> - -<p> -Task E can then acquire <tt>->exp_mutex</tt> and increment -<tt>->expedited_sequence</tt> to the value three. -If new tasks G and H arrive and moves up the combining tree at the -same time, the state will be as follows: - -<p><img src="Funnel7.svg" alt="Funnel7.svg" width="75%"> - -<p> -Note that three of the root <tt>rcu_node</tt> structure's -waitqueues are now occupied. -However, at some point, Task A will wake up the -tasks blocked on the <tt>->exp_wq</tt> waitqueues, resulting -in the following state: - -<p><img src="Funnel8.svg" alt="Funnel8.svg" width="75%"> - -<p> -Execution will continue with Tasks E and H completing -their grace periods and carrying out their wakeups. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - What happens if Task A takes so long to do its wakeups - that Task E's grace period completes? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Then Task E will block on the <tt>->exp_wake_mutex</tt>, - which will also prevent it from releasing <tt>->exp_mutex</tt>, - which in turn will prevent the next grace period from starting. - This last is important in preventing overflow of the - <tt>->exp_wq[]</tt> array. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h3><a name="Use of Workqueues">Use of Workqueues</a></h3> - -<p> -In earlier implementations, the task requesting the expedited -grace period also drove it to completion. -This straightforward approach had the disadvantage of needing to -account for POSIX signals sent to user tasks, -so more recent implemementations use the Linux kernel's -<a href="https://www.kernel.org/doc/Documentation/core-api/workqueue.rst">workqueues</a>. - -<p> -The requesting task still does counter snapshotting and funnel-lock -processing, but the task reaching the top of the funnel lock -does a <tt>schedule_work()</tt> (from <tt>_synchronize_rcu_expedited()</tt> -so that a workqueue kthread does the actual grace-period processing. -Because workqueue kthreads do not accept POSIX signals, grace-period-wait -processing need not allow for POSIX signals. - -In addition, this approach allows wakeups for the previous expedited -grace period to be overlapped with processing for the next expedited -grace period. -Because there are only four sets of waitqueues, it is necessary to -ensure that the previous grace period's wakeups complete before the -next grace period's wakeups start. -This is handled by having the <tt>->exp_mutex</tt> -guard expedited grace-period processing and the -<tt>->exp_wake_mutex</tt> guard wakeups. -The key point is that the <tt>->exp_mutex</tt> is not released -until the first wakeup is complete, which means that the -<tt>->exp_wake_mutex</tt> has already been acquired at that point. -This approach ensures that the previous grace period's wakeups can -be carried out while the current grace period is in process, but -that these wakeups will complete before the next grace period starts. -This means that only three waitqueues are required, guaranteeing that -the four that are provided are sufficient. - -<h3><a name="Stall Warnings">Stall Warnings</a></h3> - -<p> -Expediting grace periods does nothing to speed things up when RCU -readers take too long, and therefore expedited grace periods check -for stalls just as normal grace periods do. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But why not just let the normal grace-period machinery - detect the stalls, given that a given reader must block - both normal and expedited grace periods? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Because it is quite possible that at a given time there - is no normal grace period in progress, in which case the - normal grace period cannot emit a stall warning. -</font></td></tr> -<tr><td> </td></tr> -</table> - -The <tt>synchronize_sched_expedited_wait()</tt> function loops waiting -for the expedited grace period to end, but with a timeout set to the -current RCU CPU stall-warning time. -If this time is exceeded, any CPUs or <tt>rcu_node</tt> structures -blocking the current grace period are printed. -Each stall warning results in another pass through the loop, but the -second and subsequent passes use longer stall times. - -<h3><a name="Mid-Boot Operation">Mid-boot operation</a></h3> - -<p> -The use of workqueues has the advantage that the expedited -grace-period code need not worry about POSIX signals. -Unfortunately, it has the -corresponding disadvantage that workqueues cannot be used until -they are initialized, which does not happen until some time after -the scheduler spawns the first task. -Given that there are parts of the kernel that really do want to -execute grace periods during this mid-boot “dead zone”, -expedited grace periods must do something else during thie time. - -<p> -What they do is to fall back to the old practice of requiring that the -requesting task drive the expedited grace period, as was the case -before the use of workqueues. -However, the requesting task is only required to drive the grace period -during the mid-boot dead zone. -Before mid-boot, a synchronous grace period is a no-op. -Some time after mid-boot, workqueues are used. - -<p> -Non-expedited non-SRCU synchronous grace periods must also operate -normally during mid-boot. -This is handled by causing non-expedited grace periods to take the -expedited code path during mid-boot. - -<p> -The current code assumes that there are no POSIX signals during -the mid-boot dead zone. -However, if an overwhelming need for POSIX signals somehow arises, -appropriate adjustments can be made to the expedited stall-warning code. -One such adjustment would reinstate the pre-workqueue stall-warning -checks, but only during the mid-boot dead zone. - -<p> -With this refinement, synchronous grace periods can now be used from -task context pretty much any time during the life of the kernel. -That is, aside from some points in the suspend, hibernate, or shutdown -code path. - -<h3><a name="Summary"> -Summary</a></h3> - -<p> -Expedited grace periods use a sequence-number approach to promote -batching, so that a single grace-period operation can serve numerous -requests. -A funnel lock is used to efficiently identify the one task out of -a concurrent group that will request the grace period. -All members of the group will block on waitqueues provided in -the <tt>rcu_node</tt> structure. -The actual grace-period processing is carried out by a workqueue. - -<p> -CPU-hotplug operations are noted lazily in order to prevent the need -for tight synchronization between expedited grace periods and -CPU-hotplug operations. -The dyntick-idle counters are used to avoid sending IPIs to idle CPUs, -at least in the common case. -RCU-preempt and RCU-sched use different IPI handlers and different -code to respond to the state changes carried out by those handlers, -but otherwise use common code. - -<p> -Quiescent states are tracked using the <tt>rcu_node</tt> tree, -and once all necessary quiescent states have been reported, -all tasks waiting on this expedited grace period are awakened. -A pair of mutexes are used to allow one grace period's wakeups -to proceed concurrently with the next grace period's processing. - -<p> -This combination of mechanisms allows expedited grace periods to -run reasonably efficiently. -However, for non-time-critical tasks, normal grace periods should be -used instead because their longer duration permits much higher -degrees of batching, and thus much lower per-request overheads. - -</body></html> diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst new file mode 100644 index 000000000000..72f0f6fbd53c --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst @@ -0,0 +1,521 @@ +================================================= +A Tour Through TREE_RCU's Expedited Grace Periods +================================================= + +Introduction +============ + +This document describes RCU's expedited grace periods. +Unlike RCU's normal grace periods, which accept long latencies to attain +high efficiency and minimal disturbance, expedited grace periods accept +lower efficiency and significant disturbance to attain shorter latencies. + +There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier +third RCU-bh flavor having been implemented in terms of the other two. +Each of the two implementations is covered in its own section. + +Expedited Grace Period Design +============================= + +The expedited RCU grace periods cannot be accused of being subtle, +given that they for all intents and purposes hammer every CPU that +has not yet provided a quiescent state for the current expedited +grace period. +The one saving grace is that the hammer has grown a bit smaller +over time: The old call to ``try_stop_cpus()`` has been +replaced with a set of calls to ``smp_call_function_single()``, +each of which results in an IPI to the target CPU. +The corresponding handler function checks the CPU's state, motivating +a faster quiescent state where possible, and triggering a report +of that quiescent state. +As always for RCU, once everything has spent some time in a quiescent +state, the expedited grace period has completed. + +The details of the ``smp_call_function_single()`` handler's +operation depend on the RCU flavor, as described in the following +sections. + +RCU-preempt Expedited Grace Periods +=================================== + +``CONFIG_PREEMPT=y`` kernels implement RCU-preempt. +The overall flow of the handling of a given CPU by an RCU-preempt +expedited grace period is shown in the following diagram: + +.. kernel-figure:: ExpRCUFlow.svg + +The solid arrows denote direct action, for example, a function call. +The dotted arrows denote indirect action, for example, an IPI +or a state that is reached after some time. + +If a given CPU is offline or idle, ``synchronize_rcu_expedited()`` +will ignore it because idle and offline CPUs are already residing +in quiescent states. +Otherwise, the expedited grace period will use +``smp_call_function_single()`` to send the CPU an IPI, which +is handled by ``rcu_exp_handler()``. + +However, because this is preemptible RCU, ``rcu_exp_handler()`` +can check to see if the CPU is currently running in an RCU read-side +critical section. +If not, the handler can immediately report a quiescent state. +Otherwise, it sets flags so that the outermost ``rcu_read_unlock()`` +invocation will provide the needed quiescent-state report. +This flag-setting avoids the previous forced preemption of all +CPUs that might have RCU read-side critical sections. +In addition, this flag-setting is done so as to avoid increasing +the overhead of the common-case fastpath through the scheduler. + +Again because this is preemptible RCU, an RCU read-side critical section +can be preempted. +When that happens, RCU will enqueue the task, which will the continue to +block the current expedited grace period until it resumes and finds its +outermost ``rcu_read_unlock()``. +The CPU will report a quiescent state just after enqueuing the task because +the CPU is no longer blocking the grace period. +It is instead the preempted task doing the blocking. +The list of blocked tasks is managed by ``rcu_preempt_ctxt_queue()``, +which is called from ``rcu_preempt_note_context_switch()``, which +in turn is called from ``rcu_note_context_switch()``, which in +turn is called from the scheduler. + + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why not just have the expedited grace period check the state of all | +| the CPUs? After all, that would avoid all those real-time-unfriendly | +| IPIs. | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because we want the RCU read-side critical sections to run fast, | +| which means no memory barriers. Therefore, it is not possible to | +| safely check the state from some other CPU. And even if it was | +| possible to safely check the state, it would still be necessary to | +| IPI the CPU to safely interact with the upcoming | +| ``rcu_read_unlock()`` invocation, which means that the remote state | +| testing would not help the worst-case latency that real-time | +| applications care about. | +| | +| One way to prevent your real-time application from getting hit with | +| these IPIs is to build your kernel with ``CONFIG_NO_HZ_FULL=y``. RCU | +| would then perceive the CPU running your application as being idle, | +| and it would be able to safely detect that state without needing to | +| IPI the CPU. | ++-----------------------------------------------------------------------+ + +Please note that this is just the overall flow: Additional complications +can arise due to races with CPUs going idle or offline, among other +things. + +RCU-sched Expedited Grace Periods +--------------------------------- + +``CONFIG_PREEMPT=n`` kernels implement RCU-sched. The overall flow of +the handling of a given CPU by an RCU-sched expedited grace period is +shown in the following diagram: + +.. kernel-figure:: ExpSchedFlow.svg + +As with RCU-preempt, RCU-sched's ``synchronize_rcu_expedited()`` ignores +offline and idle CPUs, again because they are in remotely detectable +quiescent states. However, because the ``rcu_read_lock_sched()`` and +``rcu_read_unlock_sched()`` leave no trace of their invocation, in +general it is not possible to tell whether or not the current CPU is in +an RCU read-side critical section. The best that RCU-sched's +``rcu_exp_handler()`` can do is to check for idle, on the off-chance +that the CPU went idle while the IPI was in flight. If the CPU is idle, +then ``rcu_exp_handler()`` reports the quiescent state. + +Otherwise, the handler forces a future context switch by setting the +NEED_RESCHED flag of the current task's thread flag and the CPU preempt +counter. At the time of the context switch, the CPU reports the +quiescent state. Should the CPU go offline first, it will report the +quiescent state at that time. + +Expedited Grace Period and CPU Hotplug +-------------------------------------- + +The expedited nature of expedited grace periods require a much tighter +interaction with CPU hotplug operations than is required for normal +grace periods. In addition, attempting to IPI offline CPUs will result +in splats, but failing to IPI online CPUs can result in too-short grace +periods. Neither option is acceptable in production kernels. + +The interaction between expedited grace periods and CPU hotplug +operations is carried out at several levels: + +#. The number of CPUs that have ever been online is tracked by the + ``rcu_state`` structure's ``->ncpus`` field. The ``rcu_state`` + structure's ``->ncpus_snap`` field tracks the number of CPUs that + have ever been online at the beginning of an RCU expedited grace + period. Note that this number never decreases, at least in the + absence of a time machine. +#. The identities of the CPUs that have ever been online is tracked by + the ``rcu_node`` structure's ``->expmaskinitnext`` field. The + ``rcu_node`` structure's ``->expmaskinit`` field tracks the + identities of the CPUs that were online at least once at the + beginning of the most recent RCU expedited grace period. The + ``rcu_state`` structure's ``->ncpus`` and ``->ncpus_snap`` fields are + used to detect when new CPUs have come online for the first time, + that is, when the ``rcu_node`` structure's ``->expmaskinitnext`` + field has changed since the beginning of the last RCU expedited grace + period, which triggers an update of each ``rcu_node`` structure's + ``->expmaskinit`` field from its ``->expmaskinitnext`` field. +#. Each ``rcu_node`` structure's ``->expmaskinit`` field is used to + initialize that structure's ``->expmask`` at the beginning of each + RCU expedited grace period. This means that only those CPUs that have + been online at least once will be considered for a given grace + period. +#. Any CPU that goes offline will clear its bit in its leaf ``rcu_node`` + structure's ``->qsmaskinitnext`` field, so any CPU with that bit + clear can safely be ignored. However, it is possible for a CPU coming + online or going offline to have this bit set for some time while + ``cpu_online`` returns ``false``. +#. For each non-idle CPU that RCU believes is currently online, the + grace period invokes ``smp_call_function_single()``. If this + succeeds, the CPU was fully online. Failure indicates that the CPU is + in the process of coming online or going offline, in which case it is + necessary to wait for a short time period and try again. The purpose + of this wait (or series of waits, as the case may be) is to permit a + concurrent CPU-hotplug operation to complete. +#. In the case of RCU-sched, one of the last acts of an outgoing CPU is + to invoke ``rcu_report_dead()``, which reports a quiescent state for + that CPU. However, this is likely paranoia-induced redundancy. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why all the dancing around with multiple counters and masks tracking | +| CPUs that were once online? Why not just have a single set of masks | +| tracking the currently online CPUs and be done with it? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Maintaining single set of masks tracking the online CPUs *sounds* | +| easier, at least until you try working out all the race conditions | +| between grace-period initialization and CPU-hotplug operations. For | +| example, suppose initialization is progressing down the tree while a | +| CPU-offline operation is progressing up the tree. This situation can | +| result in bits set at the top of the tree that have no counterparts | +| at the bottom of the tree. Those bits will never be cleared, which | +| will result in grace-period hangs. In short, that way lies madness, | +| to say nothing of a great many bugs, hangs, and deadlocks. | +| In contrast, the current multi-mask multi-counter scheme ensures that | +| grace-period initialization will always see consistent masks up and | +| down the tree, which brings significant simplifications over the | +| single-mask method. | +| | +| This is an instance of `deferring work in order to avoid | +| synchronization <http://www.cs.columbia.edu/~library/TR-repository/re | +| ports/reports-1992/cucs-039-92.ps.gz>`__. | +| Lazily recording CPU-hotplug events at the beginning of the next | +| grace period greatly simplifies maintenance of the CPU-tracking | +| bitmasks in the ``rcu_node`` tree. | ++-----------------------------------------------------------------------+ + +Expedited Grace Period Refinements +---------------------------------- + +Idle-CPU Checks +~~~~~~~~~~~~~~~ + +Each expedited grace period checks for idle CPUs when initially forming +the mask of CPUs to be IPIed and again just before IPIing a CPU (both +checks are carried out by ``sync_rcu_exp_select_cpus()``). If the CPU is +idle at any time between those two times, the CPU will not be IPIed. +Instead, the task pushing the grace period forward will include the idle +CPUs in the mask passed to ``rcu_report_exp_cpu_mult()``. + +For RCU-sched, there is an additional check: If the IPI has interrupted +the idle loop, then ``rcu_exp_handler()`` invokes +``rcu_report_exp_rdp()`` to report the corresponding quiescent state. + +For RCU-preempt, there is no specific check for idle in the IPI handler +(``rcu_exp_handler()``), but because RCU read-side critical sections are +not permitted within the idle loop, if ``rcu_exp_handler()`` sees that +the CPU is within RCU read-side critical section, the CPU cannot +possibly be idle. Otherwise, ``rcu_exp_handler()`` invokes +``rcu_report_exp_rdp()`` to report the corresponding quiescent state, +regardless of whether or not that quiescent state was due to the CPU +being idle. + +In summary, RCU expedited grace periods check for idle when building the +bitmask of CPUs that must be IPIed, just before sending each IPI, and +(either explicitly or implicitly) within the IPI handler. + +Batching via Sequence Counter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If each grace-period request was carried out separately, expedited grace +periods would have abysmal scalability and problematic high-load +characteristics. Because each grace-period operation can serve an +unlimited number of updates, it is important to *batch* requests, so +that a single expedited grace-period operation will cover all requests +in the corresponding batch. + +This batching is controlled by a sequence counter named +``->expedited_sequence`` in the ``rcu_state`` structure. This counter +has an odd value when there is an expedited grace period in progress and +an even value otherwise, so that dividing the counter value by two gives +the number of completed grace periods. During any given update request, +the counter must transition from even to odd and then back to even, thus +indicating that a grace period has elapsed. Therefore, if the initial +value of the counter is ``s``, the updater must wait until the counter +reaches at least the value ``(s+3)&~0x1``. This counter is managed by +the following access functions: + +#. ``rcu_exp_gp_seq_start()``, which marks the start of an expedited + grace period. +#. ``rcu_exp_gp_seq_end()``, which marks the end of an expedited grace + period. +#. ``rcu_exp_gp_seq_snap()``, which obtains a snapshot of the counter. +#. ``rcu_exp_gp_seq_done()``, which returns ``true`` if a full expedited + grace period has elapsed since the corresponding call to + ``rcu_exp_gp_seq_snap()``. + +Again, only one request in a given batch need actually carry out a +grace-period operation, which means there must be an efficient way to +identify which of many concurrent reqeusts will initiate the grace +period, and that there be an efficient way for the remaining requests to +wait for that grace period to complete. However, that is the topic of +the next section. + +Funnel Locking and Wait/Wakeup +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The natural way to sort out which of a batch of updaters will initiate +the expedited grace period is to use the ``rcu_node`` combining tree, as +implemented by the ``exp_funnel_lock()`` function. The first updater +corresponding to a given grace period arriving at a given ``rcu_node`` +structure records its desired grace-period sequence number in the +``->exp_seq_rq`` field and moves up to the next level in the tree. +Otherwise, if the ``->exp_seq_rq`` field already contains the sequence +number for the desired grace period or some later one, the updater +blocks on one of four wait queues in the ``->exp_wq[]`` array, using the +second-from-bottom and third-from bottom bits as an index. An +``->exp_lock`` field in the ``rcu_node`` structure synchronizes access +to these fields. + +An empty ``rcu_node`` tree is shown in the following diagram, with the +white cells representing the ``->exp_seq_rq`` field and the red cells +representing the elements of the ``->exp_wq[]`` array. + +.. kernel-figure:: Funnel0.svg + +The next diagram shows the situation after the arrival of Task A and +Task B at the leftmost and rightmost leaf ``rcu_node`` structures, +respectively. The current value of the ``rcu_state`` structure's +``->expedited_sequence`` field is zero, so adding three and clearing the +bottom bit results in the value two, which both tasks record in the +``->exp_seq_rq`` field of their respective ``rcu_node`` structures: + +.. kernel-figure:: Funnel1.svg + +Each of Tasks A and B will move up to the root ``rcu_node`` structure. +Suppose that Task A wins, recording its desired grace-period sequence +number and resulting in the state shown below: + +.. kernel-figure:: Funnel2.svg + +Task A now advances to initiate a new grace period, while Task B moves +up to the root ``rcu_node`` structure, and, seeing that its desired +sequence number is already recorded, blocks on ``->exp_wq[1]``. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why ``->exp_wq[1]``? Given that the value of these tasks' desired | +| sequence number is two, so shouldn't they instead block on | +| ``->exp_wq[2]``? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No. | +| Recall that the bottom bit of the desired sequence number indicates | +| whether or not a grace period is currently in progress. It is | +| therefore necessary to shift the sequence number right one bit | +| position to obtain the number of the grace period. This results in | +| ``->exp_wq[1]``. | ++-----------------------------------------------------------------------+ + +If Tasks C and D also arrive at this point, they will compute the same +desired grace-period sequence number, and see that both leaf +``rcu_node`` structures already have that value recorded. They will +therefore block on their respective ``rcu_node`` structures' +``->exp_wq[1]`` fields, as shown below: + +.. kernel-figure:: Funnel3.svg + +Task A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and +initiates the grace period, which increments ``->expedited_sequence``. +Therefore, if Tasks E and F arrive, they will compute a desired sequence +number of 4 and will record this value as shown below: + +.. kernel-figure:: Funnel4.svg + +Tasks E and F will propagate up the ``rcu_node`` combining tree, with +Task F blocking on the root ``rcu_node`` structure and Task E wait for +Task A to finish so that it can start the next grace period. The +resulting state is as shown below: + +.. kernel-figure:: Funnel5.svg + +Once the grace period completes, Task A starts waking up the tasks +waiting for this grace period to complete, increments the +``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then +releases the ``->exp_mutex``. This results in the following state: + +.. kernel-figure:: Funnel6.svg + +Task E can then acquire ``->exp_mutex`` and increment +``->expedited_sequence`` to the value three. If new tasks G and H arrive +and moves up the combining tree at the same time, the state will be as +follows: + +.. kernel-figure:: Funnel7.svg + +Note that three of the root ``rcu_node`` structure's waitqueues are now +occupied. However, at some point, Task A will wake up the tasks blocked +on the ``->exp_wq`` waitqueues, resulting in the following state: + +.. kernel-figure:: Funnel8.svg + +Execution will continue with Tasks E and H completing their grace +periods and carrying out their wakeups. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What happens if Task A takes so long to do its wakeups that Task E's | +| grace period completes? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Then Task E will block on the ``->exp_wake_mutex``, which will also | +| prevent it from releasing ``->exp_mutex``, which in turn will prevent | +| the next grace period from starting. This last is important in | +| preventing overflow of the ``->exp_wq[]`` array. | ++-----------------------------------------------------------------------+ + +Use of Workqueues +~~~~~~~~~~~~~~~~~ + +In earlier implementations, the task requesting the expedited grace +period also drove it to completion. This straightforward approach had +the disadvantage of needing to account for POSIX signals sent to user +tasks, so more recent implemementations use the Linux kernel's +`workqueues <https://www.kernel.org/doc/Documentation/core-api/workqueue.rst>`__. + +The requesting task still does counter snapshotting and funnel-lock +processing, but the task reaching the top of the funnel lock does a +``schedule_work()`` (from ``_synchronize_rcu_expedited()`` so that a +workqueue kthread does the actual grace-period processing. Because +workqueue kthreads do not accept POSIX signals, grace-period-wait +processing need not allow for POSIX signals. In addition, this approach +allows wakeups for the previous expedited grace period to be overlapped +with processing for the next expedited grace period. Because there are +only four sets of waitqueues, it is necessary to ensure that the +previous grace period's wakeups complete before the next grace period's +wakeups start. This is handled by having the ``->exp_mutex`` guard +expedited grace-period processing and the ``->exp_wake_mutex`` guard +wakeups. The key point is that the ``->exp_mutex`` is not released until +the first wakeup is complete, which means that the ``->exp_wake_mutex`` +has already been acquired at that point. This approach ensures that the +previous grace period's wakeups can be carried out while the current +grace period is in process, but that these wakeups will complete before +the next grace period starts. This means that only three waitqueues are +required, guaranteeing that the four that are provided are sufficient. + +Stall Warnings +~~~~~~~~~~~~~~ + +Expediting grace periods does nothing to speed things up when RCU +readers take too long, and therefore expedited grace periods check for +stalls just as normal grace periods do. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But why not just let the normal grace-period machinery detect the | +| stalls, given that a given reader must block both normal and | +| expedited grace periods? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because it is quite possible that at a given time there is no normal | +| grace period in progress, in which case the normal grace period | +| cannot emit a stall warning. | ++-----------------------------------------------------------------------+ + +The ``synchronize_sched_expedited_wait()`` function loops waiting for +the expedited grace period to end, but with a timeout set to the current +RCU CPU stall-warning time. If this time is exceeded, any CPUs or +``rcu_node`` structures blocking the current grace period are printed. +Each stall warning results in another pass through the loop, but the +second and subsequent passes use longer stall times. + +Mid-boot operation +~~~~~~~~~~~~~~~~~~ + +The use of workqueues has the advantage that the expedited grace-period +code need not worry about POSIX signals. Unfortunately, it has the +corresponding disadvantage that workqueues cannot be used until they are +initialized, which does not happen until some time after the scheduler +spawns the first task. Given that there are parts of the kernel that +really do want to execute grace periods during this mid-boot “dead +zone”, expedited grace periods must do something else during thie time. + +What they do is to fall back to the old practice of requiring that the +requesting task drive the expedited grace period, as was the case before +the use of workqueues. However, the requesting task is only required to +drive the grace period during the mid-boot dead zone. Before mid-boot, a +synchronous grace period is a no-op. Some time after mid-boot, +workqueues are used. + +Non-expedited non-SRCU synchronous grace periods must also operate +normally during mid-boot. This is handled by causing non-expedited grace +periods to take the expedited code path during mid-boot. + +The current code assumes that there are no POSIX signals during the +mid-boot dead zone. However, if an overwhelming need for POSIX signals +somehow arises, appropriate adjustments can be made to the expedited +stall-warning code. One such adjustment would reinstate the +pre-workqueue stall-warning checks, but only during the mid-boot dead +zone. + +With this refinement, synchronous grace periods can now be used from +task context pretty much any time during the life of the kernel. That +is, aside from some points in the suspend, hibernate, or shutdown code +path. + +Summary +~~~~~~~ + +Expedited grace periods use a sequence-number approach to promote +batching, so that a single grace-period operation can serve numerous +requests. A funnel lock is used to efficiently identify the one task out +of a concurrent group that will request the grace period. All members of +the group will block on waitqueues provided in the ``rcu_node`` +structure. The actual grace-period processing is carried out by a +workqueue. + +CPU-hotplug operations are noted lazily in order to prevent the need for +tight synchronization between expedited grace periods and CPU-hotplug +operations. The dyntick-idle counters are used to avoid sending IPIs to +idle CPUs, at least in the common case. RCU-preempt and RCU-sched use +different IPI handlers and different code to respond to the state +changes carried out by those handlers, but otherwise use common code. + +Quiescent states are tracked using the ``rcu_node`` tree, and once all +necessary quiescent states have been reported, all tasks waiting on this +expedited grace period are awakened. A pair of mutexes are used to allow +one grace period's wakeups to proceed concurrently with the next grace +period's processing. + +This combination of mechanisms allows expedited grace periods to run +reasonably efficiently. However, for non-time-critical tasks, normal +grace periods should be used instead because their longer duration +permits much higher degrees of batching, and thus much lower per-request +overheads. diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html deleted file mode 100644 index e5b42a798ff3..000000000000 --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html +++ /dev/null @@ -1,9 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" - "http://www.w3.org/TR/html4/loose.dtd"> - <html> - <head><title>A Diagram of TREE_RCU's Grace-Period Memory Ordering</title> - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> - -<p><img src="TreeRCU-gp.svg" alt="TreeRCU-gp.svg"> - -</body></html> diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html deleted file mode 100644 index c64f8d26609f..000000000000 --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html +++ /dev/null @@ -1,704 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" - "http://www.w3.org/TR/html4/loose.dtd"> - <html> - <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title> - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> - - <p>August 8, 2017</p> - <p>This article was contributed by Paul E. McKenney</p> - -<h3>Introduction</h3> - -<p>This document gives a rough visual overview of how Tree RCU's -grace-period memory ordering guarantee is provided. - -<ol> -<li> <a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> - What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a> -<li> <a href="#Tree RCU Grace Period Memory Ordering Building Blocks"> - Tree RCU Grace Period Memory Ordering Building Blocks</a> -<li> <a href="#Tree RCU Grace Period Memory Ordering Components"> - Tree RCU Grace Period Memory Ordering Components</a> -<li> <a href="#Putting It All Together">Putting It All Together</a> -</ol> - -<h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> -What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3> - -<p>RCU grace periods provide extremely strong memory-ordering guarantees -for non-idle non-offline code. -Any code that happens after the end of a given RCU grace period is guaranteed -to see the effects of all accesses prior to the beginning of that grace -period that are within RCU read-side critical sections. -Similarly, any code that happens before the beginning of a given RCU grace -period is guaranteed to see the effects of all accesses following the end -of that grace period that are within RCU read-side critical sections. - -<p>Note well that RCU-sched read-side critical sections include any region -of code for which preemption is disabled. -Given that each individual machine instruction can be thought of as -an extremely small region of preemption-disabled code, one can think of -<tt>synchronize_rcu()</tt> as <tt>smp_mb()</tt> on steroids. - -<p>RCU updaters use this guarantee by splitting their updates into -two phases, one of which is executed before the grace period and -the other of which is executed after the grace period. -In the most common use case, phase one removes an element from -a linked RCU-protected data structure, and phase two frees that element. -For this to work, any readers that have witnessed state prior to the -phase-one update (in the common case, removal) must not witness state -following the phase-two update (in the common case, freeing). - -<p>The RCU implementation provides this guarantee using a network -of lock-based critical sections, memory barriers, and per-CPU -processing, as is described in the following sections. - -<h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks"> -Tree RCU Grace Period Memory Ordering Building Blocks</a></h3> - -<p>The workhorse for RCU's grace-period memory ordering is the -critical section for the <tt>rcu_node</tt> structure's -<tt>->lock</tt>. -These critical sections use helper functions for lock acquisition, including -<tt>raw_spin_lock_rcu_node()</tt>, -<tt>raw_spin_lock_irq_rcu_node()</tt>, and -<tt>raw_spin_lock_irqsave_rcu_node()</tt>. -Their lock-release counterparts are -<tt>raw_spin_unlock_rcu_node()</tt>, -<tt>raw_spin_unlock_irq_rcu_node()</tt>, and -<tt>raw_spin_unlock_irqrestore_rcu_node()</tt>, -respectively. -For completeness, a -<tt>raw_spin_trylock_rcu_node()</tt> -is also provided. -The key point is that the lock-acquisition functions, including -<tt>raw_spin_trylock_rcu_node()</tt>, all invoke -<tt>smp_mb__after_unlock_lock()</tt> immediately after successful -acquisition of the lock. - -<p>Therefore, for any given <tt>rcu_node</tt> structure, any access -happening before one of the above lock-release functions will be seen -by all CPUs as happening before any access happening after a later -one of the above lock-acquisition functions. -Furthermore, any access happening before one of the -above lock-release function on any given CPU will be seen by all -CPUs as happening before any access happening after a later one -of the above lock-acquisition functions executing on that same CPU, -even if the lock-release and lock-acquisition functions are operating -on different <tt>rcu_node</tt> structures. -Tree RCU uses these two ordering guarantees to form an ordering -network among all CPUs that were in any way involved in the grace -period, including any CPUs that came online or went offline during -the grace period in question. - -<p>The following litmus test exhibits the ordering effects of these -lock-acquisition and lock-release functions: - -<pre> - 1 int x, y, z; - 2 - 3 void task0(void) - 4 { - 5 raw_spin_lock_rcu_node(rnp); - 6 WRITE_ONCE(x, 1); - 7 r1 = READ_ONCE(y); - 8 raw_spin_unlock_rcu_node(rnp); - 9 } -10 -11 void task1(void) -12 { -13 raw_spin_lock_rcu_node(rnp); -14 WRITE_ONCE(y, 1); -15 r2 = READ_ONCE(z); -16 raw_spin_unlock_rcu_node(rnp); -17 } -18 -19 void task2(void) -20 { -21 WRITE_ONCE(z, 1); -22 smp_mb(); -23 r3 = READ_ONCE(x); -24 } -25 -26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); -</pre> - -<p>The <tt>WARN_ON()</tt> is evaluated at “the end of time”, -after all changes have propagated throughout the system. -Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the -acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example -on PowerPC. -The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this -<tt>WARN_ON()</tt> from triggering. - -<p>This approach must be extended to include idle CPUs, which need -RCU's grace-period memory ordering guarantee to extend to any -RCU read-side critical sections preceding and following the current -idle sojourn. -This case is handled by calls to the strongly ordered -<tt>atomic_add_return()</tt> read-modify-write atomic operation that -is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry -time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time. -The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and -<tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke -an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But what about CPUs that remain offline for the entire - grace period? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Such CPUs will be offline at the beginning of the grace period, - so the grace period won't expect quiescent states from them. - Races between grace-period start and CPU-hotplug operations - are mediated by the CPU's leaf <tt>rcu_node</tt> structure's - <tt>->lock</tt> as described above. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>The approach must be extended to handle one final case, that -of waking a task blocked in <tt>synchronize_rcu()</tt>. -This task might be affinitied to a CPU that is not yet aware that -the grace period has ended, and thus might not yet be subject to -the grace period's memory ordering. -Therefore, there is an <tt>smp_mb()</tt> after the return from -<tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt> -code path. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - What? Where??? - I don't see any <tt>smp_mb()</tt> after the return from - <tt>wait_for_completion()</tt>!!! -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - That would be because I spotted the need for that - <tt>smp_mb()</tt> during the creation of this documentation, - and it is therefore unlikely to hit mainline before v4.14. - Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and - Jonathan Cameron for asking questions that sensitized me - to the rather elaborate sequence of events that demonstrate - the need for this memory barrier. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>Tree RCU's grace--period memory-ordering guarantees rely most -heavily on the <tt>rcu_node</tt> structure's <tt>->lock</tt> -field, so much so that it is necessary to abbreviate this pattern -in the diagrams in the next section. -For example, consider the <tt>rcu_prepare_for_idle()</tt> function -shown below, which is one of several functions that enforce ordering -of newly arrived RCU callbacks against future grace periods: - -<pre> - 1 static void rcu_prepare_for_idle(void) - 2 { - 3 bool needwake; - 4 struct rcu_data *rdp; - 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); - 6 struct rcu_node *rnp; - 7 struct rcu_state *rsp; - 8 int tne; - 9 -10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || -11 rcu_is_nocb_cpu(smp_processor_id())) -12 return; -13 tne = READ_ONCE(tick_nohz_active); -14 if (tne != rdtp->tick_nohz_enabled_snap) { -15 if (rcu_cpu_has_callbacks(NULL)) -16 invoke_rcu_core(); -17 rdtp->tick_nohz_enabled_snap = tne; -18 return; -19 } -20 if (!tne) -21 return; -22 if (rdtp->all_lazy && -23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { -24 rdtp->all_lazy = false; -25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; -26 invoke_rcu_core(); -27 return; -28 } -29 if (rdtp->last_accelerate == jiffies) -30 return; -31 rdtp->last_accelerate = jiffies; -32 for_each_rcu_flavor(rsp) { -33 rdp = this_cpu_ptr(rsp->rda); -34 if (rcu_segcblist_pend_cbs(&rdp->cblist)) -35 continue; -36 rnp = rdp->mynode; -37 raw_spin_lock_rcu_node(rnp); -38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); -39 raw_spin_unlock_rcu_node(rnp); -40 if (needwake) -41 rcu_gp_kthread_wake(rsp); -42 } -43 } -</pre> - -<p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really -matters for this discussion are lines 37–39. -We will therefore abbreviate this function as follows: - -</p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg"> - -<p>The box represents the <tt>rcu_node</tt> structure's <tt>->lock</tt> -critical section, with the double line on top representing the additional -<tt>smp_mb__after_unlock_lock()</tt>. - -<h3><a name="Tree RCU Grace Period Memory Ordering Components"> -Tree RCU Grace Period Memory Ordering Components</a></h3> - -<p>Tree RCU's grace-period memory-ordering guarantee is provided by -a number of RCU components: - -<ol> -<li> <a href="#Callback Registry">Callback Registry</a> -<li> <a href="#Grace-Period Initialization">Grace-Period Initialization</a> -<li> <a href="#Self-Reported Quiescent States"> - Self-Reported Quiescent States</a> -<li> <a href="#Dynamic Tick Interface">Dynamic Tick Interface</a> -<li> <a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a> -<li> <a href="Forcing Quiescent States">Forcing Quiescent States</a> -<li> <a href="Grace-Period Cleanup">Grace-Period Cleanup</a> -<li> <a href="Callback Invocation">Callback Invocation</a> -</ol> - -<p>Each of the following section looks at the corresponding component -in detail. - -<h4><a name="Callback Registry">Callback Registry</a></h4> - -<p>If RCU's grace-period guarantee is to mean anything at all, any -access that happens before a given invocation of <tt>call_rcu()</tt> -must also happen before the corresponding grace period. -The implementation of this portion of RCU's grace period guarantee -is shown in the following figure: - -</p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg"> - -<p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state, -it provides no ordering guarantees, either for itself or for -phase one of the update (which again will usually be removal of -an element from an RCU-protected data structure). -It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list, -which cannot become associated with a grace period until a later -call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above. - -<p>One set of code paths shown on the left invokes -<tt>rcu_accelerate_cbs()</tt> via -<tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if -the current CPU is inundated with queued <tt>rcu_head</tt> structures) -or more likely from an <tt>RCU_SOFTIRQ</tt> handler. -Another code path in the middle is taken only in kernels built with -<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes -<tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>. -The final code path on the right is taken only in kernels built with -<tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes -<tt>rcu_accelerate_cbs()</tt> via -<tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>, -<tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>, -which in turn is invoked on a surviving CPU after the outgoing -CPU has been completely offlined. - -<p>There are a few other code paths within grace-period processing -that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>. -However, either way, all of the CPU's recently queued <tt>rcu_head</tt> -structures are associated with a future grace-period number under -the protection of the CPU's lead <tt>rcu_node</tt> structure's -<tt>->lock</tt>. -In all cases, there is full ordering against any prior critical section -for that same <tt>rcu_node</tt> structure's <tt>->lock</tt>, and -also full ordering against any of the current task's or CPU's prior critical -sections for any <tt>rcu_node</tt> structure's <tt>->lock</tt>. - -<p>The next section will show how this ordering ensures that any -accesses prior to the <tt>call_rcu()</tt> (particularly including phase -one of the update) -happen before the start of the corresponding grace period. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But what about <tt>synchronize_rcu()</tt>? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt> - to <tt>wait_rcu_gp()</tt>, which invokes it. - So either way, it eventually comes down to <tt>call_rcu()</tt>. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4> - -<p>Grace-period initialization is carried out by -the grace-period kernel thread, which makes several passes over the -<tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function. -This means that showing the full flow of ordering through the -grace-period computation will require duplicating this tree. -If you find this confusing, please note that the state of the -<tt>rcu_node</tt> changes over time, just like Heraclitus's river. -However, to keep the <tt>rcu_node</tt> river tractable, the -grace-period kernel thread's traversals are presented in multiple -parts, starting in this section with the various phases of -grace-period initialization. - -<p>The first ordering-related grace-period initialization action is to -advance the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt> -grace-period-number counter, as shown below: - -</p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> - -<p>The actual increment is carried out using <tt>smp_store_release()</tt>, -which helps reject false-positive RCU CPU stall detection. -Note that only the root <tt>rcu_node</tt> structure is touched. - -<p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks -based on CPUs having come online or gone offline since the start of -the previous grace period. -In the common case where the number of online CPUs for this <tt>rcu_node</tt> -structure has not transitioned to or from zero, -this pass will scan only the leaf <tt>rcu_node</tt> structures. -However, if the number of online CPUs for a given leaf <tt>rcu_node</tt> -structure has transitioned from zero, -<tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU. -Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt> -structure has transitioned to zero, -<tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU. -The diagram below shows the path of ordering if the leftmost -<tt>rcu_node</tt> structure onlines its first CPU and if the next -<tt>rcu_node</tt> structure has no online CPUs -(or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines -its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs). - -</p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> - -<p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt> -tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's -<tt>->gp_seq</tt> field to the newly advanced value from the -<tt>rcu_state</tt> structure, as shown in the following diagram. - -</p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> - -<p>This change will also cause each CPU's next call to -<tt>__note_gp_changes()</tt> -to notice that a new grace period has started, as described in the next -section. -But because the grace-period kthread started the grace period at the -root (with the advancing of the <tt>rcu_state</tt> structure's -<tt>->gp_seq</tt> field) before setting each leaf <tt>rcu_node</tt> -structure's <tt>->gp_seq</tt> field, each CPU's observation of -the start of the grace period will happen after the actual start -of the grace period. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But what about the CPU that started the grace period? - Why wouldn't it see the start of the grace period right when - it started that grace period? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - In some deep philosophical and overly anthromorphized - sense, yes, the CPU starting the grace period is immediately - aware of having done so. - However, if we instead assume that RCU is not self-aware, - then even the CPU starting the grace period does not really - become aware of the start of this grace period until its - first call to <tt>__note_gp_changes()</tt>. - On the other hand, this CPU potentially gets early notification - because it invokes <tt>__note_gp_changes()</tt> during its - last <tt>rcu_gp_init()</tt> pass through its leaf - <tt>rcu_node</tt> structure. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h4><a name="Self-Reported Quiescent States"> -Self-Reported Quiescent States</a></h4> - -<p>When all entities that might block the grace period have reported -quiescent states (or as described in a later section, had quiescent -states reported on their behalf), the grace period can end. -Online non-idle CPUs report their own quiescent states, as shown -in the following diagram: - -</p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%"> - -<p>This is for the last CPU to report a quiescent state, which signals -the end of the grace period. -Earlier quiescent states would push up the <tt>rcu_node</tt> tree -only until they encountered an <tt>rcu_node</tt> structure that -is waiting for additional quiescent states. -However, ordering is nevertheless preserved because some later quiescent -state will acquire that <tt>rcu_node</tt> structure's <tt>->lock</tt>. - -<p>Any number of events can lead up to a CPU invoking -<tt>note_gp_changes</tt> (or alternatively, directly invoking -<tt>__note_gp_changes()</tt>), at which point that CPU will notice -the start of a new grace period while holding its leaf -<tt>rcu_node</tt> lock. -Therefore, all execution shown in this diagram happens after the -start of the grace period. -In addition, this CPU will consider any RCU read-side critical -section that started before the invocation of <tt>__note_gp_changes()</tt> -to have started before the grace period, and thus a critical -section that the grace period must wait on. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But a RCU read-side critical section might have started - after the beginning of the grace period - (the advancing of <tt>->gp_seq</tt> from earlier), so why should - the grace period wait on such a critical section? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - It is indeed not necessary for the grace period to wait on such - a critical section. - However, it is permissible to wait on it. - And it is furthermore important to wait on it, as this - lazy approach is far more scalable than a “big bang” - all-at-once grace-period start could possibly be. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>If the CPU does a context switch, a quiescent state will be -noted by <tt>rcu_node_context_switch()</tt> on the left. -On the other hand, if the CPU takes a scheduler-clock interrupt -while executing in usermode, a quiescent state will be noted by -<tt>rcu_sched_clock_irq()</tt> on the right. -Either way, the passage through a quiescent state will be noted -in a per-CPU variable. - -<p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on -this CPU (for example, after the next scheduler-clock -interrupt), <tt>rcu_core()</tt> will invoke -<tt>rcu_check_quiescent_state()</tt>, which will notice the -recorded quiescent state, and invoke -<tt>rcu_report_qs_rdp()</tt>. -If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state -really does apply to the current grace period, it invokes -<tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt> -tree as shown at the bottom of the diagram, clearing bits from -each <tt>rcu_node</tt> structure's <tt>->qsmask</tt> field, -and propagating up the tree when the result is zero. - -<p>Note that traversal passes upwards out of a given <tt>rcu_node</tt> -structure only if the current CPU is reporting the last quiescent -state for the subtree headed by that <tt>rcu_node</tt> structure. -A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt> -structure, then there will be a later traversal by another CPU -(or perhaps the same one) that proceeds upwards -from that point, and the <tt>rcu_node</tt> <tt>->lock</tt> -guarantees that the first CPU's quiescent state happens before the -remainder of the second CPU's traversal. -Applying this line of thought repeatedly shows that all CPUs' -quiescent states happen before the last CPU traverses through -the root <tt>rcu_node</tt> structure, the “last CPU” -being the one that clears the last bit in the root <tt>rcu_node</tt> -structure's <tt>->qsmask</tt> field. - -<h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4> - -<p>Due to energy-efficiency considerations, RCU is forbidden from -disturbing idle CPUs. -CPUs are therefore required to notify RCU when entering or leaving idle -state, which they do via fully ordered value-returning atomic operations -on a per-CPU variable. -The ordering effects are as shown below: - -</p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%"> - -<p>The RCU grace-period kernel thread samples the per-CPU idleness -variable while holding the corresponding CPU's leaf <tt>rcu_node</tt> -structure's <tt>->lock</tt>. -This means that any RCU read-side critical sections that precede the -idle period (the oval near the top of the diagram above) will happen -before the end of the current grace period. -Similarly, the beginning of the current grace period will happen before -any RCU read-side critical sections that follow the -idle period (the oval near the bottom of the diagram above). - -<p>Plumbing this into the full grace-period execution is described -<a href="#Forcing Quiescent States">below</a>. - -<h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4> - -<p>RCU is also forbidden from disturbing offline CPUs, which might well -be powered off and removed from the system completely. -CPUs are therefore required to notify RCU of their comings and goings -as part of the corresponding CPU hotplug operations. -The ordering effects are shown below: - -</p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%"> - -<p>Because CPU hotplug operations are much less frequent than idle transitions, -they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt> -structure's <tt>->lock</tt> and update this structure's -<tt>->qsmaskinitnext</tt>. -The RCU grace-period kernel thread samples this mask to detect CPUs -having gone offline since the beginning of this grace period. - -<p>Plumbing this into the full grace-period execution is described -<a href="#Forcing Quiescent States">below</a>. - -<h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4> - -<p>As noted above, idle and offline CPUs cannot report their own -quiescent states, and therefore the grace-period kernel thread -must do the reporting on their behalf. -This process is called “forcing quiescent states”, it is -repeated every few jiffies, and its ordering effects are shown below: - -</p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%"> - -<p>Each pass of quiescent state forcing is guaranteed to traverse the -leaf <tt>rcu_node</tt> structures, and if there are no new quiescent -states due to recently idled and/or offlined CPUs, then only the -leaves are traversed. -However, if there is a newly offlined CPU as illustrated on the left -or a newly idled CPU as illustrated on the right, the corresponding -quiescent state will be driven up towards the root. -As with self-reported quiescent states, the upwards driving stops -once it reaches an <tt>rcu_node</tt> structure that has quiescent -states outstanding from other CPUs. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - The leftmost drive to root stopped before it reached - the root <tt>rcu_node</tt> structure, which means that - there are still CPUs subordinate to that structure on - which the current grace period is waiting. - Given that, how is it possible that the rightmost drive - to root ended the grace period? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Good analysis! - It is in fact impossible in the absence of bugs in RCU. - But this diagram is complex enough as it is, so simplicity - overrode accuracy. - You can think of it as poetic license, or you can think of - it as misdirection that is resolved in the - <a href="#Putting It All Together">stitched-together diagram</a>. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4> - -<p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree -breadth-first advancing all the <tt>->gp_seq</tt> fields, then it -advances the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt> field. -The ordering effects are shown below: - -</p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%"> - -<p>As indicated by the oval at the bottom of the diagram, once -grace-period cleanup is complete, the next grace period can begin. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But when precisely does the grace period end? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - There is no useful single point at which the grace period - can be said to end. - The earliest reasonable candidate is as soon as the last - CPU has reported its quiescent state, but it may be some - milliseconds before RCU becomes aware of this. - The latest reasonable candidate is once the <tt>rcu_state</tt> - structure's <tt>->gp_seq</tt> field has been updated, - but it is quite possible that some CPUs have already completed - phase two of their updates by that time. - In short, if you are going to work with RCU, you need to - learn to embrace uncertainty. -</font></td></tr> -<tr><td> </td></tr> -</table> - - -<h4><a name="Callback Invocation">Callback Invocation</a></h4> - -<p>Once a given CPU's leaf <tt>rcu_node</tt> structure's -<tt>->gp_seq</tt> field has been updated, that CPU can begin -invoking its RCU callbacks that were waiting for this grace period -to end. -These callbacks are identified by <tt>rcu_advance_cbs()</tt>, -which is usually invoked by <tt>__note_gp_changes()</tt>. -As shown in the diagram below, this invocation can be triggered by -the scheduling-clock interrupt (<tt>rcu_sched_clock_irq()</tt> on -the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on -the right, but only for kernels build with -<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>). -Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in -<tt>rcu_do_batch()</tt> invoking the callbacks, which in turn -allows those callbacks to carry out (either directly or indirectly -via wakeup) the needed phase-two processing for each update. - -</p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%"> - -<p>Please note that callback invocation can also be prompted by any -number of corner-case code paths, for example, when a CPU notes that -it has excessive numbers of callbacks queued. -In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's -<tt>->lock</tt> before invoking callbacks, which preserves the -required ordering against the newly completed grace period. - -<p>However, if the callback function communicates to other CPUs, -for example, doing a wakeup, then it is that function's responsibility -to maintain ordering. -For example, if the callback function wakes up a task that runs on -some other CPU, proper ordering must in place in both the callback -function and the task being awakened. -To see why this is important, consider the top half of the -<a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram. -The callback might be running on a CPU corresponding to the leftmost -leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on -a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure, -and the grace-period kernel thread might not yet have reached the -rightmost leaf. -In this case, the grace period's memory ordering might not yet have -reached that CPU, so again the callback function and the awakened -task must supply proper ordering. - -<h3><a name="Putting It All Together">Putting It All Together</a></h3> - -<p>A stitched-together diagram is -<a href="Tree-RCU-Diagram.html">here</a>. - -<h3><a name="Legal Statement"> -Legal Statement</a></h3> - -<p>This work represents the view of the author and does not necessarily -represent the view of IBM. - -</p><p>Linux is a registered trademark of Linus Torvalds. - -</p><p>Other company, product, and service names may be trademarks or -service marks of others. - -</body></html> diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst new file mode 100644 index 000000000000..1a8b129cfc04 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst @@ -0,0 +1,624 @@ +====================================================== +A Tour Through TREE_RCU's Grace-Period Memory Ordering +====================================================== + +August 8, 2017 + +This article was contributed by Paul E. McKenney + +Introduction +============ + +This document gives a rough visual overview of how Tree RCU's +grace-period memory ordering guarantee is provided. + +What Is Tree RCU's Grace Period Memory Ordering Guarantee? +========================================================== + +RCU grace periods provide extremely strong memory-ordering guarantees +for non-idle non-offline code. +Any code that happens after the end of a given RCU grace period is guaranteed +to see the effects of all accesses prior to the beginning of that grace +period that are within RCU read-side critical sections. +Similarly, any code that happens before the beginning of a given RCU grace +period is guaranteed to see the effects of all accesses following the end +of that grace period that are within RCU read-side critical sections. + +Note well that RCU-sched read-side critical sections include any region +of code for which preemption is disabled. +Given that each individual machine instruction can be thought of as +an extremely small region of preemption-disabled code, one can think of +``synchronize_rcu()`` as ``smp_mb()`` on steroids. + +RCU updaters use this guarantee by splitting their updates into +two phases, one of which is executed before the grace period and +the other of which is executed after the grace period. +In the most common use case, phase one removes an element from +a linked RCU-protected data structure, and phase two frees that element. +For this to work, any readers that have witnessed state prior to the +phase-one update (in the common case, removal) must not witness state +following the phase-two update (in the common case, freeing). + +The RCU implementation provides this guarantee using a network +of lock-based critical sections, memory barriers, and per-CPU +processing, as is described in the following sections. + +Tree RCU Grace Period Memory Ordering Building Blocks +===================================================== + +The workhorse for RCU's grace-period memory ordering is the +critical section for the ``rcu_node`` structure's +``->lock``. These critical sections use helper functions for lock +acquisition, including ``raw_spin_lock_rcu_node()``, +``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``. +Their lock-release counterparts are ``raw_spin_unlock_rcu_node()``, +``raw_spin_unlock_irq_rcu_node()``, and +``raw_spin_unlock_irqrestore_rcu_node()``, respectively. +For completeness, a ``raw_spin_trylock_rcu_node()`` is also provided. +The key point is that the lock-acquisition functions, including +``raw_spin_trylock_rcu_node()``, all invoke ``smp_mb__after_unlock_lock()`` +immediately after successful acquisition of the lock. + +Therefore, for any given ``rcu_node`` structure, any access +happening before one of the above lock-release functions will be seen +by all CPUs as happening before any access happening after a later +one of the above lock-acquisition functions. +Furthermore, any access happening before one of the +above lock-release function on any given CPU will be seen by all +CPUs as happening before any access happening after a later one +of the above lock-acquisition functions executing on that same CPU, +even if the lock-release and lock-acquisition functions are operating +on different ``rcu_node`` structures. +Tree RCU uses these two ordering guarantees to form an ordering +network among all CPUs that were in any way involved in the grace +period, including any CPUs that came online or went offline during +the grace period in question. + +The following litmus test exhibits the ordering effects of these +lock-acquisition and lock-release functions:: + + 1 int x, y, z; + 2 + 3 void task0(void) + 4 { + 5 raw_spin_lock_rcu_node(rnp); + 6 WRITE_ONCE(x, 1); + 7 r1 = READ_ONCE(y); + 8 raw_spin_unlock_rcu_node(rnp); + 9 } + 10 + 11 void task1(void) + 12 { + 13 raw_spin_lock_rcu_node(rnp); + 14 WRITE_ONCE(y, 1); + 15 r2 = READ_ONCE(z); + 16 raw_spin_unlock_rcu_node(rnp); + 17 } + 18 + 19 void task2(void) + 20 { + 21 WRITE_ONCE(z, 1); + 22 smp_mb(); + 23 r3 = READ_ONCE(x); + 24 } + 25 + 26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); + +The ``WARN_ON()`` is evaluated at “the end of time”, +after all changes have propagated throughout the system. +Without the ``smp_mb__after_unlock_lock()`` provided by the +acquisition functions, this ``WARN_ON()`` could trigger, for example +on PowerPC. +The ``smp_mb__after_unlock_lock()`` invocations prevent this +``WARN_ON()`` from triggering. + +This approach must be extended to include idle CPUs, which need +RCU's grace-period memory ordering guarantee to extend to any +RCU read-side critical sections preceding and following the current +idle sojourn. +This case is handled by calls to the strongly ordered +``atomic_add_return()`` read-modify-write atomic operation that +is invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry +time and within ``rcu_dynticks_eqs_exit()`` at idle-exit time. +The grace-period kthread invokes ``rcu_dynticks_snap()`` and +``rcu_dynticks_in_eqs_since()`` (both of which invoke +an ``atomic_add_return()`` of zero) to detect idle CPUs. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what about CPUs that remain offline for the entire grace period? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Such CPUs will be offline at the beginning of the grace period, so | +| the grace period won't expect quiescent states from them. Races | +| between grace-period start and CPU-hotplug operations are mediated | +| by the CPU's leaf ``rcu_node`` structure's ``->lock`` as described | +| above. | ++-----------------------------------------------------------------------+ + +The approach must be extended to handle one final case, that of waking a +task blocked in ``synchronize_rcu()``. This task might be affinitied to +a CPU that is not yet aware that the grace period has ended, and thus +might not yet be subject to the grace period's memory ordering. +Therefore, there is an ``smp_mb()`` after the return from +``wait_for_completion()`` in the ``synchronize_rcu()`` code path. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What? Where??? I don't see any ``smp_mb()`` after the return from | +| ``wait_for_completion()``!!! | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| That would be because I spotted the need for that ``smp_mb()`` during | +| the creation of this documentation, and it is therefore unlikely to | +| hit mainline before v4.14. Kudos to Lance Roy, Will Deacon, Peter | +| Zijlstra, and Jonathan Cameron for asking questions that sensitized | +| me to the rather elaborate sequence of events that demonstrate the | +| need for this memory barrier. | ++-----------------------------------------------------------------------+ + +Tree RCU's grace--period memory-ordering guarantees rely most heavily on +the ``rcu_node`` structure's ``->lock`` field, so much so that it is +necessary to abbreviate this pattern in the diagrams in the next +section. For example, consider the ``rcu_prepare_for_idle()`` function +shown below, which is one of several functions that enforce ordering of +newly arrived RCU callbacks against future grace periods: + +:: + + 1 static void rcu_prepare_for_idle(void) + 2 { + 3 bool needwake; + 4 struct rcu_data *rdp; + 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); + 6 struct rcu_node *rnp; + 7 struct rcu_state *rsp; + 8 int tne; + 9 + 10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || + 11 rcu_is_nocb_cpu(smp_processor_id())) + 12 return; + 13 tne = READ_ONCE(tick_nohz_active); + 14 if (tne != rdtp->tick_nohz_enabled_snap) { + 15 if (rcu_cpu_has_callbacks(NULL)) + 16 invoke_rcu_core(); + 17 rdtp->tick_nohz_enabled_snap = tne; + 18 return; + 19 } + 20 if (!tne) + 21 return; + 22 if (rdtp->all_lazy && + 23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { + 24 rdtp->all_lazy = false; + 25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; + 26 invoke_rcu_core(); + 27 return; + 28 } + 29 if (rdtp->last_accelerate == jiffies) + 30 return; + 31 rdtp->last_accelerate = jiffies; + 32 for_each_rcu_flavor(rsp) { + 33 rdp = this_cpu_ptr(rsp->rda); + 34 if (rcu_segcblist_pend_cbs(&rdp->cblist)) + 35 continue; + 36 rnp = rdp->mynode; + 37 raw_spin_lock_rcu_node(rnp); + 38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); + 39 raw_spin_unlock_rcu_node(rnp); + 40 if (needwake) + 41 rcu_gp_kthread_wake(rsp); + 42 } + 43 } + +But the only part of ``rcu_prepare_for_idle()`` that really matters for +this discussion are lines 37–39. We will therefore abbreviate this +function as follows: + +.. kernel-figure:: rcu_node-lock.svg + +The box represents the ``rcu_node`` structure's ``->lock`` critical +section, with the double line on top representing the additional +``smp_mb__after_unlock_lock()``. + +Tree RCU Grace Period Memory Ordering Components +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Tree RCU's grace-period memory-ordering guarantee is provided by a +number of RCU components: + +#. `Callback Registry`_ +#. `Grace-Period Initialization`_ +#. `Self-Reported Quiescent States`_ +#. `Dynamic Tick Interface`_ +#. `CPU-Hotplug Interface`_ +#. `Forcing Quiescent States`_ +#. `Grace-Period Cleanup`_ +#. `Callback Invocation`_ + +Each of the following section looks at the corresponding component in +detail. + +Callback Registry +^^^^^^^^^^^^^^^^^ + +If RCU's grace-period guarantee is to mean anything at all, any access +that happens before a given invocation of ``call_rcu()`` must also +happen before the corresponding grace period. The implementation of this +portion of RCU's grace period guarantee is shown in the following +figure: + +.. kernel-figure:: TreeRCU-callback-registry.svg + +Because ``call_rcu()`` normally acts only on CPU-local state, it +provides no ordering guarantees, either for itself or for phase one of +the update (which again will usually be removal of an element from an +RCU-protected data structure). It simply enqueues the ``rcu_head`` +structure on a per-CPU list, which cannot become associated with a grace +period until a later call to ``rcu_accelerate_cbs()``, as shown in the +diagram above. + +One set of code paths shown on the left invokes ``rcu_accelerate_cbs()`` +via ``note_gp_changes()``, either directly from ``call_rcu()`` (if the +current CPU is inundated with queued ``rcu_head`` structures) or more +likely from an ``RCU_SOFTIRQ`` handler. Another code path in the middle +is taken only in kernels built with ``CONFIG_RCU_FAST_NO_HZ=y``, which +invokes ``rcu_accelerate_cbs()`` via ``rcu_prepare_for_idle()``. The +final code path on the right is taken only in kernels built with +``CONFIG_HOTPLUG_CPU=y``, which invokes ``rcu_accelerate_cbs()`` via +``rcu_advance_cbs()``, ``rcu_migrate_callbacks``, +``rcutree_migrate_callbacks()``, and ``takedown_cpu()``, which in turn +is invoked on a surviving CPU after the outgoing CPU has been completely +offlined. + +There are a few other code paths within grace-period processing that +opportunistically invoke ``rcu_accelerate_cbs()``. However, either way, +all of the CPU's recently queued ``rcu_head`` structures are associated +with a future grace-period number under the protection of the CPU's lead +``rcu_node`` structure's ``->lock``. In all cases, there is full +ordering against any prior critical section for that same ``rcu_node`` +structure's ``->lock``, and also full ordering against any of the +current task's or CPU's prior critical sections for any ``rcu_node`` +structure's ``->lock``. + +The next section will show how this ordering ensures that any accesses +prior to the ``call_rcu()`` (particularly including phase one of the +update) happen before the start of the corresponding grace period. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what about ``synchronize_rcu()``? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| The ``synchronize_rcu()`` passes ``call_rcu()`` to ``wait_rcu_gp()``, | +| which invokes it. So either way, it eventually comes down to | +| ``call_rcu()``. | ++-----------------------------------------------------------------------+ + +Grace-Period Initialization +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Grace-period initialization is carried out by the grace-period kernel +thread, which makes several passes over the ``rcu_node`` tree within the +``rcu_gp_init()`` function. This means that showing the full flow of +ordering through the grace-period computation will require duplicating +this tree. If you find this confusing, please note that the state of the +``rcu_node`` changes over time, just like Heraclitus's river. However, +to keep the ``rcu_node`` river tractable, the grace-period kernel +thread's traversals are presented in multiple parts, starting in this +section with the various phases of grace-period initialization. + +The first ordering-related grace-period initialization action is to +advance the ``rcu_state`` structure's ``->gp_seq`` grace-period-number +counter, as shown below: + +.. kernel-figure:: TreeRCU-gp-init-1.svg + +The actual increment is carried out using ``smp_store_release()``, which +helps reject false-positive RCU CPU stall detection. Note that only the +root ``rcu_node`` structure is touched. + +The first pass through the ``rcu_node`` tree updates bitmasks based on +CPUs having come online or gone offline since the start of the previous +grace period. In the common case where the number of online CPUs for +this ``rcu_node`` structure has not transitioned to or from zero, this +pass will scan only the leaf ``rcu_node`` structures. However, if the +number of online CPUs for a given leaf ``rcu_node`` structure has +transitioned from zero, ``rcu_init_new_rnp()`` will be invoked for the +first incoming CPU. Similarly, if the number of online CPUs for a given +leaf ``rcu_node`` structure has transitioned to zero, +``rcu_cleanup_dead_rnp()`` will be invoked for the last outgoing CPU. +The diagram below shows the path of ordering if the leftmost +``rcu_node`` structure onlines its first CPU and if the next +``rcu_node`` structure has no online CPUs (or, alternatively if the +leftmost ``rcu_node`` structure offlines its last CPU and if the next +``rcu_node`` structure has no online CPUs). + +.. kernel-figure:: TreeRCU-gp-init-1.svg + +The final ``rcu_gp_init()`` pass through the ``rcu_node`` tree traverses +breadth-first, setting each ``rcu_node`` structure's ``->gp_seq`` field +to the newly advanced value from the ``rcu_state`` structure, as shown +in the following diagram. + +.. kernel-figure:: TreeRCU-gp-init-1.svg + +This change will also cause each CPU's next call to +``__note_gp_changes()`` to notice that a new grace period has started, +as described in the next section. But because the grace-period kthread +started the grace period at the root (with the advancing of the +``rcu_state`` structure's ``->gp_seq`` field) before setting each leaf +``rcu_node`` structure's ``->gp_seq`` field, each CPU's observation of +the start of the grace period will happen after the actual start of the +grace period. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what about the CPU that started the grace period? Why wouldn't it | +| see the start of the grace period right when it started that grace | +| period? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In some deep philosophical and overly anthromorphized sense, yes, the | +| CPU starting the grace period is immediately aware of having done so. | +| However, if we instead assume that RCU is not self-aware, then even | +| the CPU starting the grace period does not really become aware of the | +| start of this grace period until its first call to | +| ``__note_gp_changes()``. On the other hand, this CPU potentially gets | +| early notification because it invokes ``__note_gp_changes()`` during | +| its last ``rcu_gp_init()`` pass through its leaf ``rcu_node`` | +| structure. | ++-----------------------------------------------------------------------+ + +Self-Reported Quiescent States +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When all entities that might block the grace period have reported +quiescent states (or as described in a later section, had quiescent +states reported on their behalf), the grace period can end. Online +non-idle CPUs report their own quiescent states, as shown in the +following diagram: + +.. kernel-figure:: TreeRCU-qs.svg + +This is for the last CPU to report a quiescent state, which signals the +end of the grace period. Earlier quiescent states would push up the +``rcu_node`` tree only until they encountered an ``rcu_node`` structure +that is waiting for additional quiescent states. However, ordering is +nevertheless preserved because some later quiescent state will acquire +that ``rcu_node`` structure's ``->lock``. + +Any number of events can lead up to a CPU invoking ``note_gp_changes`` +(or alternatively, directly invoking ``__note_gp_changes()``), at which +point that CPU will notice the start of a new grace period while holding +its leaf ``rcu_node`` lock. Therefore, all execution shown in this +diagram happens after the start of the grace period. In addition, this +CPU will consider any RCU read-side critical section that started before +the invocation of ``__note_gp_changes()`` to have started before the +grace period, and thus a critical section that the grace period must +wait on. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But a RCU read-side critical section might have started after the | +| beginning of the grace period (the advancing of ``->gp_seq`` from | +| earlier), so why should the grace period wait on such a critical | +| section? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| It is indeed not necessary for the grace period to wait on such a | +| critical section. However, it is permissible to wait on it. And it is | +| furthermore important to wait on it, as this lazy approach is far | +| more scalable than a “big bang” all-at-once grace-period start could | +| possibly be. | ++-----------------------------------------------------------------------+ + +If the CPU does a context switch, a quiescent state will be noted by +``rcu_note_context_switch()`` on the left. On the other hand, if the CPU +takes a scheduler-clock interrupt while executing in usermode, a +quiescent state will be noted by ``rcu_sched_clock_irq()`` on the right. +Either way, the passage through a quiescent state will be noted in a +per-CPU variable. + +The next time an ``RCU_SOFTIRQ`` handler executes on this CPU (for +example, after the next scheduler-clock interrupt), ``rcu_core()`` will +invoke ``rcu_check_quiescent_state()``, which will notice the recorded +quiescent state, and invoke ``rcu_report_qs_rdp()``. If +``rcu_report_qs_rdp()`` verifies that the quiescent state really does +apply to the current grace period, it invokes ``rcu_report_rnp()`` which +traverses up the ``rcu_node`` tree as shown at the bottom of the +diagram, clearing bits from each ``rcu_node`` structure's ``->qsmask`` +field, and propagating up the tree when the result is zero. + +Note that traversal passes upwards out of a given ``rcu_node`` structure +only if the current CPU is reporting the last quiescent state for the +subtree headed by that ``rcu_node`` structure. A key point is that if a +CPU's traversal stops at a given ``rcu_node`` structure, then there will +be a later traversal by another CPU (or perhaps the same one) that +proceeds upwards from that point, and the ``rcu_node`` ``->lock`` +guarantees that the first CPU's quiescent state happens before the +remainder of the second CPU's traversal. Applying this line of thought +repeatedly shows that all CPUs' quiescent states happen before the last +CPU traverses through the root ``rcu_node`` structure, the “last CPU” +being the one that clears the last bit in the root ``rcu_node`` +structure's ``->qsmask`` field. + +Dynamic Tick Interface +^^^^^^^^^^^^^^^^^^^^^^ + +Due to energy-efficiency considerations, RCU is forbidden from +disturbing idle CPUs. CPUs are therefore required to notify RCU when +entering or leaving idle state, which they do via fully ordered +value-returning atomic operations on a per-CPU variable. The ordering +effects are as shown below: + +.. kernel-figure:: TreeRCU-dyntick.svg + +The RCU grace-period kernel thread samples the per-CPU idleness variable +while holding the corresponding CPU's leaf ``rcu_node`` structure's +``->lock``. This means that any RCU read-side critical sections that +precede the idle period (the oval near the top of the diagram above) +will happen before the end of the current grace period. Similarly, the +beginning of the current grace period will happen before any RCU +read-side critical sections that follow the idle period (the oval near +the bottom of the diagram above). + +Plumbing this into the full grace-period execution is described +`below <#Forcing%20Quiescent%20States>`__. + +CPU-Hotplug Interface +^^^^^^^^^^^^^^^^^^^^^ + +RCU is also forbidden from disturbing offline CPUs, which might well be +powered off and removed from the system completely. CPUs are therefore +required to notify RCU of their comings and goings as part of the +corresponding CPU hotplug operations. The ordering effects are shown +below: + +.. kernel-figure:: TreeRCU-hotplug.svg + +Because CPU hotplug operations are much less frequent than idle +transitions, they are heavier weight, and thus acquire the CPU's leaf +``rcu_node`` structure's ``->lock`` and update this structure's +``->qsmaskinitnext``. The RCU grace-period kernel thread samples this +mask to detect CPUs having gone offline since the beginning of this +grace period. + +Plumbing this into the full grace-period execution is described +`below <#Forcing%20Quiescent%20States>`__. + +Forcing Quiescent States +^^^^^^^^^^^^^^^^^^^^^^^^ + +As noted above, idle and offline CPUs cannot report their own quiescent +states, and therefore the grace-period kernel thread must do the +reporting on their behalf. This process is called “forcing quiescent +states”, it is repeated every few jiffies, and its ordering effects are +shown below: + +.. kernel-figure:: TreeRCU-gp-fqs.svg + +Each pass of quiescent state forcing is guaranteed to traverse the leaf +``rcu_node`` structures, and if there are no new quiescent states due to +recently idled and/or offlined CPUs, then only the leaves are traversed. +However, if there is a newly offlined CPU as illustrated on the left or +a newly idled CPU as illustrated on the right, the corresponding +quiescent state will be driven up towards the root. As with +self-reported quiescent states, the upwards driving stops once it +reaches an ``rcu_node`` structure that has quiescent states outstanding +from other CPUs. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| The leftmost drive to root stopped before it reached the root | +| ``rcu_node`` structure, which means that there are still CPUs | +| subordinate to that structure on which the current grace period is | +| waiting. Given that, how is it possible that the rightmost drive to | +| root ended the grace period? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Good analysis! It is in fact impossible in the absence of bugs in | +| RCU. But this diagram is complex enough as it is, so simplicity | +| overrode accuracy. You can think of it as poetic license, or you can | +| think of it as misdirection that is resolved in the | +| `stitched-together diagram <#Putting%20It%20All%20Together>`__. | ++-----------------------------------------------------------------------+ + +Grace-Period Cleanup +^^^^^^^^^^^^^^^^^^^^ + +Grace-period cleanup first scans the ``rcu_node`` tree breadth-first +advancing all the ``->gp_seq`` fields, then it advances the +``rcu_state`` structure's ``->gp_seq`` field. The ordering effects are +shown below: + +.. kernel-figure:: TreeRCU-gp-cleanup.svg + +As indicated by the oval at the bottom of the diagram, once grace-period +cleanup is complete, the next grace period can begin. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But when precisely does the grace period end? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| There is no useful single point at which the grace period can be said | +| to end. The earliest reasonable candidate is as soon as the last CPU | +| has reported its quiescent state, but it may be some milliseconds | +| before RCU becomes aware of this. The latest reasonable candidate is | +| once the ``rcu_state`` structure's ``->gp_seq`` field has been | +| updated, but it is quite possible that some CPUs have already | +| completed phase two of their updates by that time. In short, if you | +| are going to work with RCU, you need to learn to embrace uncertainty. | ++-----------------------------------------------------------------------+ + +Callback Invocation +^^^^^^^^^^^^^^^^^^^ + +Once a given CPU's leaf ``rcu_node`` structure's ``->gp_seq`` field has +been updated, that CPU can begin invoking its RCU callbacks that were +waiting for this grace period to end. These callbacks are identified by +``rcu_advance_cbs()``, which is usually invoked by +``__note_gp_changes()``. As shown in the diagram below, this invocation +can be triggered by the scheduling-clock interrupt +(``rcu_sched_clock_irq()`` on the left) or by idle entry +(``rcu_cleanup_after_idle()`` on the right, but only for kernels build +with ``CONFIG_RCU_FAST_NO_HZ=y``). Either way, ``RCU_SOFTIRQ`` is +raised, which results in ``rcu_do_batch()`` invoking the callbacks, +which in turn allows those callbacks to carry out (either directly or +indirectly via wakeup) the needed phase-two processing for each update. + +.. kernel-figure:: TreeRCU-callback-invocation.svg + +Please note that callback invocation can also be prompted by any number +of corner-case code paths, for example, when a CPU notes that it has +excessive numbers of callbacks queued. In all cases, the CPU acquires +its leaf ``rcu_node`` structure's ``->lock`` before invoking callbacks, +which preserves the required ordering against the newly completed grace +period. + +However, if the callback function communicates to other CPUs, for +example, doing a wakeup, then it is that function's responsibility to +maintain ordering. For example, if the callback function wakes up a task +that runs on some other CPU, proper ordering must in place in both the +callback function and the task being awakened. To see why this is +important, consider the top half of the `grace-period +cleanup <#Grace-Period%20Cleanup>`__ diagram. The callback might be +running on a CPU corresponding to the leftmost leaf ``rcu_node`` +structure, and awaken a task that is to run on a CPU corresponding to +the rightmost leaf ``rcu_node`` structure, and the grace-period kernel +thread might not yet have reached the rightmost leaf. In this case, the +grace period's memory ordering might not yet have reached that CPU, so +again the callback function and the awakened task must supply proper +ordering. + +Putting It All Together +~~~~~~~~~~~~~~~~~~~~~~~ + +A stitched-together diagram is here: + +.. kernel-figure:: TreeRCU-gp.svg + +Legal Statement +~~~~~~~~~~~~~~~ + +This work represents the view of the author and does not necessarily +represent the view of IBM. + +Linux is a registered trademark of Linus Torvalds. + +Other company, product, and service names may be trademarks or service +marks of others. diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg index 2bcd742d6e49..069f6f8371c2 100644 --- a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg @@ -3880,7 +3880,7 @@ font-style="normal" y="-4418.6582" x="3745.7725" - xml:space="preserve">rcu_node_context_switch()</text> + xml:space="preserve">rcu_note_context_switch()</text> </g> <g transform="translate(1881.1886,54048.57)" diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg index 779c9ac31a52..7d6c5f7e505c 100644 --- a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg @@ -753,7 +753,7 @@ font-style="normal" y="-4418.6582" x="3745.7725" - xml:space="preserve">rcu_node_context_switch()</text> + xml:space="preserve">rcu_note_context_switch()</text> </g> <g transform="translate(3131.2648,-585.6713)" diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html deleted file mode 100644 index 467251f7fef6..000000000000 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ /dev/null @@ -1,3401 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" - "http://www.w3.org/TR/html4/loose.dtd"> - <html> - <head><title>A Tour Through RCU's Requirements [LWN.net]</title> - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> - -<h1>A Tour Through RCU's Requirements</h1> - -<p>Copyright IBM Corporation, 2015</p> -<p>Author: Paul E. McKenney</p> -<p><i>The initial version of this document appeared in the -<a href="https://lwn.net/">LWN</a> articles -<a href="https://lwn.net/Articles/652156/">here</a>, -<a href="https://lwn.net/Articles/652677/">here</a>, and -<a href="https://lwn.net/Articles/653326/">here</a>.</i></p> - -<h2>Introduction</h2> - -<p> -Read-copy update (RCU) is a synchronization mechanism that is often -used as a replacement for reader-writer locking. -RCU is unusual in that updaters do not block readers, -which means that RCU's read-side primitives can be exceedingly fast -and scalable. -In addition, updaters can make useful forward progress concurrently -with readers. -However, all this concurrency between RCU readers and updaters does raise -the question of exactly what RCU readers are doing, which in turn -raises the question of exactly what RCU's requirements are. - -<p> -This document therefore summarizes RCU's requirements, and can be thought -of as an informal, high-level specification for RCU. -It is important to understand that RCU's specification is primarily -empirical in nature; -in fact, I learned about many of these requirements the hard way. -This situation might cause some consternation, however, not only -has this learning process been a lot of fun, but it has also been -a great privilege to work with so many people willing to apply -technologies in interesting new ways. - -<p> -All that aside, here are the categories of currently known RCU requirements: -</p> - -<ol> -<li> <a href="#Fundamental Requirements"> - Fundamental Requirements</a> -<li> <a href="#Fundamental Non-Requirements">Fundamental Non-Requirements</a> -<li> <a href="#Parallelism Facts of Life"> - Parallelism Facts of Life</a> -<li> <a href="#Quality-of-Implementation Requirements"> - Quality-of-Implementation Requirements</a> -<li> <a href="#Linux Kernel Complications"> - Linux Kernel Complications</a> -<li> <a href="#Software-Engineering Requirements"> - Software-Engineering Requirements</a> -<li> <a href="#Other RCU Flavors"> - Other RCU Flavors</a> -<li> <a href="#Possible Future Changes"> - Possible Future Changes</a> -</ol> - -<p> -This is followed by a <a href="#Summary">summary</a>, -however, the answers to each quick quiz immediately follows the quiz. -Select the big white space with your mouse to see the answer. - -<h2><a name="Fundamental Requirements">Fundamental Requirements</a></h2> - -<p> -RCU's fundamental requirements are the closest thing RCU has to hard -mathematical requirements. -These are: - -<ol> -<li> <a href="#Grace-Period Guarantee"> - Grace-Period Guarantee</a> -<li> <a href="#Publish-Subscribe Guarantee"> - Publish-Subscribe Guarantee</a> -<li> <a href="#Memory-Barrier Guarantees"> - Memory-Barrier Guarantees</a> -<li> <a href="#RCU Primitives Guaranteed to Execute Unconditionally"> - RCU Primitives Guaranteed to Execute Unconditionally</a> -<li> <a href="#Guaranteed Read-to-Write Upgrade"> - Guaranteed Read-to-Write Upgrade</a> -</ol> - -<h3><a name="Grace-Period Guarantee">Grace-Period Guarantee</a></h3> - -<p> -RCU's grace-period guarantee is unusual in being premeditated: -Jack Slingwine and I had this guarantee firmly in mind when we started -work on RCU (then called “rclock”) in the early 1990s. -That said, the past two decades of experience with RCU have produced -a much more detailed understanding of this guarantee. - -<p> -RCU's grace-period guarantee allows updaters to wait for the completion -of all pre-existing RCU read-side critical sections. -An RCU read-side critical section -begins with the marker <tt>rcu_read_lock()</tt> and ends with -the marker <tt>rcu_read_unlock()</tt>. -These markers may be nested, and RCU treats a nested set as one -big RCU read-side critical section. -Production-quality implementations of <tt>rcu_read_lock()</tt> and -<tt>rcu_read_unlock()</tt> are extremely lightweight, and in -fact have exactly zero overhead in Linux kernels built for production -use with <tt>CONFIG_PREEMPT=n</tt>. - -<p> -This guarantee allows ordering to be enforced with extremely low -overhead to readers, for example: - -<blockquote> -<pre> - 1 int x, y; - 2 - 3 void thread0(void) - 4 { - 5 rcu_read_lock(); - 6 r1 = READ_ONCE(x); - 7 r2 = READ_ONCE(y); - 8 rcu_read_unlock(); - 9 } -10 -11 void thread1(void) -12 { -13 WRITE_ONCE(x, 1); -14 synchronize_rcu(); -15 WRITE_ONCE(y, 1); -16 } -</pre> -</blockquote> - -<p> -Because the <tt>synchronize_rcu()</tt> on line 14 waits for -all pre-existing readers, any instance of <tt>thread0()</tt> that -loads a value of zero from <tt>x</tt> must complete before -<tt>thread1()</tt> stores to <tt>y</tt>, so that instance must -also load a value of zero from <tt>y</tt>. -Similarly, any instance of <tt>thread0()</tt> that loads a value of -one from <tt>y</tt> must have started after the -<tt>synchronize_rcu()</tt> started, and must therefore also load -a value of one from <tt>x</tt>. -Therefore, the outcome: -<blockquote> -<pre> -(r1 == 0 && r2 == 1) -</pre> -</blockquote> -cannot happen. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Wait a minute! - You said that updaters can make useful forward progress concurrently - with readers, but pre-existing readers will block - <tt>synchronize_rcu()</tt>!!! - Just who are you trying to fool??? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - First, if updaters do not wish to be blocked by readers, they can use - <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt>, which will - be discussed later. - Second, even when using <tt>synchronize_rcu()</tt>, the other - update-side code does run concurrently with readers, whether - pre-existing or not. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -This scenario resembles one of the first uses of RCU in -<a href="https://en.wikipedia.org/wiki/DYNIX">DYNIX/ptx</a>, -which managed a distributed lock manager's transition into -a state suitable for handling recovery from node failure, -more or less as follows: - -<blockquote> -<pre> - 1 #define STATE_NORMAL 0 - 2 #define STATE_WANT_RECOVERY 1 - 3 #define STATE_RECOVERING 2 - 4 #define STATE_WANT_NORMAL 3 - 5 - 6 int state = STATE_NORMAL; - 7 - 8 void do_something_dlm(void) - 9 { -10 int state_snap; -11 -12 rcu_read_lock(); -13 state_snap = READ_ONCE(state); -14 if (state_snap == STATE_NORMAL) -15 do_something(); -16 else -17 do_something_carefully(); -18 rcu_read_unlock(); -19 } -20 -21 void start_recovery(void) -22 { -23 WRITE_ONCE(state, STATE_WANT_RECOVERY); -24 synchronize_rcu(); -25 WRITE_ONCE(state, STATE_RECOVERING); -26 recovery(); -27 WRITE_ONCE(state, STATE_WANT_NORMAL); -28 synchronize_rcu(); -29 WRITE_ONCE(state, STATE_NORMAL); -30 } -</pre> -</blockquote> - -<p> -The RCU read-side critical section in <tt>do_something_dlm()</tt> -works with the <tt>synchronize_rcu()</tt> in <tt>start_recovery()</tt> -to guarantee that <tt>do_something()</tt> never runs concurrently -with <tt>recovery()</tt>, but with little or no synchronization -overhead in <tt>do_something_dlm()</tt>. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why is the <tt>synchronize_rcu()</tt> on line 28 needed? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Without that extra grace period, memory reordering could result in - <tt>do_something_dlm()</tt> executing <tt>do_something()</tt> - concurrently with the last bits of <tt>recovery()</tt>. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -In order to avoid fatal problems such as deadlocks, -an RCU read-side critical section must not contain calls to -<tt>synchronize_rcu()</tt>. -Similarly, an RCU read-side critical section must not -contain anything that waits, directly or indirectly, on completion of -an invocation of <tt>synchronize_rcu()</tt>. - -<p> -Although RCU's grace-period guarantee is useful in and of itself, with -<a href="https://lwn.net/Articles/573497/">quite a few use cases</a>, -it would be good to be able to use RCU to coordinate read-side -access to linked data structures. -For this, the grace-period guarantee is not sufficient, as can -be seen in function <tt>add_gp_buggy()</tt> below. -We will look at the reader's code later, but in the meantime, just think of -the reader as locklessly picking up the <tt>gp</tt> pointer, -and, if the value loaded is non-<tt>NULL</tt>, locklessly accessing the -<tt>->a</tt> and <tt>->b</tt> fields. - -<blockquote> -<pre> - 1 bool add_gp_buggy(int a, int b) - 2 { - 3 p = kmalloc(sizeof(*p), GFP_KERNEL); - 4 if (!p) - 5 return -ENOMEM; - 6 spin_lock(&gp_lock); - 7 if (rcu_access_pointer(gp)) { - 8 spin_unlock(&gp_lock); - 9 return false; -10 } -11 p->a = a; -12 p->b = a; -13 gp = p; /* ORDERING BUG */ -14 spin_unlock(&gp_lock); -15 return true; -16 } -</pre> -</blockquote> - -<p> -The problem is that both the compiler and weakly ordered CPUs are within -their rights to reorder this code as follows: - -<blockquote> -<pre> - 1 bool add_gp_buggy_optimized(int a, int b) - 2 { - 3 p = kmalloc(sizeof(*p), GFP_KERNEL); - 4 if (!p) - 5 return -ENOMEM; - 6 spin_lock(&gp_lock); - 7 if (rcu_access_pointer(gp)) { - 8 spin_unlock(&gp_lock); - 9 return false; -10 } -<b>11 gp = p; /* ORDERING BUG */ -12 p->a = a; -13 p->b = a;</b> -14 spin_unlock(&gp_lock); -15 return true; -16 } -</pre> -</blockquote> - -<p> -If an RCU reader fetches <tt>gp</tt> just after -<tt>add_gp_buggy_optimized</tt> executes line 11, -it will see garbage in the <tt>->a</tt> and <tt>->b</tt> -fields. -And this is but one of many ways in which compiler and hardware optimizations -could cause trouble. -Therefore, we clearly need some way to prevent the compiler and the CPU from -reordering in this manner, which brings us to the publish-subscribe -guarantee discussed in the next section. - -<h3><a name="Publish-Subscribe Guarantee">Publish/Subscribe Guarantee</a></h3> - -<p> -RCU's publish-subscribe guarantee allows data to be inserted -into a linked data structure without disrupting RCU readers. -The updater uses <tt>rcu_assign_pointer()</tt> to insert the -new data, and readers use <tt>rcu_dereference()</tt> to -access data, whether new or old. -The following shows an example of insertion: - -<blockquote> -<pre> - 1 bool add_gp(int a, int b) - 2 { - 3 p = kmalloc(sizeof(*p), GFP_KERNEL); - 4 if (!p) - 5 return -ENOMEM; - 6 spin_lock(&gp_lock); - 7 if (rcu_access_pointer(gp)) { - 8 spin_unlock(&gp_lock); - 9 return false; -10 } -11 p->a = a; -12 p->b = a; -13 rcu_assign_pointer(gp, p); -14 spin_unlock(&gp_lock); -15 return true; -16 } -</pre> -</blockquote> - -<p> -The <tt>rcu_assign_pointer()</tt> on line 13 is conceptually -equivalent to a simple assignment statement, but also guarantees -that its assignment will -happen after the two assignments in lines 11 and 12, -similar to the C11 <tt>memory_order_release</tt> store operation. -It also prevents any number of “interesting” compiler -optimizations, for example, the use of <tt>gp</tt> as a scratch -location immediately preceding the assignment. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But <tt>rcu_assign_pointer()</tt> does nothing to prevent the - two assignments to <tt>p->a</tt> and <tt>p->b</tt> - from being reordered. - Can't that also cause problems? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - No, it cannot. - The readers cannot see either of these two fields until - the assignment to <tt>gp</tt>, by which time both fields are - fully initialized. - So reordering the assignments - to <tt>p->a</tt> and <tt>p->b</tt> cannot possibly - cause any problems. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -It is tempting to assume that the reader need not do anything special -to control its accesses to the RCU-protected data, -as shown in <tt>do_something_gp_buggy()</tt> below: - -<blockquote> -<pre> - 1 bool do_something_gp_buggy(void) - 2 { - 3 rcu_read_lock(); - 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ - 5 if (p) { - 6 do_something(p->a, p->b); - 7 rcu_read_unlock(); - 8 return true; - 9 } -10 rcu_read_unlock(); -11 return false; -12 } -</pre> -</blockquote> - -<p> -However, this temptation must be resisted because there are a -surprisingly large number of ways that the compiler -(to say nothing of -<a href="https://h71000.www7.hp.com/wizard/wiz_2637.html">DEC Alpha CPUs</a>) -can trip this code up. -For but one example, if the compiler were short of registers, it -might choose to refetch from <tt>gp</tt> rather than keeping -a separate copy in <tt>p</tt> as follows: - -<blockquote> -<pre> - 1 bool do_something_gp_buggy_optimized(void) - 2 { - 3 rcu_read_lock(); - 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ -<b> 5 do_something(gp->a, gp->b);</b> - 6 rcu_read_unlock(); - 7 return true; - 8 } - 9 rcu_read_unlock(); -10 return false; -11 } -</pre> -</blockquote> - -<p> -If this function ran concurrently with a series of updates that -replaced the current structure with a new one, -the fetches of <tt>gp->a</tt> -and <tt>gp->b</tt> might well come from two different structures, -which could cause serious confusion. -To prevent this (and much else besides), <tt>do_something_gp()</tt> uses -<tt>rcu_dereference()</tt> to fetch from <tt>gp</tt>: - -<blockquote> -<pre> - 1 bool do_something_gp(void) - 2 { - 3 rcu_read_lock(); - 4 p = rcu_dereference(gp); - 5 if (p) { - 6 do_something(p->a, p->b); - 7 rcu_read_unlock(); - 8 return true; - 9 } -10 rcu_read_unlock(); -11 return false; -12 } -</pre> -</blockquote> - -<p> -The <tt>rcu_dereference()</tt> uses volatile casts and (for DEC Alpha) -memory barriers in the Linux kernel. -Should a -<a href="http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf">high-quality implementation of C11 <tt>memory_order_consume</tt> [PDF]</a> -ever appear, then <tt>rcu_dereference()</tt> could be implemented -as a <tt>memory_order_consume</tt> load. -Regardless of the exact implementation, a pointer fetched by -<tt>rcu_dereference()</tt> may not be used outside of the -outermost RCU read-side critical section containing that -<tt>rcu_dereference()</tt>, unless protection of -the corresponding data element has been passed from RCU to some -other synchronization mechanism, most commonly locking or -<a href="https://www.kernel.org/doc/Documentation/RCU/rcuref.txt">reference counting</a>. - -<p> -In short, updaters use <tt>rcu_assign_pointer()</tt> and readers -use <tt>rcu_dereference()</tt>, and these two RCU API elements -work together to ensure that readers have a consistent view of -newly added data elements. - -<p> -Of course, it is also necessary to remove elements from RCU-protected -data structures, for example, using the following process: - -<ol> -<li> Remove the data element from the enclosing structure. -<li> Wait for all pre-existing RCU read-side critical sections - to complete (because only pre-existing readers can possibly have - a reference to the newly removed data element). -<li> At this point, only the updater has a reference to the - newly removed data element, so it can safely reclaim - the data element, for example, by passing it to <tt>kfree()</tt>. -</ol> - -This process is implemented by <tt>remove_gp_synchronous()</tt>: - -<blockquote> -<pre> - 1 bool remove_gp_synchronous(void) - 2 { - 3 struct foo *p; - 4 - 5 spin_lock(&gp_lock); - 6 p = rcu_access_pointer(gp); - 7 if (!p) { - 8 spin_unlock(&gp_lock); - 9 return false; -10 } -11 rcu_assign_pointer(gp, NULL); -12 spin_unlock(&gp_lock); -13 synchronize_rcu(); -14 kfree(p); -15 return true; -16 } -</pre> -</blockquote> - -<p> -This function is straightforward, with line 13 waiting for a grace -period before line 14 frees the old data element. -This waiting ensures that readers will reach line 7 of -<tt>do_something_gp()</tt> before the data element referenced by -<tt>p</tt> is freed. -The <tt>rcu_access_pointer()</tt> on line 6 is similar to -<tt>rcu_dereference()</tt>, except that: - -<ol> -<li> The value returned by <tt>rcu_access_pointer()</tt> - cannot be dereferenced. - If you want to access the value pointed to as well as - the pointer itself, use <tt>rcu_dereference()</tt> - instead of <tt>rcu_access_pointer()</tt>. -<li> The call to <tt>rcu_access_pointer()</tt> need not be - protected. - In contrast, <tt>rcu_dereference()</tt> must either be - within an RCU read-side critical section or in a code - segment where the pointer cannot change, for example, in - code protected by the corresponding update-side lock. -</ol> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Without the <tt>rcu_dereference()</tt> or the - <tt>rcu_access_pointer()</tt>, what destructive optimizations - might the compiler make use of? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Let's start with what happens to <tt>do_something_gp()</tt> - if it fails to use <tt>rcu_dereference()</tt>. - It could reuse a value formerly fetched from this same pointer. - It could also fetch the pointer from <tt>gp</tt> in a byte-at-a-time - manner, resulting in <i>load tearing</i>, in turn resulting a bytewise - mash-up of two distinct pointer values. - It might even use value-speculation optimizations, where it makes - a wrong guess, but by the time it gets around to checking the - value, an update has changed the pointer to match the wrong guess. - Too bad about any dereferences that returned pre-initialization garbage - in the meantime! - </font> - - <p><font color="ffffff"> - For <tt>remove_gp_synchronous()</tt>, as long as all modifications - to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>, - the above optimizations are harmless. - However, <tt>sparse</tt> will complain if you - define <tt>gp</tt> with <tt>__rcu</tt> and then - access it without using - either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -In short, RCU's publish-subscribe guarantee is provided by the combination -of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt>. -This guarantee allows data elements to be safely added to RCU-protected -linked data structures without disrupting RCU readers. -This guarantee can be used in combination with the grace-period -guarantee to also allow data elements to be removed from RCU-protected -linked data structures, again without disrupting RCU readers. - -<p> -This guarantee was only partially premeditated. -DYNIX/ptx used an explicit memory barrier for publication, but had nothing -resembling <tt>rcu_dereference()</tt> for subscription, nor did it -have anything resembling the <tt>smp_read_barrier_depends()</tt> -that was later subsumed into <tt>rcu_dereference()</tt> and later -still into <tt>READ_ONCE()</tt>. -The need for these operations made itself known quite suddenly at a -late-1990s meeting with the DEC Alpha architects, back in the days when -DEC was still a free-standing company. -It took the Alpha architects a good hour to convince me that any sort -of barrier would ever be needed, and it then took me a good <i>two</i> hours -to convince them that their documentation did not make this point clear. -More recent work with the C and C++ standards committees have provided -much education on tricks and traps from the compiler. -In short, compilers were much less tricky in the early 1990s, but in -2015, don't even think about omitting <tt>rcu_dereference()</tt>! - -<h3><a name="Memory-Barrier Guarantees">Memory-Barrier Guarantees</a></h3> - -<p> -The previous section's simple linked-data-structure scenario clearly -demonstrates the need for RCU's stringent memory-ordering guarantees on -systems with more than one CPU: - -<ol> -<li> Each CPU that has an RCU read-side critical section that - begins before <tt>synchronize_rcu()</tt> starts is - guaranteed to execute a full memory barrier between the time - that the RCU read-side critical section ends and the time that - <tt>synchronize_rcu()</tt> returns. - Without this guarantee, a pre-existing RCU read-side critical section - might hold a reference to the newly removed <tt>struct foo</tt> - after the <tt>kfree()</tt> on line 14 of - <tt>remove_gp_synchronous()</tt>. -<li> Each CPU that has an RCU read-side critical section that ends - after <tt>synchronize_rcu()</tt> returns is guaranteed - to execute a full memory barrier between the time that - <tt>synchronize_rcu()</tt> begins and the time that the RCU - read-side critical section begins. - Without this guarantee, a later RCU read-side critical section - running after the <tt>kfree()</tt> on line 14 of - <tt>remove_gp_synchronous()</tt> might - later run <tt>do_something_gp()</tt> and find the - newly deleted <tt>struct foo</tt>. -<li> If the task invoking <tt>synchronize_rcu()</tt> remains - on a given CPU, then that CPU is guaranteed to execute a full - memory barrier sometime during the execution of - <tt>synchronize_rcu()</tt>. - This guarantee ensures that the <tt>kfree()</tt> on - line 14 of <tt>remove_gp_synchronous()</tt> really does - execute after the removal on line 11. -<li> If the task invoking <tt>synchronize_rcu()</tt> migrates - among a group of CPUs during that invocation, then each of the - CPUs in that group is guaranteed to execute a full memory barrier - sometime during the execution of <tt>synchronize_rcu()</tt>. - This guarantee also ensures that the <tt>kfree()</tt> on - line 14 of <tt>remove_gp_synchronous()</tt> really does - execute after the removal on - line 11, but also in the case where the thread executing the - <tt>synchronize_rcu()</tt> migrates in the meantime. -</ol> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Given that multiple CPUs can start RCU read-side critical sections - at any time without any ordering whatsoever, how can RCU possibly - tell whether or not a given RCU read-side critical section starts - before a given instance of <tt>synchronize_rcu()</tt>? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - If RCU cannot tell whether or not a given - RCU read-side critical section starts before a - given instance of <tt>synchronize_rcu()</tt>, - then it must assume that the RCU read-side critical section - started first. - In other words, a given instance of <tt>synchronize_rcu()</tt> - can avoid waiting on a given RCU read-side critical section only - if it can prove that <tt>synchronize_rcu()</tt> started first. - </font> - - <p><font color="ffffff"> - A related question is “When <tt>rcu_read_lock()</tt> - doesn't generate any code, why does it matter how it relates - to a grace period?” - The answer is that it is not the relationship of - <tt>rcu_read_lock()</tt> itself that is important, but rather - the relationship of the code within the enclosed RCU read-side - critical section to the code preceding and following the - grace period. - If we take this viewpoint, then a given RCU read-side critical - section begins before a given grace period when some access - preceding the grace period observes the effect of some access - within the critical section, in which case none of the accesses - within the critical section may observe the effects of any - access following the grace period. - </font> - - <p><font color="ffffff"> - As of late 2016, mathematical models of RCU take this - viewpoint, for example, see slides 62 and 63 - of the - <a href="http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.2016.10.04c.LCE.pdf">2016 LinuxCon EU</a> - presentation. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - The first and second guarantees require unbelievably strict ordering! - Are all these memory barriers <i> really</i> required? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Yes, they really are required. - To see why the first guarantee is required, consider the following - sequence of events: - </font> - - <ol> - <li> <font color="ffffff"> - CPU 1: <tt>rcu_read_lock()</tt> - </font> - <li> <font color="ffffff"> - CPU 1: <tt>q = rcu_dereference(gp); - /* Very likely to return p. */</tt> - </font> - <li> <font color="ffffff"> - CPU 0: <tt>list_del_rcu(p);</tt> - </font> - <li> <font color="ffffff"> - CPU 0: <tt>synchronize_rcu()</tt> starts. - </font> - <li> <font color="ffffff"> - CPU 1: <tt>do_something_with(q->a); - /* No smp_mb(), so might happen after kfree(). */</tt> - </font> - <li> <font color="ffffff"> - CPU 1: <tt>rcu_read_unlock()</tt> - </font> - <li> <font color="ffffff"> - CPU 0: <tt>synchronize_rcu()</tt> returns. - </font> - <li> <font color="ffffff"> - CPU 0: <tt>kfree(p);</tt> - </font> - </ol> - - <p><font color="ffffff"> - Therefore, there absolutely must be a full memory barrier between the - end of the RCU read-side critical section and the end of the - grace period. - </font> - - <p><font color="ffffff"> - The sequence of events demonstrating the necessity of the second rule - is roughly similar: - </font> - - <ol> - <li> <font color="ffffff">CPU 0: <tt>list_del_rcu(p);</tt> - </font> - <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> starts. - </font> - <li> <font color="ffffff">CPU 1: <tt>rcu_read_lock()</tt> - </font> - <li> <font color="ffffff">CPU 1: <tt>q = rcu_dereference(gp); - /* Might return p if no memory barrier. */</tt> - </font> - <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> returns. - </font> - <li> <font color="ffffff">CPU 0: <tt>kfree(p);</tt> - </font> - <li> <font color="ffffff"> - CPU 1: <tt>do_something_with(q->a); /* Boom!!! */</tt> - </font> - <li> <font color="ffffff">CPU 1: <tt>rcu_read_unlock()</tt> - </font> - </ol> - - <p><font color="ffffff"> - And similarly, without a memory barrier between the beginning of the - grace period and the beginning of the RCU read-side critical section, - CPU 1 might end up accessing the freelist. - </font> - - <p><font color="ffffff"> - The “as if” rule of course applies, so that any - implementation that acts as if the appropriate memory barriers - were in place is a correct implementation. - That said, it is much easier to fool yourself into believing - that you have adhered to the as-if rule than it is to actually - adhere to it! -</font></td></tr> -<tr><td> </td></tr> -</table> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - You claim that <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> - generate absolutely no code in some kernel builds. - This means that the compiler might arbitrarily rearrange consecutive - RCU read-side critical sections. - Given such rearrangement, if a given RCU read-side critical section - is done, how can you be sure that all prior RCU read-side critical - sections are done? - Won't the compiler rearrangements make that impossible to determine? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - In cases where <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> - generate absolutely no code, RCU infers quiescent states only at - special locations, for example, within the scheduler. - Because calls to <tt>schedule()</tt> had better prevent calling-code - accesses to shared variables from being rearranged across the call to - <tt>schedule()</tt>, if RCU detects the end of a given RCU read-side - critical section, it will necessarily detect the end of all prior - RCU read-side critical sections, no matter how aggressively the - compiler scrambles the code. - </font> - - <p><font color="ffffff"> - Again, this all assumes that the compiler cannot scramble code across - calls to the scheduler, out of interrupt handlers, into the idle loop, - into user-mode code, and so on. - But if your kernel build allows that sort of scrambling, you have broken - far more than just RCU! -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -Note that these memory-barrier requirements do not replace the fundamental -RCU requirement that a grace period wait for all pre-existing readers. -On the contrary, the memory barriers called out in this section must operate in -such a way as to <i>enforce</i> this fundamental requirement. -Of course, different implementations enforce this requirement in different -ways, but enforce it they must. - -<h3><a name="RCU Primitives Guaranteed to Execute Unconditionally">RCU Primitives Guaranteed to Execute Unconditionally</a></h3> - -<p> -The common-case RCU primitives are unconditional. -They are invoked, they do their job, and they return, with no possibility -of error, and no need to retry. -This is a key RCU design philosophy. - -<p> -However, this philosophy is pragmatic rather than pigheaded. -If someone comes up with a good justification for a particular conditional -RCU primitive, it might well be implemented and added. -After all, this guarantee was reverse-engineered, not premeditated. -The unconditional nature of the RCU primitives was initially an -accident of implementation, and later experience with synchronization -primitives with conditional primitives caused me to elevate this -accident to a guarantee. -Therefore, the justification for adding a conditional primitive to -RCU would need to be based on detailed and compelling use cases. - -<h3><a name="Guaranteed Read-to-Write Upgrade">Guaranteed Read-to-Write Upgrade</a></h3> - -<p> -As far as RCU is concerned, it is always possible to carry out an -update within an RCU read-side critical section. -For example, that RCU read-side critical section might search for -a given data element, and then might acquire the update-side -spinlock in order to update that element, all while remaining -in that RCU read-side critical section. -Of course, it is necessary to exit the RCU read-side critical section -before invoking <tt>synchronize_rcu()</tt>, however, this -inconvenience can be avoided through use of the -<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt> API members -described later in this document. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But how does the upgrade-to-write operation exclude other readers? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - It doesn't, just like normal RCU updates, which also do not exclude - RCU readers. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -This guarantee allows lookup code to be shared between read-side -and update-side code, and was premeditated, appearing in the earliest -DYNIX/ptx RCU documentation. - -<h2><a name="Fundamental Non-Requirements">Fundamental Non-Requirements</a></h2> - -<p> -RCU provides extremely lightweight readers, and its read-side guarantees, -though quite useful, are correspondingly lightweight. -It is therefore all too easy to assume that RCU is guaranteeing more -than it really is. -Of course, the list of things that RCU does not guarantee is infinitely -long, however, the following sections list a few non-guarantees that -have caused confusion. -Except where otherwise noted, these non-guarantees were premeditated. - -<ol> -<li> <a href="#Readers Impose Minimal Ordering"> - Readers Impose Minimal Ordering</a> -<li> <a href="#Readers Do Not Exclude Updaters"> - Readers Do Not Exclude Updaters</a> -<li> <a href="#Updaters Only Wait For Old Readers"> - Updaters Only Wait For Old Readers</a> -<li> <a href="#Grace Periods Don't Partition Read-Side Critical Sections"> - Grace Periods Don't Partition Read-Side Critical Sections</a> -<li> <a href="#Read-Side Critical Sections Don't Partition Grace Periods"> - Read-Side Critical Sections Don't Partition Grace Periods</a> -</ol> - -<h3><a name="Readers Impose Minimal Ordering">Readers Impose Minimal Ordering</a></h3> - -<p> -Reader-side markers such as <tt>rcu_read_lock()</tt> and -<tt>rcu_read_unlock()</tt> provide absolutely no ordering guarantees -except through their interaction with the grace-period APIs such as -<tt>synchronize_rcu()</tt>. -To see this, consider the following pair of threads: - -<blockquote> -<pre> - 1 void thread0(void) - 2 { - 3 rcu_read_lock(); - 4 WRITE_ONCE(x, 1); - 5 rcu_read_unlock(); - 6 rcu_read_lock(); - 7 WRITE_ONCE(y, 1); - 8 rcu_read_unlock(); - 9 } -10 -11 void thread1(void) -12 { -13 rcu_read_lock(); -14 r1 = READ_ONCE(y); -15 rcu_read_unlock(); -16 rcu_read_lock(); -17 r2 = READ_ONCE(x); -18 rcu_read_unlock(); -19 } -</pre> -</blockquote> - -<p> -After <tt>thread0()</tt> and <tt>thread1()</tt> execute -concurrently, it is quite possible to have - -<blockquote> -<pre> -(r1 == 1 && r2 == 0) -</pre> -</blockquote> - -(that is, <tt>y</tt> appears to have been assigned before <tt>x</tt>), -which would not be possible if <tt>rcu_read_lock()</tt> and -<tt>rcu_read_unlock()</tt> had much in the way of ordering -properties. -But they do not, so the CPU is within its rights -to do significant reordering. -This is by design: Any significant ordering constraints would slow down -these fast-path APIs. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Can't the compiler also reorder this code? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - No, the volatile casts in <tt>READ_ONCE()</tt> and - <tt>WRITE_ONCE()</tt> prevent the compiler from reordering in - this particular case. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h3><a name="Readers Do Not Exclude Updaters">Readers Do Not Exclude Updaters</a></h3> - -<p> -Neither <tt>rcu_read_lock()</tt> nor <tt>rcu_read_unlock()</tt> -exclude updates. -All they do is to prevent grace periods from ending. -The following example illustrates this: - -<blockquote> -<pre> - 1 void thread0(void) - 2 { - 3 rcu_read_lock(); - 4 r1 = READ_ONCE(y); - 5 if (r1) { - 6 do_something_with_nonzero_x(); - 7 r2 = READ_ONCE(x); - 8 WARN_ON(!r2); /* BUG!!! */ - 9 } -10 rcu_read_unlock(); -11 } -12 -13 void thread1(void) -14 { -15 spin_lock(&my_lock); -16 WRITE_ONCE(x, 1); -17 WRITE_ONCE(y, 1); -18 spin_unlock(&my_lock); -19 } -</pre> -</blockquote> - -<p> -If the <tt>thread0()</tt> function's <tt>rcu_read_lock()</tt> -excluded the <tt>thread1()</tt> function's update, -the <tt>WARN_ON()</tt> could never fire. -But the fact is that <tt>rcu_read_lock()</tt> does not exclude -much of anything aside from subsequent grace periods, of which -<tt>thread1()</tt> has none, so the -<tt>WARN_ON()</tt> can and does fire. - -<h3><a name="Updaters Only Wait For Old Readers">Updaters Only Wait For Old Readers</a></h3> - -<p> -It might be tempting to assume that after <tt>synchronize_rcu()</tt> -completes, there are no readers executing. -This temptation must be avoided because -new readers can start immediately after <tt>synchronize_rcu()</tt> -starts, and <tt>synchronize_rcu()</tt> is under no -obligation to wait for these new readers. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Suppose that synchronize_rcu() did wait until <i>all</i> - readers had completed instead of waiting only on - pre-existing readers. - For how long would the updater be able to rely on there - being no readers? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - For no time at all. - Even if <tt>synchronize_rcu()</tt> were to wait until - all readers had completed, a new reader might start immediately after - <tt>synchronize_rcu()</tt> completed. - Therefore, the code following - <tt>synchronize_rcu()</tt> can <i>never</i> rely on there being - no readers. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h3><a name="Grace Periods Don't Partition Read-Side Critical Sections"> -Grace Periods Don't Partition Read-Side Critical Sections</a></h3> - -<p> -It is tempting to assume that if any part of one RCU read-side critical -section precedes a given grace period, and if any part of another RCU -read-side critical section follows that same grace period, then all of -the first RCU read-side critical section must precede all of the second. -However, this just isn't the case: A single grace period does not -partition the set of RCU read-side critical sections. -An example of this situation can be illustrated as follows, where -<tt>x</tt>, <tt>y</tt>, and <tt>z</tt> are initially all zero: - -<blockquote> -<pre> - 1 void thread0(void) - 2 { - 3 rcu_read_lock(); - 4 WRITE_ONCE(a, 1); - 5 WRITE_ONCE(b, 1); - 6 rcu_read_unlock(); - 7 } - 8 - 9 void thread1(void) -10 { -11 r1 = READ_ONCE(a); -12 synchronize_rcu(); -13 WRITE_ONCE(c, 1); -14 } -15 -16 void thread2(void) -17 { -18 rcu_read_lock(); -19 r2 = READ_ONCE(b); -20 r3 = READ_ONCE(c); -21 rcu_read_unlock(); -22 } -</pre> -</blockquote> - -<p> -It turns out that the outcome: - -<blockquote> -<pre> -(r1 == 1 && r2 == 0 && r3 == 1) -</pre> -</blockquote> - -is entirely possible. -The following figure show how this can happen, with each circled -<tt>QS</tt> indicating the point at which RCU recorded a -<i>quiescent state</i> for each thread, that is, a state in which -RCU knows that the thread cannot be in the midst of an RCU read-side -critical section that started before the current grace period: - -<p><img src="GPpartitionReaders1.svg" alt="GPpartitionReaders1.svg" width="60%"></p> - -<p> -If it is necessary to partition RCU read-side critical sections in this -manner, it is necessary to use two grace periods, where the first -grace period is known to end before the second grace period starts: - -<blockquote> -<pre> - 1 void thread0(void) - 2 { - 3 rcu_read_lock(); - 4 WRITE_ONCE(a, 1); - 5 WRITE_ONCE(b, 1); - 6 rcu_read_unlock(); - 7 } - 8 - 9 void thread1(void) -10 { -11 r1 = READ_ONCE(a); -12 synchronize_rcu(); -13 WRITE_ONCE(c, 1); -14 } -15 -16 void thread2(void) -17 { -18 r2 = READ_ONCE(c); -19 synchronize_rcu(); -20 WRITE_ONCE(d, 1); -21 } -22 -23 void thread3(void) -24 { -25 rcu_read_lock(); -26 r3 = READ_ONCE(b); -27 r4 = READ_ONCE(d); -28 rcu_read_unlock(); -29 } -</pre> -</blockquote> - -<p> -Here, if <tt>(r1 == 1)</tt>, then -<tt>thread0()</tt>'s write to <tt>b</tt> must happen -before the end of <tt>thread1()</tt>'s grace period. -If in addition <tt>(r4 == 1)</tt>, then -<tt>thread3()</tt>'s read from <tt>b</tt> must happen -after the beginning of <tt>thread2()</tt>'s grace period. -If it is also the case that <tt>(r2 == 1)</tt>, then the -end of <tt>thread1()</tt>'s grace period must precede the -beginning of <tt>thread2()</tt>'s grace period. -This mean that the two RCU read-side critical sections cannot overlap, -guaranteeing that <tt>(r3 == 1)</tt>. -As a result, the outcome: - -<blockquote> -<pre> -(r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) -</pre> -</blockquote> - -cannot happen. - -<p> -This non-requirement was also non-premeditated, but became apparent -when studying RCU's interaction with memory ordering. - -<h3><a name="Read-Side Critical Sections Don't Partition Grace Periods"> -Read-Side Critical Sections Don't Partition Grace Periods</a></h3> - -<p> -It is also tempting to assume that if an RCU read-side critical section -happens between a pair of grace periods, then those grace periods cannot -overlap. -However, this temptation leads nowhere good, as can be illustrated by -the following, with all variables initially zero: - -<blockquote> -<pre> - 1 void thread0(void) - 2 { - 3 rcu_read_lock(); - 4 WRITE_ONCE(a, 1); - 5 WRITE_ONCE(b, 1); - 6 rcu_read_unlock(); - 7 } - 8 - 9 void thread1(void) -10 { -11 r1 = READ_ONCE(a); -12 synchronize_rcu(); -13 WRITE_ONCE(c, 1); -14 } -15 -16 void thread2(void) -17 { -18 rcu_read_lock(); -19 WRITE_ONCE(d, 1); -20 r2 = READ_ONCE(c); -21 rcu_read_unlock(); -22 } -23 -24 void thread3(void) -25 { -26 r3 = READ_ONCE(d); -27 synchronize_rcu(); -28 WRITE_ONCE(e, 1); -29 } -30 -31 void thread4(void) -32 { -33 rcu_read_lock(); -34 r4 = READ_ONCE(b); -35 r5 = READ_ONCE(e); -36 rcu_read_unlock(); -37 } -</pre> -</blockquote> - -<p> -In this case, the outcome: - -<blockquote> -<pre> -(r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) -</pre> -</blockquote> - -is entirely possible, as illustrated below: - -<p><img src="ReadersPartitionGP1.svg" alt="ReadersPartitionGP1.svg" width="100%"></p> - -<p> -Again, an RCU read-side critical section can overlap almost all of a -given grace period, just so long as it does not overlap the entire -grace period. -As a result, an RCU read-side critical section cannot partition a pair -of RCU grace periods. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - How long a sequence of grace periods, each separated by an RCU - read-side critical section, would be required to partition the RCU - read-side critical sections at the beginning and end of the chain? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - In theory, an infinite number. - In practice, an unknown number that is sensitive to both implementation - details and timing considerations. - Therefore, even in practice, RCU users must abide by the - theoretical rather than the practical answer. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2> - -<p> -These parallelism facts of life are by no means specific to RCU, but -the RCU implementation must abide by them. -They therefore bear repeating: - -<ol> -<li> Any CPU or task may be delayed at any time, - and any attempts to avoid these delays by disabling - preemption, interrupts, or whatever are completely futile. - This is most obvious in preemptible user-level - environments and in virtualized environments (where - a given guest OS's VCPUs can be preempted at any time by - the underlying hypervisor), but can also happen in bare-metal - environments due to ECC errors, NMIs, and other hardware - events. - Although a delay of more than about 20 seconds can result - in splats, the RCU implementation is obligated to use - algorithms that can tolerate extremely long delays, but where - “extremely long” is not long enough to allow - wrap-around when incrementing a 64-bit counter. -<li> Both the compiler and the CPU can reorder memory accesses. - Where it matters, RCU must use compiler directives and - memory-barrier instructions to preserve ordering. -<li> Conflicting writes to memory locations in any given cache line - will result in expensive cache misses. - Greater numbers of concurrent writes and more-frequent - concurrent writes will result in more dramatic slowdowns. - RCU is therefore obligated to use algorithms that have - sufficient locality to avoid significant performance and - scalability problems. -<li> As a rough rule of thumb, only one CPU's worth of processing - may be carried out under the protection of any given exclusive - lock. - RCU must therefore use scalable locking designs. -<li> Counters are finite, especially on 32-bit systems. - RCU's use of counters must therefore tolerate counter wrap, - or be designed such that counter wrap would take way more - time than a single system is likely to run. - An uptime of ten years is quite possible, a runtime - of a century much less so. - As an example of the latter, RCU's dyntick-idle nesting counter - allows 54 bits for interrupt nesting level (this counter - is 64 bits even on a 32-bit system). - Overflowing this counter requires 2<sup>54</sup> - half-interrupts on a given CPU without that CPU ever going idle. - If a half-interrupt happened every microsecond, it would take - 570 years of runtime to overflow this counter, which is currently - believed to be an acceptably long time. -<li> Linux systems can have thousands of CPUs running a single - Linux kernel in a single shared-memory environment. - RCU must therefore pay close attention to high-end scalability. -</ol> - -<p> -This last parallelism fact of life means that RCU must pay special -attention to the preceding facts of life. -The idea that Linux might scale to systems with thousands of CPUs would -have been met with some skepticism in the 1990s, but these requirements -would have otherwise have been unsurprising, even in the early 1990s. - -<h2><a name="Quality-of-Implementation Requirements">Quality-of-Implementation Requirements</a></h2> - -<p> -These sections list quality-of-implementation requirements. -Although an RCU implementation that ignores these requirements could -still be used, it would likely be subject to limitations that would -make it inappropriate for industrial-strength production use. -Classes of quality-of-implementation requirements are as follows: - -<ol> -<li> <a href="#Specialization">Specialization</a> -<li> <a href="#Performance and Scalability">Performance and Scalability</a> -<li> <a href="#Forward Progress">Forward Progress</a> -<li> <a href="#Composability">Composability</a> -<li> <a href="#Corner Cases">Corner Cases</a> -</ol> - -<p> -These classes is covered in the following sections. - -<h3><a name="Specialization">Specialization</a></h3> - -<p> -RCU is and always has been intended primarily for read-mostly situations, -which means that RCU's read-side primitives are optimized, often at the -expense of its update-side primitives. -Experience thus far is captured by the following list of situations: - -<ol> -<li> Read-mostly data, where stale and inconsistent data is not - a problem: RCU works great! -<li> Read-mostly data, where data must be consistent: - RCU works well. -<li> Read-write data, where data must be consistent: - RCU <i>might</i> work OK. - Or not. -<li> Write-mostly data, where data must be consistent: - RCU is very unlikely to be the right tool for the job, - with the following exceptions, where RCU can provide: - <ol type=a> - <li> Existence guarantees for update-friendly mechanisms. - <li> Wait-free read-side primitives for real-time use. - </ol> -</ol> - -<p> -This focus on read-mostly situations means that RCU must interoperate -with other synchronization primitives. -For example, the <tt>add_gp()</tt> and <tt>remove_gp_synchronous()</tt> -examples discussed earlier use RCU to protect readers and locking to -coordinate updaters. -However, the need extends much farther, requiring that a variety of -synchronization primitives be legal within RCU read-side critical sections, -including spinlocks, sequence locks, atomic operations, reference -counters, and memory barriers. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - What about sleeping locks? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - These are forbidden within Linux-kernel RCU read-side critical - sections because it is not legal to place a quiescent state - (in this case, voluntary context switch) within an RCU read-side - critical section. - However, sleeping locks may be used within userspace RCU read-side - critical sections, and also within Linux-kernel sleepable RCU - <a href="#Sleepable RCU"><font color="ffffff">(SRCU)</font></a> - read-side critical sections. - In addition, the -rt patchset turns spinlocks into a - sleeping locks so that the corresponding critical sections - can be preempted, which also means that these sleeplockified - spinlocks (but not other sleeping locks!) may be acquire within - -rt-Linux-kernel RCU read-side critical sections. - </font> - - <p><font color="ffffff"> - Note that it <i>is</i> legal for a normal RCU read-side - critical section to conditionally acquire a sleeping locks - (as in <tt>mutex_trylock()</tt>), but only as long as it does - not loop indefinitely attempting to conditionally acquire that - sleeping locks. - The key point is that things like <tt>mutex_trylock()</tt> - either return with the mutex held, or return an error indication if - the mutex was not immediately available. - Either way, <tt>mutex_trylock()</tt> returns immediately without - sleeping. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -It often comes as a surprise that many algorithms do not require a -consistent view of data, but many can function in that mode, -with network routing being the poster child. -Internet routing algorithms take significant time to propagate -updates, so that by the time an update arrives at a given system, -that system has been sending network traffic the wrong way for -a considerable length of time. -Having a few threads continue to send traffic the wrong way for a -few more milliseconds is clearly not a problem: In the worst case, -TCP retransmissions will eventually get the data where it needs to go. -In general, when tracking the state of the universe outside of the -computer, some level of inconsistency must be tolerated due to -speed-of-light delays if nothing else. - -<p> -Furthermore, uncertainty about external state is inherent in many cases. -For example, a pair of veterinarians might use heartbeat to determine -whether or not a given cat was alive. -But how long should they wait after the last heartbeat to decide that -the cat is in fact dead? -Waiting less than 400 milliseconds makes no sense because this would -mean that a relaxed cat would be considered to cycle between death -and life more than 100 times per minute. -Moreover, just as with human beings, a cat's heart might stop for -some period of time, so the exact wait period is a judgment call. -One of our pair of veterinarians might wait 30 seconds before pronouncing -the cat dead, while the other might insist on waiting a full minute. -The two veterinarians would then disagree on the state of the cat during -the final 30 seconds of the minute following the last heartbeat. - -<p> -Interestingly enough, this same situation applies to hardware. -When push comes to shove, how do we tell whether or not some -external server has failed? -We send messages to it periodically, and declare it failed if we -don't receive a response within a given period of time. -Policy decisions can usually tolerate short -periods of inconsistency. -The policy was decided some time ago, and is only now being put into -effect, so a few milliseconds of delay is normally inconsequential. - -<p> -However, there are algorithms that absolutely must see consistent data. -For example, the translation between a user-level SystemV semaphore -ID to the corresponding in-kernel data structure is protected by RCU, -but it is absolutely forbidden to update a semaphore that has just been -removed. -In the Linux kernel, this need for consistency is accommodated by acquiring -spinlocks located in the in-kernel data structure from within -the RCU read-side critical section, and this is indicated by the -green box in the figure above. -Many other techniques may be used, and are in fact used within the -Linux kernel. - -<p> -In short, RCU is not required to maintain consistency, and other -mechanisms may be used in concert with RCU when consistency is required. -RCU's specialization allows it to do its job extremely well, and its -ability to interoperate with other synchronization mechanisms allows -the right mix of synchronization tools to be used for a given job. - -<h3><a name="Performance and Scalability">Performance and Scalability</a></h3> - -<p> -Energy efficiency is a critical component of performance today, -and Linux-kernel RCU implementations must therefore avoid unnecessarily -awakening idle CPUs. -I cannot claim that this requirement was premeditated. -In fact, I learned of it during a telephone conversation in which I -was given “frank and open” feedback on the importance -of energy efficiency in battery-powered systems and on specific -energy-efficiency shortcomings of the Linux-kernel RCU implementation. -In my experience, the battery-powered embedded community will consider -any unnecessary wakeups to be extremely unfriendly acts. -So much so that mere Linux-kernel-mailing-list posts are -insufficient to vent their ire. - -<p> -Memory consumption is not particularly important for in most -situations, and has become decreasingly -so as memory sizes have expanded and memory -costs have plummeted. -However, as I learned from Matt Mackall's -<a href="http://elinux.org/Linux_Tiny-FAQ">bloatwatch</a> -efforts, memory footprint is critically important on single-CPU systems with -non-preemptible (<tt>CONFIG_PREEMPT=n</tt>) kernels, and thus -<a href="https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com">tiny RCU</a> -was born. -Josh Triplett has since taken over the small-memory banner with his -<a href="https://tiny.wiki.kernel.org/">Linux kernel tinification</a> -project, which resulted in -<a href="#Sleepable RCU">SRCU</a> -becoming optional for those kernels not needing it. - -<p> -The remaining performance requirements are, for the most part, -unsurprising. -For example, in keeping with RCU's read-side specialization, -<tt>rcu_dereference()</tt> should have negligible overhead (for -example, suppression of a few minor compiler optimizations). -Similarly, in non-preemptible environments, <tt>rcu_read_lock()</tt> and -<tt>rcu_read_unlock()</tt> should have exactly zero overhead. - -<p> -In preemptible environments, in the case where the RCU read-side -critical section was not preempted (as will be the case for the -highest-priority real-time process), <tt>rcu_read_lock()</tt> and -<tt>rcu_read_unlock()</tt> should have minimal overhead. -In particular, they should not contain atomic read-modify-write -operations, memory-barrier instructions, preemption disabling, -interrupt disabling, or backwards branches. -However, in the case where the RCU read-side critical section was preempted, -<tt>rcu_read_unlock()</tt> may acquire spinlocks and disable interrupts. -This is why it is better to nest an RCU read-side critical section -within a preempt-disable region than vice versa, at least in cases -where that critical section is short enough to avoid unduly degrading -real-time latencies. - -<p> -The <tt>synchronize_rcu()</tt> grace-period-wait primitive is -optimized for throughput. -It may therefore incur several milliseconds of latency in addition to -the duration of the longest RCU read-side critical section. -On the other hand, multiple concurrent invocations of -<tt>synchronize_rcu()</tt> are required to use batching optimizations -so that they can be satisfied by a single underlying grace-period-wait -operation. -For example, in the Linux kernel, it is not unusual for a single -grace-period-wait operation to serve more than -<a href="https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response">1,000 separate invocations</a> -of <tt>synchronize_rcu()</tt>, thus amortizing the per-invocation -overhead down to nearly zero. -However, the grace-period optimization is also required to avoid -measurable degradation of real-time scheduling and interrupt latencies. - -<p> -In some cases, the multi-millisecond <tt>synchronize_rcu()</tt> -latencies are unacceptable. -In these cases, <tt>synchronize_rcu_expedited()</tt> may be used -instead, reducing the grace-period latency down to a few tens of -microseconds on small systems, at least in cases where the RCU read-side -critical sections are short. -There are currently no special latency requirements for -<tt>synchronize_rcu_expedited()</tt> on large systems, but, -consistent with the empirical nature of the RCU specification, -that is subject to change. -However, there most definitely are scalability requirements: -A storm of <tt>synchronize_rcu_expedited()</tt> invocations on 4096 -CPUs should at least make reasonable forward progress. -In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt> -is permitted to impose modest degradation of real-time latency -on non-idle online CPUs. -Here, “modest” means roughly the same latency -degradation as a scheduling-clock interrupt. - -<p> -There are a number of situations where even -<tt>synchronize_rcu_expedited()</tt>'s reduced grace-period -latency is unacceptable. -In these situations, the asynchronous <tt>call_rcu()</tt> can be -used in place of <tt>synchronize_rcu()</tt> as follows: - -<blockquote> -<pre> - 1 struct foo { - 2 int a; - 3 int b; - 4 struct rcu_head rh; - 5 }; - 6 - 7 static void remove_gp_cb(struct rcu_head *rhp) - 8 { - 9 struct foo *p = container_of(rhp, struct foo, rh); -10 -11 kfree(p); -12 } -13 -14 bool remove_gp_asynchronous(void) -15 { -16 struct foo *p; -17 -18 spin_lock(&gp_lock); -19 p = rcu_access_pointer(gp); -20 if (!p) { -21 spin_unlock(&gp_lock); -22 return false; -23 } -24 rcu_assign_pointer(gp, NULL); -25 call_rcu(&p->rh, remove_gp_cb); -26 spin_unlock(&gp_lock); -27 return true; -28 } -</pre> -</blockquote> - -<p> -A definition of <tt>struct foo</tt> is finally needed, and appears -on lines 1-5. -The function <tt>remove_gp_cb()</tt> is passed to <tt>call_rcu()</tt> -on line 25, and will be invoked after the end of a subsequent -grace period. -This gets the same effect as <tt>remove_gp_synchronous()</tt>, -but without forcing the updater to wait for a grace period to elapse. -The <tt>call_rcu()</tt> function may be used in a number of -situations where neither <tt>synchronize_rcu()</tt> nor -<tt>synchronize_rcu_expedited()</tt> would be legal, -including within preempt-disable code, <tt>local_bh_disable()</tt> code, -interrupt-disable code, and interrupt handlers. -However, even <tt>call_rcu()</tt> is illegal within NMI handlers -and from idle and offline CPUs. -The callback function (<tt>remove_gp_cb()</tt> in this case) will be -executed within softirq (software interrupt) environment within the -Linux kernel, -either within a real softirq handler or under the protection -of <tt>local_bh_disable()</tt>. -In both the Linux kernel and in userspace, it is bad practice to -write an RCU callback function that takes too long. -Long-running operations should be relegated to separate threads or -(in the Linux kernel) workqueues. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why does line 19 use <tt>rcu_access_pointer()</tt>? - After all, <tt>call_rcu()</tt> on line 25 stores into the - structure, which would interact badly with concurrent insertions. - Doesn't this mean that <tt>rcu_dereference()</tt> is required? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Presumably the <tt>->gp_lock</tt> acquired on line 18 excludes - any changes, including any insertions that <tt>rcu_dereference()</tt> - would protect against. - Therefore, any insertions will be delayed until after - <tt>->gp_lock</tt> - is released on line 25, which in turn means that - <tt>rcu_access_pointer()</tt> suffices. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -However, all that <tt>remove_gp_cb()</tt> is doing is -invoking <tt>kfree()</tt> on the data element. -This is a common idiom, and is supported by <tt>kfree_rcu()</tt>, -which allows “fire and forget” operation as shown below: - -<blockquote> -<pre> - 1 struct foo { - 2 int a; - 3 int b; - 4 struct rcu_head rh; - 5 }; - 6 - 7 bool remove_gp_faf(void) - 8 { - 9 struct foo *p; -10 -11 spin_lock(&gp_lock); -12 p = rcu_dereference(gp); -13 if (!p) { -14 spin_unlock(&gp_lock); -15 return false; -16 } -17 rcu_assign_pointer(gp, NULL); -18 kfree_rcu(p, rh); -19 spin_unlock(&gp_lock); -20 return true; -21 } -</pre> -</blockquote> - -<p> -Note that <tt>remove_gp_faf()</tt> simply invokes -<tt>kfree_rcu()</tt> and proceeds, without any need to pay any -further attention to the subsequent grace period and <tt>kfree()</tt>. -It is permissible to invoke <tt>kfree_rcu()</tt> from the same -environments as for <tt>call_rcu()</tt>. -Interestingly enough, DYNIX/ptx had the equivalents of -<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>, but not -<tt>synchronize_rcu()</tt>. -This was due to the fact that RCU was not heavily used within DYNIX/ptx, -so the very few places that needed something like -<tt>synchronize_rcu()</tt> simply open-coded it. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Earlier it was claimed that <tt>call_rcu()</tt> and - <tt>kfree_rcu()</tt> allowed updaters to avoid being blocked - by readers. - But how can that be correct, given that the invocation of the callback - and the freeing of the memory (respectively) must still wait for - a grace period to elapse? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - We could define things this way, but keep in mind that this sort of - definition would say that updates in garbage-collected languages - cannot complete until the next time the garbage collector runs, - which does not seem at all reasonable. - The key point is that in most cases, an updater using either - <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> can proceed to the - next update as soon as it has invoked <tt>call_rcu()</tt> or - <tt>kfree_rcu()</tt>, without having to wait for a subsequent - grace period. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -But what if the updater must wait for the completion of code to be -executed after the end of the grace period, but has other tasks -that can be carried out in the meantime? -The polling-style <tt>get_state_synchronize_rcu()</tt> and -<tt>cond_synchronize_rcu()</tt> functions may be used for this -purpose, as shown below: - -<blockquote> -<pre> - 1 bool remove_gp_poll(void) - 2 { - 3 struct foo *p; - 4 unsigned long s; - 5 - 6 spin_lock(&gp_lock); - 7 p = rcu_access_pointer(gp); - 8 if (!p) { - 9 spin_unlock(&gp_lock); -10 return false; -11 } -12 rcu_assign_pointer(gp, NULL); -13 spin_unlock(&gp_lock); -14 s = get_state_synchronize_rcu(); -15 do_something_while_waiting(); -16 cond_synchronize_rcu(s); -17 kfree(p); -18 return true; -19 } -</pre> -</blockquote> - -<p> -On line 14, <tt>get_state_synchronize_rcu()</tt> obtains a -“cookie” from RCU, -then line 15 carries out other tasks, -and finally, line 16 returns immediately if a grace period has -elapsed in the meantime, but otherwise waits as required. -The need for <tt>get_state_synchronize_rcu</tt> and -<tt>cond_synchronize_rcu()</tt> has appeared quite recently, -so it is too early to tell whether they will stand the test of time. - -<p> -RCU thus provides a range of tools to allow updaters to strike the -required tradeoff between latency, flexibility and CPU overhead. - -<h3><a name="Forward Progress">Forward Progress</a></h3> - -<p> -In theory, delaying grace-period completion and callback invocation -is harmless. -In practice, not only are memory sizes finite but also callbacks sometimes -do wakeups, and sufficiently deferred wakeups can be difficult -to distinguish from system hangs. -Therefore, RCU must provide a number of mechanisms to promote forward -progress. - -<p> -These mechanisms are not foolproof, nor can they be. -For one simple example, an infinite loop in an RCU read-side critical -section must by definition prevent later grace periods from ever completing. -For a more involved example, consider a 64-CPU system built with -<tt>CONFIG_RCU_NOCB_CPU=y</tt> and booted with <tt>rcu_nocbs=1-63</tt>, -where CPUs 1 through 63 spin in tight loops that invoke -<tt>call_rcu()</tt>. -Even if these tight loops also contain calls to <tt>cond_resched()</tt> -(thus allowing grace periods to complete), CPU 0 simply will -not be able to invoke callbacks as fast as the other 63 CPUs can -register them, at least not until the system runs out of memory. -In both of these examples, the Spiderman principle applies: With great -power comes great responsibility. -However, short of this level of abuse, RCU is required to -ensure timely completion of grace periods and timely invocation of -callbacks. - -<p> -RCU takes the following steps to encourage timely completion of -grace periods: - -<ol> -<li> If a grace period fails to complete within 100 milliseconds, - RCU causes future invocations of <tt>cond_resched()</tt> on - the holdout CPUs to provide an RCU quiescent state. - RCU also causes those CPUs' <tt>need_resched()</tt> invocations - to return <tt>true</tt>, but only after the corresponding CPU's - next scheduling-clock. -<li> CPUs mentioned in the <tt>nohz_full</tt> kernel boot parameter - can run indefinitely in the kernel without scheduling-clock - interrupts, which defeats the above <tt>need_resched()</tt> - strategem. - RCU will therefore invoke <tt>resched_cpu()</tt> on any - <tt>nohz_full</tt> CPUs still holding out after - 109 milliseconds. -<li> In kernels built with <tt>CONFIG_RCU_BOOST=y</tt>, if a given - task that has been preempted within an RCU read-side critical - section is holding out for more than 500 milliseconds, - RCU will resort to priority boosting. -<li> If a CPU is still holding out 10 seconds into the grace - period, RCU will invoke <tt>resched_cpu()</tt> on it regardless - of its <tt>nohz_full</tt> state. -</ol> - -<p> -The above values are defaults for systems running with <tt>HZ=1000</tt>. -They will vary as the value of <tt>HZ</tt> varies, and can also be -changed using the relevant Kconfig options and kernel boot parameters. -RCU currently does not do much sanity checking of these -parameters, so please use caution when changing them. -Note that these forward-progress measures are provided only for RCU, -not for -<a href="#Sleepable RCU">SRCU</a> or -<a href="#Tasks RCU">Tasks RCU</a>. - -<p> -RCU takes the following steps in <tt>call_rcu()</tt> to encourage timely -invocation of callbacks when any given non-<tt>rcu_nocbs</tt> CPU has -10,000 callbacks, or has 10,000 more callbacks than it had the last time -encouragement was provided: - -<ol> -<li> Starts a grace period, if one is not already in progress. -<li> Forces immediate checking for quiescent states, rather than - waiting for three milliseconds to have elapsed since the - beginning of the grace period. -<li> Immediately tags the CPU's callbacks with their grace period - completion numbers, rather than waiting for the <tt>RCU_SOFTIRQ</tt> - handler to get around to it. -<li> Lifts callback-execution batch limits, which speeds up callback - invocation at the expense of degrading realtime response. -</ol> - -<p> -Again, these are default values when running at <tt>HZ=1000</tt>, -and can be overridden. -Again, these forward-progress measures are provided only for RCU, -not for -<a href="#Sleepable RCU">SRCU</a> or -<a href="#Tasks RCU">Tasks RCU</a>. -Even for RCU, callback-invocation forward progress for <tt>rcu_nocbs</tt> -CPUs is much less well-developed, in part because workloads benefiting -from <tt>rcu_nocbs</tt> CPUs tend to invoke <tt>call_rcu()</tt> -relatively infrequently. -If workloads emerge that need both <tt>rcu_nocbs</tt> CPUs and high -<tt>call_rcu()</tt> invocation rates, then additional forward-progress -work will be required. - -<h3><a name="Composability">Composability</a></h3> - -<p> -Composability has received much attention in recent years, perhaps in part -due to the collision of multicore hardware with object-oriented techniques -designed in single-threaded environments for single-threaded use. -And in theory, RCU read-side critical sections may be composed, and in -fact may be nested arbitrarily deeply. -In practice, as with all real-world implementations of composable -constructs, there are limitations. - -<p> -Implementations of RCU for which <tt>rcu_read_lock()</tt> -and <tt>rcu_read_unlock()</tt> generate no code, such as -Linux-kernel RCU when <tt>CONFIG_PREEMPT=n</tt>, can be -nested arbitrarily deeply. -After all, there is no overhead. -Except that if all these instances of <tt>rcu_read_lock()</tt> -and <tt>rcu_read_unlock()</tt> are visible to the compiler, -compilation will eventually fail due to exhausting memory, -mass storage, or user patience, whichever comes first. -If the nesting is not visible to the compiler, as is the case with -mutually recursive functions each in its own translation unit, -stack overflow will result. -If the nesting takes the form of loops, perhaps in the guise of tail -recursion, either the control variable -will overflow or (in the Linux kernel) you will get an RCU CPU stall warning. -Nevertheless, this class of RCU implementations is one -of the most composable constructs in existence. - -<p> -RCU implementations that explicitly track nesting depth -are limited by the nesting-depth counter. -For example, the Linux kernel's preemptible RCU limits nesting to -<tt>INT_MAX</tt>. -This should suffice for almost all practical purposes. -That said, a consecutive pair of RCU read-side critical sections -between which there is an operation that waits for a grace period -cannot be enclosed in another RCU read-side critical section. -This is because it is not legal to wait for a grace period within -an RCU read-side critical section: To do so would result either -in deadlock or -in RCU implicitly splitting the enclosing RCU read-side critical -section, neither of which is conducive to a long-lived and prosperous -kernel. - -<p> -It is worth noting that RCU is not alone in limiting composability. -For example, many transactional-memory implementations prohibit -composing a pair of transactions separated by an irrevocable -operation (for example, a network receive operation). -For another example, lock-based critical sections can be composed -surprisingly freely, but only if deadlock is avoided. - -<p> -In short, although RCU read-side critical sections are highly composable, -care is required in some situations, just as is the case for any other -composable synchronization mechanism. - -<h3><a name="Corner Cases">Corner Cases</a></h3> - -<p> -A given RCU workload might have an endless and intense stream of -RCU read-side critical sections, perhaps even so intense that there -was never a point in time during which there was not at least one -RCU read-side critical section in flight. -RCU cannot allow this situation to block grace periods: As long as -all the RCU read-side critical sections are finite, grace periods -must also be finite. - -<p> -That said, preemptible RCU implementations could potentially result -in RCU read-side critical sections being preempted for long durations, -which has the effect of creating a long-duration RCU read-side -critical section. -This situation can arise only in heavily loaded systems, but systems using -real-time priorities are of course more vulnerable. -Therefore, RCU priority boosting is provided to help deal with this -case. -That said, the exact requirements on RCU priority boosting will likely -evolve as more experience accumulates. - -<p> -Other workloads might have very high update rates. -Although one can argue that such workloads should instead use -something other than RCU, the fact remains that RCU must -handle such workloads gracefully. -This requirement is another factor driving batching of grace periods, -but it is also the driving force behind the checks for large numbers -of queued RCU callbacks in the <tt>call_rcu()</tt> code path. -Finally, high update rates should not delay RCU read-side critical -sections, although some small read-side delays can occur when using -<tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use -of <tt>smp_call_function_single()</tt>. - -<p> -Although all three of these corner cases were understood in the early -1990s, a simple user-level test consisting of <tt>close(open(path))</tt> -in a tight loop -in the early 2000s suddenly provided a much deeper appreciation of the -high-update-rate corner case. -This test also motivated addition of some RCU code to react to high update -rates, for example, if a given CPU finds itself with more than 10,000 -RCU callbacks queued, it will cause RCU to take evasive action by -more aggressively starting grace periods and more aggressively forcing -completion of grace-period processing. -This evasive action causes the grace period to complete more quickly, -but at the cost of restricting RCU's batching optimizations, thus -increasing the CPU overhead incurred by that grace period. - -<h2><a name="Software-Engineering Requirements"> -Software-Engineering Requirements</a></h2> - -<p> -Between Murphy's Law and “To err is human”, it is necessary to -guard against mishaps and misuse: - -<ol> -<li> It is all too easy to forget to use <tt>rcu_read_lock()</tt> - everywhere that it is needed, so kernels built with - <tt>CONFIG_PROVE_RCU=y</tt> will splat if - <tt>rcu_dereference()</tt> is used outside of an - RCU read-side critical section. - Update-side code can use <tt>rcu_dereference_protected()</tt>, - which takes a - <a href="https://lwn.net/Articles/371986/">lockdep expression</a> - to indicate what is providing the protection. - If the indicated protection is not provided, a lockdep splat - is emitted. - - <p> - Code shared between readers and updaters can use - <tt>rcu_dereference_check()</tt>, which also takes a - lockdep expression, and emits a lockdep splat if neither - <tt>rcu_read_lock()</tt> nor the indicated protection - is in place. - In addition, <tt>rcu_dereference_raw()</tt> is used in those - (hopefully rare) cases where the required protection cannot - be easily described. - Finally, <tt>rcu_read_lock_held()</tt> is provided to - allow a function to verify that it has been invoked within - an RCU read-side critical section. - I was made aware of this set of requirements shortly after Thomas - Gleixner audited a number of RCU uses. -<li> A given function might wish to check for RCU-related preconditions - upon entry, before using any other RCU API. - The <tt>rcu_lockdep_assert()</tt> does this job, - asserting the expression in kernels having lockdep enabled - and doing nothing otherwise. -<li> It is also easy to forget to use <tt>rcu_assign_pointer()</tt> - and <tt>rcu_dereference()</tt>, perhaps (incorrectly) - substituting a simple assignment. - To catch this sort of error, a given RCU-protected pointer may be - tagged with <tt>__rcu</tt>, after which sparse - will complain about simple-assignment accesses to that pointer. - Arnd Bergmann made me aware of this requirement, and also - supplied the needed - <a href="https://lwn.net/Articles/376011/">patch series</a>. -<li> Kernels built with <tt>CONFIG_DEBUG_OBJECTS_RCU_HEAD=y</tt> - will splat if a data element is passed to <tt>call_rcu()</tt> - twice in a row, without a grace period in between. - (This error is similar to a double free.) - The corresponding <tt>rcu_head</tt> structures that are - dynamically allocated are automatically tracked, but - <tt>rcu_head</tt> structures allocated on the stack - must be initialized with <tt>init_rcu_head_on_stack()</tt> - and cleaned up with <tt>destroy_rcu_head_on_stack()</tt>. - Similarly, statically allocated non-stack <tt>rcu_head</tt> - structures must be initialized with <tt>init_rcu_head()</tt> - and cleaned up with <tt>destroy_rcu_head()</tt>. - Mathieu Desnoyers made me aware of this requirement, and also - supplied the needed - <a href="https://lkml.kernel.org/g/20100319013024.GA28456@Krystal">patch</a>. -<li> An infinite loop in an RCU read-side critical section will - eventually trigger an RCU CPU stall warning splat, with - the duration of “eventually” being controlled by the - <tt>RCU_CPU_STALL_TIMEOUT</tt> <tt>Kconfig</tt> option, or, - alternatively, by the - <tt>rcupdate.rcu_cpu_stall_timeout</tt> boot/sysfs - parameter. - However, RCU is not obligated to produce this splat - unless there is a grace period waiting on that particular - RCU read-side critical section. - <p> - Some extreme workloads might intentionally delay - RCU grace periods, and systems running those workloads can - be booted with <tt>rcupdate.rcu_cpu_stall_suppress</tt> - to suppress the splats. - This kernel parameter may also be set via <tt>sysfs</tt>. - Furthermore, RCU CPU stall warnings are counter-productive - during sysrq dumps and during panics. - RCU therefore supplies the <tt>rcu_sysrq_start()</tt> and - <tt>rcu_sysrq_end()</tt> API members to be called before - and after long sysrq dumps. - RCU also supplies the <tt>rcu_panic()</tt> notifier that is - automatically invoked at the beginning of a panic to suppress - further RCU CPU stall warnings. - - <p> - This requirement made itself known in the early 1990s, pretty - much the first time that it was necessary to debug a CPU stall. - That said, the initial implementation in DYNIX/ptx was quite - generic in comparison with that of Linux. -<li> Although it would be very good to detect pointers leaking out - of RCU read-side critical sections, there is currently no - good way of doing this. - One complication is the need to distinguish between pointers - leaking and pointers that have been handed off from RCU to - some other synchronization mechanism, for example, reference - counting. -<li> In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related - information is provided via event tracing. -<li> Open-coded use of <tt>rcu_assign_pointer()</tt> and - <tt>rcu_dereference()</tt> to create typical linked - data structures can be surprisingly error-prone. - Therefore, RCU-protected - <a href="https://lwn.net/Articles/609973/#RCU List APIs">linked lists</a> - and, more recently, RCU-protected - <a href="https://lwn.net/Articles/612100/">hash tables</a> - are available. - Many other special-purpose RCU-protected data structures are - available in the Linux kernel and the userspace RCU library. -<li> Some linked structures are created at compile time, but still - require <tt>__rcu</tt> checking. - The <tt>RCU_POINTER_INITIALIZER()</tt> macro serves this - purpose. -<li> It is not necessary to use <tt>rcu_assign_pointer()</tt> - when creating linked structures that are to be published via - a single external pointer. - The <tt>RCU_INIT_POINTER()</tt> macro is provided for - this task and also for assigning <tt>NULL</tt> pointers - at runtime. -</ol> - -<p> -This not a hard-and-fast list: RCU's diagnostic capabilities will -continue to be guided by the number and type of usage bugs found -in real-world RCU usage. - -<h2><a name="Linux Kernel Complications">Linux Kernel Complications</a></h2> - -<p> -The Linux kernel provides an interesting environment for all kinds of -software, including RCU. -Some of the relevant points of interest are as follows: - -<ol> -<li> <a href="#Configuration">Configuration</a>. -<li> <a href="#Firmware Interface">Firmware Interface</a>. -<li> <a href="#Early Boot">Early Boot</a>. -<li> <a href="#Interrupts and NMIs"> - Interrupts and non-maskable interrupts (NMIs)</a>. -<li> <a href="#Loadable Modules">Loadable Modules</a>. -<li> <a href="#Hotplug CPU">Hotplug CPU</a>. -<li> <a href="#Scheduler and RCU">Scheduler and RCU</a>. -<li> <a href="#Tracing and RCU">Tracing and RCU</a>. -<li> <a href="#Accesses to User Memory and RCU"> -Accesses to User Memory and RCU</a>. -<li> <a href="#Energy Efficiency">Energy Efficiency</a>. -<li> <a href="#Scheduling-Clock Interrupts and RCU"> - Scheduling-Clock Interrupts and RCU</a>. -<li> <a href="#Memory Efficiency">Memory Efficiency</a>. -<li> <a href="#Performance, Scalability, Response Time, and Reliability"> - Performance, Scalability, Response Time, and Reliability</a>. -</ol> - -<p> -This list is probably incomplete, but it does give a feel for the -most notable Linux-kernel complications. -Each of the following sections covers one of the above topics. - -<h3><a name="Configuration">Configuration</a></h3> - -<p> -RCU's goal is automatic configuration, so that almost nobody -needs to worry about RCU's <tt>Kconfig</tt> options. -And for almost all users, RCU does in fact work well -“out of the box.” - -<p> -However, there are specialized use cases that are handled by -kernel boot parameters and <tt>Kconfig</tt> options. -Unfortunately, the <tt>Kconfig</tt> system will explicitly ask users -about new <tt>Kconfig</tt> options, which requires almost all of them -be hidden behind a <tt>CONFIG_RCU_EXPERT</tt> <tt>Kconfig</tt> option. - -<p> -This all should be quite obvious, but the fact remains that -Linus Torvalds recently had to -<a href="https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com">remind</a> -me of this requirement. - -<h3><a name="Firmware Interface">Firmware Interface</a></h3> - -<p> -In many cases, kernel obtains information about the system from the -firmware, and sometimes things are lost in translation. -Or the translation is accurate, but the original message is bogus. - -<p> -For example, some systems' firmware overreports the number of CPUs, -sometimes by a large factor. -If RCU naively believed the firmware, as it used to do, -it would create too many per-CPU kthreads. -Although the resulting system will still run correctly, the extra -kthreads needlessly consume memory and can cause confusion -when they show up in <tt>ps</tt> listings. - -<p> -RCU must therefore wait for a given CPU to actually come online before -it can allow itself to believe that the CPU actually exists. -The resulting “ghost CPUs” (which are never going to -come online) cause a number of -<a href="https://paulmck.livejournal.com/37494.html">interesting complications</a>. - -<h3><a name="Early Boot">Early Boot</a></h3> - -<p> -The Linux kernel's boot sequence is an interesting process, -and RCU is used early, even before <tt>rcu_init()</tt> -is invoked. -In fact, a number of RCU's primitives can be used as soon as the -initial task's <tt>task_struct</tt> is available and the -boot CPU's per-CPU variables are set up. -The read-side primitives (<tt>rcu_read_lock()</tt>, -<tt>rcu_read_unlock()</tt>, <tt>rcu_dereference()</tt>, -and <tt>rcu_access_pointer()</tt>) will operate normally very early on, -as will <tt>rcu_assign_pointer()</tt>. - -<p> -Although <tt>call_rcu()</tt> may be invoked at any -time during boot, callbacks are not guaranteed to be invoked until after -all of RCU's kthreads have been spawned, which occurs at -<tt>early_initcall()</tt> time. -This delay in callback invocation is due to the fact that RCU does not -invoke callbacks until it is fully initialized, and this full initialization -cannot occur until after the scheduler has initialized itself to the -point where RCU can spawn and run its kthreads. -In theory, it would be possible to invoke callbacks earlier, -however, this is not a panacea because there would be severe restrictions -on what operations those callbacks could invoke. - -<p> -Perhaps surprisingly, <tt>synchronize_rcu()</tt> and -<tt>synchronize_rcu_expedited()</tt>, -will operate normally -during very early boot, the reason being that there is only one CPU -and preemption is disabled. -This means that the call <tt>synchronize_rcu()</tt> (or friends) -itself is a quiescent -state and thus a grace period, so the early-boot implementation can -be a no-op. - -<p> -However, once the scheduler has spawned its first kthread, this early -boot trick fails for <tt>synchronize_rcu()</tt> (as well as for -<tt>synchronize_rcu_expedited()</tt>) in <tt>CONFIG_PREEMPT=y</tt> -kernels. -The reason is that an RCU read-side critical section might be preempted, -which means that a subsequent <tt>synchronize_rcu()</tt> really does have -to wait for something, as opposed to simply returning immediately. -Unfortunately, <tt>synchronize_rcu()</tt> can't do this until all of -its kthreads are spawned, which doesn't happen until some time during -<tt>early_initcalls()</tt> time. -But this is no excuse: RCU is nevertheless required to correctly handle -synchronous grace periods during this time period. -Once all of its kthreads are up and running, RCU starts running -normally. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - How can RCU possibly handle grace periods before all of its - kthreads have been spawned??? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Very carefully! - </font> - - <p><font color="ffffff"> - During the “dead zone” between the time that the - scheduler spawns the first task and the time that all of RCU's - kthreads have been spawned, all synchronous grace periods are - handled by the expedited grace-period mechanism. - At runtime, this expedited mechanism relies on workqueues, but - during the dead zone the requesting task itself drives the - desired expedited grace period. - Because dead-zone execution takes place within task context, - everything works. - Once the dead zone ends, expedited grace periods go back to - using workqueues, as is required to avoid problems that would - otherwise occur when a user task received a POSIX signal while - driving an expedited grace period. - </font> - - <p><font color="ffffff"> - And yes, this does mean that it is unhelpful to send POSIX - signals to random tasks between the time that the scheduler - spawns its first kthread and the time that RCU's kthreads - have all been spawned. - If there ever turns out to be a good reason for sending POSIX - signals during that time, appropriate adjustments will be made. - (If it turns out that POSIX signals are sent during this time for - no good reason, other adjustments will be made, appropriate - or otherwise.) -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -I learned of these boot-time requirements as a result of a series of -system hangs. - -<h3><a name="Interrupts and NMIs">Interrupts and NMIs</a></h3> - -<p> -The Linux kernel has interrupts, and RCU read-side critical sections are -legal within interrupt handlers and within interrupt-disabled regions -of code, as are invocations of <tt>call_rcu()</tt>. - -<p> -Some Linux-kernel architectures can enter an interrupt handler from -non-idle process context, and then just never leave it, instead stealthily -transitioning back to process context. -This trick is sometimes used to invoke system calls from inside the kernel. -These “half-interrupts” mean that RCU has to be very careful -about how it counts interrupt nesting levels. -I learned of this requirement the hard way during a rewrite -of RCU's dyntick-idle code. - -<p> -The Linux kernel has non-maskable interrupts (NMIs), and -RCU read-side critical sections are legal within NMI handlers. -Thankfully, RCU update-side primitives, including -<tt>call_rcu()</tt>, are prohibited within NMI handlers. - -<p> -The name notwithstanding, some Linux-kernel architectures -can have nested NMIs, which RCU must handle correctly. -Andy Lutomirski -<a href="https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a> -with this requirement; -he also kindly surprised me with -<a href="https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a> -that meets this requirement. - -<p> -Furthermore, NMI handlers can be interrupted by what appear to RCU -to be normal interrupts. -One way that this can happen is for code that directly invokes -<tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt> to be called -from an NMI handler. -This astonishing fact of life prompted the current code structure, -which has <tt>rcu_irq_enter()</tt> invoking <tt>rcu_nmi_enter()</tt> -and <tt>rcu_irq_exit()</tt> invoking <tt>rcu_nmi_exit()</tt>. -And yes, I also learned of this requirement the hard way. - -<h3><a name="Loadable Modules">Loadable Modules</a></h3> - -<p> -The Linux kernel has loadable modules, and these modules can -also be unloaded. -After a given module has been unloaded, any attempt to call -one of its functions results in a segmentation fault. -The module-unload functions must therefore cancel any -delayed calls to loadable-module functions, for example, -any outstanding <tt>mod_timer()</tt> must be dealt with -via <tt>del_timer_sync()</tt> or similar. - -<p> -Unfortunately, there is no way to cancel an RCU callback; -once you invoke <tt>call_rcu()</tt>, the callback function is -eventually going to be invoked, unless the system goes down first. -Because it is normally considered socially irresponsible to crash the system -in response to a module unload request, we need some other way -to deal with in-flight RCU callbacks. - -<p> -RCU therefore provides -<tt><a href="https://lwn.net/Articles/217484/">rcu_barrier()</a></tt>, -which waits until all in-flight RCU callbacks have been invoked. -If a module uses <tt>call_rcu()</tt>, its exit function should therefore -prevent any future invocation of <tt>call_rcu()</tt>, then invoke -<tt>rcu_barrier()</tt>. -In theory, the underlying module-unload code could invoke -<tt>rcu_barrier()</tt> unconditionally, but in practice this would -incur unacceptable latencies. - -<p> -Nikita Danilov noted this requirement for an analogous filesystem-unmount -situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU. -The need for <tt>rcu_barrier()</tt> for module unloading became -apparent later. - -<p> -<b>Important note</b>: The <tt>rcu_barrier()</tt> function is not, -repeat, <i>not</i>, obligated to wait for a grace period. -It is instead only required to wait for RCU callbacks that have -already been posted. -Therefore, if there are no RCU callbacks posted anywhere in the system, -<tt>rcu_barrier()</tt> is within its rights to return immediately. -Even if there are callbacks posted, <tt>rcu_barrier()</tt> does not -necessarily need to wait for a grace period. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Wait a minute! - Each RCU callbacks must wait for a grace period to complete, - and <tt>rcu_barrier()</tt> must wait for each pre-existing - callback to be invoked. - Doesn't <tt>rcu_barrier()</tt> therefore need to wait for - a full grace period if there is even one callback posted anywhere - in the system? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Absolutely not!!! - </font> - - <p><font color="ffffff"> - Yes, each RCU callbacks must wait for a grace period to complete, - but it might well be partly (or even completely) finished waiting - by the time <tt>rcu_barrier()</tt> is invoked. - In that case, <tt>rcu_barrier()</tt> need only wait for the - remaining portion of the grace period to elapse. - So even if there are quite a few callbacks posted, - <tt>rcu_barrier()</tt> might well return quite quickly. - </font> - - <p><font color="ffffff"> - So if you need to wait for a grace period as well as for all - pre-existing callbacks, you will need to invoke both - <tt>synchronize_rcu()</tt> and <tt>rcu_barrier()</tt>. - If latency is a concern, you can always use workqueues - to invoke them concurrently. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h3><a name="Hotplug CPU">Hotplug CPU</a></h3> - -<p> -The Linux kernel supports CPU hotplug, which means that CPUs -can come and go. -It is of course illegal to use any RCU API member from an offline CPU, -with the exception of <a href="#Sleepable RCU">SRCU</a> read-side -critical sections. -This requirement was present from day one in DYNIX/ptx, but -on the other hand, the Linux kernel's CPU-hotplug implementation -is “interesting.” - -<p> -The Linux-kernel CPU-hotplug implementation has notifiers that -are used to allow the various kernel subsystems (including RCU) -to respond appropriately to a given CPU-hotplug operation. -Most RCU operations may be invoked from CPU-hotplug notifiers, -including even synchronous grace-period operations such as -<tt>synchronize_rcu()</tt> and <tt>synchronize_rcu_expedited()</tt>. - -<p> -However, all-callback-wait operations such as -<tt>rcu_barrier()</tt> are also not supported, due to the -fact that there are phases of CPU-hotplug operations where -the outgoing CPU's callbacks will not be invoked until after -the CPU-hotplug operation ends, which could also result in deadlock. -Furthermore, <tt>rcu_barrier()</tt> blocks CPU-hotplug operations -during its execution, which results in another type of deadlock -when invoked from a CPU-hotplug notifier. - -<h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3> - -<p> -RCU depends on the scheduler, and the scheduler uses RCU to -protect some of its data structures. -The preemptible-RCU <tt>rcu_read_unlock()</tt> -implementation must therefore be written carefully to avoid deadlocks -involving the scheduler's runqueue and priority-inheritance locks. -In particular, <tt>rcu_read_unlock()</tt> must tolerate an -interrupt where the interrupt handler invokes both -<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. -This possibility requires <tt>rcu_read_unlock()</tt> to use -negative nesting levels to avoid destructive recursion via -interrupt handler's use of RCU. - -<p> -This scheduler-RCU requirement came as a -<a href="https://lwn.net/Articles/453002/">complete surprise</a>. - -<p> -As noted above, RCU makes use of kthreads, and it is necessary to -avoid excessive CPU-time accumulation by these kthreads. -This requirement was no surprise, but RCU's violation of it -when running context-switch-heavy workloads when built with -<tt>CONFIG_NO_HZ_FULL=y</tt> -<a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>. -RCU has made good progress towards meeting this requirement, even -for context-switch-heavy <tt>CONFIG_NO_HZ_FULL=y</tt> workloads, -but there is room for further improvement. - -<p> -It is forbidden to hold any of scheduler's runqueue or priority-inheritance -spinlocks across an <tt>rcu_read_unlock()</tt> unless interrupts have been -disabled across the entire RCU read-side critical section, that is, -up to and including the matching <tt>rcu_read_lock()</tt>. -Violating this restriction can result in deadlocks involving these -scheduler spinlocks. -There was hope that this restriction might be lifted when interrupt-disabled -calls to <tt>rcu_read_unlock()</tt> started deferring the reporting of -the resulting RCU-preempt quiescent state until the end of the corresponding -interrupts-disabled region. -Unfortunately, timely reporting of the corresponding quiescent state -to expedited grace periods requires a call to <tt>raise_softirq()</tt>, -which can acquire these scheduler spinlocks. -In addition, real-time systems using RCU priority boosting -need this restriction to remain in effect because deferred -quiescent-state reporting would also defer deboosting, which in turn -would degrade real-time latencies. - -<p> -In theory, if a given RCU read-side critical section could be -guaranteed to be less than one second in duration, holding a scheduler -spinlock across that critical section's <tt>rcu_read_unlock()</tt> -would require only that preemption be disabled across the entire -RCU read-side critical section, not interrupts. -Unfortunately, given the possibility of vCPU preemption, long-running -interrupts, and so on, it is not possible in practice to guarantee -that a given RCU read-side critical section will complete in less than -one second. -Therefore, as noted above, if scheduler spinlocks are held across -a given call to <tt>rcu_read_unlock()</tt>, interrupts must be -disabled across the entire RCU read-side critical section. - -<h3><a name="Tracing and RCU">Tracing and RCU</a></h3> - -<p> -It is possible to use tracing on RCU code, but tracing itself -uses RCU. -For this reason, <tt>rcu_dereference_raw_check()</tt> -is provided for use by tracing, which avoids the destructive -recursion that could otherwise ensue. -This API is also used by virtualization in some architectures, -where RCU readers execute in environments in which tracing -cannot be used. -The tracing folks both located the requirement and provided the -needed fix, so this surprise requirement was relatively painless. - -<h3><a name="Accesses to User Memory and RCU"> -Accesses to User Memory and RCU</a></h3> - -<p> -The kernel needs to access user-space memory, for example, to access -data referenced by system-call parameters. -The <tt>get_user()</tt> macro does this job. - -<p> -However, user-space memory might well be paged out, which means -that <tt>get_user()</tt> might well page-fault and thus block while -waiting for the resulting I/O to complete. -It would be a very bad thing for the compiler to reorder -a <tt>get_user()</tt> invocation into an RCU read-side critical -section. -For example, suppose that the source code looked like this: - -<blockquote> -<pre> - 1 rcu_read_lock(); - 2 p = rcu_dereference(gp); - 3 v = p->value; - 4 rcu_read_unlock(); - 5 get_user(user_v, user_p); - 6 do_something_with(v, user_v); -</pre> -</blockquote> - -<p> -The compiler must not be permitted to transform this source code into -the following: - -<blockquote> -<pre> - 1 rcu_read_lock(); - 2 p = rcu_dereference(gp); - 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!! - 4 v = p->value; - 5 rcu_read_unlock(); - 6 do_something_with(v, user_v); -</pre> -</blockquote> - -<p> -If the compiler did make this transformation in a -<tt>CONFIG_PREEMPT=n</tt> kernel build, and if <tt>get_user()</tt> did -page fault, the result would be a quiescent state in the middle -of an RCU read-side critical section. -This misplaced quiescent state could result in line 4 being -a use-after-free access, which could be bad for your kernel's -actuarial statistics. -Similar examples can be constructed with the call to <tt>get_user()</tt> -preceding the <tt>rcu_read_lock()</tt>. - -<p> -Unfortunately, <tt>get_user()</tt> doesn't have any particular -ordering properties, and in some architectures the underlying <tt>asm</tt> -isn't even marked <tt>volatile</tt>. -And even if it was marked <tt>volatile</tt>, the above access to -<tt>p->value</tt> is not volatile, so the compiler would not have any -reason to keep those two accesses in order. - -<p> -Therefore, the Linux-kernel definitions of <tt>rcu_read_lock()</tt> -and <tt>rcu_read_unlock()</tt> must act as compiler barriers, -at least for outermost instances of <tt>rcu_read_lock()</tt> and -<tt>rcu_read_unlock()</tt> within a nested set of RCU read-side critical -sections. - -<h3><a name="Energy Efficiency">Energy Efficiency</a></h3> - -<p> -Interrupting idle CPUs is considered socially unacceptable, -especially by people with battery-powered embedded systems. -RCU therefore conserves energy by detecting which CPUs are -idle, including tracking CPUs that have been interrupted from idle. -This is a large part of the energy-efficiency requirement, -so I learned of this via an irate phone call. - -<p> -Because RCU avoids interrupting idle CPUs, it is illegal to -execute an RCU read-side critical section on an idle CPU. -(Kernels built with <tt>CONFIG_PROVE_RCU=y</tt> will splat -if you try it.) -The <tt>RCU_NONIDLE()</tt> macro and <tt>_rcuidle</tt> -event tracing is provided to work around this restriction. -In addition, <tt>rcu_is_watching()</tt> may be used to -test whether or not it is currently legal to run RCU read-side -critical sections on this CPU. -I learned of the need for diagnostics on the one hand -and <tt>RCU_NONIDLE()</tt> on the other while inspecting -idle-loop code. -Steven Rostedt supplied <tt>_rcuidle</tt> event tracing, -which is used quite heavily in the idle loop. -However, there are some restrictions on the code placed within -<tt>RCU_NONIDLE()</tt>: - -<ol> -<li> Blocking is prohibited. - In practice, this is not a serious restriction given that idle - tasks are prohibited from blocking to begin with. -<li> Although nesting <tt>RCU_NONIDLE()</tt> is permitted, they cannot - nest indefinitely deeply. - However, given that they can be nested on the order of a million - deep, even on 32-bit systems, this should not be a serious - restriction. - This nesting limit would probably be reached long after the - compiler OOMed or the stack overflowed. -<li> Any code path that enters <tt>RCU_NONIDLE()</tt> must sequence - out of that same <tt>RCU_NONIDLE()</tt>. - For example, the following is grossly illegal: - - <blockquote> - <pre> - 1 RCU_NONIDLE({ - 2 do_something(); - 3 goto bad_idea; /* BUG!!! */ - 4 do_something_else();}); - 5 bad_idea: - </pre> - </blockquote> - - <p> - It is just as illegal to transfer control into the middle of - <tt>RCU_NONIDLE()</tt>'s argument. - Yes, in theory, you could transfer in as long as you also - transferred out, but in practice you could also expect to get sharply - worded review comments. -</ol> - -<p> -It is similarly socially unacceptable to interrupt an -<tt>nohz_full</tt> CPU running in userspace. -RCU must therefore track <tt>nohz_full</tt> userspace -execution. -RCU must therefore be able to sample state at two points in -time, and be able to determine whether or not some other CPU spent -any time idle and/or executing in userspace. - -<p> -These energy-efficiency requirements have proven quite difficult to -understand and to meet, for example, there have been more than five -clean-sheet rewrites of RCU's energy-efficiency code, the last of -which was finally able to demonstrate -<a href="http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf">real energy savings running on real hardware [PDF]</a>. -As noted earlier, -I learned of many of these requirements via angry phone calls: -Flaming me on the Linux-kernel mailing list was apparently not -sufficient to fully vent their ire at RCU's energy-efficiency bugs! - -<h3><a name="Scheduling-Clock Interrupts and RCU"> -Scheduling-Clock Interrupts and RCU</a></h3> - -<p> -The kernel transitions between in-kernel non-idle execution, userspace -execution, and the idle loop. -Depending on kernel configuration, RCU handles these states differently: - -<table border=3> -<tr><th><tt>HZ</tt> Kconfig</th> - <th>In-Kernel</th> - <th>Usermode</th> - <th>Idle</th></tr> -<tr><th align="left"><tt>HZ_PERIODIC</tt></th> - <td>Can rely on scheduling-clock interrupt.</td> - <td>Can rely on scheduling-clock interrupt and its - detection of interrupt from usermode.</td> - <td>Can rely on RCU's dyntick-idle detection.</td></tr> -<tr><th align="left"><tt>NO_HZ_IDLE</tt></th> - <td>Can rely on scheduling-clock interrupt.</td> - <td>Can rely on scheduling-clock interrupt and its - detection of interrupt from usermode.</td> - <td>Can rely on RCU's dyntick-idle detection.</td></tr> -<tr><th align="left"><tt>NO_HZ_FULL</tt></th> - <td>Can only sometimes rely on scheduling-clock interrupt. - In other cases, it is necessary to bound kernel execution - times and/or use IPIs.</td> - <td>Can rely on RCU's dyntick-idle detection.</td> - <td>Can rely on RCU's dyntick-idle detection.</td></tr> -</table> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the - scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt> - and <tt>NO_HZ_IDLE</tt> do? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Because, as a performance optimization, <tt>NO_HZ_FULL</tt> - does not necessarily re-enable the scheduling-clock interrupt - on entry to each and every system call. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -However, RCU must be reliably informed as to whether any given -CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>, -also whether that CPU is executing in usermode, as discussed -<a href="#Energy Efficiency">earlier</a>. -It also requires that the scheduling-clock interrupt be enabled when -RCU needs it to be: - -<ol> -<li> If a CPU is either idle or executing in usermode, and RCU believes - it is non-idle, the scheduling-clock tick had better be running. - Otherwise, you will get RCU CPU stall warnings. Or at best, - very long (11-second) grace periods, with a pointless IPI waking - the CPU from time to time. -<li> If a CPU is in a portion of the kernel that executes RCU read-side - critical sections, and RCU believes this CPU to be idle, you will get - random memory corruption. <b>DON'T DO THIS!!!</b> - - <br>This is one reason to test with lockdep, which will complain - about this sort of thing. -<li> If a CPU is in a portion of the kernel that is absolutely - positively no-joking guaranteed to never execute any RCU read-side - critical sections, and RCU believes this CPU to to be idle, - no problem. This sort of thing is used by some architectures - for light-weight exception handlers, which can then avoid the - overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt> - at exception entry and exit, respectively. - Some go further and avoid the entireties of <tt>irq_enter()</tt> - and <tt>irq_exit()</tt>. - - <br>Just make very sure you are running some of your tests with - <tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths - was in fact joking about not doing RCU read-side critical sections. -<li> If a CPU is executing in the kernel with the scheduling-clock - interrupt disabled and RCU believes this CPU to be non-idle, - and if the CPU goes idle (from an RCU perspective) every few - jiffies, no problem. It is usually OK for there to be the - occasional gap between idle periods of up to a second or so. - - <br>If the gap grows too long, you get RCU CPU stall warnings. -<li> If a CPU is either idle or executing in usermode, and RCU believes - it to be idle, of course no problem. -<li> If a CPU is executing in the kernel, the kernel code - path is passing through quiescent states at a reasonable - frequency (preferably about once per few jiffies, but the - occasional excursion to a second or so is usually OK) and the - scheduling-clock interrupt is enabled, of course no problem. - - <br>If the gap between a successive pair of quiescent states grows - too long, you get RCU CPU stall warnings. -</ol> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But what if my driver has a hardware interrupt handler - that can run for many seconds? - I cannot invoke <tt>schedule()</tt> from an hardware - interrupt handler, after all! -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt> - every so often. - But given that long-running interrupt handlers can cause - other problems, not least for response time, shouldn't you - work to keep your interrupt handler's runtime within reasonable - bounds? -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -But as long as RCU is properly informed of kernel state transitions between -in-kernel execution, usermode execution, and idle, and as long as the -scheduling-clock interrupt is enabled when RCU needs it to be, you -can rest assured that the bugs you encounter will be in some other -part of RCU or some other part of the kernel! - -<h3><a name="Memory Efficiency">Memory Efficiency</a></h3> - -<p> -Although small-memory non-realtime systems can simply use Tiny RCU, -code size is only one aspect of memory efficiency. -Another aspect is the size of the <tt>rcu_head</tt> structure -used by <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>. -Although this structure contains nothing more than a pair of pointers, -it does appear in many RCU-protected data structures, including -some that are size critical. -The <tt>page</tt> structure is a case in point, as evidenced by -the many occurrences of the <tt>union</tt> keyword within that structure. - -<p> -This need for memory efficiency is one reason that RCU uses hand-crafted -singly linked lists to track the <tt>rcu_head</tt> structures that -are waiting for a grace period to elapse. -It is also the reason why <tt>rcu_head</tt> structures do not contain -debug information, such as fields tracking the file and line of the -<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> that posted them. -Although this information might appear in debug-only kernel builds at some -point, in the meantime, the <tt>->func</tt> field will often provide -the needed debug information. - -<p> -However, in some cases, the need for memory efficiency leads to even -more extreme measures. -Returning to the <tt>page</tt> structure, the <tt>rcu_head</tt> field -shares storage with a great many other structures that are used at -various points in the corresponding page's lifetime. -In order to correctly resolve certain -<a href="https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com">race conditions</a>, -the Linux kernel's memory-management subsystem needs a particular bit -to remain zero during all phases of grace-period processing, -and that bit happens to map to the bottom bit of the -<tt>rcu_head</tt> structure's <tt>->next</tt> field. -RCU makes this guarantee as long as <tt>call_rcu()</tt> -is used to post the callback, as opposed to <tt>kfree_rcu()</tt> -or some future “lazy” -variant of <tt>call_rcu()</tt> that might one day be created for -energy-efficiency purposes. - -<p> -That said, there are limits. -RCU requires that the <tt>rcu_head</tt> structure be aligned to a -two-byte boundary, and passing a misaligned <tt>rcu_head</tt> -structure to one of the <tt>call_rcu()</tt> family of functions -will result in a splat. -It is therefore necessary to exercise caution when packing -structures containing fields of type <tt>rcu_head</tt>. -Why not a four-byte or even eight-byte alignment requirement? -Because the m68k architecture provides only two-byte alignment, -and thus acts as alignment's least common denominator. - -<p> -The reason for reserving the bottom bit of pointers to -<tt>rcu_head</tt> structures is to leave the door open to -“lazy” callbacks whose invocations can safely be deferred. -Deferring invocation could potentially have energy-efficiency -benefits, but only if the rate of non-lazy callbacks decreases -significantly for some important workload. -In the meantime, reserving the bottom bit keeps this option open -in case it one day becomes useful. - -<h3><a name="Performance, Scalability, Response Time, and Reliability"> -Performance, Scalability, Response Time, and Reliability</a></h3> - -<p> -Expanding on the -<a href="#Performance and Scalability">earlier discussion</a>, -RCU is used heavily by hot code paths in performance-critical -portions of the Linux kernel's networking, security, virtualization, -and scheduling code paths. -RCU must therefore use efficient implementations, especially in its -read-side primitives. -To that end, it would be good if preemptible RCU's implementation -of <tt>rcu_read_lock()</tt> could be inlined, however, doing -this requires resolving <tt>#include</tt> issues with the -<tt>task_struct</tt> structure. - -<p> -The Linux kernel supports hardware configurations with up to -4096 CPUs, which means that RCU must be extremely scalable. -Algorithms that involve frequent acquisitions of global locks or -frequent atomic operations on global variables simply cannot be -tolerated within the RCU implementation. -RCU therefore makes heavy use of a combining tree based on the -<tt>rcu_node</tt> structure. -RCU is required to tolerate all CPUs continuously invoking any -combination of RCU's runtime primitives with minimal per-operation -overhead. -In fact, in many cases, increasing load must <i>decrease</i> the -per-operation overhead, witness the batching optimizations for -<tt>synchronize_rcu()</tt>, <tt>call_rcu()</tt>, -<tt>synchronize_rcu_expedited()</tt>, and <tt>rcu_barrier()</tt>. -As a general rule, RCU must cheerfully accept whatever the -rest of the Linux kernel decides to throw at it. - -<p> -The Linux kernel is used for real-time workloads, especially -in conjunction with the -<a href="https://rt.wiki.kernel.org/index.php/Main_Page">-rt patchset</a>. -The real-time-latency response requirements are such that the -traditional approach of disabling preemption across RCU -read-side critical sections is inappropriate. -Kernels built with <tt>CONFIG_PREEMPT=y</tt> therefore -use an RCU implementation that allows RCU read-side critical -sections to be preempted. -This requirement made its presence known after users made it -clear that an earlier -<a href="https://lwn.net/Articles/107930/">real-time patch</a> -did not meet their needs, in conjunction with some -<a href="https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com">RCU issues</a> -encountered by a very early version of the -rt patchset. - -<p> -In addition, RCU must make do with a sub-100-microsecond real-time latency -budget. -In fact, on smaller systems with the -rt patchset, the Linux kernel -provides sub-20-microsecond real-time latencies for the whole kernel, -including RCU. -RCU's scalability and latency must therefore be sufficient for -these sorts of configurations. -To my surprise, the sub-100-microsecond real-time latency budget -<a href="http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf"> -applies to even the largest systems [PDF]</a>, -up to and including systems with 4096 CPUs. -This real-time requirement motivated the grace-period kthread, which -also simplified handling of a number of race conditions. - -<p> -RCU must avoid degrading real-time response for CPU-bound threads, whether -executing in usermode (which is one use case for -<tt>CONFIG_NO_HZ_FULL=y</tt>) or in the kernel. -That said, CPU-bound loops in the kernel must execute -<tt>cond_resched()</tt> at least once per few tens of milliseconds -in order to avoid receiving an IPI from RCU. - -<p> -Finally, RCU's status as a synchronization primitive means that -any RCU failure can result in arbitrary memory corruption that can be -extremely difficult to debug. -This means that RCU must be extremely reliable, which in -practice also means that RCU must have an aggressive stress-test -suite. -This stress-test suite is called <tt>rcutorture</tt>. - -<p> -Although the need for <tt>rcutorture</tt> was no surprise, -the current immense popularity of the Linux kernel is posing -interesting—and perhaps unprecedented—validation -challenges. -To see this, keep in mind that there are well over one billion -instances of the Linux kernel running today, given Android -smartphones, Linux-powered televisions, and servers. -This number can be expected to increase sharply with the advent of -the celebrated Internet of Things. - -<p> -Suppose that RCU contains a race condition that manifests on average -once per million years of runtime. -This bug will be occurring about three times per <i>day</i> across -the installed base. -RCU could simply hide behind hardware error rates, given that no one -should really expect their smartphone to last for a million years. -However, anyone taking too much comfort from this thought should -consider the fact that in most jurisdictions, a successful multi-year -test of a given mechanism, which might include a Linux kernel, -suffices for a number of types of safety-critical certifications. -In fact, rumor has it that the Linux kernel is already being used -in production for safety-critical applications. -I don't know about you, but I would feel quite bad if a bug in RCU -killed someone. -Which might explain my recent focus on validation and verification. - -<h2><a name="Other RCU Flavors">Other RCU Flavors</a></h2> - -<p> -One of the more surprising things about RCU is that there are now -no fewer than five <i>flavors</i>, or API families. -In addition, the primary flavor that has been the sole focus up to -this point has two different implementations, non-preemptible and -preemptible. -The other four flavors are listed below, with requirements for each -described in a separate section. - -<ol> -<li> <a href="#Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a> -<li> <a href="#Sched Flavor">Sched Flavor (Historical)</a> -<li> <a href="#Sleepable RCU">Sleepable RCU</a> -<li> <a href="#Tasks RCU">Tasks RCU</a> -</ol> - -<h3><a name="Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a></h3> - -<p> -The RCU-bh flavor of RCU has since been expressed in terms of -the other RCU flavors as part of a consolidation of the three -flavors into a single flavor. -The read-side API remains, and continues to disable softirq and to -be accounted for by lockdep. -Much of the material in this section is therefore strictly historical -in nature. - -<p> -The softirq-disable (AKA “bottom-half”, -hence the “_bh” abbreviations) -flavor of RCU, or <i>RCU-bh</i>, was developed by -Dipankar Sarma to provide a flavor of RCU that could withstand the -network-based denial-of-service attacks researched by Robert -Olsson. -These attacks placed so much networking load on the system -that some of the CPUs never exited softirq execution, -which in turn prevented those CPUs from ever executing a context switch, -which, in the RCU implementation of that time, prevented grace periods -from ever ending. -The result was an out-of-memory condition and a system hang. - -<p> -The solution was the creation of RCU-bh, which does -<tt>local_bh_disable()</tt> -across its read-side critical sections, and which uses the transition -from one type of softirq processing to another as a quiescent state -in addition to context switch, idle, user mode, and offline. -This means that RCU-bh grace periods can complete even when some of -the CPUs execute in softirq indefinitely, thus allowing algorithms -based on RCU-bh to withstand network-based denial-of-service attacks. - -<p> -Because -<tt>rcu_read_lock_bh()</tt> and <tt>rcu_read_unlock_bh()</tt> -disable and re-enable softirq handlers, any attempt to start a softirq -handlers during the -RCU-bh read-side critical section will be deferred. -In this case, <tt>rcu_read_unlock_bh()</tt> -will invoke softirq processing, which can take considerable time. -One can of course argue that this softirq overhead should be associated -with the code following the RCU-bh read-side critical section rather -than <tt>rcu_read_unlock_bh()</tt>, but the fact -is that most profiling tools cannot be expected to make this sort -of fine distinction. -For example, suppose that a three-millisecond-long RCU-bh read-side -critical section executes during a time of heavy networking load. -There will very likely be an attempt to invoke at least one softirq -handler during that three milliseconds, but any such invocation will -be delayed until the time of the <tt>rcu_read_unlock_bh()</tt>. -This can of course make it appear at first glance as if -<tt>rcu_read_unlock_bh()</tt> was executing very slowly. - -<p> -The -<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-bh API</a> -includes -<tt>rcu_read_lock_bh()</tt>, -<tt>rcu_read_unlock_bh()</tt>, -<tt>rcu_dereference_bh()</tt>, -<tt>rcu_dereference_bh_check()</tt>, -<tt>synchronize_rcu_bh()</tt>, -<tt>synchronize_rcu_bh_expedited()</tt>, -<tt>call_rcu_bh()</tt>, -<tt>rcu_barrier_bh()</tt>, and -<tt>rcu_read_lock_bh_held()</tt>. -However, the update-side APIs are now simple wrappers for other RCU -flavors, namely RCU-sched in CONFIG_PREEMPT=n kernels and RCU-preempt -otherwise. - -<h3><a name="Sched Flavor">Sched Flavor (Historical)</a></h3> - -<p> -The RCU-sched flavor of RCU has since been expressed in terms of -the other RCU flavors as part of a consolidation of the three -flavors into a single flavor. -The read-side API remains, and continues to disable preemption and to -be accounted for by lockdep. -Much of the material in this section is therefore strictly historical -in nature. - -<p> -Before preemptible RCU, waiting for an RCU grace period had the -side effect of also waiting for all pre-existing interrupt -and NMI handlers. -However, there are legitimate preemptible-RCU implementations that -do not have this property, given that any point in the code outside -of an RCU read-side critical section can be a quiescent state. -Therefore, <i>RCU-sched</i> was created, which follows “classic” -RCU in that an RCU-sched grace period waits for for pre-existing -interrupt and NMI handlers. -In kernels built with <tt>CONFIG_PREEMPT=n</tt>, the RCU and RCU-sched -APIs have identical implementations, while kernels built with -<tt>CONFIG_PREEMPT=y</tt> provide a separate implementation for each. - -<p> -Note well that in <tt>CONFIG_PREEMPT=y</tt> kernels, -<tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt> -disable and re-enable preemption, respectively. -This means that if there was a preemption attempt during the -RCU-sched read-side critical section, <tt>rcu_read_unlock_sched()</tt> -will enter the scheduler, with all the latency and overhead entailed. -Just as with <tt>rcu_read_unlock_bh()</tt>, this can make it look -as if <tt>rcu_read_unlock_sched()</tt> was executing very slowly. -However, the highest-priority task won't be preempted, so that task -will enjoy low-overhead <tt>rcu_read_unlock_sched()</tt> invocations. - -<p> -The -<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-sched API</a> -includes -<tt>rcu_read_lock_sched()</tt>, -<tt>rcu_read_unlock_sched()</tt>, -<tt>rcu_read_lock_sched_notrace()</tt>, -<tt>rcu_read_unlock_sched_notrace()</tt>, -<tt>rcu_dereference_sched()</tt>, -<tt>rcu_dereference_sched_check()</tt>, -<tt>synchronize_sched()</tt>, -<tt>synchronize_rcu_sched_expedited()</tt>, -<tt>call_rcu_sched()</tt>, -<tt>rcu_barrier_sched()</tt>, and -<tt>rcu_read_lock_sched_held()</tt>. -However, anything that disables preemption also marks an RCU-sched -read-side critical section, including -<tt>preempt_disable()</tt> and <tt>preempt_enable()</tt>, -<tt>local_irq_save()</tt> and <tt>local_irq_restore()</tt>, -and so on. - -<h3><a name="Sleepable RCU">Sleepable RCU</a></h3> - -<p> -For well over a decade, someone saying “I need to block within -an RCU read-side critical section” was a reliable indication -that this someone did not understand RCU. -After all, if you are always blocking in an RCU read-side critical -section, you can probably afford to use a higher-overhead synchronization -mechanism. -However, that changed with the advent of the Linux kernel's notifiers, -whose RCU read-side critical -sections almost never sleep, but sometimes need to. -This resulted in the introduction of -<a href="https://lwn.net/Articles/202847/">sleepable RCU</a>, -or <i>SRCU</i>. - -<p> -SRCU allows different domains to be defined, with each such domain -defined by an instance of an <tt>srcu_struct</tt> structure. -A pointer to this structure must be passed in to each SRCU function, -for example, <tt>synchronize_srcu(&ss)</tt>, where -<tt>ss</tt> is the <tt>srcu_struct</tt> structure. -The key benefit of these domains is that a slow SRCU reader in one -domain does not delay an SRCU grace period in some other domain. -That said, one consequence of these domains is that read-side code -must pass a “cookie” from <tt>srcu_read_lock()</tt> -to <tt>srcu_read_unlock()</tt>, for example, as follows: - -<blockquote> -<pre> - 1 int idx; - 2 - 3 idx = srcu_read_lock(&ss); - 4 do_something(); - 5 srcu_read_unlock(&ss, idx); -</pre> -</blockquote> - -<p> -As noted above, it is legal to block within SRCU read-side critical sections, -however, with great power comes great responsibility. -If you block forever in one of a given domain's SRCU read-side critical -sections, then that domain's grace periods will also be blocked forever. -Of course, one good way to block forever is to deadlock, which can -happen if any operation in a given domain's SRCU read-side critical -section can wait, either directly or indirectly, for that domain's -grace period to elapse. -For example, this results in a self-deadlock: - -<blockquote> -<pre> - 1 int idx; - 2 - 3 idx = srcu_read_lock(&ss); - 4 do_something(); - 5 synchronize_srcu(&ss); - 6 srcu_read_unlock(&ss, idx); -</pre> -</blockquote> - -<p> -However, if line 5 acquired a mutex that was held across -a <tt>synchronize_srcu()</tt> for domain <tt>ss</tt>, -deadlock would still be possible. -Furthermore, if line 5 acquired a mutex that was held across -a <tt>synchronize_srcu()</tt> for some other domain <tt>ss1</tt>, -and if an <tt>ss1</tt>-domain SRCU read-side critical section -acquired another mutex that was held across as <tt>ss</tt>-domain -<tt>synchronize_srcu()</tt>, -deadlock would again be possible. -Such a deadlock cycle could extend across an arbitrarily large number -of different SRCU domains. -Again, with great power comes great responsibility. - -<p> -Unlike the other RCU flavors, SRCU read-side critical sections can -run on idle and even offline CPUs. -This ability requires that <tt>srcu_read_lock()</tt> and -<tt>srcu_read_unlock()</tt> contain memory barriers, which means -that SRCU readers will run a bit slower than would RCU readers. -It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt> -API, which, in combination with <tt>srcu_read_unlock()</tt>, -guarantees a full memory barrier. - -<p> -Also unlike other RCU flavors, <tt>synchronize_srcu()</tt> may <b>not</b> -be invoked from CPU-hotplug notifiers, due to the fact that SRCU grace -periods make use of timers and the possibility of timers being temporarily -“stranded” on the outgoing CPU. -This stranding of timers means that timers posted to the outgoing CPU -will not fire until late in the CPU-hotplug process. -The problem is that if a notifier is waiting on an SRCU grace period, -that grace period is waiting on a timer, and that timer is stranded on the -outgoing CPU, then the notifier will never be awakened, in other words, -deadlock has occurred. -This same situation of course also prohibits <tt>srcu_barrier()</tt> -from being invoked from CPU-hotplug notifiers. - -<p> -SRCU also differs from other RCU flavors in that SRCU's expedited and -non-expedited grace periods are implemented by the same mechanism. -This means that in the current SRCU implementation, expediting a -future grace period has the side effect of expediting all prior -grace periods that have not yet completed. -(But please note that this is a property of the current implementation, -not necessarily of future implementations.) -In addition, if SRCU has been idle for longer than the interval -specified by the <tt>srcutree.exp_holdoff</tt> kernel boot parameter -(25 microseconds by default), -and if a <tt>synchronize_srcu()</tt> invocation ends this idle period, -that invocation will be automatically expedited. - -<p> -As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating -a locking bottleneck present in prior kernel versions. -Although this will allow users to put much heavier stress on -<tt>call_srcu()</tt>, it is important to note that SRCU does not -yet take any special steps to deal with callback flooding. -So if you are posting (say) 10,000 SRCU callbacks per second per CPU, -you are probably totally OK, but if you intend to post (say) 1,000,000 -SRCU callbacks per second per CPU, please run some tests first. -SRCU just might need a few adjustment to deal with that sort of load. -Of course, your mileage may vary based on the speed of your CPUs and -the size of your memory. - -<p> -The -<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a> -includes -<tt>srcu_read_lock()</tt>, -<tt>srcu_read_unlock()</tt>, -<tt>srcu_dereference()</tt>, -<tt>srcu_dereference_check()</tt>, -<tt>synchronize_srcu()</tt>, -<tt>synchronize_srcu_expedited()</tt>, -<tt>call_srcu()</tt>, -<tt>srcu_barrier()</tt>, and -<tt>srcu_read_lock_held()</tt>. -It also includes -<tt>DEFINE_SRCU()</tt>, -<tt>DEFINE_STATIC_SRCU()</tt>, and -<tt>init_srcu_struct()</tt> -APIs for defining and initializing <tt>srcu_struct</tt> structures. - -<h3><a name="Tasks RCU">Tasks RCU</a></h3> - -<p> -Some forms of tracing use “trampolines” to handle the -binary rewriting required to install different types of probes. -It would be good to be able to free old trampolines, which sounds -like a job for some form of RCU. -However, because it is necessary to be able to install a trace -anywhere in the code, it is not possible to use read-side markers -such as <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. -In addition, it does not work to have these markers in the trampoline -itself, because there would need to be instructions following -<tt>rcu_read_unlock()</tt>. -Although <tt>synchronize_rcu()</tt> would guarantee that execution -reached the <tt>rcu_read_unlock()</tt>, it would not be able to -guarantee that execution had completely left the trampoline. - -<p> -The solution, in the form of -<a href="https://lwn.net/Articles/607117/"><i>Tasks RCU</i></a>, -is to have implicit -read-side critical sections that are delimited by voluntary context -switches, that is, calls to <tt>schedule()</tt>, -<tt>cond_resched()</tt>, and -<tt>synchronize_rcu_tasks()</tt>. -In addition, transitions to and from userspace execution also delimit -tasks-RCU read-side critical sections. - -<p> -The tasks-RCU API is quite compact, consisting only of -<tt>call_rcu_tasks()</tt>, -<tt>synchronize_rcu_tasks()</tt>, and -<tt>rcu_barrier_tasks()</tt>. -In <tt>CONFIG_PREEMPT=n</tt> kernels, trampolines cannot be preempted, -so these APIs map to -<tt>call_rcu()</tt>, -<tt>synchronize_rcu()</tt>, and -<tt>rcu_barrier()</tt>, respectively. -In <tt>CONFIG_PREEMPT=y</tt> kernels, trampolines can be preempted, -and these three APIs are therefore implemented by separate functions -that check for voluntary context switches. - -<h2><a name="Possible Future Changes">Possible Future Changes</a></h2> - -<p> -One of the tricks that RCU uses to attain update-side scalability is -to increase grace-period latency with increasing numbers of CPUs. -If this becomes a serious problem, it will be necessary to rework the -grace-period state machine so as to avoid the need for the additional -latency. - -<p> -RCU disables CPU hotplug in a few places, perhaps most notably in the -<tt>rcu_barrier()</tt> operations. -If there is a strong reason to use <tt>rcu_barrier()</tt> in CPU-hotplug -notifiers, it will be necessary to avoid disabling CPU hotplug. -This would introduce some complexity, so there had better be a <i>very</i> -good reason. - -<p> -The tradeoff between grace-period latency on the one hand and interruptions -of other CPUs on the other hand may need to be re-examined. -The desire is of course for zero grace-period latency as well as zero -interprocessor interrupts undertaken during an expedited grace period -operation. -While this ideal is unlikely to be achievable, it is quite possible that -further improvements can be made. - -<p> -The multiprocessor implementations of RCU use a combining tree that -groups CPUs so as to reduce lock contention and increase cache locality. -However, this combining tree does not spread its memory across NUMA -nodes nor does it align the CPU groups with hardware features such -as sockets or cores. -Such spreading and alignment is currently believed to be unnecessary -because the hotpath read-side primitives do not access the combining -tree, nor does <tt>call_rcu()</tt> in the common case. -If you believe that your architecture needs such spreading and alignment, -then your architecture should also benefit from the -<tt>rcutree.rcu_fanout_leaf</tt> boot parameter, which can be set -to the number of CPUs in a socket, NUMA node, or whatever. -If the number of CPUs is too large, use a fraction of the number of -CPUs. -If the number of CPUs is a large prime number, well, that certainly -is an “interesting” architectural choice! -More flexible arrangements might be considered, but only if -<tt>rcutree.rcu_fanout_leaf</tt> has proven inadequate, and only -if the inadequacy has been demonstrated by a carefully run and -realistic system-level workload. - -<p> -Please note that arrangements that require RCU to remap CPU numbers will -require extremely good demonstration of need and full exploration of -alternatives. - -<p> -RCU's various kthreads are reasonably recent additions. -It is quite likely that adjustments will be required to more gracefully -handle extreme loads. -It might also be necessary to be able to relate CPU utilization by -RCU's kthreads and softirq handlers to the code that instigated this -CPU utilization. -For example, RCU callback overhead might be charged back to the -originating <tt>call_rcu()</tt> instance, though probably not -in production kernels. - -<p> -Additional work may be required to provide reasonable forward-progress -guarantees under heavy load for grace periods and for callback -invocation. - -<h2><a name="Summary">Summary</a></h2> - -<p> -This document has presented more than two decade's worth of RCU -requirements. -Given that the requirements keep changing, this will not be the last -word on this subject, but at least it serves to get an important -subset of the requirements set forth. - -<h2><a name="Acknowledgments">Acknowledgments</a></h2> - -I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, -Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and -Andy Lutomirski for their help in rendering -this article human readable, and to Michelle Rankin for her support -of this effort. -Other contributions are acknowledged in the Linux kernel's git archive. - -</body></html> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst new file mode 100644 index 000000000000..fd5e2cbc4935 --- /dev/null +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -0,0 +1,2704 @@ +================================= +A Tour Through RCU's Requirements +================================= + +Copyright IBM Corporation, 2015 + +Author: Paul E. McKenney + +The initial version of this document appeared in the +`LWN <https://lwn.net/>`_ on those articles: +`part 1 <https://lwn.net/Articles/652156/>`_, +`part 2 <https://lwn.net/Articles/652677/>`_, and +`part 3 <https://lwn.net/Articles/653326/>`_. + +Introduction +------------ + +Read-copy update (RCU) is a synchronization mechanism that is often used +as a replacement for reader-writer locking. RCU is unusual in that +updaters do not block readers, which means that RCU's read-side +primitives can be exceedingly fast and scalable. In addition, updaters +can make useful forward progress concurrently with readers. However, all +this concurrency between RCU readers and updaters does raise the +question of exactly what RCU readers are doing, which in turn raises the +question of exactly what RCU's requirements are. + +This document therefore summarizes RCU's requirements, and can be +thought of as an informal, high-level specification for RCU. It is +important to understand that RCU's specification is primarily empirical +in nature; in fact, I learned about many of these requirements the hard +way. This situation might cause some consternation, however, not only +has this learning process been a lot of fun, but it has also been a +great privilege to work with so many people willing to apply +technologies in interesting new ways. + +All that aside, here are the categories of currently known RCU +requirements: + +#. `Fundamental Requirements`_ +#. `Fundamental Non-Requirements`_ +#. `Parallelism Facts of Life`_ +#. `Quality-of-Implementation Requirements`_ +#. `Linux Kernel Complications`_ +#. `Software-Engineering Requirements`_ +#. `Other RCU Flavors`_ +#. `Possible Future Changes`_ + +This is followed by a `summary <#Summary>`__, however, the answers to +each quick quiz immediately follows the quiz. Select the big white space +with your mouse to see the answer. + +Fundamental Requirements +------------------------ + +RCU's fundamental requirements are the closest thing RCU has to hard +mathematical requirements. These are: + +#. `Grace-Period Guarantee`_ +#. `Publish/Subscribe Guarantee`_ +#. `Memory-Barrier Guarantees`_ +#. `RCU Primitives Guaranteed to Execute Unconditionally`_ +#. `Guaranteed Read-to-Write Upgrade`_ + +Grace-Period Guarantee +~~~~~~~~~~~~~~~~~~~~~~ + +RCU's grace-period guarantee is unusual in being premeditated: Jack +Slingwine and I had this guarantee firmly in mind when we started work +on RCU (then called “rclock”) in the early 1990s. That said, the past +two decades of experience with RCU have produced a much more detailed +understanding of this guarantee. + +RCU's grace-period guarantee allows updaters to wait for the completion +of all pre-existing RCU read-side critical sections. An RCU read-side +critical section begins with the marker ``rcu_read_lock()`` and ends +with the marker ``rcu_read_unlock()``. These markers may be nested, and +RCU treats a nested set as one big RCU read-side critical section. +Production-quality implementations of ``rcu_read_lock()`` and +``rcu_read_unlock()`` are extremely lightweight, and in fact have +exactly zero overhead in Linux kernels built for production use with +``CONFIG_PREEMPT=n``. + +This guarantee allows ordering to be enforced with extremely low +overhead to readers, for example: + + :: + + 1 int x, y; + 2 + 3 void thread0(void) + 4 { + 5 rcu_read_lock(); + 6 r1 = READ_ONCE(x); + 7 r2 = READ_ONCE(y); + 8 rcu_read_unlock(); + 9 } + 10 + 11 void thread1(void) + 12 { + 13 WRITE_ONCE(x, 1); + 14 synchronize_rcu(); + 15 WRITE_ONCE(y, 1); + 16 } + +Because the ``synchronize_rcu()`` on line 14 waits for all pre-existing +readers, any instance of ``thread0()`` that loads a value of zero from +``x`` must complete before ``thread1()`` stores to ``y``, so that +instance must also load a value of zero from ``y``. Similarly, any +instance of ``thread0()`` that loads a value of one from ``y`` must have +started after the ``synchronize_rcu()`` started, and must therefore also +load a value of one from ``x``. Therefore, the outcome: + + :: + + (r1 == 0 && r2 == 1) + +cannot happen. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! You said that updaters can make useful forward | +| progress concurrently with readers, but pre-existing readers will | +| block ``synchronize_rcu()``!!! | +| Just who are you trying to fool??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| First, if updaters do not wish to be blocked by readers, they can use | +| ``call_rcu()`` or ``kfree_rcu()``, which will be discussed later. | +| Second, even when using ``synchronize_rcu()``, the other update-side | +| code does run concurrently with readers, whether pre-existing or not. | ++-----------------------------------------------------------------------+ + +This scenario resembles one of the first uses of RCU in +`DYNIX/ptx <https://en.wikipedia.org/wiki/DYNIX>`__, which managed a +distributed lock manager's transition into a state suitable for handling +recovery from node failure, more or less as follows: + + :: + + 1 #define STATE_NORMAL 0 + 2 #define STATE_WANT_RECOVERY 1 + 3 #define STATE_RECOVERING 2 + 4 #define STATE_WANT_NORMAL 3 + 5 + 6 int state = STATE_NORMAL; + 7 + 8 void do_something_dlm(void) + 9 { + 10 int state_snap; + 11 + 12 rcu_read_lock(); + 13 state_snap = READ_ONCE(state); + 14 if (state_snap == STATE_NORMAL) + 15 do_something(); + 16 else + 17 do_something_carefully(); + 18 rcu_read_unlock(); + 19 } + 20 + 21 void start_recovery(void) + 22 { + 23 WRITE_ONCE(state, STATE_WANT_RECOVERY); + 24 synchronize_rcu(); + 25 WRITE_ONCE(state, STATE_RECOVERING); + 26 recovery(); + 27 WRITE_ONCE(state, STATE_WANT_NORMAL); + 28 synchronize_rcu(); + 29 WRITE_ONCE(state, STATE_NORMAL); + 30 } + +The RCU read-side critical section in ``do_something_dlm()`` works with +the ``synchronize_rcu()`` in ``start_recovery()`` to guarantee that +``do_something()`` never runs concurrently with ``recovery()``, but with +little or no synchronization overhead in ``do_something_dlm()``. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why is the ``synchronize_rcu()`` on line 28 needed? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Without that extra grace period, memory reordering could result in | +| ``do_something_dlm()`` executing ``do_something()`` concurrently with | +| the last bits of ``recovery()``. | ++-----------------------------------------------------------------------+ + +In order to avoid fatal problems such as deadlocks, an RCU read-side +critical section must not contain calls to ``synchronize_rcu()``. +Similarly, an RCU read-side critical section must not contain anything +that waits, directly or indirectly, on completion of an invocation of +``synchronize_rcu()``. + +Although RCU's grace-period guarantee is useful in and of itself, with +`quite a few use cases <https://lwn.net/Articles/573497/>`__, it would +be good to be able to use RCU to coordinate read-side access to linked +data structures. For this, the grace-period guarantee is not sufficient, +as can be seen in function ``add_gp_buggy()`` below. We will look at the +reader's code later, but in the meantime, just think of the reader as +locklessly picking up the ``gp`` pointer, and, if the value loaded is +non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields. + + :: + + 1 bool add_gp_buggy(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 p->a = a; + 12 p->b = a; + 13 gp = p; /* ORDERING BUG */ + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +The problem is that both the compiler and weakly ordered CPUs are within +their rights to reorder this code as follows: + + :: + + 1 bool add_gp_buggy_optimized(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 gp = p; /* ORDERING BUG */ + 12 p->a = a; + 13 p->b = a; + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +If an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized`` +executes line 11, it will see garbage in the ``->a`` and ``->b`` fields. +And this is but one of many ways in which compiler and hardware +optimizations could cause trouble. Therefore, we clearly need some way +to prevent the compiler and the CPU from reordering in this manner, +which brings us to the publish-subscribe guarantee discussed in the next +section. + +Publish/Subscribe Guarantee +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RCU's publish-subscribe guarantee allows data to be inserted into a +linked data structure without disrupting RCU readers. The updater uses +``rcu_assign_pointer()`` to insert the new data, and readers use +``rcu_dereference()`` to access data, whether new or old. The following +shows an example of insertion: + + :: + + 1 bool add_gp(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 p->a = a; + 12 p->b = a; + 13 rcu_assign_pointer(gp, p); + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +The ``rcu_assign_pointer()`` on line 13 is conceptually equivalent to a +simple assignment statement, but also guarantees that its assignment +will happen after the two assignments in lines 11 and 12, similar to the +C11 ``memory_order_release`` store operation. It also prevents any +number of “interesting” compiler optimizations, for example, the use of +``gp`` as a scratch location immediately preceding the assignment. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But ``rcu_assign_pointer()`` does nothing to prevent the two | +| assignments to ``p->a`` and ``p->b`` from being reordered. Can't that | +| also cause problems? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, it cannot. The readers cannot see either of these two fields | +| until the assignment to ``gp``, by which time both fields are fully | +| initialized. So reordering the assignments to ``p->a`` and ``p->b`` | +| cannot possibly cause any problems. | ++-----------------------------------------------------------------------+ + +It is tempting to assume that the reader need not do anything special to +control its accesses to the RCU-protected data, as shown in +``do_something_gp_buggy()`` below: + + :: + + 1 bool do_something_gp_buggy(void) + 2 { + 3 rcu_read_lock(); + 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ + 5 if (p) { + 6 do_something(p->a, p->b); + 7 rcu_read_unlock(); + 8 return true; + 9 } + 10 rcu_read_unlock(); + 11 return false; + 12 } + +However, this temptation must be resisted because there are a +surprisingly large number of ways that the compiler (to say nothing of +`DEC Alpha CPUs <https://h71000.www7.hp.com/wizard/wiz_2637.html>`__) +can trip this code up. For but one example, if the compiler were short +of registers, it might choose to refetch from ``gp`` rather than keeping +a separate copy in ``p`` as follows: + + :: + + 1 bool do_something_gp_buggy_optimized(void) + 2 { + 3 rcu_read_lock(); + 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ + 5 do_something(gp->a, gp->b); + 6 rcu_read_unlock(); + 7 return true; + 8 } + 9 rcu_read_unlock(); + 10 return false; + 11 } + +If this function ran concurrently with a series of updates that replaced +the current structure with a new one, the fetches of ``gp->a`` and +``gp->b`` might well come from two different structures, which could +cause serious confusion. To prevent this (and much else besides), +``do_something_gp()`` uses ``rcu_dereference()`` to fetch from ``gp``: + + :: + + 1 bool do_something_gp(void) + 2 { + 3 rcu_read_lock(); + 4 p = rcu_dereference(gp); + 5 if (p) { + 6 do_something(p->a, p->b); + 7 rcu_read_unlock(); + 8 return true; + 9 } + 10 rcu_read_unlock(); + 11 return false; + 12 } + +The ``rcu_dereference()`` uses volatile casts and (for DEC Alpha) memory +barriers in the Linux kernel. Should a `high-quality implementation of +C11 ``memory_order_consume`` +[PDF] <http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf>`__ +ever appear, then ``rcu_dereference()`` could be implemented as a +``memory_order_consume`` load. Regardless of the exact implementation, a +pointer fetched by ``rcu_dereference()`` may not be used outside of the +outermost RCU read-side critical section containing that +``rcu_dereference()``, unless protection of the corresponding data +element has been passed from RCU to some other synchronization +mechanism, most commonly locking or `reference +counting <https://www.kernel.org/doc/Documentation/RCU/rcuref.txt>`__. + +In short, updaters use ``rcu_assign_pointer()`` and readers use +``rcu_dereference()``, and these two RCU API elements work together to +ensure that readers have a consistent view of newly added data elements. + +Of course, it is also necessary to remove elements from RCU-protected +data structures, for example, using the following process: + +#. Remove the data element from the enclosing structure. +#. Wait for all pre-existing RCU read-side critical sections to complete + (because only pre-existing readers can possibly have a reference to + the newly removed data element). +#. At this point, only the updater has a reference to the newly removed + data element, so it can safely reclaim the data element, for example, + by passing it to ``kfree()``. + +This process is implemented by ``remove_gp_synchronous()``: + + :: + + 1 bool remove_gp_synchronous(void) + 2 { + 3 struct foo *p; + 4 + 5 spin_lock(&gp_lock); + 6 p = rcu_access_pointer(gp); + 7 if (!p) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 rcu_assign_pointer(gp, NULL); + 12 spin_unlock(&gp_lock); + 13 synchronize_rcu(); + 14 kfree(p); + 15 return true; + 16 } + +This function is straightforward, with line 13 waiting for a grace +period before line 14 frees the old data element. This waiting ensures +that readers will reach line 7 of ``do_something_gp()`` before the data +element referenced by ``p`` is freed. The ``rcu_access_pointer()`` on +line 6 is similar to ``rcu_dereference()``, except that: + +#. The value returned by ``rcu_access_pointer()`` cannot be + dereferenced. If you want to access the value pointed to as well as + the pointer itself, use ``rcu_dereference()`` instead of + ``rcu_access_pointer()``. +#. The call to ``rcu_access_pointer()`` need not be protected. In + contrast, ``rcu_dereference()`` must either be within an RCU + read-side critical section or in a code segment where the pointer + cannot change, for example, in code protected by the corresponding + update-side lock. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Without the ``rcu_dereference()`` or the ``rcu_access_pointer()``, | +| what destructive optimizations might the compiler make use of? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Let's start with what happens to ``do_something_gp()`` if it fails to | +| use ``rcu_dereference()``. It could reuse a value formerly fetched | +| from this same pointer. It could also fetch the pointer from ``gp`` | +| in a byte-at-a-time manner, resulting in *load tearing*, in turn | +| resulting a bytewise mash-up of two distinct pointer values. It might | +| even use value-speculation optimizations, where it makes a wrong | +| guess, but by the time it gets around to checking the value, an | +| update has changed the pointer to match the wrong guess. Too bad | +| about any dereferences that returned pre-initialization garbage in | +| the meantime! | +| For ``remove_gp_synchronous()``, as long as all modifications to | +| ``gp`` are carried out while holding ``gp_lock``, the above | +| optimizations are harmless. However, ``sparse`` will complain if you | +| define ``gp`` with ``__rcu`` and then access it without using either | +| ``rcu_access_pointer()`` or ``rcu_dereference()``. | ++-----------------------------------------------------------------------+ + +In short, RCU's publish-subscribe guarantee is provided by the +combination of ``rcu_assign_pointer()`` and ``rcu_dereference()``. This +guarantee allows data elements to be safely added to RCU-protected +linked data structures without disrupting RCU readers. This guarantee +can be used in combination with the grace-period guarantee to also allow +data elements to be removed from RCU-protected linked data structures, +again without disrupting RCU readers. + +This guarantee was only partially premeditated. DYNIX/ptx used an +explicit memory barrier for publication, but had nothing resembling +``rcu_dereference()`` for subscription, nor did it have anything +resembling the ``smp_read_barrier_depends()`` that was later subsumed +into ``rcu_dereference()`` and later still into ``READ_ONCE()``. The +need for these operations made itself known quite suddenly at a +late-1990s meeting with the DEC Alpha architects, back in the days when +DEC was still a free-standing company. It took the Alpha architects a +good hour to convince me that any sort of barrier would ever be needed, +and it then took me a good *two* hours to convince them that their +documentation did not make this point clear. More recent work with the C +and C++ standards committees have provided much education on tricks and +traps from the compiler. In short, compilers were much less tricky in +the early 1990s, but in 2015, don't even think about omitting +``rcu_dereference()``! + +Memory-Barrier Guarantees +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The previous section's simple linked-data-structure scenario clearly +demonstrates the need for RCU's stringent memory-ordering guarantees on +systems with more than one CPU: + +#. Each CPU that has an RCU read-side critical section that begins + before ``synchronize_rcu()`` starts is guaranteed to execute a full + memory barrier between the time that the RCU read-side critical + section ends and the time that ``synchronize_rcu()`` returns. Without + this guarantee, a pre-existing RCU read-side critical section might + hold a reference to the newly removed ``struct foo`` after the + ``kfree()`` on line 14 of ``remove_gp_synchronous()``. +#. Each CPU that has an RCU read-side critical section that ends after + ``synchronize_rcu()`` returns is guaranteed to execute a full memory + barrier between the time that ``synchronize_rcu()`` begins and the + time that the RCU read-side critical section begins. Without this + guarantee, a later RCU read-side critical section running after the + ``kfree()`` on line 14 of ``remove_gp_synchronous()`` might later run + ``do_something_gp()`` and find the newly deleted ``struct foo``. +#. If the task invoking ``synchronize_rcu()`` remains on a given CPU, + then that CPU is guaranteed to execute a full memory barrier sometime + during the execution of ``synchronize_rcu()``. This guarantee ensures + that the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really + does execute after the removal on line 11. +#. If the task invoking ``synchronize_rcu()`` migrates among a group of + CPUs during that invocation, then each of the CPUs in that group is + guaranteed to execute a full memory barrier sometime during the + execution of ``synchronize_rcu()``. This guarantee also ensures that + the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really does + execute after the removal on line 11, but also in the case where the + thread executing the ``synchronize_rcu()`` migrates in the meantime. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Given that multiple CPUs can start RCU read-side critical sections at | +| any time without any ordering whatsoever, how can RCU possibly tell | +| whether or not a given RCU read-side critical section starts before a | +| given instance of ``synchronize_rcu()``? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| If RCU cannot tell whether or not a given RCU read-side critical | +| section starts before a given instance of ``synchronize_rcu()``, then | +| it must assume that the RCU read-side critical section started first. | +| In other words, a given instance of ``synchronize_rcu()`` can avoid | +| waiting on a given RCU read-side critical section only if it can | +| prove that ``synchronize_rcu()`` started first. | +| A related question is “When ``rcu_read_lock()`` doesn't generate any | +| code, why does it matter how it relates to a grace period?” The | +| answer is that it is not the relationship of ``rcu_read_lock()`` | +| itself that is important, but rather the relationship of the code | +| within the enclosed RCU read-side critical section to the code | +| preceding and following the grace period. If we take this viewpoint, | +| then a given RCU read-side critical section begins before a given | +| grace period when some access preceding the grace period observes the | +| effect of some access within the critical section, in which case none | +| of the accesses within the critical section may observe the effects | +| of any access following the grace period. | +| | +| As of late 2016, mathematical models of RCU take this viewpoint, for | +| example, see slides 62 and 63 of the `2016 LinuxCon | +| EU <http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.201 | +| 6.10.04c.LCE.pdf>`__ | +| presentation. | ++-----------------------------------------------------------------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| The first and second guarantees require unbelievably strict ordering! | +| Are all these memory barriers *really* required? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Yes, they really are required. To see why the first guarantee is | +| required, consider the following sequence of events: | +| | +| #. CPU 1: ``rcu_read_lock()`` | +| #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` | +| #. CPU 0: ``list_del_rcu(p);`` | +| #. CPU 0: ``synchronize_rcu()`` starts. | +| #. CPU 1: ``do_something_with(q->a);`` | +| ``/* No smp_mb(), so might happen after kfree(). */`` | +| #. CPU 1: ``rcu_read_unlock()`` | +| #. CPU 0: ``synchronize_rcu()`` returns. | +| #. CPU 0: ``kfree(p);`` | +| | +| Therefore, there absolutely must be a full memory barrier between the | +| end of the RCU read-side critical section and the end of the grace | +| period. | +| | +| The sequence of events demonstrating the necessity of the second rule | +| is roughly similar: | +| | +| #. CPU 0: ``list_del_rcu(p);`` | +| #. CPU 0: ``synchronize_rcu()`` starts. | +| #. CPU 1: ``rcu_read_lock()`` | +| #. CPU 1: ``q = rcu_dereference(gp);`` | +| ``/* Might return p if no memory barrier. */`` | +| #. CPU 0: ``synchronize_rcu()`` returns. | +| #. CPU 0: ``kfree(p);`` | +| #. CPU 1: ``do_something_with(q->a); /* Boom!!! */`` | +| #. CPU 1: ``rcu_read_unlock()`` | +| | +| And similarly, without a memory barrier between the beginning of the | +| grace period and the beginning of the RCU read-side critical section, | +| CPU 1 might end up accessing the freelist. | +| | +| The “as if” rule of course applies, so that any implementation that | +| acts as if the appropriate memory barriers were in place is a correct | +| implementation. That said, it is much easier to fool yourself into | +| believing that you have adhered to the as-if rule than it is to | +| actually adhere to it! | ++-----------------------------------------------------------------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| You claim that ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | +| absolutely no code in some kernel builds. This means that the | +| compiler might arbitrarily rearrange consecutive RCU read-side | +| critical sections. Given such rearrangement, if a given RCU read-side | +| critical section is done, how can you be sure that all prior RCU | +| read-side critical sections are done? Won't the compiler | +| rearrangements make that impossible to determine? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In cases where ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | +| absolutely no code, RCU infers quiescent states only at special | +| locations, for example, within the scheduler. Because calls to | +| ``schedule()`` had better prevent calling-code accesses to shared | +| variables from being rearranged across the call to ``schedule()``, if | +| RCU detects the end of a given RCU read-side critical section, it | +| will necessarily detect the end of all prior RCU read-side critical | +| sections, no matter how aggressively the compiler scrambles the code. | +| Again, this all assumes that the compiler cannot scramble code across | +| calls to the scheduler, out of interrupt handlers, into the idle | +| loop, into user-mode code, and so on. But if your kernel build allows | +| that sort of scrambling, you have broken far more than just RCU! | ++-----------------------------------------------------------------------+ + +Note that these memory-barrier requirements do not replace the +fundamental RCU requirement that a grace period wait for all +pre-existing readers. On the contrary, the memory barriers called out in +this section must operate in such a way as to *enforce* this fundamental +requirement. Of course, different implementations enforce this +requirement in different ways, but enforce it they must. + +RCU Primitives Guaranteed to Execute Unconditionally +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The common-case RCU primitives are unconditional. They are invoked, they +do their job, and they return, with no possibility of error, and no need +to retry. This is a key RCU design philosophy. + +However, this philosophy is pragmatic rather than pigheaded. If someone +comes up with a good justification for a particular conditional RCU +primitive, it might well be implemented and added. After all, this +guarantee was reverse-engineered, not premeditated. The unconditional +nature of the RCU primitives was initially an accident of +implementation, and later experience with synchronization primitives +with conditional primitives caused me to elevate this accident to a +guarantee. Therefore, the justification for adding a conditional +primitive to RCU would need to be based on detailed and compelling use +cases. + +Guaranteed Read-to-Write Upgrade +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As far as RCU is concerned, it is always possible to carry out an update +within an RCU read-side critical section. For example, that RCU +read-side critical section might search for a given data element, and +then might acquire the update-side spinlock in order to update that +element, all while remaining in that RCU read-side critical section. Of +course, it is necessary to exit the RCU read-side critical section +before invoking ``synchronize_rcu()``, however, this inconvenience can +be avoided through use of the ``call_rcu()`` and ``kfree_rcu()`` API +members described later in this document. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But how does the upgrade-to-write operation exclude other readers? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| It doesn't, just like normal RCU updates, which also do not exclude | +| RCU readers. | ++-----------------------------------------------------------------------+ + +This guarantee allows lookup code to be shared between read-side and +update-side code, and was premeditated, appearing in the earliest +DYNIX/ptx RCU documentation. + +Fundamental Non-Requirements +---------------------------- + +RCU provides extremely lightweight readers, and its read-side +guarantees, though quite useful, are correspondingly lightweight. It is +therefore all too easy to assume that RCU is guaranteeing more than it +really is. Of course, the list of things that RCU does not guarantee is +infinitely long, however, the following sections list a few +non-guarantees that have caused confusion. Except where otherwise noted, +these non-guarantees were premeditated. + +#. `Readers Impose Minimal Ordering`_ +#. `Readers Do Not Exclude Updaters`_ +#. `Updaters Only Wait For Old Readers`_ +#. `Grace Periods Don't Partition Read-Side Critical Sections`_ +#. `Read-Side Critical Sections Don't Partition Grace Periods`_ + +Readers Impose Minimal Ordering +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Reader-side markers such as ``rcu_read_lock()`` and +``rcu_read_unlock()`` provide absolutely no ordering guarantees except +through their interaction with the grace-period APIs such as +``synchronize_rcu()``. To see this, consider the following pair of +threads: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(x, 1); + 5 rcu_read_unlock(); + 6 rcu_read_lock(); + 7 WRITE_ONCE(y, 1); + 8 rcu_read_unlock(); + 9 } + 10 + 11 void thread1(void) + 12 { + 13 rcu_read_lock(); + 14 r1 = READ_ONCE(y); + 15 rcu_read_unlock(); + 16 rcu_read_lock(); + 17 r2 = READ_ONCE(x); + 18 rcu_read_unlock(); + 19 } + +After ``thread0()`` and ``thread1()`` execute concurrently, it is quite +possible to have + + :: + + (r1 == 1 && r2 == 0) + +(that is, ``y`` appears to have been assigned before ``x``), which would +not be possible if ``rcu_read_lock()`` and ``rcu_read_unlock()`` had +much in the way of ordering properties. But they do not, so the CPU is +within its rights to do significant reordering. This is by design: Any +significant ordering constraints would slow down these fast-path APIs. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Can't the compiler also reorder this code? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, the volatile casts in ``READ_ONCE()`` and ``WRITE_ONCE()`` | +| prevent the compiler from reordering in this particular case. | ++-----------------------------------------------------------------------+ + +Readers Do Not Exclude Updaters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Neither ``rcu_read_lock()`` nor ``rcu_read_unlock()`` exclude updates. +All they do is to prevent grace periods from ending. The following +example illustrates this: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 r1 = READ_ONCE(y); + 5 if (r1) { + 6 do_something_with_nonzero_x(); + 7 r2 = READ_ONCE(x); + 8 WARN_ON(!r2); /* BUG!!! */ + 9 } + 10 rcu_read_unlock(); + 11 } + 12 + 13 void thread1(void) + 14 { + 15 spin_lock(&my_lock); + 16 WRITE_ONCE(x, 1); + 17 WRITE_ONCE(y, 1); + 18 spin_unlock(&my_lock); + 19 } + +If the ``thread0()`` function's ``rcu_read_lock()`` excluded the +``thread1()`` function's update, the ``WARN_ON()`` could never fire. But +the fact is that ``rcu_read_lock()`` does not exclude much of anything +aside from subsequent grace periods, of which ``thread1()`` has none, so +the ``WARN_ON()`` can and does fire. + +Updaters Only Wait For Old Readers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It might be tempting to assume that after ``synchronize_rcu()`` +completes, there are no readers executing. This temptation must be +avoided because new readers can start immediately after +``synchronize_rcu()`` starts, and ``synchronize_rcu()`` is under no +obligation to wait for these new readers. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Suppose that synchronize_rcu() did wait until *all* readers had | +| completed instead of waiting only on pre-existing readers. For how | +| long would the updater be able to rely on there being no readers? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| For no time at all. Even if ``synchronize_rcu()`` were to wait until | +| all readers had completed, a new reader might start immediately after | +| ``synchronize_rcu()`` completed. Therefore, the code following | +| ``synchronize_rcu()`` can *never* rely on there being no readers. | ++-----------------------------------------------------------------------+ + +Grace Periods Don't Partition Read-Side Critical Sections +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is tempting to assume that if any part of one RCU read-side critical +section precedes a given grace period, and if any part of another RCU +read-side critical section follows that same grace period, then all of +the first RCU read-side critical section must precede all of the second. +However, this just isn't the case: A single grace period does not +partition the set of RCU read-side critical sections. An example of this +situation can be illustrated as follows, where ``x``, ``y``, and ``z`` +are initially all zero: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 rcu_read_lock(); + 19 r2 = READ_ONCE(b); + 20 r3 = READ_ONCE(c); + 21 rcu_read_unlock(); + 22 } + +It turns out that the outcome: + + :: + + (r1 == 1 && r2 == 0 && r3 == 1) + +is entirely possible. The following figure show how this can happen, +with each circled ``QS`` indicating the point at which RCU recorded a +*quiescent state* for each thread, that is, a state in which RCU knows +that the thread cannot be in the midst of an RCU read-side critical +section that started before the current grace period: + +.. kernel-figure:: GPpartitionReaders1.svg + +If it is necessary to partition RCU read-side critical sections in this +manner, it is necessary to use two grace periods, where the first grace +period is known to end before the second grace period starts: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 r2 = READ_ONCE(c); + 19 synchronize_rcu(); + 20 WRITE_ONCE(d, 1); + 21 } + 22 + 23 void thread3(void) + 24 { + 25 rcu_read_lock(); + 26 r3 = READ_ONCE(b); + 27 r4 = READ_ONCE(d); + 28 rcu_read_unlock(); + 29 } + +Here, if ``(r1 == 1)``, then ``thread0()``'s write to ``b`` must happen +before the end of ``thread1()``'s grace period. If in addition +``(r4 == 1)``, then ``thread3()``'s read from ``b`` must happen after +the beginning of ``thread2()``'s grace period. If it is also the case +that ``(r2 == 1)``, then the end of ``thread1()``'s grace period must +precede the beginning of ``thread2()``'s grace period. This mean that +the two RCU read-side critical sections cannot overlap, guaranteeing +that ``(r3 == 1)``. As a result, the outcome: + + :: + + (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) + +cannot happen. + +This non-requirement was also non-premeditated, but became apparent when +studying RCU's interaction with memory ordering. + +Read-Side Critical Sections Don't Partition Grace Periods +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is also tempting to assume that if an RCU read-side critical section +happens between a pair of grace periods, then those grace periods cannot +overlap. However, this temptation leads nowhere good, as can be +illustrated by the following, with all variables initially zero: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 rcu_read_lock(); + 19 WRITE_ONCE(d, 1); + 20 r2 = READ_ONCE(c); + 21 rcu_read_unlock(); + 22 } + 23 + 24 void thread3(void) + 25 { + 26 r3 = READ_ONCE(d); + 27 synchronize_rcu(); + 28 WRITE_ONCE(e, 1); + 29 } + 30 + 31 void thread4(void) + 32 { + 33 rcu_read_lock(); + 34 r4 = READ_ONCE(b); + 35 r5 = READ_ONCE(e); + 36 rcu_read_unlock(); + 37 } + +In this case, the outcome: + + :: + + (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) + +is entirely possible, as illustrated below: + +.. kernel-figure:: ReadersPartitionGP1.svg + +Again, an RCU read-side critical section can overlap almost all of a +given grace period, just so long as it does not overlap the entire grace +period. As a result, an RCU read-side critical section cannot partition +a pair of RCU grace periods. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| How long a sequence of grace periods, each separated by an RCU | +| read-side critical section, would be required to partition the RCU | +| read-side critical sections at the beginning and end of the chain? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In theory, an infinite number. In practice, an unknown number that is | +| sensitive to both implementation details and timing considerations. | +| Therefore, even in practice, RCU users must abide by the theoretical | +| rather than the practical answer. | ++-----------------------------------------------------------------------+ + +Parallelism Facts of Life +------------------------- + +These parallelism facts of life are by no means specific to RCU, but the +RCU implementation must abide by them. They therefore bear repeating: + +#. Any CPU or task may be delayed at any time, and any attempts to avoid + these delays by disabling preemption, interrupts, or whatever are + completely futile. This is most obvious in preemptible user-level + environments and in virtualized environments (where a given guest + OS's VCPUs can be preempted at any time by the underlying + hypervisor), but can also happen in bare-metal environments due to + ECC errors, NMIs, and other hardware events. Although a delay of more + than about 20 seconds can result in splats, the RCU implementation is + obligated to use algorithms that can tolerate extremely long delays, + but where “extremely long” is not long enough to allow wrap-around + when incrementing a 64-bit counter. +#. Both the compiler and the CPU can reorder memory accesses. Where it + matters, RCU must use compiler directives and memory-barrier + instructions to preserve ordering. +#. Conflicting writes to memory locations in any given cache line will + result in expensive cache misses. Greater numbers of concurrent + writes and more-frequent concurrent writes will result in more + dramatic slowdowns. RCU is therefore obligated to use algorithms that + have sufficient locality to avoid significant performance and + scalability problems. +#. As a rough rule of thumb, only one CPU's worth of processing may be + carried out under the protection of any given exclusive lock. RCU + must therefore use scalable locking designs. +#. Counters are finite, especially on 32-bit systems. RCU's use of + counters must therefore tolerate counter wrap, or be designed such + that counter wrap would take way more time than a single system is + likely to run. An uptime of ten years is quite possible, a runtime of + a century much less so. As an example of the latter, RCU's + dyntick-idle nesting counter allows 54 bits for interrupt nesting + level (this counter is 64 bits even on a 32-bit system). Overflowing + this counter requires 2\ :sup:`54` half-interrupts on a given CPU + without that CPU ever going idle. If a half-interrupt happened every + microsecond, it would take 570 years of runtime to overflow this + counter, which is currently believed to be an acceptably long time. +#. Linux systems can have thousands of CPUs running a single Linux + kernel in a single shared-memory environment. RCU must therefore pay + close attention to high-end scalability. + +This last parallelism fact of life means that RCU must pay special +attention to the preceding facts of life. The idea that Linux might +scale to systems with thousands of CPUs would have been met with some +skepticism in the 1990s, but these requirements would have otherwise +have been unsurprising, even in the early 1990s. + +Quality-of-Implementation Requirements +-------------------------------------- + +These sections list quality-of-implementation requirements. Although an +RCU implementation that ignores these requirements could still be used, +it would likely be subject to limitations that would make it +inappropriate for industrial-strength production use. Classes of +quality-of-implementation requirements are as follows: + +#. `Specialization`_ +#. `Performance and Scalability`_ +#. `Forward Progress`_ +#. `Composability`_ +#. `Corner Cases`_ + +These classes is covered in the following sections. + +Specialization +~~~~~~~~~~~~~~ + +RCU is and always has been intended primarily for read-mostly +situations, which means that RCU's read-side primitives are optimized, +often at the expense of its update-side primitives. Experience thus far +is captured by the following list of situations: + +#. Read-mostly data, where stale and inconsistent data is not a problem: + RCU works great! +#. Read-mostly data, where data must be consistent: RCU works well. +#. Read-write data, where data must be consistent: RCU *might* work OK. + Or not. +#. Write-mostly data, where data must be consistent: RCU is very + unlikely to be the right tool for the job, with the following + exceptions, where RCU can provide: + + a. Existence guarantees for update-friendly mechanisms. + b. Wait-free read-side primitives for real-time use. + +This focus on read-mostly situations means that RCU must interoperate +with other synchronization primitives. For example, the ``add_gp()`` and +``remove_gp_synchronous()`` examples discussed earlier use RCU to +protect readers and locking to coordinate updaters. However, the need +extends much farther, requiring that a variety of synchronization +primitives be legal within RCU read-side critical sections, including +spinlocks, sequence locks, atomic operations, reference counters, and +memory barriers. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What about sleeping locks? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| These are forbidden within Linux-kernel RCU read-side critical | +| sections because it is not legal to place a quiescent state (in this | +| case, voluntary context switch) within an RCU read-side critical | +| section. However, sleeping locks may be used within userspace RCU | +| read-side critical sections, and also within Linux-kernel sleepable | +| RCU `(SRCU) <#Sleepable%20RCU>`__ read-side critical sections. In | +| addition, the -rt patchset turns spinlocks into a sleeping locks so | +| that the corresponding critical sections can be preempted, which also | +| means that these sleeplockified spinlocks (but not other sleeping | +| locks!) may be acquire within -rt-Linux-kernel RCU read-side critical | +| sections. | +| Note that it *is* legal for a normal RCU read-side critical section | +| to conditionally acquire a sleeping locks (as in | +| ``mutex_trylock()``), but only as long as it does not loop | +| indefinitely attempting to conditionally acquire that sleeping locks. | +| The key point is that things like ``mutex_trylock()`` either return | +| with the mutex held, or return an error indication if the mutex was | +| not immediately available. Either way, ``mutex_trylock()`` returns | +| immediately without sleeping. | ++-----------------------------------------------------------------------+ + +It often comes as a surprise that many algorithms do not require a +consistent view of data, but many can function in that mode, with +network routing being the poster child. Internet routing algorithms take +significant time to propagate updates, so that by the time an update +arrives at a given system, that system has been sending network traffic +the wrong way for a considerable length of time. Having a few threads +continue to send traffic the wrong way for a few more milliseconds is +clearly not a problem: In the worst case, TCP retransmissions will +eventually get the data where it needs to go. In general, when tracking +the state of the universe outside of the computer, some level of +inconsistency must be tolerated due to speed-of-light delays if nothing +else. + +Furthermore, uncertainty about external state is inherent in many cases. +For example, a pair of veterinarians might use heartbeat to determine +whether or not a given cat was alive. But how long should they wait +after the last heartbeat to decide that the cat is in fact dead? Waiting +less than 400 milliseconds makes no sense because this would mean that a +relaxed cat would be considered to cycle between death and life more +than 100 times per minute. Moreover, just as with human beings, a cat's +heart might stop for some period of time, so the exact wait period is a +judgment call. One of our pair of veterinarians might wait 30 seconds +before pronouncing the cat dead, while the other might insist on waiting +a full minute. The two veterinarians would then disagree on the state of +the cat during the final 30 seconds of the minute following the last +heartbeat. + +Interestingly enough, this same situation applies to hardware. When push +comes to shove, how do we tell whether or not some external server has +failed? We send messages to it periodically, and declare it failed if we +don't receive a response within a given period of time. Policy decisions +can usually tolerate short periods of inconsistency. The policy was +decided some time ago, and is only now being put into effect, so a few +milliseconds of delay is normally inconsequential. + +However, there are algorithms that absolutely must see consistent data. +For example, the translation between a user-level SystemV semaphore ID +to the corresponding in-kernel data structure is protected by RCU, but +it is absolutely forbidden to update a semaphore that has just been +removed. In the Linux kernel, this need for consistency is accommodated +by acquiring spinlocks located in the in-kernel data structure from +within the RCU read-side critical section, and this is indicated by the +green box in the figure above. Many other techniques may be used, and +are in fact used within the Linux kernel. + +In short, RCU is not required to maintain consistency, and other +mechanisms may be used in concert with RCU when consistency is required. +RCU's specialization allows it to do its job extremely well, and its +ability to interoperate with other synchronization mechanisms allows the +right mix of synchronization tools to be used for a given job. + +Performance and Scalability +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Energy efficiency is a critical component of performance today, and +Linux-kernel RCU implementations must therefore avoid unnecessarily +awakening idle CPUs. I cannot claim that this requirement was +premeditated. In fact, I learned of it during a telephone conversation +in which I was given “frank and open” feedback on the importance of +energy efficiency in battery-powered systems and on specific +energy-efficiency shortcomings of the Linux-kernel RCU implementation. +In my experience, the battery-powered embedded community will consider +any unnecessary wakeups to be extremely unfriendly acts. So much so that +mere Linux-kernel-mailing-list posts are insufficient to vent their ire. + +Memory consumption is not particularly important for in most situations, +and has become decreasingly so as memory sizes have expanded and memory +costs have plummeted. However, as I learned from Matt Mackall's +`bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory +footprint is critically important on single-CPU systems with +non-preemptible (``CONFIG_PREEMPT=n``) kernels, and thus `tiny +RCU <https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com>`__ +was born. Josh Triplett has since taken over the small-memory banner +with his `Linux kernel tinification <https://tiny.wiki.kernel.org/>`__ +project, which resulted in `SRCU <#Sleepable%20RCU>`__ becoming optional +for those kernels not needing it. + +The remaining performance requirements are, for the most part, +unsurprising. For example, in keeping with RCU's read-side +specialization, ``rcu_dereference()`` should have negligible overhead +(for example, suppression of a few minor compiler optimizations). +Similarly, in non-preemptible environments, ``rcu_read_lock()`` and +``rcu_read_unlock()`` should have exactly zero overhead. + +In preemptible environments, in the case where the RCU read-side +critical section was not preempted (as will be the case for the +highest-priority real-time process), ``rcu_read_lock()`` and +``rcu_read_unlock()`` should have minimal overhead. In particular, they +should not contain atomic read-modify-write operations, memory-barrier +instructions, preemption disabling, interrupt disabling, or backwards +branches. However, in the case where the RCU read-side critical section +was preempted, ``rcu_read_unlock()`` may acquire spinlocks and disable +interrupts. This is why it is better to nest an RCU read-side critical +section within a preempt-disable region than vice versa, at least in +cases where that critical section is short enough to avoid unduly +degrading real-time latencies. + +The ``synchronize_rcu()`` grace-period-wait primitive is optimized for +throughput. It may therefore incur several milliseconds of latency in +addition to the duration of the longest RCU read-side critical section. +On the other hand, multiple concurrent invocations of +``synchronize_rcu()`` are required to use batching optimizations so that +they can be satisfied by a single underlying grace-period-wait +operation. For example, in the Linux kernel, it is not unusual for a +single grace-period-wait operation to serve more than `1,000 separate +invocations <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response>`__ +of ``synchronize_rcu()``, thus amortizing the per-invocation overhead +down to nearly zero. However, the grace-period optimization is also +required to avoid measurable degradation of real-time scheduling and +interrupt latencies. + +In some cases, the multi-millisecond ``synchronize_rcu()`` latencies are +unacceptable. In these cases, ``synchronize_rcu_expedited()`` may be +used instead, reducing the grace-period latency down to a few tens of +microseconds on small systems, at least in cases where the RCU read-side +critical sections are short. There are currently no special latency +requirements for ``synchronize_rcu_expedited()`` on large systems, but, +consistent with the empirical nature of the RCU specification, that is +subject to change. However, there most definitely are scalability +requirements: A storm of ``synchronize_rcu_expedited()`` invocations on +4096 CPUs should at least make reasonable forward progress. In return +for its shorter latencies, ``synchronize_rcu_expedited()`` is permitted +to impose modest degradation of real-time latency on non-idle online +CPUs. Here, “modest” means roughly the same latency degradation as a +scheduling-clock interrupt. + +There are a number of situations where even +``synchronize_rcu_expedited()``'s reduced grace-period latency is +unacceptable. In these situations, the asynchronous ``call_rcu()`` can +be used in place of ``synchronize_rcu()`` as follows: + + :: + + 1 struct foo { + 2 int a; + 3 int b; + 4 struct rcu_head rh; + 5 }; + 6 + 7 static void remove_gp_cb(struct rcu_head *rhp) + 8 { + 9 struct foo *p = container_of(rhp, struct foo, rh); + 10 + 11 kfree(p); + 12 } + 13 + 14 bool remove_gp_asynchronous(void) + 15 { + 16 struct foo *p; + 17 + 18 spin_lock(&gp_lock); + 19 p = rcu_access_pointer(gp); + 20 if (!p) { + 21 spin_unlock(&gp_lock); + 22 return false; + 23 } + 24 rcu_assign_pointer(gp, NULL); + 25 call_rcu(&p->rh, remove_gp_cb); + 26 spin_unlock(&gp_lock); + 27 return true; + 28 } + +A definition of ``struct foo`` is finally needed, and appears on +lines 1-5. The function ``remove_gp_cb()`` is passed to ``call_rcu()`` +on line 25, and will be invoked after the end of a subsequent grace +period. This gets the same effect as ``remove_gp_synchronous()``, but +without forcing the updater to wait for a grace period to elapse. The +``call_rcu()`` function may be used in a number of situations where +neither ``synchronize_rcu()`` nor ``synchronize_rcu_expedited()`` would +be legal, including within preempt-disable code, ``local_bh_disable()`` +code, interrupt-disable code, and interrupt handlers. However, even +``call_rcu()`` is illegal within NMI handlers and from idle and offline +CPUs. The callback function (``remove_gp_cb()`` in this case) will be +executed within softirq (software interrupt) environment within the +Linux kernel, either within a real softirq handler or under the +protection of ``local_bh_disable()``. In both the Linux kernel and in +userspace, it is bad practice to write an RCU callback function that +takes too long. Long-running operations should be relegated to separate +threads or (in the Linux kernel) workqueues. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why does line 19 use ``rcu_access_pointer()``? After all, | +| ``call_rcu()`` on line 25 stores into the structure, which would | +| interact badly with concurrent insertions. Doesn't this mean that | +| ``rcu_dereference()`` is required? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Presumably the ``->gp_lock`` acquired on line 18 excludes any | +| changes, including any insertions that ``rcu_dereference()`` would | +| protect against. Therefore, any insertions will be delayed until | +| after ``->gp_lock`` is released on line 25, which in turn means that | +| ``rcu_access_pointer()`` suffices. | ++-----------------------------------------------------------------------+ + +However, all that ``remove_gp_cb()`` is doing is invoking ``kfree()`` on +the data element. This is a common idiom, and is supported by +``kfree_rcu()``, which allows “fire and forget” operation as shown +below: + + :: + + 1 struct foo { + 2 int a; + 3 int b; + 4 struct rcu_head rh; + 5 }; + 6 + 7 bool remove_gp_faf(void) + 8 { + 9 struct foo *p; + 10 + 11 spin_lock(&gp_lock); + 12 p = rcu_dereference(gp); + 13 if (!p) { + 14 spin_unlock(&gp_lock); + 15 return false; + 16 } + 17 rcu_assign_pointer(gp, NULL); + 18 kfree_rcu(p, rh); + 19 spin_unlock(&gp_lock); + 20 return true; + 21 } + +Note that ``remove_gp_faf()`` simply invokes ``kfree_rcu()`` and +proceeds, without any need to pay any further attention to the +subsequent grace period and ``kfree()``. It is permissible to invoke +``kfree_rcu()`` from the same environments as for ``call_rcu()``. +Interestingly enough, DYNIX/ptx had the equivalents of ``call_rcu()`` +and ``kfree_rcu()``, but not ``synchronize_rcu()``. This was due to the +fact that RCU was not heavily used within DYNIX/ptx, so the very few +places that needed something like ``synchronize_rcu()`` simply +open-coded it. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Earlier it was claimed that ``call_rcu()`` and ``kfree_rcu()`` | +| allowed updaters to avoid being blocked by readers. But how can that | +| be correct, given that the invocation of the callback and the freeing | +| of the memory (respectively) must still wait for a grace period to | +| elapse? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| We could define things this way, but keep in mind that this sort of | +| definition would say that updates in garbage-collected languages | +| cannot complete until the next time the garbage collector runs, which | +| does not seem at all reasonable. The key point is that in most cases, | +| an updater using either ``call_rcu()`` or ``kfree_rcu()`` can proceed | +| to the next update as soon as it has invoked ``call_rcu()`` or | +| ``kfree_rcu()``, without having to wait for a subsequent grace | +| period. | ++-----------------------------------------------------------------------+ + +But what if the updater must wait for the completion of code to be +executed after the end of the grace period, but has other tasks that can +be carried out in the meantime? The polling-style +``get_state_synchronize_rcu()`` and ``cond_synchronize_rcu()`` functions +may be used for this purpose, as shown below: + + :: + + 1 bool remove_gp_poll(void) + 2 { + 3 struct foo *p; + 4 unsigned long s; + 5 + 6 spin_lock(&gp_lock); + 7 p = rcu_access_pointer(gp); + 8 if (!p) { + 9 spin_unlock(&gp_lock); + 10 return false; + 11 } + 12 rcu_assign_pointer(gp, NULL); + 13 spin_unlock(&gp_lock); + 14 s = get_state_synchronize_rcu(); + 15 do_something_while_waiting(); + 16 cond_synchronize_rcu(s); + 17 kfree(p); + 18 return true; + 19 } + +On line 14, ``get_state_synchronize_rcu()`` obtains a “cookie” from RCU, +then line 15 carries out other tasks, and finally, line 16 returns +immediately if a grace period has elapsed in the meantime, but otherwise +waits as required. The need for ``get_state_synchronize_rcu`` and +``cond_synchronize_rcu()`` has appeared quite recently, so it is too +early to tell whether they will stand the test of time. + +RCU thus provides a range of tools to allow updaters to strike the +required tradeoff between latency, flexibility and CPU overhead. + +Forward Progress +~~~~~~~~~~~~~~~~ + +In theory, delaying grace-period completion and callback invocation is +harmless. In practice, not only are memory sizes finite but also +callbacks sometimes do wakeups, and sufficiently deferred wakeups can be +difficult to distinguish from system hangs. Therefore, RCU must provide +a number of mechanisms to promote forward progress. + +These mechanisms are not foolproof, nor can they be. For one simple +example, an infinite loop in an RCU read-side critical section must by +definition prevent later grace periods from ever completing. For a more +involved example, consider a 64-CPU system built with +``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where +CPUs 1 through 63 spin in tight loops that invoke ``call_rcu()``. Even +if these tight loops also contain calls to ``cond_resched()`` (thus +allowing grace periods to complete), CPU 0 simply will not be able to +invoke callbacks as fast as the other 63 CPUs can register them, at +least not until the system runs out of memory. In both of these +examples, the Spiderman principle applies: With great power comes great +responsibility. However, short of this level of abuse, RCU is required +to ensure timely completion of grace periods and timely invocation of +callbacks. + +RCU takes the following steps to encourage timely completion of grace +periods: + +#. If a grace period fails to complete within 100 milliseconds, RCU + causes future invocations of ``cond_resched()`` on the holdout CPUs + to provide an RCU quiescent state. RCU also causes those CPUs' + ``need_resched()`` invocations to return ``true``, but only after the + corresponding CPU's next scheduling-clock. +#. CPUs mentioned in the ``nohz_full`` kernel boot parameter can run + indefinitely in the kernel without scheduling-clock interrupts, which + defeats the above ``need_resched()`` strategem. RCU will therefore + invoke ``resched_cpu()`` on any ``nohz_full`` CPUs still holding out + after 109 milliseconds. +#. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that + has been preempted within an RCU read-side critical section is + holding out for more than 500 milliseconds, RCU will resort to + priority boosting. +#. If a CPU is still holding out 10 seconds into the grace period, RCU + will invoke ``resched_cpu()`` on it regardless of its ``nohz_full`` + state. + +The above values are defaults for systems running with ``HZ=1000``. They +will vary as the value of ``HZ`` varies, and can also be changed using +the relevant Kconfig options and kernel boot parameters. RCU currently +does not do much sanity checking of these parameters, so please use +caution when changing them. Note that these forward-progress measures +are provided only for RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks +RCU <#Tasks%20RCU>`__. + +RCU takes the following steps in ``call_rcu()`` to encourage timely +invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has +10,000 callbacks, or has 10,000 more callbacks than it had the last time +encouragement was provided: + +#. Starts a grace period, if one is not already in progress. +#. Forces immediate checking for quiescent states, rather than waiting + for three milliseconds to have elapsed since the beginning of the + grace period. +#. Immediately tags the CPU's callbacks with their grace period + completion numbers, rather than waiting for the ``RCU_SOFTIRQ`` + handler to get around to it. +#. Lifts callback-execution batch limits, which speeds up callback + invocation at the expense of degrading realtime response. + +Again, these are default values when running at ``HZ=1000``, and can be +overridden. Again, these forward-progress measures are provided only for +RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks +RCU <#Tasks%20RCU>`__. Even for RCU, callback-invocation forward +progress for ``rcu_nocbs`` CPUs is much less well-developed, in part +because workloads benefiting from ``rcu_nocbs`` CPUs tend to invoke +``call_rcu()`` relatively infrequently. If workloads emerge that need +both ``rcu_nocbs`` CPUs and high ``call_rcu()`` invocation rates, then +additional forward-progress work will be required. + +Composability +~~~~~~~~~~~~~ + +Composability has received much attention in recent years, perhaps in +part due to the collision of multicore hardware with object-oriented +techniques designed in single-threaded environments for single-threaded +use. And in theory, RCU read-side critical sections may be composed, and +in fact may be nested arbitrarily deeply. In practice, as with all +real-world implementations of composable constructs, there are +limitations. + +Implementations of RCU for which ``rcu_read_lock()`` and +``rcu_read_unlock()`` generate no code, such as Linux-kernel RCU when +``CONFIG_PREEMPT=n``, can be nested arbitrarily deeply. After all, there +is no overhead. Except that if all these instances of +``rcu_read_lock()`` and ``rcu_read_unlock()`` are visible to the +compiler, compilation will eventually fail due to exhausting memory, +mass storage, or user patience, whichever comes first. If the nesting is +not visible to the compiler, as is the case with mutually recursive +functions each in its own translation unit, stack overflow will result. +If the nesting takes the form of loops, perhaps in the guise of tail +recursion, either the control variable will overflow or (in the Linux +kernel) you will get an RCU CPU stall warning. Nevertheless, this class +of RCU implementations is one of the most composable constructs in +existence. + +RCU implementations that explicitly track nesting depth are limited by +the nesting-depth counter. For example, the Linux kernel's preemptible +RCU limits nesting to ``INT_MAX``. This should suffice for almost all +practical purposes. That said, a consecutive pair of RCU read-side +critical sections between which there is an operation that waits for a +grace period cannot be enclosed in another RCU read-side critical +section. This is because it is not legal to wait for a grace period +within an RCU read-side critical section: To do so would result either +in deadlock or in RCU implicitly splitting the enclosing RCU read-side +critical section, neither of which is conducive to a long-lived and +prosperous kernel. + +It is worth noting that RCU is not alone in limiting composability. For +example, many transactional-memory implementations prohibit composing a +pair of transactions separated by an irrevocable operation (for example, +a network receive operation). For another example, lock-based critical +sections can be composed surprisingly freely, but only if deadlock is +avoided. + +In short, although RCU read-side critical sections are highly +composable, care is required in some situations, just as is the case for +any other composable synchronization mechanism. + +Corner Cases +~~~~~~~~~~~~ + +A given RCU workload might have an endless and intense stream of RCU +read-side critical sections, perhaps even so intense that there was +never a point in time during which there was not at least one RCU +read-side critical section in flight. RCU cannot allow this situation to +block grace periods: As long as all the RCU read-side critical sections +are finite, grace periods must also be finite. + +That said, preemptible RCU implementations could potentially result in +RCU read-side critical sections being preempted for long durations, +which has the effect of creating a long-duration RCU read-side critical +section. This situation can arise only in heavily loaded systems, but +systems using real-time priorities are of course more vulnerable. +Therefore, RCU priority boosting is provided to help deal with this +case. That said, the exact requirements on RCU priority boosting will +likely evolve as more experience accumulates. + +Other workloads might have very high update rates. Although one can +argue that such workloads should instead use something other than RCU, +the fact remains that RCU must handle such workloads gracefully. This +requirement is another factor driving batching of grace periods, but it +is also the driving force behind the checks for large numbers of queued +RCU callbacks in the ``call_rcu()`` code path. Finally, high update +rates should not delay RCU read-side critical sections, although some +small read-side delays can occur when using +``synchronize_rcu_expedited()``, courtesy of this function's use of +``smp_call_function_single()``. + +Although all three of these corner cases were understood in the early +1990s, a simple user-level test consisting of ``close(open(path))`` in a +tight loop in the early 2000s suddenly provided a much deeper +appreciation of the high-update-rate corner case. This test also +motivated addition of some RCU code to react to high update rates, for +example, if a given CPU finds itself with more than 10,000 RCU callbacks +queued, it will cause RCU to take evasive action by more aggressively +starting grace periods and more aggressively forcing completion of +grace-period processing. This evasive action causes the grace period to +complete more quickly, but at the cost of restricting RCU's batching +optimizations, thus increasing the CPU overhead incurred by that grace +period. + +Software-Engineering Requirements +--------------------------------- + +Between Murphy's Law and “To err is human”, it is necessary to guard +against mishaps and misuse: + +#. It is all too easy to forget to use ``rcu_read_lock()`` everywhere + that it is needed, so kernels built with ``CONFIG_PROVE_RCU=y`` will + splat if ``rcu_dereference()`` is used outside of an RCU read-side + critical section. Update-side code can use + ``rcu_dereference_protected()``, which takes a `lockdep + expression <https://lwn.net/Articles/371986/>`__ to indicate what is + providing the protection. If the indicated protection is not + provided, a lockdep splat is emitted. + Code shared between readers and updaters can use + ``rcu_dereference_check()``, which also takes a lockdep expression, + and emits a lockdep splat if neither ``rcu_read_lock()`` nor the + indicated protection is in place. In addition, + ``rcu_dereference_raw()`` is used in those (hopefully rare) cases + where the required protection cannot be easily described. Finally, + ``rcu_read_lock_held()`` is provided to allow a function to verify + that it has been invoked within an RCU read-side critical section. I + was made aware of this set of requirements shortly after Thomas + Gleixner audited a number of RCU uses. +#. A given function might wish to check for RCU-related preconditions + upon entry, before using any other RCU API. The + ``rcu_lockdep_assert()`` does this job, asserting the expression in + kernels having lockdep enabled and doing nothing otherwise. +#. It is also easy to forget to use ``rcu_assign_pointer()`` and + ``rcu_dereference()``, perhaps (incorrectly) substituting a simple + assignment. To catch this sort of error, a given RCU-protected + pointer may be tagged with ``__rcu``, after which sparse will + complain about simple-assignment accesses to that pointer. Arnd + Bergmann made me aware of this requirement, and also supplied the + needed `patch series <https://lwn.net/Articles/376011/>`__. +#. Kernels built with ``CONFIG_DEBUG_OBJECTS_RCU_HEAD=y`` will splat if + a data element is passed to ``call_rcu()`` twice in a row, without a + grace period in between. (This error is similar to a double free.) + The corresponding ``rcu_head`` structures that are dynamically + allocated are automatically tracked, but ``rcu_head`` structures + allocated on the stack must be initialized with + ``init_rcu_head_on_stack()`` and cleaned up with + ``destroy_rcu_head_on_stack()``. Similarly, statically allocated + non-stack ``rcu_head`` structures must be initialized with + ``init_rcu_head()`` and cleaned up with ``destroy_rcu_head()``. + Mathieu Desnoyers made me aware of this requirement, and also + supplied the needed + `patch <https://lkml.kernel.org/g/20100319013024.GA28456@Krystal>`__. +#. An infinite loop in an RCU read-side critical section will eventually + trigger an RCU CPU stall warning splat, with the duration of + “eventually” being controlled by the ``RCU_CPU_STALL_TIMEOUT`` + ``Kconfig`` option, or, alternatively, by the + ``rcupdate.rcu_cpu_stall_timeout`` boot/sysfs parameter. However, RCU + is not obligated to produce this splat unless there is a grace period + waiting on that particular RCU read-side critical section. + + Some extreme workloads might intentionally delay RCU grace periods, + and systems running those workloads can be booted with + ``rcupdate.rcu_cpu_stall_suppress`` to suppress the splats. This + kernel parameter may also be set via ``sysfs``. Furthermore, RCU CPU + stall warnings are counter-productive during sysrq dumps and during + panics. RCU therefore supplies the ``rcu_sysrq_start()`` and + ``rcu_sysrq_end()`` API members to be called before and after long + sysrq dumps. RCU also supplies the ``rcu_panic()`` notifier that is + automatically invoked at the beginning of a panic to suppress further + RCU CPU stall warnings. + + This requirement made itself known in the early 1990s, pretty much + the first time that it was necessary to debug a CPU stall. That said, + the initial implementation in DYNIX/ptx was quite generic in + comparison with that of Linux. + +#. Although it would be very good to detect pointers leaking out of RCU + read-side critical sections, there is currently no good way of doing + this. One complication is the need to distinguish between pointers + leaking and pointers that have been handed off from RCU to some other + synchronization mechanism, for example, reference counting. +#. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information + is provided via event tracing. +#. Open-coded use of ``rcu_assign_pointer()`` and ``rcu_dereference()`` + to create typical linked data structures can be surprisingly + error-prone. Therefore, RCU-protected `linked + lists <https://lwn.net/Articles/609973/#RCU%20List%20APIs>`__ and, + more recently, RCU-protected `hash + tables <https://lwn.net/Articles/612100/>`__ are available. Many + other special-purpose RCU-protected data structures are available in + the Linux kernel and the userspace RCU library. +#. Some linked structures are created at compile time, but still require + ``__rcu`` checking. The ``RCU_POINTER_INITIALIZER()`` macro serves + this purpose. +#. It is not necessary to use ``rcu_assign_pointer()`` when creating + linked structures that are to be published via a single external + pointer. The ``RCU_INIT_POINTER()`` macro is provided for this task + and also for assigning ``NULL`` pointers at runtime. + +This not a hard-and-fast list: RCU's diagnostic capabilities will +continue to be guided by the number and type of usage bugs found in +real-world RCU usage. + +Linux Kernel Complications +-------------------------- + +The Linux kernel provides an interesting environment for all kinds of +software, including RCU. Some of the relevant points of interest are as +follows: + +#. `Configuration`_ +#. `Firmware Interface`_ +#. `Early Boot`_ +#. `Interrupts and NMIs`_ +#. `Loadable Modules`_ +#. `Hotplug CPU`_ +#. `Scheduler and RCU`_ +#. `Tracing and RCU`_ +#. `Accesses to User Memory and RCU`_ +#. `Energy Efficiency`_ +#. `Scheduling-Clock Interrupts and RCU`_ +#. `Memory Efficiency`_ +#. `Performance, Scalability, Response Time, and Reliability`_ + +This list is probably incomplete, but it does give a feel for the most +notable Linux-kernel complications. Each of the following sections +covers one of the above topics. + +Configuration +~~~~~~~~~~~~~ + +RCU's goal is automatic configuration, so that almost nobody needs to +worry about RCU's ``Kconfig`` options. And for almost all users, RCU +does in fact work well “out of the box.” + +However, there are specialized use cases that are handled by kernel boot +parameters and ``Kconfig`` options. Unfortunately, the ``Kconfig`` +system will explicitly ask users about new ``Kconfig`` options, which +requires almost all of them be hidden behind a ``CONFIG_RCU_EXPERT`` +``Kconfig`` option. + +This all should be quite obvious, but the fact remains that Linus +Torvalds recently had to +`remind <https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com>`__ +me of this requirement. + +Firmware Interface +~~~~~~~~~~~~~~~~~~ + +In many cases, kernel obtains information about the system from the +firmware, and sometimes things are lost in translation. Or the +translation is accurate, but the original message is bogus. + +For example, some systems' firmware overreports the number of CPUs, +sometimes by a large factor. If RCU naively believed the firmware, as it +used to do, it would create too many per-CPU kthreads. Although the +resulting system will still run correctly, the extra kthreads needlessly +consume memory and can cause confusion when they show up in ``ps`` +listings. + +RCU must therefore wait for a given CPU to actually come online before +it can allow itself to believe that the CPU actually exists. The +resulting “ghost CPUs” (which are never going to come online) cause a +number of `interesting +complications <https://paulmck.livejournal.com/37494.html>`__. + +Early Boot +~~~~~~~~~~ + +The Linux kernel's boot sequence is an interesting process, and RCU is +used early, even before ``rcu_init()`` is invoked. In fact, a number of +RCU's primitives can be used as soon as the initial task's +``task_struct`` is available and the boot CPU's per-CPU variables are +set up. The read-side primitives (``rcu_read_lock()``, +``rcu_read_unlock()``, ``rcu_dereference()``, and +``rcu_access_pointer()``) will operate normally very early on, as will +``rcu_assign_pointer()``. + +Although ``call_rcu()`` may be invoked at any time during boot, +callbacks are not guaranteed to be invoked until after all of RCU's +kthreads have been spawned, which occurs at ``early_initcall()`` time. +This delay in callback invocation is due to the fact that RCU does not +invoke callbacks until it is fully initialized, and this full +initialization cannot occur until after the scheduler has initialized +itself to the point where RCU can spawn and run its kthreads. In theory, +it would be possible to invoke callbacks earlier, however, this is not a +panacea because there would be severe restrictions on what operations +those callbacks could invoke. + +Perhaps surprisingly, ``synchronize_rcu()`` and +``synchronize_rcu_expedited()``, will operate normally during very early +boot, the reason being that there is only one CPU and preemption is +disabled. This means that the call ``synchronize_rcu()`` (or friends) +itself is a quiescent state and thus a grace period, so the early-boot +implementation can be a no-op. + +However, once the scheduler has spawned its first kthread, this early +boot trick fails for ``synchronize_rcu()`` (as well as for +``synchronize_rcu_expedited()``) in ``CONFIG_PREEMPT=y`` kernels. The +reason is that an RCU read-side critical section might be preempted, +which means that a subsequent ``synchronize_rcu()`` really does have to +wait for something, as opposed to simply returning immediately. +Unfortunately, ``synchronize_rcu()`` can't do this until all of its +kthreads are spawned, which doesn't happen until some time during +``early_initcalls()`` time. But this is no excuse: RCU is nevertheless +required to correctly handle synchronous grace periods during this time +period. Once all of its kthreads are up and running, RCU starts running +normally. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| How can RCU possibly handle grace periods before all of its kthreads | +| have been spawned??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Very carefully! | +| During the “dead zone” between the time that the scheduler spawns the | +| first task and the time that all of RCU's kthreads have been spawned, | +| all synchronous grace periods are handled by the expedited | +| grace-period mechanism. At runtime, this expedited mechanism relies | +| on workqueues, but during the dead zone the requesting task itself | +| drives the desired expedited grace period. Because dead-zone | +| execution takes place within task context, everything works. Once the | +| dead zone ends, expedited grace periods go back to using workqueues, | +| as is required to avoid problems that would otherwise occur when a | +| user task received a POSIX signal while driving an expedited grace | +| period. | +| | +| And yes, this does mean that it is unhelpful to send POSIX signals to | +| random tasks between the time that the scheduler spawns its first | +| kthread and the time that RCU's kthreads have all been spawned. If | +| there ever turns out to be a good reason for sending POSIX signals | +| during that time, appropriate adjustments will be made. (If it turns | +| out that POSIX signals are sent during this time for no good reason, | +| other adjustments will be made, appropriate or otherwise.) | ++-----------------------------------------------------------------------+ + +I learned of these boot-time requirements as a result of a series of +system hangs. + +Interrupts and NMIs +~~~~~~~~~~~~~~~~~~~ + +The Linux kernel has interrupts, and RCU read-side critical sections are +legal within interrupt handlers and within interrupt-disabled regions of +code, as are invocations of ``call_rcu()``. + +Some Linux-kernel architectures can enter an interrupt handler from +non-idle process context, and then just never leave it, instead +stealthily transitioning back to process context. This trick is +sometimes used to invoke system calls from inside the kernel. These +“half-interrupts” mean that RCU has to be very careful about how it +counts interrupt nesting levels. I learned of this requirement the hard +way during a rewrite of RCU's dyntick-idle code. + +The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side +critical sections are legal within NMI handlers. Thankfully, RCU +update-side primitives, including ``call_rcu()``, are prohibited within +NMI handlers. + +The name notwithstanding, some Linux-kernel architectures can have +nested NMIs, which RCU must handle correctly. Andy Lutomirski `surprised +me <https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__ +with this requirement; he also kindly surprised me with `an +algorithm <https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com>`__ +that meets this requirement. + +Furthermore, NMI handlers can be interrupted by what appear to RCU to be +normal interrupts. One way that this can happen is for code that +directly invokes ``rcu_irq_enter()`` and ``rcu_irq_exit()`` to be called +from an NMI handler. This astonishing fact of life prompted the current +code structure, which has ``rcu_irq_enter()`` invoking +``rcu_nmi_enter()`` and ``rcu_irq_exit()`` invoking ``rcu_nmi_exit()``. +And yes, I also learned of this requirement the hard way. + +Loadable Modules +~~~~~~~~~~~~~~~~ + +The Linux kernel has loadable modules, and these modules can also be +unloaded. After a given module has been unloaded, any attempt to call +one of its functions results in a segmentation fault. The module-unload +functions must therefore cancel any delayed calls to loadable-module +functions, for example, any outstanding ``mod_timer()`` must be dealt +with via ``del_timer_sync()`` or similar. + +Unfortunately, there is no way to cancel an RCU callback; once you +invoke ``call_rcu()``, the callback function is eventually going to be +invoked, unless the system goes down first. Because it is normally +considered socially irresponsible to crash the system in response to a +module unload request, we need some other way to deal with in-flight RCU +callbacks. + +RCU therefore provides ``rcu_barrier()``, which waits until all +in-flight RCU callbacks have been invoked. If a module uses +``call_rcu()``, its exit function should therefore prevent any future +invocation of ``call_rcu()``, then invoke ``rcu_barrier()``. In theory, +the underlying module-unload code could invoke ``rcu_barrier()`` +unconditionally, but in practice this would incur unacceptable +latencies. + +Nikita Danilov noted this requirement for an analogous +filesystem-unmount situation, and Dipankar Sarma incorporated +``rcu_barrier()`` into RCU. The need for ``rcu_barrier()`` for module +unloading became apparent later. + +.. important:: + + The ``rcu_barrier()`` function is not, repeat, + *not*, obligated to wait for a grace period. It is instead only required + to wait for RCU callbacks that have already been posted. Therefore, if + there are no RCU callbacks posted anywhere in the system, + ``rcu_barrier()`` is within its rights to return immediately. Even if + there are callbacks posted, ``rcu_barrier()`` does not necessarily need + to wait for a grace period. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! Each RCU callbacks must wait for a grace period to | +| complete, and ``rcu_barrier()`` must wait for each pre-existing | +| callback to be invoked. Doesn't ``rcu_barrier()`` therefore need to | +| wait for a full grace period if there is even one callback posted | +| anywhere in the system? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Absolutely not!!! | +| Yes, each RCU callbacks must wait for a grace period to complete, but | +| it might well be partly (or even completely) finished waiting by the | +| time ``rcu_barrier()`` is invoked. In that case, ``rcu_barrier()`` | +| need only wait for the remaining portion of the grace period to | +| elapse. So even if there are quite a few callbacks posted, | +| ``rcu_barrier()`` might well return quite quickly. | +| | +| So if you need to wait for a grace period as well as for all | +| pre-existing callbacks, you will need to invoke both | +| ``synchronize_rcu()`` and ``rcu_barrier()``. If latency is a concern, | +| you can always use workqueues to invoke them concurrently. | ++-----------------------------------------------------------------------+ + +Hotplug CPU +~~~~~~~~~~~ + +The Linux kernel supports CPU hotplug, which means that CPUs can come +and go. It is of course illegal to use any RCU API member from an +offline CPU, with the exception of `SRCU <#Sleepable%20RCU>`__ read-side +critical sections. This requirement was present from day one in +DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug +implementation is “interesting.” + +The Linux-kernel CPU-hotplug implementation has notifiers that are used +to allow the various kernel subsystems (including RCU) to respond +appropriately to a given CPU-hotplug operation. Most RCU operations may +be invoked from CPU-hotplug notifiers, including even synchronous +grace-period operations such as ``synchronize_rcu()`` and +``synchronize_rcu_expedited()``. + +However, all-callback-wait operations such as ``rcu_barrier()`` are also +not supported, due to the fact that there are phases of CPU-hotplug +operations where the outgoing CPU's callbacks will not be invoked until +after the CPU-hotplug operation ends, which could also result in +deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations +during its execution, which results in another type of deadlock when +invoked from a CPU-hotplug notifier. + +Scheduler and RCU +~~~~~~~~~~~~~~~~~ + +RCU depends on the scheduler, and the scheduler uses RCU to protect some +of its data structures. The preemptible-RCU ``rcu_read_unlock()`` +implementation must therefore be written carefully to avoid deadlocks +involving the scheduler's runqueue and priority-inheritance locks. In +particular, ``rcu_read_unlock()`` must tolerate an interrupt where the +interrupt handler invokes both ``rcu_read_lock()`` and +``rcu_read_unlock()``. This possibility requires ``rcu_read_unlock()`` +to use negative nesting levels to avoid destructive recursion via +interrupt handler's use of RCU. + +This scheduler-RCU requirement came as a `complete +surprise <https://lwn.net/Articles/453002/>`__. + +As noted above, RCU makes use of kthreads, and it is necessary to avoid +excessive CPU-time accumulation by these kthreads. This requirement was +no surprise, but RCU's violation of it when running context-switch-heavy +workloads when built with ``CONFIG_NO_HZ_FULL=y`` `did come as a +surprise +[PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__. +RCU has made good progress towards meeting this requirement, even for +context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is +room for further improvement. + +It is forbidden to hold any of scheduler's runqueue or +priority-inheritance spinlocks across an ``rcu_read_unlock()`` unless +interrupts have been disabled across the entire RCU read-side critical +section, that is, up to and including the matching ``rcu_read_lock()``. +Violating this restriction can result in deadlocks involving these +scheduler spinlocks. There was hope that this restriction might be +lifted when interrupt-disabled calls to ``rcu_read_unlock()`` started +deferring the reporting of the resulting RCU-preempt quiescent state +until the end of the corresponding interrupts-disabled region. +Unfortunately, timely reporting of the corresponding quiescent state to +expedited grace periods requires a call to ``raise_softirq()``, which +can acquire these scheduler spinlocks. In addition, real-time systems +using RCU priority boosting need this restriction to remain in effect +because deferred quiescent-state reporting would also defer deboosting, +which in turn would degrade real-time latencies. + +In theory, if a given RCU read-side critical section could be guaranteed +to be less than one second in duration, holding a scheduler spinlock +across that critical section's ``rcu_read_unlock()`` would require only +that preemption be disabled across the entire RCU read-side critical +section, not interrupts. Unfortunately, given the possibility of vCPU +preemption, long-running interrupts, and so on, it is not possible in +practice to guarantee that a given RCU read-side critical section will +complete in less than one second. Therefore, as noted above, if +scheduler spinlocks are held across a given call to +``rcu_read_unlock()``, interrupts must be disabled across the entire RCU +read-side critical section. + +Tracing and RCU +~~~~~~~~~~~~~~~ + +It is possible to use tracing on RCU code, but tracing itself uses RCU. +For this reason, ``rcu_dereference_raw_check()`` is provided for use +by tracing, which avoids the destructive recursion that could otherwise +ensue. This API is also used by virtualization in some architectures, +where RCU readers execute in environments in which tracing cannot be +used. The tracing folks both located the requirement and provided the +needed fix, so this surprise requirement was relatively painless. + +Accesses to User Memory and RCU +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The kernel needs to access user-space memory, for example, to access data +referenced by system-call parameters. The ``get_user()`` macro does this job. + +However, user-space memory might well be paged out, which means that +``get_user()`` might well page-fault and thus block while waiting for the +resulting I/O to complete. It would be a very bad thing for the compiler to +reorder a ``get_user()`` invocation into an RCU read-side critical section. + +For example, suppose that the source code looked like this: + + :: + + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 v = p->value; + 4 rcu_read_unlock(); + 5 get_user(user_v, user_p); + 6 do_something_with(v, user_v); + +The compiler must not be permitted to transform this source code into +the following: + + :: + + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!! + 4 v = p->value; + 5 rcu_read_unlock(); + 6 do_something_with(v, user_v); + +If the compiler did make this transformation in a ``CONFIG_PREEMPT=n`` kernel +build, and if ``get_user()`` did page fault, the result would be a quiescent +state in the middle of an RCU read-side critical section. This misplaced +quiescent state could result in line 4 being a use-after-free access, +which could be bad for your kernel's actuarial statistics. Similar examples +can be constructed with the call to ``get_user()`` preceding the +``rcu_read_lock()``. + +Unfortunately, ``get_user()`` doesn't have any particular ordering properties, +and in some architectures the underlying ``asm`` isn't even marked +``volatile``. And even if it was marked ``volatile``, the above access to +``p->value`` is not volatile, so the compiler would not have any reason to keep +those two accesses in order. + +Therefore, the Linux-kernel definitions of ``rcu_read_lock()`` and +``rcu_read_unlock()`` must act as compiler barriers, at least for outermost +instances of ``rcu_read_lock()`` and ``rcu_read_unlock()`` within a nested set +of RCU read-side critical sections. + +Energy Efficiency +~~~~~~~~~~~~~~~~~ + +Interrupting idle CPUs is considered socially unacceptable, especially +by people with battery-powered embedded systems. RCU therefore conserves +energy by detecting which CPUs are idle, including tracking CPUs that +have been interrupted from idle. This is a large part of the +energy-efficiency requirement, so I learned of this via an irate phone +call. + +Because RCU avoids interrupting idle CPUs, it is illegal to execute an +RCU read-side critical section on an idle CPU. (Kernels built with +``CONFIG_PROVE_RCU=y`` will splat if you try it.) The ``RCU_NONIDLE()`` +macro and ``_rcuidle`` event tracing is provided to work around this +restriction. In addition, ``rcu_is_watching()`` may be used to test +whether or not it is currently legal to run RCU read-side critical +sections on this CPU. I learned of the need for diagnostics on the one +hand and ``RCU_NONIDLE()`` on the other while inspecting idle-loop code. +Steven Rostedt supplied ``_rcuidle`` event tracing, which is used quite +heavily in the idle loop. However, there are some restrictions on the +code placed within ``RCU_NONIDLE()``: + +#. Blocking is prohibited. In practice, this is not a serious + restriction given that idle tasks are prohibited from blocking to + begin with. +#. Although nesting ``RCU_NONIDLE()`` is permitted, they cannot nest + indefinitely deeply. However, given that they can be nested on the + order of a million deep, even on 32-bit systems, this should not be a + serious restriction. This nesting limit would probably be reached + long after the compiler OOMed or the stack overflowed. +#. Any code path that enters ``RCU_NONIDLE()`` must sequence out of that + same ``RCU_NONIDLE()``. For example, the following is grossly + illegal: + + :: + + 1 RCU_NONIDLE({ + 2 do_something(); + 3 goto bad_idea; /* BUG!!! */ + 4 do_something_else();}); + 5 bad_idea: + + + It is just as illegal to transfer control into the middle of + ``RCU_NONIDLE()``'s argument. Yes, in theory, you could transfer in + as long as you also transferred out, but in practice you could also + expect to get sharply worded review comments. + +It is similarly socially unacceptable to interrupt an ``nohz_full`` CPU +running in userspace. RCU must therefore track ``nohz_full`` userspace +execution. RCU must therefore be able to sample state at two points in +time, and be able to determine whether or not some other CPU spent any +time idle and/or executing in userspace. + +These energy-efficiency requirements have proven quite difficult to +understand and to meet, for example, there have been more than five +clean-sheet rewrites of RCU's energy-efficiency code, the last of which +was finally able to demonstrate `real energy savings running on real +hardware +[PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf>`__. +As noted earlier, I learned of many of these requirements via angry +phone calls: Flaming me on the Linux-kernel mailing list was apparently +not sufficient to fully vent their ire at RCU's energy-efficiency bugs! + +Scheduling-Clock Interrupts and RCU +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The kernel transitions between in-kernel non-idle execution, userspace +execution, and the idle loop. Depending on kernel configuration, RCU +handles these states differently: + ++-----------------+------------------+------------------+-----------------+ +| ``HZ`` Kconfig | In-Kernel | Usermode | Idle | ++=================+==================+==================+=================+ +| ``HZ_PERIODIC`` | Can rely on | Can rely on | Can rely on | +| | scheduling-clock | scheduling-clock | RCU's | +| | interrupt. | interrupt and | dyntick-idle | +| | | its detection | detection. | +| | | of interrupt | | +| | | from usermode. | | ++-----------------+------------------+------------------+-----------------+ +| ``NO_HZ_IDLE`` | Can rely on | Can rely on | Can rely on | +| | scheduling-clock | scheduling-clock | RCU's | +| | interrupt. | interrupt and | dyntick-idle | +| | | its detection | detection. | +| | | of interrupt | | +| | | from usermode. | | ++-----------------+------------------+------------------+-----------------+ +| ``NO_HZ_FULL`` | Can only | Can rely on | Can rely on | +| | sometimes rely | RCU's | RCU's | +| | on | dyntick-idle | dyntick-idle | +| | scheduling-clock | detection. | detection. | +| | interrupt. In | | | +| | other cases, it | | | +| | is necessary to | | | +| | bound kernel | | | +| | execution times | | | +| | and/or use | | | +| | IPIs. | | | ++-----------------+------------------+------------------+-----------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why can't ``NO_HZ_FULL`` in-kernel execution rely on the | +| scheduling-clock interrupt, just like ``HZ_PERIODIC`` and | +| ``NO_HZ_IDLE`` do? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because, as a performance optimization, ``NO_HZ_FULL`` does not | +| necessarily re-enable the scheduling-clock interrupt on entry to each | +| and every system call. | ++-----------------------------------------------------------------------+ + +However, RCU must be reliably informed as to whether any given CPU is +currently in the idle loop, and, for ``NO_HZ_FULL``, also whether that +CPU is executing in usermode, as discussed +`earlier <#Energy%20Efficiency>`__. It also requires that the +scheduling-clock interrupt be enabled when RCU needs it to be: + +#. If a CPU is either idle or executing in usermode, and RCU believes it + is non-idle, the scheduling-clock tick had better be running. + Otherwise, you will get RCU CPU stall warnings. Or at best, very long + (11-second) grace periods, with a pointless IPI waking the CPU from + time to time. +#. If a CPU is in a portion of the kernel that executes RCU read-side + critical sections, and RCU believes this CPU to be idle, you will get + random memory corruption. **DON'T DO THIS!!!** + This is one reason to test with lockdep, which will complain about + this sort of thing. +#. If a CPU is in a portion of the kernel that is absolutely positively + no-joking guaranteed to never execute any RCU read-side critical + sections, and RCU believes this CPU to to be idle, no problem. This + sort of thing is used by some architectures for light-weight + exception handlers, which can then avoid the overhead of + ``rcu_irq_enter()`` and ``rcu_irq_exit()`` at exception entry and + exit, respectively. Some go further and avoid the entireties of + ``irq_enter()`` and ``irq_exit()``. + Just make very sure you are running some of your tests with + ``CONFIG_PROVE_RCU=y``, just in case one of your code paths was in + fact joking about not doing RCU read-side critical sections. +#. If a CPU is executing in the kernel with the scheduling-clock + interrupt disabled and RCU believes this CPU to be non-idle, and if + the CPU goes idle (from an RCU perspective) every few jiffies, no + problem. It is usually OK for there to be the occasional gap between + idle periods of up to a second or so. + If the gap grows too long, you get RCU CPU stall warnings. +#. If a CPU is either idle or executing in usermode, and RCU believes it + to be idle, of course no problem. +#. If a CPU is executing in the kernel, the kernel code path is passing + through quiescent states at a reasonable frequency (preferably about + once per few jiffies, but the occasional excursion to a second or so + is usually OK) and the scheduling-clock interrupt is enabled, of + course no problem. + If the gap between a successive pair of quiescent states grows too + long, you get RCU CPU stall warnings. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what if my driver has a hardware interrupt handler that can run | +| for many seconds? I cannot invoke ``schedule()`` from an hardware | +| interrupt handler, after all! | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| One approach is to do ``rcu_irq_exit();rcu_irq_enter();`` every so | +| often. But given that long-running interrupt handlers can cause other | +| problems, not least for response time, shouldn't you work to keep | +| your interrupt handler's runtime within reasonable bounds? | ++-----------------------------------------------------------------------+ + +But as long as RCU is properly informed of kernel state transitions +between in-kernel execution, usermode execution, and idle, and as long +as the scheduling-clock interrupt is enabled when RCU needs it to be, +you can rest assured that the bugs you encounter will be in some other +part of RCU or some other part of the kernel! + +Memory Efficiency +~~~~~~~~~~~~~~~~~ + +Although small-memory non-realtime systems can simply use Tiny RCU, code +size is only one aspect of memory efficiency. Another aspect is the size +of the ``rcu_head`` structure used by ``call_rcu()`` and +``kfree_rcu()``. Although this structure contains nothing more than a +pair of pointers, it does appear in many RCU-protected data structures, +including some that are size critical. The ``page`` structure is a case +in point, as evidenced by the many occurrences of the ``union`` keyword +within that structure. + +This need for memory efficiency is one reason that RCU uses hand-crafted +singly linked lists to track the ``rcu_head`` structures that are +waiting for a grace period to elapse. It is also the reason why +``rcu_head`` structures do not contain debug information, such as fields +tracking the file and line of the ``call_rcu()`` or ``kfree_rcu()`` that +posted them. Although this information might appear in debug-only kernel +builds at some point, in the meantime, the ``->func`` field will often +provide the needed debug information. + +However, in some cases, the need for memory efficiency leads to even +more extreme measures. Returning to the ``page`` structure, the +``rcu_head`` field shares storage with a great many other structures +that are used at various points in the corresponding page's lifetime. In +order to correctly resolve certain `race +conditions <https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com>`__, +the Linux kernel's memory-management subsystem needs a particular bit to +remain zero during all phases of grace-period processing, and that bit +happens to map to the bottom bit of the ``rcu_head`` structure's +``->next`` field. RCU makes this guarantee as long as ``call_rcu()`` is +used to post the callback, as opposed to ``kfree_rcu()`` or some future +“lazy” variant of ``call_rcu()`` that might one day be created for +energy-efficiency purposes. + +That said, there are limits. RCU requires that the ``rcu_head`` +structure be aligned to a two-byte boundary, and passing a misaligned +``rcu_head`` structure to one of the ``call_rcu()`` family of functions +will result in a splat. It is therefore necessary to exercise caution +when packing structures containing fields of type ``rcu_head``. Why not +a four-byte or even eight-byte alignment requirement? Because the m68k +architecture provides only two-byte alignment, and thus acts as +alignment's least common denominator. + +The reason for reserving the bottom bit of pointers to ``rcu_head`` +structures is to leave the door open to “lazy” callbacks whose +invocations can safely be deferred. Deferring invocation could +potentially have energy-efficiency benefits, but only if the rate of +non-lazy callbacks decreases significantly for some important workload. +In the meantime, reserving the bottom bit keeps this option open in case +it one day becomes useful. + +Performance, Scalability, Response Time, and Reliability +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Expanding on the `earlier +discussion <#Performance%20and%20Scalability>`__, RCU is used heavily by +hot code paths in performance-critical portions of the Linux kernel's +networking, security, virtualization, and scheduling code paths. RCU +must therefore use efficient implementations, especially in its +read-side primitives. To that end, it would be good if preemptible RCU's +implementation of ``rcu_read_lock()`` could be inlined, however, doing +this requires resolving ``#include`` issues with the ``task_struct`` +structure. + +The Linux kernel supports hardware configurations with up to 4096 CPUs, +which means that RCU must be extremely scalable. Algorithms that involve +frequent acquisitions of global locks or frequent atomic operations on +global variables simply cannot be tolerated within the RCU +implementation. RCU therefore makes heavy use of a combining tree based +on the ``rcu_node`` structure. RCU is required to tolerate all CPUs +continuously invoking any combination of RCU's runtime primitives with +minimal per-operation overhead. In fact, in many cases, increasing load +must *decrease* the per-operation overhead, witness the batching +optimizations for ``synchronize_rcu()``, ``call_rcu()``, +``synchronize_rcu_expedited()``, and ``rcu_barrier()``. As a general +rule, RCU must cheerfully accept whatever the rest of the Linux kernel +decides to throw at it. + +The Linux kernel is used for real-time workloads, especially in +conjunction with the `-rt +patchset <https://rt.wiki.kernel.org/index.php/Main_Page>`__. The +real-time-latency response requirements are such that the traditional +approach of disabling preemption across RCU read-side critical sections +is inappropriate. Kernels built with ``CONFIG_PREEMPT=y`` therefore use +an RCU implementation that allows RCU read-side critical sections to be +preempted. This requirement made its presence known after users made it +clear that an earlier `real-time +patch <https://lwn.net/Articles/107930/>`__ did not meet their needs, in +conjunction with some `RCU +issues <https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com>`__ +encountered by a very early version of the -rt patchset. + +In addition, RCU must make do with a sub-100-microsecond real-time +latency budget. In fact, on smaller systems with the -rt patchset, the +Linux kernel provides sub-20-microsecond real-time latencies for the +whole kernel, including RCU. RCU's scalability and latency must +therefore be sufficient for these sorts of configurations. To my +surprise, the sub-100-microsecond real-time latency budget `applies to +even the largest systems +[PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf>`__, +up to and including systems with 4096 CPUs. This real-time requirement +motivated the grace-period kthread, which also simplified handling of a +number of race conditions. + +RCU must avoid degrading real-time response for CPU-bound threads, +whether executing in usermode (which is one use case for +``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in +the kernel must execute ``cond_resched()`` at least once per few tens of +milliseconds in order to avoid receiving an IPI from RCU. + +Finally, RCU's status as a synchronization primitive means that any RCU +failure can result in arbitrary memory corruption that can be extremely +difficult to debug. This means that RCU must be extremely reliable, +which in practice also means that RCU must have an aggressive +stress-test suite. This stress-test suite is called ``rcutorture``. + +Although the need for ``rcutorture`` was no surprise, the current +immense popularity of the Linux kernel is posing interesting—and perhaps +unprecedented—validation challenges. To see this, keep in mind that +there are well over one billion instances of the Linux kernel running +today, given Android smartphones, Linux-powered televisions, and +servers. This number can be expected to increase sharply with the advent +of the celebrated Internet of Things. + +Suppose that RCU contains a race condition that manifests on average +once per million years of runtime. This bug will be occurring about +three times per *day* across the installed base. RCU could simply hide +behind hardware error rates, given that no one should really expect +their smartphone to last for a million years. However, anyone taking too +much comfort from this thought should consider the fact that in most +jurisdictions, a successful multi-year test of a given mechanism, which +might include a Linux kernel, suffices for a number of types of +safety-critical certifications. In fact, rumor has it that the Linux +kernel is already being used in production for safety-critical +applications. I don't know about you, but I would feel quite bad if a +bug in RCU killed someone. Which might explain my recent focus on +validation and verification. + +Other RCU Flavors +----------------- + +One of the more surprising things about RCU is that there are now no +fewer than five *flavors*, or API families. In addition, the primary +flavor that has been the sole focus up to this point has two different +implementations, non-preemptible and preemptible. The other four flavors +are listed below, with requirements for each described in a separate +section. + +#. `Bottom-Half Flavor (Historical)`_ +#. `Sched Flavor (Historical)`_ +#. `Sleepable RCU`_ +#. `Tasks RCU`_ + +Bottom-Half Flavor (Historical) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The RCU-bh flavor of RCU has since been expressed in terms of the other +RCU flavors as part of a consolidation of the three flavors into a +single flavor. The read-side API remains, and continues to disable +softirq and to be accounted for by lockdep. Much of the material in this +section is therefore strictly historical in nature. + +The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations) +flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a +flavor of RCU that could withstand the network-based denial-of-service +attacks researched by Robert Olsson. These attacks placed so much +networking load on the system that some of the CPUs never exited softirq +execution, which in turn prevented those CPUs from ever executing a +context switch, which, in the RCU implementation of that time, prevented +grace periods from ever ending. The result was an out-of-memory +condition and a system hang. + +The solution was the creation of RCU-bh, which does +``local_bh_disable()`` across its read-side critical sections, and which +uses the transition from one type of softirq processing to another as a +quiescent state in addition to context switch, idle, user mode, and +offline. This means that RCU-bh grace periods can complete even when +some of the CPUs execute in softirq indefinitely, thus allowing +algorithms based on RCU-bh to withstand network-based denial-of-service +attacks. + +Because ``rcu_read_lock_bh()`` and ``rcu_read_unlock_bh()`` disable and +re-enable softirq handlers, any attempt to start a softirq handlers +during the RCU-bh read-side critical section will be deferred. In this +case, ``rcu_read_unlock_bh()`` will invoke softirq processing, which can +take considerable time. One can of course argue that this softirq +overhead should be associated with the code following the RCU-bh +read-side critical section rather than ``rcu_read_unlock_bh()``, but the +fact is that most profiling tools cannot be expected to make this sort +of fine distinction. For example, suppose that a three-millisecond-long +RCU-bh read-side critical section executes during a time of heavy +networking load. There will very likely be an attempt to invoke at least +one softirq handler during that three milliseconds, but any such +invocation will be delayed until the time of the +``rcu_read_unlock_bh()``. This can of course make it appear at first +glance as if ``rcu_read_unlock_bh()`` was executing very slowly. + +The `RCU-bh +API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ +includes ``rcu_read_lock_bh()``, ``rcu_read_unlock_bh()``, +``rcu_dereference_bh()``, ``rcu_dereference_bh_check()``, +``synchronize_rcu_bh()``, ``synchronize_rcu_bh_expedited()``, +``call_rcu_bh()``, ``rcu_barrier_bh()``, and +``rcu_read_lock_bh_held()``. However, the update-side APIs are now +simple wrappers for other RCU flavors, namely RCU-sched in +CONFIG_PREEMPT=n kernels and RCU-preempt otherwise. + +Sched Flavor (Historical) +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The RCU-sched flavor of RCU has since been expressed in terms of the +other RCU flavors as part of a consolidation of the three flavors into a +single flavor. The read-side API remains, and continues to disable +preemption and to be accounted for by lockdep. Much of the material in +this section is therefore strictly historical in nature. + +Before preemptible RCU, waiting for an RCU grace period had the side +effect of also waiting for all pre-existing interrupt and NMI handlers. +However, there are legitimate preemptible-RCU implementations that do +not have this property, given that any point in the code outside of an +RCU read-side critical section can be a quiescent state. Therefore, +*RCU-sched* was created, which follows “classic” RCU in that an +RCU-sched grace period waits for for pre-existing interrupt and NMI +handlers. In kernels built with ``CONFIG_PREEMPT=n``, the RCU and +RCU-sched APIs have identical implementations, while kernels built with +``CONFIG_PREEMPT=y`` provide a separate implementation for each. + +Note well that in ``CONFIG_PREEMPT=y`` kernels, +``rcu_read_lock_sched()`` and ``rcu_read_unlock_sched()`` disable and +re-enable preemption, respectively. This means that if there was a +preemption attempt during the RCU-sched read-side critical section, +``rcu_read_unlock_sched()`` will enter the scheduler, with all the +latency and overhead entailed. Just as with ``rcu_read_unlock_bh()``, +this can make it look as if ``rcu_read_unlock_sched()`` was executing +very slowly. However, the highest-priority task won't be preempted, so +that task will enjoy low-overhead ``rcu_read_unlock_sched()`` +invocations. + +The `RCU-sched +API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ +includes ``rcu_read_lock_sched()``, ``rcu_read_unlock_sched()``, +``rcu_read_lock_sched_notrace()``, ``rcu_read_unlock_sched_notrace()``, +``rcu_dereference_sched()``, ``rcu_dereference_sched_check()``, +``synchronize_sched()``, ``synchronize_rcu_sched_expedited()``, +``call_rcu_sched()``, ``rcu_barrier_sched()``, and +``rcu_read_lock_sched_held()``. However, anything that disables +preemption also marks an RCU-sched read-side critical section, including +``preempt_disable()`` and ``preempt_enable()``, ``local_irq_save()`` and +``local_irq_restore()``, and so on. + +Sleepable RCU +~~~~~~~~~~~~~ + +For well over a decade, someone saying “I need to block within an RCU +read-side critical section” was a reliable indication that this someone +did not understand RCU. After all, if you are always blocking in an RCU +read-side critical section, you can probably afford to use a +higher-overhead synchronization mechanism. However, that changed with +the advent of the Linux kernel's notifiers, whose RCU read-side critical +sections almost never sleep, but sometimes need to. This resulted in the +introduction of `sleepable RCU <https://lwn.net/Articles/202847/>`__, or +*SRCU*. + +SRCU allows different domains to be defined, with each such domain +defined by an instance of an ``srcu_struct`` structure. A pointer to +this structure must be passed in to each SRCU function, for example, +``synchronize_srcu(&ss)``, where ``ss`` is the ``srcu_struct`` +structure. The key benefit of these domains is that a slow SRCU reader +in one domain does not delay an SRCU grace period in some other domain. +That said, one consequence of these domains is that read-side code must +pass a “cookie” from ``srcu_read_lock()`` to ``srcu_read_unlock()``, for +example, as follows: + + :: + + 1 int idx; + 2 + 3 idx = srcu_read_lock(&ss); + 4 do_something(); + 5 srcu_read_unlock(&ss, idx); + +As noted above, it is legal to block within SRCU read-side critical +sections, however, with great power comes great responsibility. If you +block forever in one of a given domain's SRCU read-side critical +sections, then that domain's grace periods will also be blocked forever. +Of course, one good way to block forever is to deadlock, which can +happen if any operation in a given domain's SRCU read-side critical +section can wait, either directly or indirectly, for that domain's grace +period to elapse. For example, this results in a self-deadlock: + + :: + + 1 int idx; + 2 + 3 idx = srcu_read_lock(&ss); + 4 do_something(); + 5 synchronize_srcu(&ss); + 6 srcu_read_unlock(&ss, idx); + +However, if line 5 acquired a mutex that was held across a +``synchronize_srcu()`` for domain ``ss``, deadlock would still be +possible. Furthermore, if line 5 acquired a mutex that was held across a +``synchronize_srcu()`` for some other domain ``ss1``, and if an +``ss1``-domain SRCU read-side critical section acquired another mutex +that was held across as ``ss``-domain ``synchronize_srcu()``, deadlock +would again be possible. Such a deadlock cycle could extend across an +arbitrarily large number of different SRCU domains. Again, with great +power comes great responsibility. + +Unlike the other RCU flavors, SRCU read-side critical sections can run +on idle and even offline CPUs. This ability requires that +``srcu_read_lock()`` and ``srcu_read_unlock()`` contain memory barriers, +which means that SRCU readers will run a bit slower than would RCU +readers. It also motivates the ``smp_mb__after_srcu_read_unlock()`` API, +which, in combination with ``srcu_read_unlock()``, guarantees a full +memory barrier. + +Also unlike other RCU flavors, ``synchronize_srcu()`` may **not** be +invoked from CPU-hotplug notifiers, due to the fact that SRCU grace +periods make use of timers and the possibility of timers being +temporarily “stranded” on the outgoing CPU. This stranding of timers +means that timers posted to the outgoing CPU will not fire until late in +the CPU-hotplug process. The problem is that if a notifier is waiting on +an SRCU grace period, that grace period is waiting on a timer, and that +timer is stranded on the outgoing CPU, then the notifier will never be +awakened, in other words, deadlock has occurred. This same situation of +course also prohibits ``srcu_barrier()`` from being invoked from +CPU-hotplug notifiers. + +SRCU also differs from other RCU flavors in that SRCU's expedited and +non-expedited grace periods are implemented by the same mechanism. This +means that in the current SRCU implementation, expediting a future grace +period has the side effect of expediting all prior grace periods that +have not yet completed. (But please note that this is a property of the +current implementation, not necessarily of future implementations.) In +addition, if SRCU has been idle for longer than the interval specified +by the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds +by default), and if a ``synchronize_srcu()`` invocation ends this idle +period, that invocation will be automatically expedited. + +As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a +locking bottleneck present in prior kernel versions. Although this will +allow users to put much heavier stress on ``call_srcu()``, it is +important to note that SRCU does not yet take any special steps to deal +with callback flooding. So if you are posting (say) 10,000 SRCU +callbacks per second per CPU, you are probably totally OK, but if you +intend to post (say) 1,000,000 SRCU callbacks per second per CPU, please +run some tests first. SRCU just might need a few adjustment to deal with +that sort of load. Of course, your mileage may vary based on the speed +of your CPUs and the size of your memory. + +The `SRCU +API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ +includes ``srcu_read_lock()``, ``srcu_read_unlock()``, +``srcu_dereference()``, ``srcu_dereference_check()``, +``synchronize_srcu()``, ``synchronize_srcu_expedited()``, +``call_srcu()``, ``srcu_barrier()``, and ``srcu_read_lock_held()``. It +also includes ``DEFINE_SRCU()``, ``DEFINE_STATIC_SRCU()``, and +``init_srcu_struct()`` APIs for defining and initializing +``srcu_struct`` structures. + +Tasks RCU +~~~~~~~~~ + +Some forms of tracing use “trampolines” to handle the binary rewriting +required to install different types of probes. It would be good to be +able to free old trampolines, which sounds like a job for some form of +RCU. However, because it is necessary to be able to install a trace +anywhere in the code, it is not possible to use read-side markers such +as ``rcu_read_lock()`` and ``rcu_read_unlock()``. In addition, it does +not work to have these markers in the trampoline itself, because there +would need to be instructions following ``rcu_read_unlock()``. Although +``synchronize_rcu()`` would guarantee that execution reached the +``rcu_read_unlock()``, it would not be able to guarantee that execution +had completely left the trampoline. + +The solution, in the form of `Tasks +RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side +critical sections that are delimited by voluntary context switches, that +is, calls to ``schedule()``, ``cond_resched()``, and +``synchronize_rcu_tasks()``. In addition, transitions to and from +userspace execution also delimit tasks-RCU read-side critical sections. + +The tasks-RCU API is quite compact, consisting only of +``call_rcu_tasks()``, ``synchronize_rcu_tasks()``, and +``rcu_barrier_tasks()``. In ``CONFIG_PREEMPT=n`` kernels, trampolines +cannot be preempted, so these APIs map to ``call_rcu()``, +``synchronize_rcu()``, and ``rcu_barrier()``, respectively. In +``CONFIG_PREEMPT=y`` kernels, trampolines can be preempted, and these +three APIs are therefore implemented by separate functions that check +for voluntary context switches. + +Possible Future Changes +----------------------- + +One of the tricks that RCU uses to attain update-side scalability is to +increase grace-period latency with increasing numbers of CPUs. If this +becomes a serious problem, it will be necessary to rework the +grace-period state machine so as to avoid the need for the additional +latency. + +RCU disables CPU hotplug in a few places, perhaps most notably in the +``rcu_barrier()`` operations. If there is a strong reason to use +``rcu_barrier()`` in CPU-hotplug notifiers, it will be necessary to +avoid disabling CPU hotplug. This would introduce some complexity, so +there had better be a *very* good reason. + +The tradeoff between grace-period latency on the one hand and +interruptions of other CPUs on the other hand may need to be +re-examined. The desire is of course for zero grace-period latency as +well as zero interprocessor interrupts undertaken during an expedited +grace period operation. While this ideal is unlikely to be achievable, +it is quite possible that further improvements can be made. + +The multiprocessor implementations of RCU use a combining tree that +groups CPUs so as to reduce lock contention and increase cache locality. +However, this combining tree does not spread its memory across NUMA +nodes nor does it align the CPU groups with hardware features such as +sockets or cores. Such spreading and alignment is currently believed to +be unnecessary because the hotpath read-side primitives do not access +the combining tree, nor does ``call_rcu()`` in the common case. If you +believe that your architecture needs such spreading and alignment, then +your architecture should also benefit from the +``rcutree.rcu_fanout_leaf`` boot parameter, which can be set to the +number of CPUs in a socket, NUMA node, or whatever. If the number of +CPUs is too large, use a fraction of the number of CPUs. If the number +of CPUs is a large prime number, well, that certainly is an +“interesting” architectural choice! More flexible arrangements might be +considered, but only if ``rcutree.rcu_fanout_leaf`` has proven +inadequate, and only if the inadequacy has been demonstrated by a +carefully run and realistic system-level workload. + +Please note that arrangements that require RCU to remap CPU numbers will +require extremely good demonstration of need and full exploration of +alternatives. + +RCU's various kthreads are reasonably recent additions. It is quite +likely that adjustments will be required to more gracefully handle +extreme loads. It might also be necessary to be able to relate CPU +utilization by RCU's kthreads and softirq handlers to the code that +instigated this CPU utilization. For example, RCU callback overhead +might be charged back to the originating ``call_rcu()`` instance, though +probably not in production kernels. + +Additional work may be required to provide reasonable forward-progress +guarantees under heavy load for grace periods and for callback +invocation. + +Summary +------- + +This document has presented more than two decade's worth of RCU +requirements. Given that the requirements keep changing, this will not +be the last word on this subject, but at least it serves to get an +important subset of the requirements set forth. + +Acknowledgments +--------------- + +I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, Oleg +Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and Andy +Lutomirski for their help in rendering this article human readable, and +to Michelle Rankin for her support of this effort. Other contributions +are acknowledged in the Linux kernel's git archive. diff --git a/Documentation/RCU/index.rst b/Documentation/RCU/index.rst index 340a9725676c..5c99185710fa 100644 --- a/Documentation/RCU/index.rst +++ b/Documentation/RCU/index.rst @@ -5,12 +5,17 @@ RCU concepts ============ .. toctree:: - :maxdepth: 1 + :maxdepth: 3 rcu listRCU UP + Design/Memory-Ordering/Tree-RCU-Memory-Ordering + Design/Expedited-Grace-Periods/Expedited-Grace-Periods + Design/Requirements/Requirements + Design/Data-Structures/Data-Structures + .. only:: subproject and html Indices diff --git a/Documentation/RCU/lockdep.txt b/Documentation/RCU/lockdep.txt index da51d3068850..89db949eeca0 100644 --- a/Documentation/RCU/lockdep.txt +++ b/Documentation/RCU/lockdep.txt @@ -96,7 +96,17 @@ other flavors of rcu_dereference(). On the other hand, it is illegal to use rcu_dereference_protected() if either the RCU-protected pointer or the RCU-protected data that it points to can change concurrently. -There are currently only "universal" versions of the rcu_assign_pointer() -and RCU list-/tree-traversal primitives, which do not (yet) check for -being in an RCU read-side critical section. In the future, separate -versions of these primitives might be created. +Like rcu_dereference(), when lockdep is enabled, RCU list and hlist +traversal primitives check for being called from within an RCU read-side +critical section. However, a lockdep expression can be passed to them +as a additional optional argument. With this lockdep expression, these +traversal primitives will complain only if the lockdep expression is +false and they are called from outside any RCU read-side critical section. + +For example, the workqueue for_each_pwq() macro is intended to be used +either within an RCU read-side critical section or with wq->mutex held. +It is thus implemented as follows: + + #define for_each_pwq(pwq, wq) + list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node, + lock_is_held(&(wq->mutex).dep_map)) diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 7e1a8721637a..58ba05c4d97f 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -290,7 +290,7 @@ rcu_dereference() at any time, including immediately after the rcu_dereference(). And, again like rcu_assign_pointer(), rcu_dereference() is typically used indirectly, via the _rcu list-manipulation - primitives, such as list_for_each_entry_rcu(). + primitives, such as list_for_each_entry_rcu() [2]. [1] The variant rcu_dereference_protected() can be used outside of an RCU read-side critical section as long as the usage is @@ -302,9 +302,17 @@ rcu_dereference() must prohibit. The rcu_dereference_protected() variant takes a lockdep expression to indicate which locks must be acquired by the caller. If the indicated protection is not provided, - a lockdep splat is emitted. See RCU/Design/Requirements/Requirements.html + a lockdep splat is emitted. See Documentation/RCU/Design/Requirements/Requirements.rst and the API's code comments for more details and example usage. + [2] If the list_for_each_entry_rcu() instance might be used by + update-side code as well as by RCU readers, then an additional + lockdep expression can be added to its list of arguments. + For example, given an additional "lock_is_held(&mylock)" argument, + the RCU lockdep code would complain only if this instance was + invoked outside of an RCU read-side critical section and without + the protection of mylock. + The following diagram shows how each API communicates among the reader, updater, and reclaimer. @@ -630,7 +638,7 @@ been able to write-acquire the lock otherwise. The smp_mb__after_spinlock() promotes synchronize_rcu() to a full memory barrier in compliance with the "Memory-Barrier Guarantees" listed in: - Documentation/RCU/Design/Requirements/Requirements.html. + Documentation/RCU/Design/Requirements/Requirements.rst It is possible to nest rcu_read_lock(), since reader-writer locks may be recursively acquired. Note also that rcu_read_lock() is immune diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 5361ebec3361..007ba86aef78 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1334,7 +1334,7 @@ PAGE_SIZE multiple when read back. pgdeactivate - Amount of pages moved to the inactive LRU lis + Amount of pages moved to the inactive LRU list pglazyfree diff --git a/Documentation/admin-guide/device-mapper/dm-integrity.rst b/Documentation/admin-guide/device-mapper/dm-integrity.rst index a30aa91b5fbe..594095b54b29 100644 --- a/Documentation/admin-guide/device-mapper/dm-integrity.rst +++ b/Documentation/admin-guide/device-mapper/dm-integrity.rst @@ -177,6 +177,11 @@ bitmap_flush_interval:number The bitmap flush interval in milliseconds. The metadata buffers are synchronized when this interval expires. +fix_padding + Use a smaller padding of the tag area that is more + space-efficient. If this option is not present, large padding is + used - that is for compatibility with older kernels. + The journal mode (D/J), buffer_sectors, journal_watermark, commit_time can be changed when reloading the target (load an inactive table and swap the diff --git a/Documentation/admin-guide/device-mapper/dm-raid.rst b/Documentation/admin-guide/device-mapper/dm-raid.rst index 2fe255b130fb..f6344675e395 100644 --- a/Documentation/admin-guide/device-mapper/dm-raid.rst +++ b/Documentation/admin-guide/device-mapper/dm-raid.rst @@ -417,3 +417,5 @@ Version History deadlock/potential data corruption. Update superblock when specific devices are requested via rebuild. Fix RAID leg rebuild errors. + 1.15.0 Fix size extensions not being synchronized in case of new MD bitmap + pages allocated; also fix those not occuring after previous reductions diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index 49311f3da6f2..0795e3c2643f 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -12,3 +12,5 @@ are configurable at compile, boot or run time. spectre l1tf mds + tsx_async_abort + multihit.rst diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst index e3a796c0d3a2..2d19c9f4c1fe 100644 --- a/Documentation/admin-guide/hw-vuln/mds.rst +++ b/Documentation/admin-guide/hw-vuln/mds.rst @@ -265,8 +265,11 @@ time with the option "mds=". The valid arguments for this option are: ============ ============================================================= -Not specifying this option is equivalent to "mds=full". - +Not specifying this option is equivalent to "mds=full". For processors +that are affected by both TAA (TSX Asynchronous Abort) and MDS, +specifying just "mds=off" without an accompanying "tsx_async_abort=off" +will have no effect as the same mitigation is used for both +vulnerabilities. Mitigation selection guide -------------------------- diff --git a/Documentation/admin-guide/hw-vuln/multihit.rst b/Documentation/admin-guide/hw-vuln/multihit.rst new file mode 100644 index 000000000000..ba9988d8bce5 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/multihit.rst @@ -0,0 +1,163 @@ +iTLB multihit +============= + +iTLB multihit is an erratum where some processors may incur a machine check +error, possibly resulting in an unrecoverable CPU lockup, when an +instruction fetch hits multiple entries in the instruction TLB. This can +occur when the page size is changed along with either the physical address +or cache type. A malicious guest running on a virtualized system can +exploit this erratum to perform a denial of service attack. + + +Affected processors +------------------- + +Variations of this erratum are present on most Intel Core and Xeon processor +models. The erratum is not present on: + + - non-Intel processors + + - Some Atoms (Airmont, Bonnell, Goldmont, GoldmontPlus, Saltwell, Silvermont) + + - Intel processors that have the PSCHANGE_MC_NO bit set in the + IA32_ARCH_CAPABILITIES MSR. + + +Related CVEs +------------ + +The following CVE entry is related to this issue: + + ============== ================================================= + CVE-2018-12207 Machine Check Error Avoidance on Page Size Change + ============== ================================================= + + +Problem +------- + +Privileged software, including OS and virtual machine managers (VMM), are in +charge of memory management. A key component in memory management is the control +of the page tables. Modern processors use virtual memory, a technique that creates +the illusion of a very large memory for processors. This virtual space is split +into pages of a given size. Page tables translate virtual addresses to physical +addresses. + +To reduce latency when performing a virtual to physical address translation, +processors include a structure, called TLB, that caches recent translations. +There are separate TLBs for instruction (iTLB) and data (dTLB). + +Under this errata, instructions are fetched from a linear address translated +using a 4 KB translation cached in the iTLB. Privileged software modifies the +paging structure so that the same linear address using large page size (2 MB, 4 +MB, 1 GB) with a different physical address or memory type. After the page +structure modification but before the software invalidates any iTLB entries for +the linear address, a code fetch that happens on the same linear address may +cause a machine-check error which can result in a system hang or shutdown. + + +Attack scenarios +---------------- + +Attacks against the iTLB multihit erratum can be mounted from malicious +guests in a virtualized system. + + +iTLB multihit system information +-------------------------------- + +The Linux kernel provides a sysfs interface to enumerate the current iTLB +multihit status of the system:whether the system is vulnerable and which +mitigations are active. The relevant sysfs file is: + +/sys/devices/system/cpu/vulnerabilities/itlb_multihit + +The possible values in this file are: + +.. list-table:: + + * - Not affected + - The processor is not vulnerable. + * - KVM: Mitigation: Split huge pages + - Software changes mitigate this issue. + * - KVM: Vulnerable + - The processor is vulnerable, but no mitigation enabled + + +Enumeration of the erratum +-------------------------------- + +A new bit has been allocated in the IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) msr +and will be set on CPU's which are mitigated against this issue. + + ======================================= =========== =============================== + IA32_ARCH_CAPABILITIES MSR Not present Possibly vulnerable,check model + IA32_ARCH_CAPABILITIES[PSCHANGE_MC_NO] '0' Likely vulnerable,check model + IA32_ARCH_CAPABILITIES[PSCHANGE_MC_NO] '1' Not vulnerable + ======================================= =========== =============================== + + +Mitigation mechanism +------------------------- + +This erratum can be mitigated by restricting the use of large page sizes to +non-executable pages. This forces all iTLB entries to be 4K, and removes +the possibility of multiple hits. + +In order to mitigate the vulnerability, KVM initially marks all huge pages +as non-executable. If the guest attempts to execute in one of those pages, +the page is broken down into 4K pages, which are then marked executable. + +If EPT is disabled or not available on the host, KVM is in control of TLB +flushes and the problematic situation cannot happen. However, the shadow +EPT paging mechanism used by nested virtualization is vulnerable, because +the nested guest can trigger multiple iTLB hits by modifying its own +(non-nested) page tables. For simplicity, KVM will make large pages +non-executable in all shadow paging modes. + +Mitigation control on the kernel command line and KVM - module parameter +------------------------------------------------------------------------ + +The KVM hypervisor mitigation mechanism for marking huge pages as +non-executable can be controlled with a module parameter "nx_huge_pages=". +The kernel command line allows to control the iTLB multihit mitigations at +boot time with the option "kvm.nx_huge_pages=". + +The valid arguments for these options are: + + ========== ================================================================ + force Mitigation is enabled. In this case, the mitigation implements + non-executable huge pages in Linux kernel KVM module. All huge + pages in the EPT are marked as non-executable. + If a guest attempts to execute in one of those pages, the page is + broken down into 4K pages, which are then marked executable. + + off Mitigation is disabled. + + auto Enable mitigation only if the platform is affected and the kernel + was not booted with the "mitigations=off" command line parameter. + This is the default option. + ========== ================================================================ + + +Mitigation selection guide +-------------------------- + +1. No virtualization in use +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The system is protected by the kernel unconditionally and no further + action is required. + +2. Virtualization with trusted guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + If the guest comes from a trusted source, you may assume that the guest will + not attempt to maliciously exploit these errata and no further action is + required. + +3. Virtualization with untrusted guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + If the guest comes from an untrusted source, the guest host kernel will need + to apply iTLB multihit mitigation via the kernel command line or kvm + module parameter. diff --git a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst new file mode 100644 index 000000000000..af6865b822d2 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst @@ -0,0 +1,279 @@ +.. SPDX-License-Identifier: GPL-2.0 + +TAA - TSX Asynchronous Abort +====================================== + +TAA is a hardware vulnerability that allows unprivileged speculative access to +data which is available in various CPU internal buffers by using asynchronous +aborts within an Intel TSX transactional region. + +Affected processors +------------------- + +This vulnerability only affects Intel processors that support Intel +Transactional Synchronization Extensions (TSX) when the TAA_NO bit (bit 8) +is 0 in the IA32_ARCH_CAPABILITIES MSR. On processors where the MDS_NO bit +(bit 5) is 0 in the IA32_ARCH_CAPABILITIES MSR, the existing MDS mitigations +also mitigate against TAA. + +Whether a processor is affected or not can be read out from the TAA +vulnerability file in sysfs. See :ref:`tsx_async_abort_sys_info`. + +Related CVEs +------------ + +The following CVE entry is related to this TAA issue: + + ============== ===== =================================================== + CVE-2019-11135 TAA TSX Asynchronous Abort (TAA) condition on some + microprocessors utilizing speculative execution may + allow an authenticated user to potentially enable + information disclosure via a side channel with + local access. + ============== ===== =================================================== + +Problem +------- + +When performing store, load or L1 refill operations, processors write +data into temporary microarchitectural structures (buffers). The data in +those buffers can be forwarded to load operations as an optimization. + +Intel TSX is an extension to the x86 instruction set architecture that adds +hardware transactional memory support to improve performance of multi-threaded +software. TSX lets the processor expose and exploit concurrency hidden in an +application due to dynamically avoiding unnecessary synchronization. + +TSX supports atomic memory transactions that are either committed (success) or +aborted. During an abort, operations that happened within the transactional region +are rolled back. An asynchronous abort takes place, among other options, when a +different thread accesses a cache line that is also used within the transactional +region when that access might lead to a data race. + +Immediately after an uncompleted asynchronous abort, certain speculatively +executed loads may read data from those internal buffers and pass it to dependent +operations. This can be then used to infer the value via a cache side channel +attack. + +Because the buffers are potentially shared between Hyper-Threads cross +Hyper-Thread attacks are possible. + +The victim of a malicious actor does not need to make use of TSX. Only the +attacker needs to begin a TSX transaction and raise an asynchronous abort +which in turn potenitally leaks data stored in the buffers. + +More detailed technical information is available in the TAA specific x86 +architecture section: :ref:`Documentation/x86/tsx_async_abort.rst <tsx_async_abort>`. + + +Attack scenarios +---------------- + +Attacks against the TAA vulnerability can be implemented from unprivileged +applications running on hosts or guests. + +As for MDS, the attacker has no control over the memory addresses that can +be leaked. Only the victim is responsible for bringing data to the CPU. As +a result, the malicious actor has to sample as much data as possible and +then postprocess it to try to infer any useful information from it. + +A potential attacker only has read access to the data. Also, there is no direct +privilege escalation by using this technique. + + +.. _tsx_async_abort_sys_info: + +TAA system information +----------------------- + +The Linux kernel provides a sysfs interface to enumerate the current TAA status +of mitigated systems. The relevant sysfs file is: + +/sys/devices/system/cpu/vulnerabilities/tsx_async_abort + +The possible values in this file are: + +.. list-table:: + + * - 'Vulnerable' + - The CPU is affected by this vulnerability and the microcode and kernel mitigation are not applied. + * - 'Vulnerable: Clear CPU buffers attempted, no microcode' + - The system tries to clear the buffers but the microcode might not support the operation. + * - 'Mitigation: Clear CPU buffers' + - The microcode has been updated to clear the buffers. TSX is still enabled. + * - 'Mitigation: TSX disabled' + - TSX is disabled. + * - 'Not affected' + - The CPU is not affected by this issue. + +.. _ucode_needed: + +Best effort mitigation mode +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If the processor is vulnerable, but the availability of the microcode-based +mitigation mechanism is not advertised via CPUID the kernel selects a best +effort mitigation mode. This mode invokes the mitigation instructions +without a guarantee that they clear the CPU buffers. + +This is done to address virtualization scenarios where the host has the +microcode update applied, but the hypervisor is not yet updated to expose the +CPUID to the guest. If the host has updated microcode the protection takes +effect; otherwise a few CPU cycles are wasted pointlessly. + +The state in the tsx_async_abort sysfs file reflects this situation +accordingly. + + +Mitigation mechanism +-------------------- + +The kernel detects the affected CPUs and the presence of the microcode which is +required. If a CPU is affected and the microcode is available, then the kernel +enables the mitigation by default. + + +The mitigation can be controlled at boot time via a kernel command line option. +See :ref:`taa_mitigation_control_command_line`. + +.. _virt_mechanism: + +Virtualization mitigation +^^^^^^^^^^^^^^^^^^^^^^^^^ + +Affected systems where the host has TAA microcode and TAA is mitigated by +having disabled TSX previously, are not vulnerable regardless of the status +of the VMs. + +In all other cases, if the host either does not have the TAA microcode or +the kernel is not mitigated, the system might be vulnerable. + + +.. _taa_mitigation_control_command_line: + +Mitigation control on the kernel command line +--------------------------------------------- + +The kernel command line allows to control the TAA mitigations at boot time with +the option "tsx_async_abort=". The valid arguments for this option are: + + ============ ============================================================= + off This option disables the TAA mitigation on affected platforms. + If the system has TSX enabled (see next parameter) and the CPU + is affected, the system is vulnerable. + + full TAA mitigation is enabled. If TSX is enabled, on an affected + system it will clear CPU buffers on ring transitions. On + systems which are MDS-affected and deploy MDS mitigation, + TAA is also mitigated. Specifying this option on those + systems will have no effect. + + full,nosmt The same as tsx_async_abort=full, with SMT disabled on + vulnerable CPUs that have TSX enabled. This is the complete + mitigation. When TSX is disabled, SMT is not disabled because + CPU is not vulnerable to cross-thread TAA attacks. + ============ ============================================================= + +Not specifying this option is equivalent to "tsx_async_abort=full". For +processors that are affected by both TAA and MDS, specifying just +"tsx_async_abort=off" without an accompanying "mds=off" will have no +effect as the same mitigation is used for both vulnerabilities. + +The kernel command line also allows to control the TSX feature using the +parameter "tsx=" on CPUs which support TSX control. MSR_IA32_TSX_CTRL is used +to control the TSX feature and the enumeration of the TSX feature bits (RTM +and HLE) in CPUID. + +The valid options are: + + ============ ============================================================= + off Disables TSX on the system. + + Note that this option takes effect only on newer CPUs which are + not vulnerable to MDS, i.e., have MSR_IA32_ARCH_CAPABILITIES.MDS_NO=1 + and which get the new IA32_TSX_CTRL MSR through a microcode + update. This new MSR allows for the reliable deactivation of + the TSX functionality. + + on Enables TSX. + + Although there are mitigations for all known security + vulnerabilities, TSX has been known to be an accelerator for + several previous speculation-related CVEs, and so there may be + unknown security risks associated with leaving it enabled. + + auto Disables TSX if X86_BUG_TAA is present, otherwise enables TSX + on the system. + ============ ============================================================= + +Not specifying this option is equivalent to "tsx=off". + +The following combinations of the "tsx_async_abort" and "tsx" are possible. For +affected platforms tsx=auto is equivalent to tsx=off and the result will be: + + ========= ========================== ========================================= + tsx=on tsx_async_abort=full The system will use VERW to clear CPU + buffers. Cross-thread attacks are still + possible on SMT machines. + tsx=on tsx_async_abort=full,nosmt As above, cross-thread attacks on SMT + mitigated. + tsx=on tsx_async_abort=off The system is vulnerable. + tsx=off tsx_async_abort=full TSX might be disabled if microcode + provides a TSX control MSR. If so, + system is not vulnerable. + tsx=off tsx_async_abort=full,nosmt Ditto + tsx=off tsx_async_abort=off ditto + ========= ========================== ========================================= + + +For unaffected platforms "tsx=on" and "tsx_async_abort=full" does not clear CPU +buffers. For platforms without TSX control (MSR_IA32_ARCH_CAPABILITIES.MDS_NO=0) +"tsx" command line argument has no effect. + +For the affected platforms below table indicates the mitigation status for the +combinations of CPUID bit MD_CLEAR and IA32_ARCH_CAPABILITIES MSR bits MDS_NO +and TSX_CTRL_MSR. + + ======= ========= ============= ======================================== + MDS_NO MD_CLEAR TSX_CTRL_MSR Status + ======= ========= ============= ======================================== + 0 0 0 Vulnerable (needs microcode) + 0 1 0 MDS and TAA mitigated via VERW + 1 1 0 MDS fixed, TAA vulnerable if TSX enabled + because MD_CLEAR has no meaning and + VERW is not guaranteed to clear buffers + 1 X 1 MDS fixed, TAA can be mitigated by + VERW or TSX_CTRL_MSR + ======= ========= ============= ======================================== + +Mitigation selection guide +-------------------------- + +1. Trusted userspace and guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If all user space applications are from a trusted source and do not execute +untrusted code which is supplied externally, then the mitigation can be +disabled. The same applies to virtualized environments with trusted guests. + + +2. Untrusted userspace and guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If there are untrusted applications or guests on the system, enabling TSX +might allow a malicious actor to leak data from the host or from other +processes running on the same physical core. + +If the microcode is available and the TSX is disabled on the host, attacks +are prevented in a virtualized environment as well, even if the VMs do not +explicitly enable the mitigation. + + +.. _taa_default_mitigations: + +Default mitigations +------------------- + +The kernel's default action for vulnerable processors is: + + - Deploy TSX disable mitigation (tsx_async_abort=full tsx=off). diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst index 5d63b18bd6d1..4f0462af3ca7 100644 --- a/Documentation/admin-guide/iostats.rst +++ b/Documentation/admin-guide/iostats.rst @@ -121,6 +121,15 @@ Field 15 -- # of milliseconds spent discarding This is the total number of milliseconds spent by all discards (as measured from __make_request() to end_that_request_last()). +Field 16 -- # of flush requests completed + This is the total number of flush requests completed successfully. + + Block layer combines flush requests and executes at most one at a time. + This counts flush requests executed by disk. Not tracked for partitions. + +Field 17 -- # of milliseconds spent flushing + This is the total number of milliseconds spent by all flush requests. + To avoid introducing performance bottlenecks, no locks are held while modifying these counters. This implies that minor inaccuracies may be introduced when changes collide, so (for instance) adding up all the diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst index d05d531b4ec9..6d421694d98e 100644 --- a/Documentation/admin-guide/kernel-parameters.rst +++ b/Documentation/admin-guide/kernel-parameters.rst @@ -127,6 +127,7 @@ parameter is applicable:: NET Appropriate network support is enabled. NUMA NUMA support is enabled. NFS Appropriate NFS support is enabled. + OF Devicetree is enabled. OSS OSS sound support is enabled. PV_OPS A paravirtualized kernel is enabled. PARIDE The ParIDE (parallel port IDE) subsystem is enabled. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a84a83f8881e..b25b47a47acd 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1168,7 +1168,8 @@ Format: {"off" | "on" | "skip[mbr]"} efi= [EFI] - Format: { "old_map", "nochunk", "noruntime", "debug" } + Format: { "old_map", "nochunk", "noruntime", "debug", + "nosoftreserve" } old_map [X86-64]: switch to the old ioremap-based EFI runtime services mapping. 32-bit still uses this one by default. @@ -1177,6 +1178,12 @@ firmware implementations. noruntime : disable EFI runtime services support debug: enable misc debug output + nosoftreserve: The EFI_MEMORY_SP (Specific Purpose) + attribute may cause the kernel to reserve the + memory range for a memory mapping driver to + claim. Specify efi=nosoftreserve to disable this + reservation and treat the memory by its base type + (i.e. EFI_CONVENTIONAL_MEMORY / "System RAM"). efi_no_storage_paranoia [EFI; X86] Using this parameter you can use more than 50% of @@ -1189,15 +1196,21 @@ updating original EFI memory map. Region of memory which aa attribute is added to is from ss to ss+nn. + If efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000 is specified, EFI_MEMORY_MORE_RELIABLE(0x10000) attribute is added to range 0x100000000-0x180000000 and 0x10a0000000-0x1120000000. + If efi_fake_mem=8G@9G:0x40000 is specified, the + EFI_MEMORY_SP(0x40000) attribute is added to + range 0x240000000-0x43fffffff. + Using this parameter you can do debugging of EFI memmap - related feature. For example, you can do debugging of + related features. For example, you can do debugging of Address Range Mirroring feature even if your box - doesn't support it. + doesn't support it, or mark specific memory as + "soft reserved". efivar_ssdt= [EFI; X86] Name of an EFI variable that contains an SSDT that is to be dynamically loaded by Linux. If there are @@ -2055,6 +2068,25 @@ KVM MMU at runtime. Default is 0 (off) + kvm.nx_huge_pages= + [KVM] Controls the software workaround for the + X86_BUG_ITLB_MULTIHIT bug. + force : Always deploy workaround. + off : Never deploy workaround. + auto : Deploy workaround based on the presence of + X86_BUG_ITLB_MULTIHIT. + + Default is 'auto'. + + If the software workaround is enabled for the host, + guests do need not to enable it for nested guests. + + kvm.nx_huge_pages_recovery_ratio= + [KVM] Controls how many 4KiB pages are periodically zapped + back to huge pages. 0 disables the recovery, otherwise if + the value is N KVM will zap 1/Nth of the 4KiB pages every + minute. The default is 60. + kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM. Default is 1 (enabled) @@ -2454,6 +2486,12 @@ SMT on vulnerable CPUs off - Unconditionally disable MDS mitigation + On TAA-affected machines, mds=off can be prevented by + an active TAA mitigation as both vulnerabilities are + mitigated with the same mechanism so in order to disable + this mitigation, you need to specify tsx_async_abort=off + too. + Not specifying this option is equivalent to mds=full. @@ -2636,6 +2674,13 @@ ssbd=force-off [ARM64] l1tf=off [X86] mds=off [X86] + tsx_async_abort=off [X86] + kvm.nx_huge_pages=off [X86] + + Exceptions: + This does not have any effect on + kvm.nx_huge_pages when + kvm.nx_huge_pages=force. auto (default) Mitigate all CPU vulnerabilities, but leave SMT @@ -2651,6 +2696,7 @@ be fully mitigated, even if it means losing SMT. Equivalent to: l1tf=flush,nosmt [X86] mds=full,nosmt [X86] + tsx_async_abort=full,nosmt [X86] mminit_loglevel= [KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this @@ -3083,9 +3129,9 @@ [X86,PV_OPS] Disable paravirtualized VMware scheduler clock and use the default one. - no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting. - steal time is computed, but won't influence scheduler - behaviour + no-steal-acc [X86,KVM,ARM64] Disable paravirtualized steal time + accounting. steal time is computed, but won't + influence scheduler behaviour nolapic [X86-32,APIC] Do not enable or use the local APIC. @@ -3194,6 +3240,12 @@ This can be set from sysctl after boot. See Documentation/admin-guide/sysctl/vm.rst for details. + of_devlink [OF, KNL] Create device links between consumer and + supplier devices by scanning the devictree to infer the + consumer/supplier relationships. A consumer device + will not be probed until all the supplier devices have + probed successfully. + ohci1394_dma=early [HW] enable debugging via the ohci1394 driver. See Documentation/debugging-via-ohci1394.txt for more info. @@ -4848,6 +4900,76 @@ interruptions from clocksource watchdog are not acceptable). + tsx= [X86] Control Transactional Synchronization + Extensions (TSX) feature in Intel processors that + support TSX control. + + This parameter controls the TSX feature. The options are: + + on - Enable TSX on the system. Although there are + mitigations for all known security vulnerabilities, + TSX has been known to be an accelerator for + several previous speculation-related CVEs, and + so there may be unknown security risks associated + with leaving it enabled. + + off - Disable TSX on the system. (Note that this + option takes effect only on newer CPUs which are + not vulnerable to MDS, i.e., have + MSR_IA32_ARCH_CAPABILITIES.MDS_NO=1 and which get + the new IA32_TSX_CTRL MSR through a microcode + update. This new MSR allows for the reliable + deactivation of the TSX functionality.) + + auto - Disable TSX if X86_BUG_TAA is present, + otherwise enable TSX on the system. + + Not specifying this option is equivalent to tsx=off. + + See Documentation/admin-guide/hw-vuln/tsx_async_abort.rst + for more details. + + tsx_async_abort= [X86,INTEL] Control mitigation for the TSX Async + Abort (TAA) vulnerability. + + Similar to Micro-architectural Data Sampling (MDS) + certain CPUs that support Transactional + Synchronization Extensions (TSX) are vulnerable to an + exploit against CPU internal buffers which can forward + information to a disclosure gadget under certain + conditions. + + In vulnerable processors, the speculatively forwarded + data can be used in a cache side channel attack, to + access data to which the attacker does not have direct + access. + + This parameter controls the TAA mitigation. The + options are: + + full - Enable TAA mitigation on vulnerable CPUs + if TSX is enabled. + + full,nosmt - Enable TAA mitigation and disable SMT on + vulnerable CPUs. If TSX is disabled, SMT + is not disabled because CPU is not + vulnerable to cross-thread TAA attacks. + off - Unconditionally disable TAA mitigation + + On MDS-affected machines, tsx_async_abort=off can be + prevented by an active MDS mitigation as both vulnerabilities + are mitigated with the same mechanism so in order to disable + this mitigation, you need to specify mds=off too. + + Not specifying this option is equivalent to + tsx_async_abort=full. On CPUs which are MDS affected + and deploy MDS mitigation, TAA mitigation is not + required and doesn't provide any additional + mitigation. + + For details see: + Documentation/admin-guide/hw-vuln/tsx_async_abort.rst + turbografx.map[2|3]= [HW,JOY] TurboGraFX parallel port interface Format: @@ -4998,13 +5120,13 @@ Flags is a set of characters, each corresponding to a common usb-storage quirk flag as follows: a = SANE_SENSE (collect more than 18 bytes - of sense data); + of sense data, not on uas); b = BAD_SENSE (don't collect more than 18 - bytes of sense data); + bytes of sense data, not on uas); c = FIX_CAPACITY (decrease the reported device capacity by one sector); d = NO_READ_DISC_INFO (don't use - READ_DISC_INFO command); + READ_DISC_INFO command, not on uas); e = NO_READ_CAPACITY_16 (don't use READ_CAPACITY_16 command); f = NO_REPORT_OPCODES (don't use report opcodes @@ -5019,17 +5141,18 @@ j = NO_REPORT_LUNS (don't use report luns command, uas only); l = NOT_LOCKABLE (don't try to lock and - unlock ejectable media); + unlock ejectable media, not on uas); m = MAX_SECTORS_64 (don't transfer more - than 64 sectors = 32 KB at a time); + than 64 sectors = 32 KB at a time, + not on uas); n = INITIAL_READ10 (force a retry of the - initial READ(10) command); + initial READ(10) command, not on uas); o = CAPACITY_OK (accept the capacity - reported by the device); + reported by the device, not on uas); p = WRITE_CACHE (the device cache is ON - by default); + by default, not on uas); r = IGNORE_RESIDUE (the device reports - bogus residue values); + bogus residue values, not on uas); s = SINGLE_LUN (the device has only one Logical Unit); t = NO_ATA_1X (don't allow ATA(12) and ATA(16) @@ -5038,7 +5161,8 @@ w = NO_WP_DETECT (don't test whether the medium is write-protected). y = ALWAYS_SYNC (issue a SYNCHRONIZE_CACHE - even if the device claims no cache) + even if the device claims no cache, + not on uas) Example: quirks=0419:aaf5:rl,0421:0433:rc user_debug= [KNL,ARM] diff --git a/Documentation/admin-guide/perf/imx-ddr.rst b/Documentation/admin-guide/perf/imx-ddr.rst index 517a205abad6..90056e4e8859 100644 --- a/Documentation/admin-guide/perf/imx-ddr.rst +++ b/Documentation/admin-guide/perf/imx-ddr.rst @@ -17,7 +17,8 @@ The "format" directory describes format of the config (event ID) and config1 (AXI filtering) fields of the perf_event_attr structure, see /sys/bus/event_source/ devices/imx8_ddr0/format/. The "events" directory describes the events types hardware supported that can be used with perf tool, see /sys/bus/event_source/ -devices/imx8_ddr0/events/. +devices/imx8_ddr0/events/. The "caps" directory describes filter features implemented +in DDR PMU, see /sys/bus/events_source/devices/imx8_ddr0/caps/. e.g.:: perf stat -a -e imx8_ddr0/cycles/ cmd perf stat -a -e imx8_ddr0/read/,imx8_ddr0/write/ cmd @@ -25,9 +26,12 @@ devices/imx8_ddr0/events/. AXI filtering is only used by CSV modes 0x41 (axid-read) and 0x42 (axid-write) to count reading or writing matches filter setting. Filter setting is various from different DRAM controller implementations, which is distinguished by quirks -in the driver. +in the driver. You also can dump info from userspace, filter in "caps" directory +indicates whether PMU supports AXI ID filter or not; enhanced_filter indicates +whether PMU supports enhanced AXI ID filter or not. Value 0 for un-supported, and +value 1 for supported. -* With DDR_CAP_AXI_ID_FILTER quirk. +* With DDR_CAP_AXI_ID_FILTER quirk(filter: 1, enhanced_filter: 0). Filter is defined with two configuration parts: --AXI_ID defines AxID matching value. --AXI_MASKING defines which bits of AxID are meaningful for the matching. @@ -50,3 +54,8 @@ in the driver. axi_id to monitor a specific id, rather than having to specify axi_mask. e.g.:: perf stat -a -e imx8_ddr0/axid-read,axi_id=0x12/ cmd, which will monitor ARID=0x12 + +* With DDR_CAP_AXI_ID_FILTER_ENHANCED quirk(filter: 1, enhanced_filter: 1). + This is an extension to the DDR_CAP_AXI_ID_FILTER quirk which permits + counting the number of bytes (as opposed to the number of bursts) from DDR + read and write transactions concurrently with another set of data counters. diff --git a/Documentation/admin-guide/perf/thunderx2-pmu.rst b/Documentation/admin-guide/perf/thunderx2-pmu.rst index 08e33675853a..01f158238ae1 100644 --- a/Documentation/admin-guide/perf/thunderx2-pmu.rst +++ b/Documentation/admin-guide/perf/thunderx2-pmu.rst @@ -3,24 +3,26 @@ Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE) ============================================================= The ThunderX2 SoC PMU consists of independent, system-wide, per-socket -PMUs such as the Level 3 Cache (L3C) and DDR4 Memory Controller (DMC). +PMUs such as the Level 3 Cache (L3C), DDR4 Memory Controller (DMC) and +Cavium Coherent Processor Interconnect (CCPI2). The DMC has 8 interleaved channels and the L3C has 16 interleaved tiles. Events are counted for the default channel (i.e. channel 0) and prorated to the total number of channels/tiles. -The DMC and L3C support up to 4 counters. Counters are independently -programmable and can be started and stopped individually. Each counter -can be set to a different event. Counters are 32-bit and do not support -an overflow interrupt; they are read every 2 seconds. +The DMC and L3C support up to 4 counters, while the CCPI2 supports up to 8 +counters. Counters are independently programmable to different events and +can be started and stopped individually. None of the counters support an +overflow interrupt. DMC and L3C counters are 32-bit and read every 2 seconds. +The CCPI2 counters are 64-bit and assumed not to overflow in normal operation. PMU UNCORE (perf) driver: The thunderx2_pmu driver registers per-socket perf PMUs for the DMC and -L3C devices. Each PMU can be used to count up to 4 events -simultaneously. The PMUs provide a description of their available events -and configuration options under sysfs, see -/sys/devices/uncore_<l3c_S/dmc_S/>; S is the socket id. +L3C devices. Each PMU can be used to count up to 4 (DMC/L3C) or up to 8 +(CCPI2) events simultaneously. The PMUs provide a description of their +available events and configuration options under sysfs, see +/sys/devices/uncore_<l3c_S/dmc_S/ccpi2_S/>; S is the socket id. The driver does not support sampling, therefore "perf record" will not work. Per-task perf sessions are also not supported. diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/ras.rst index 2b20f5f7380d..0310db624964 100644 --- a/Documentation/admin-guide/ras.rst +++ b/Documentation/admin-guide/ras.rst @@ -330,9 +330,12 @@ There can be multiple csrows and multiple channels. .. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely used to refer to a memory module, although there are other memory - packaging alternatives, like SO-DIMM, SIMM, etc. Along this document, - and inside the EDAC system, the term "dimm" is used for all memory - modules, even when they use a different kind of packaging. + packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI + specification (Version 2.7) defines a memory module in the Common + Platform Error Record (CPER) section to be an SMBIOS Memory Device + (Type 17). Along this document, and inside the EDAC subsystem, the term + "dimm" is used for all memory modules, even when they use a + different kind of packaging. Memory controllers allow for several csrows, with 8 csrows being a typical value. Yet, the actual number of csrows depends on the layout of @@ -349,12 +352,14 @@ controllers. The following example will assume 2 channels: | | ``ch0`` | ``ch1`` | +============+===========+===========+ | ``csrow0`` | DIMM_A0 | DIMM_B0 | - +------------+ | | - | ``csrow1`` | | | + | | rank0 | rank0 | + +------------+ - | - | + | ``csrow1`` | rank1 | rank1 | +------------+-----------+-----------+ | ``csrow2`` | DIMM_A1 | DIMM_B1 | - +------------+ | | - | ``csrow3`` | | | + | | rank0 | rank0 | + +------------+ - | - | + | ``csrow3`` | rank1 | rank1 | +------------+-----------+-----------+ In the above example, there are 4 physical slots on the motherboard @@ -374,11 +379,13 @@ which the memory DIMM is placed. Thus, when 1 DIMM is placed in each Channel, the csrows cross both DIMMs. Memory DIMMs come single or dual "ranked". A rank is a populated csrow. -Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above -will have just one csrow (csrow0). csrow1 will be empty. On the other -hand, when 2 dual ranked DIMMs are similarly placed, then both csrow0 -and csrow1 will be populated. The pattern repeats itself for csrow2 and -csrow3. +In the example above 2 dual ranked DIMMs are similarly placed. Thus, +both csrow0 and csrow1 are populated. On the other hand, when 2 single +ranked DIMMs are placed in slots DIMM_A0 and DIMM_B0, then they will +have just one csrow (csrow0) and csrow1 will be empty. The pattern +repeats itself for csrow2 and csrow3. Also note that some memory +controllers don't have any logic to identify the memory module, see +``rankX`` directories below. The representation of the above is reflected in the directory tree in EDAC's sysfs interface. Starting in directory diff --git a/Documentation/arm64/booting.rst b/Documentation/arm64/booting.rst index d3f3a60fbf25..5d78a6f5b0ae 100644 --- a/Documentation/arm64/booting.rst +++ b/Documentation/arm64/booting.rst @@ -213,6 +213,9 @@ Before jumping into the kernel, the following conditions must be met: - ICC_SRE_EL3.Enable (bit 3) must be initialiased to 0b1. - ICC_SRE_EL3.SRE (bit 0) must be initialised to 0b1. + - ICC_CTLR_EL3.PMHE (bit 6) must be set to the same value across + all CPUs the kernel is executing on, and must stay constant + for the lifetime of the kernel. - If the kernel is entered at EL1: diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst index 2955287e9acc..b6e44884e3ad 100644 --- a/Documentation/arm64/cpu-feature-registers.rst +++ b/Documentation/arm64/cpu-feature-registers.rst @@ -168,8 +168,15 @@ infrastructure: +------------------------------+---------+---------+ - 3) MIDR_EL1 - Main ID Register + 3) ID_AA64PFR1_EL1 - Processor Feature Register 1 + +------------------------------+---------+---------+ + | Name | bits | visible | + +------------------------------+---------+---------+ + | SSBS | [7-4] | y | + +------------------------------+---------+---------+ + + 4) MIDR_EL1 - Main ID Register +------------------------------+---------+---------+ | Name | bits | visible | +------------------------------+---------+---------+ @@ -188,11 +195,15 @@ infrastructure: as available on the CPU where it is fetched and is not a system wide safe value. - 4) ID_AA64ISAR1_EL1 - Instruction set attribute register 1 + 5) ID_AA64ISAR1_EL1 - Instruction set attribute register 1 +------------------------------+---------+---------+ | Name | bits | visible | +------------------------------+---------+---------+ + | SB | [39-36] | y | + +------------------------------+---------+---------+ + | FRINTTS | [35-32] | y | + +------------------------------+---------+---------+ | GPI | [31-28] | y | +------------------------------+---------+---------+ | GPA | [27-24] | y | @@ -210,7 +221,7 @@ infrastructure: | DPB | [3-0] | y | +------------------------------+---------+---------+ - 5) ID_AA64MMFR2_EL1 - Memory model feature register 2 + 6) ID_AA64MMFR2_EL1 - Memory model feature register 2 +------------------------------+---------+---------+ | Name | bits | visible | @@ -218,7 +229,7 @@ infrastructure: | AT | [35-32] | y | +------------------------------+---------+---------+ - 6) ID_AA64ZFR0_EL1 - SVE feature ID register 0 + 7) ID_AA64ZFR0_EL1 - SVE feature ID register 0 +------------------------------+---------+---------+ | Name | bits | visible | diff --git a/Documentation/arm64/elf_hwcaps.rst b/Documentation/arm64/elf_hwcaps.rst index 91f79529c58c..7fa3d215ae6a 100644 --- a/Documentation/arm64/elf_hwcaps.rst +++ b/Documentation/arm64/elf_hwcaps.rst @@ -119,10 +119,6 @@ HWCAP_LRCPC HWCAP_DCPOP Functionality implied by ID_AA64ISAR1_EL1.DPB == 0b0001. -HWCAP2_DCPODP - - Functionality implied by ID_AA64ISAR1_EL1.DPB == 0b0010. - HWCAP_SHA3 Functionality implied by ID_AA64ISAR0_EL1.SHA3 == 0b0001. @@ -141,30 +137,6 @@ HWCAP_SHA512 HWCAP_SVE Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001. -HWCAP2_SVE2 - - Functionality implied by ID_AA64ZFR0_EL1.SVEVer == 0b0001. - -HWCAP2_SVEAES - - Functionality implied by ID_AA64ZFR0_EL1.AES == 0b0001. - -HWCAP2_SVEPMULL - - Functionality implied by ID_AA64ZFR0_EL1.AES == 0b0010. - -HWCAP2_SVEBITPERM - - Functionality implied by ID_AA64ZFR0_EL1.BitPerm == 0b0001. - -HWCAP2_SVESHA3 - - Functionality implied by ID_AA64ZFR0_EL1.SHA3 == 0b0001. - -HWCAP2_SVESM4 - - Functionality implied by ID_AA64ZFR0_EL1.SM4 == 0b0001. - HWCAP_ASIMDFHM Functionality implied by ID_AA64ISAR0_EL1.FHM == 0b0001. @@ -180,13 +152,12 @@ HWCAP_ILRCPC HWCAP_FLAGM Functionality implied by ID_AA64ISAR0_EL1.TS == 0b0001. -HWCAP2_FLAGM2 - - Functionality implied by ID_AA64ISAR0_EL1.TS == 0b0010. - HWCAP_SSBS Functionality implied by ID_AA64PFR1_EL1.SSBS == 0b0010. +HWCAP_SB + Functionality implied by ID_AA64ISAR1_EL1.SB == 0b0001. + HWCAP_PACA Functionality implied by ID_AA64ISAR1_EL1.APA == 0b0001 or ID_AA64ISAR1_EL1.API == 0b0001, as described by @@ -197,6 +168,38 @@ HWCAP_PACG ID_AA64ISAR1_EL1.GPI == 0b0001, as described by Documentation/arm64/pointer-authentication.rst. +HWCAP2_DCPODP + + Functionality implied by ID_AA64ISAR1_EL1.DPB == 0b0010. + +HWCAP2_SVE2 + + Functionality implied by ID_AA64ZFR0_EL1.SVEVer == 0b0001. + +HWCAP2_SVEAES + + Functionality implied by ID_AA64ZFR0_EL1.AES == 0b0001. + +HWCAP2_SVEPMULL + + Functionality implied by ID_AA64ZFR0_EL1.AES == 0b0010. + +HWCAP2_SVEBITPERM + + Functionality implied by ID_AA64ZFR0_EL1.BitPerm == 0b0001. + +HWCAP2_SVESHA3 + + Functionality implied by ID_AA64ZFR0_EL1.SHA3 == 0b0001. + +HWCAP2_SVESM4 + + Functionality implied by ID_AA64ZFR0_EL1.SM4 == 0b0001. + +HWCAP2_FLAGM2 + + Functionality implied by ID_AA64ISAR0_EL1.TS == 0b0010. + HWCAP2_FRINT Functionality implied by ID_AA64ISAR1_EL1.FRINTTS == 0b0001. diff --git a/Documentation/arm64/silicon-errata.rst b/Documentation/arm64/silicon-errata.rst index ab7ed2fd072f..99b2545455ff 100644 --- a/Documentation/arm64/silicon-errata.rst +++ b/Documentation/arm64/silicon-errata.rst @@ -70,8 +70,12 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A57 | #834220 | ARM64_ERRATUM_834220 | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A57 | #1319537 | ARM64_ERRATUM_1319367 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A72 | #853709 | N/A | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A72 | #1319367 | ARM64_ERRATUM_1319367 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A73 | #858921 | ARM64_ERRATUM_858921 | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A55 | #1024718 | ARM64_ERRATUM_1024718 | @@ -88,9 +92,16 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | ARM | Neoverse-N1 | #1349291 | N/A | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Neoverse-N1 | #1542419 | ARM64_ERRATUM_1542419 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | MMU-500 | #841119,826419 | N/A | +----------------+-----------------+-----------------+-----------------------------+ +----------------+-----------------+-----------------+-----------------------------+ +| Broadcom | Brahma-B53 | N/A | ARM64_ERRATUM_845719 | ++----------------+-----------------+-----------------+-----------------------------+ +| Broadcom | Brahma-B53 | N/A | ARM64_ERRATUM_843419 | ++----------------+-----------------+-----------------+-----------------------------+ ++----------------+-----------------+-----------------+-----------------------------+ | Cavium | ThunderX ITS | #22375,24313 | CAVIUM_ERRATUM_22375 | +----------------+-----------------+-----------------+-----------------------------+ | Cavium | ThunderX ITS | #23144 | CAVIUM_ERRATUM_23144 | @@ -126,7 +137,7 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | Qualcomm Tech. | Kryo/Falkor v1 | E1003 | QCOM_FALKOR_ERRATUM_1003 | +----------------+-----------------+-----------------+-----------------------------+ -| Qualcomm Tech. | Falkor v1 | E1009 | QCOM_FALKOR_ERRATUM_1009 | +| Qualcomm Tech. | Kryo/Falkor v1 | E1009 | QCOM_FALKOR_ERRATUM_1009 | +----------------+-----------------+-----------------+-----------------------------+ | Qualcomm Tech. | QDF2400 ITS | E0065 | QCOM_QDF2400_ERRATUM_0065 | +----------------+-----------------+-----------------+-----------------------------+ diff --git a/Documentation/asm-annotations.rst b/Documentation/asm-annotations.rst new file mode 100644 index 000000000000..f55c2bb74d00 --- /dev/null +++ b/Documentation/asm-annotations.rst @@ -0,0 +1,216 @@ +Assembler Annotations +===================== + +Copyright (c) 2017-2019 Jiri Slaby + +This document describes the new macros for annotation of data and code in +assembly. In particular, it contains information about ``SYM_FUNC_START``, +``SYM_FUNC_END``, ``SYM_CODE_START``, and similar. + +Rationale +--------- +Some code like entries, trampolines, or boot code needs to be written in +assembly. The same as in C, such code is grouped into functions and +accompanied with data. Standard assemblers do not force users into precisely +marking these pieces as code, data, or even specifying their length. +Nevertheless, assemblers provide developers with such annotations to aid +debuggers throughout assembly. On top of that, developers also want to mark +some functions as *global* in order to be visible outside of their translation +units. + +Over time, the Linux kernel has adopted macros from various projects (like +``binutils``) to facilitate such annotations. So for historic reasons, +developers have been using ``ENTRY``, ``END``, ``ENDPROC``, and other +annotations in assembly. Due to the lack of their documentation, the macros +are used in rather wrong contexts at some locations. Clearly, ``ENTRY`` was +intended to denote the beginning of global symbols (be it data or code). +``END`` used to mark the end of data or end of special functions with +*non-standard* calling convention. In contrast, ``ENDPROC`` should annotate +only ends of *standard* functions. + +When these macros are used correctly, they help assemblers generate a nice +object with both sizes and types set correctly. For example, the result of +``arch/x86/lib/putuser.S``:: + + Num: Value Size Type Bind Vis Ndx Name + 25: 0000000000000000 33 FUNC GLOBAL DEFAULT 1 __put_user_1 + 29: 0000000000000030 37 FUNC GLOBAL DEFAULT 1 __put_user_2 + 32: 0000000000000060 36 FUNC GLOBAL DEFAULT 1 __put_user_4 + 35: 0000000000000090 37 FUNC GLOBAL DEFAULT 1 __put_user_8 + +This is not only important for debugging purposes. When there are properly +annotated objects like this, tools can be run on them to generate more useful +information. In particular, on properly annotated objects, ``objtool`` can be +run to check and fix the object if needed. Currently, ``objtool`` can report +missing frame pointer setup/destruction in functions. It can also +automatically generate annotations for :doc:`ORC unwinder <x86/orc-unwinder>` +for most code. Both of these are especially important to support reliable +stack traces which are in turn necessary for :doc:`Kernel live patching +<livepatch/livepatch>`. + +Caveat and Discussion +--------------------- +As one might realize, there were only three macros previously. That is indeed +insufficient to cover all the combinations of cases: + +* standard/non-standard function +* code/data +* global/local symbol + +There was a discussion_ and instead of extending the current ``ENTRY/END*`` +macros, it was decided that brand new macros should be introduced instead:: + + So how about using macro names that actually show the purpose, instead + of importing all the crappy, historic, essentially randomly chosen + debug symbol macro names from the binutils and older kernels? + +.. _discussion: https://lkml.kernel.org/r/20170217104757.28588-1-jslaby@suse.cz + +Macros Description +------------------ + +The new macros are prefixed with the ``SYM_`` prefix and can be divided into +three main groups: + +1. ``SYM_FUNC_*`` -- to annotate C-like functions. This means functions with + standard C calling conventions, i.e. the stack contains a return address at + the predefined place and a return from the function can happen in a + standard way. When frame pointers are enabled, save/restore of frame + pointer shall happen at the start/end of a function, respectively, too. + + Checking tools like ``objtool`` should ensure such marked functions conform + to these rules. The tools can also easily annotate these functions with + debugging information (like *ORC data*) automatically. + +2. ``SYM_CODE_*`` -- special functions called with special stack. Be it + interrupt handlers with special stack content, trampolines, or startup + functions. + + Checking tools mostly ignore checking of these functions. But some debug + information still can be generated automatically. For correct debug data, + this code needs hints like ``UNWIND_HINT_REGS`` provided by developers. + +3. ``SYM_DATA*`` -- obviously data belonging to ``.data`` sections and not to + ``.text``. Data do not contain instructions, so they have to be treated + specially by the tools: they should not treat the bytes as instructions, + nor assign any debug information to them. + +Instruction Macros +~~~~~~~~~~~~~~~~~~ +This section covers ``SYM_FUNC_*`` and ``SYM_CODE_*`` enumerated above. + +* ``SYM_FUNC_START`` and ``SYM_FUNC_START_LOCAL`` are supposed to be **the + most frequent markings**. They are used for functions with standard calling + conventions -- global and local. Like in C, they both align the functions to + architecture specific ``__ALIGN`` bytes. There are also ``_NOALIGN`` variants + for special cases where developers do not want this implicit alignment. + + ``SYM_FUNC_START_WEAK`` and ``SYM_FUNC_START_WEAK_NOALIGN`` markings are + also offered as an assembler counterpart to the *weak* attribute known from + C. + + All of these **shall** be coupled with ``SYM_FUNC_END``. First, it marks + the sequence of instructions as a function and computes its size to the + generated object file. Second, it also eases checking and processing such + object files as the tools can trivially find exact function boundaries. + + So in most cases, developers should write something like in the following + example, having some asm instructions in between the macros, of course:: + + SYM_FUNC_START(memset) + ... asm insns ... + SYM_FUNC_END(memset) + + In fact, this kind of annotation corresponds to the now deprecated ``ENTRY`` + and ``ENDPROC`` macros. + +* ``SYM_FUNC_START_ALIAS`` and ``SYM_FUNC_START_LOCAL_ALIAS`` serve for those + who decided to have two or more names for one function. The typical use is:: + + SYM_FUNC_START_ALIAS(__memset) + SYM_FUNC_START(memset) + ... asm insns ... + SYM_FUNC_END(memset) + SYM_FUNC_END_ALIAS(__memset) + + In this example, one can call ``__memset`` or ``memset`` with the same + result, except the debug information for the instructions is generated to + the object file only once -- for the non-``ALIAS`` case. + +* ``SYM_CODE_START`` and ``SYM_CODE_START_LOCAL`` should be used only in + special cases -- if you know what you are doing. This is used exclusively + for interrupt handlers and similar where the calling convention is not the C + one. ``_NOALIGN`` variants exist too. The use is the same as for the ``FUNC`` + category above:: + + SYM_CODE_START_LOCAL(bad_put_user) + ... asm insns ... + SYM_CODE_END(bad_put_user) + + Again, every ``SYM_CODE_START*`` **shall** be coupled by ``SYM_CODE_END``. + + To some extent, this category corresponds to deprecated ``ENTRY`` and + ``END``. Except ``END`` had several other meanings too. + +* ``SYM_INNER_LABEL*`` is used to denote a label inside some + ``SYM_{CODE,FUNC}_START`` and ``SYM_{CODE,FUNC}_END``. They are very similar + to C labels, except they can be made global. An example of use:: + + SYM_CODE_START(ftrace_caller) + /* save_mcount_regs fills in first two parameters */ + ... + + SYM_INNER_LABEL(ftrace_caller_op_ptr, SYM_L_GLOBAL) + /* Load the ftrace_ops into the 3rd parameter */ + ... + + SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL) + call ftrace_stub + ... + retq + SYM_CODE_END(ftrace_caller) + +Data Macros +~~~~~~~~~~~ +Similar to instructions, there is a couple of macros to describe data in the +assembly. + +* ``SYM_DATA_START`` and ``SYM_DATA_START_LOCAL`` mark the start of some data + and shall be used in conjunction with either ``SYM_DATA_END``, or + ``SYM_DATA_END_LABEL``. The latter adds also a label to the end, so that + people can use ``lstack`` and (local) ``lstack_end`` in the following + example:: + + SYM_DATA_START_LOCAL(lstack) + .skip 4096 + SYM_DATA_END_LABEL(lstack, SYM_L_LOCAL, lstack_end) + +* ``SYM_DATA`` and ``SYM_DATA_LOCAL`` are variants for simple, mostly one-line + data:: + + SYM_DATA(HEAP, .long rm_heap) + SYM_DATA(heap_end, .long rm_stack) + + In the end, they expand to ``SYM_DATA_START`` with ``SYM_DATA_END`` + internally. + +Support Macros +~~~~~~~~~~~~~~ +All the above reduce themselves to some invocation of ``SYM_START``, +``SYM_END``, or ``SYM_ENTRY`` at last. Normally, developers should avoid using +these. + +Further, in the above examples, one could see ``SYM_L_LOCAL``. There are also +``SYM_L_GLOBAL`` and ``SYM_L_WEAK``. All are intended to denote linkage of a +symbol marked by them. They are used either in ``_LABEL`` variants of the +earlier macros, or in ``SYM_START``. + + +Overriding Macros +~~~~~~~~~~~~~~~~~ +Architecture can also override any of the macros in their own +``asm/linkage.h``, including macros specifying the type of a symbol +(``SYM_T_FUNC``, ``SYM_T_OBJECT``, and ``SYM_T_NONE``). As every macro +described in this file is surrounded by ``#ifdef`` + ``#endif``, it is enough +to define the macros differently in the aforementioned architecture-dependent +header. diff --git a/Documentation/block/stat.rst b/Documentation/block/stat.rst index 9c07bc22b0bc..77311335c08b 100644 --- a/Documentation/block/stat.rst +++ b/Documentation/block/stat.rst @@ -41,6 +41,8 @@ discard I/Os requests number of discard I/Os processed discard merges requests number of discard I/Os merged with in-queue I/O discard sectors sectors number of sectors discarded discard ticks milliseconds total wait time for discard requests +flush I/Os requests number of flush I/Os processed +flush ticks milliseconds total wait time for flush requests =============== ============= ================================================= read I/Os, write I/Os, discard I/0s @@ -48,6 +50,14 @@ read I/Os, write I/Os, discard I/0s These values increment when an I/O request completes. +flush I/Os +========== + +These values increment when an flush I/O request completes. + +Block layer combines flush requests and executes at most one at a time. +This counts flush requests executed by disk. Not tracked for partitions. + read merges, write merges, discard merges ========================================= @@ -62,8 +72,8 @@ discarded from this block device. The "sectors" in question are the standard UNIX 512-byte sectors, not any device- or filesystem-specific block size. The counters are incremented when the I/O completes. -read ticks, write ticks, discard ticks -====================================== +read ticks, write ticks, discard ticks, flush ticks +=================================================== These values count the number of milliseconds that I/O requests have waited on this block device. If there are multiple I/O requests waiting, diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index 801a6ed3f2e5..4f5410b61441 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -47,6 +47,15 @@ Program types prog_flow_dissector +Testing BPF +=========== + +.. toctree:: + :maxdepth: 1 + + s390 + + .. Links: .. _Documentation/networking/filter.txt: ../networking/filter.txt .. _man-pages: https://www.kernel.org/doc/man-pages/ diff --git a/Documentation/bpf/prog_flow_dissector.rst b/Documentation/bpf/prog_flow_dissector.rst index a78bf036cadd..4d86780ab0f1 100644 --- a/Documentation/bpf/prog_flow_dissector.rst +++ b/Documentation/bpf/prog_flow_dissector.rst @@ -142,3 +142,6 @@ BPF flow dissector doesn't support exporting all the metadata that in-kernel C-based implementation can export. Notable example is single VLAN (802.1Q) and double VLAN (802.1AD) tags. Please refer to the ``struct bpf_flow_keys`` for a set of information that's currently can be exported from the BPF context. + +When BPF flow dissector is attached to the root network namespace (machine-wide +policy), users can't override it in their child network namespaces. diff --git a/Documentation/bpf/s390.rst b/Documentation/bpf/s390.rst new file mode 100644 index 000000000000..21ecb309daea --- /dev/null +++ b/Documentation/bpf/s390.rst @@ -0,0 +1,205 @@ +=================== +Testing BPF on s390 +=================== + +1. Introduction +*************** + +IBM Z are mainframe computers, which are descendants of IBM System/360 from +year 1964. They are supported by the Linux kernel under the name "s390". This +document describes how to test BPF in an s390 QEMU guest. + +2. One-time setup +***************** + +The following is required to build and run the test suite: + + * s390 GCC + * s390 development headers and libraries + * Clang with BPF support + * QEMU with s390 support + * Disk image with s390 rootfs + +Debian supports installing compiler and libraries for s390 out of the box. +Users of other distros may use debootstrap in order to set up a Debian chroot:: + + sudo debootstrap \ + --variant=minbase \ + --include=sudo \ + testing \ + ./s390-toolchain + sudo mount --rbind /dev ./s390-toolchain/dev + sudo mount --rbind /proc ./s390-toolchain/proc + sudo mount --rbind /sys ./s390-toolchain/sys + sudo chroot ./s390-toolchain + +Once on Debian, the build prerequisites can be installed as follows:: + + sudo dpkg --add-architecture s390x + sudo apt-get update + sudo apt-get install \ + bc \ + bison \ + cmake \ + debootstrap \ + dwarves \ + flex \ + g++ \ + gcc \ + g++-s390x-linux-gnu \ + gcc-s390x-linux-gnu \ + gdb-multiarch \ + git \ + make \ + python3 \ + qemu-system-misc \ + qemu-utils \ + rsync \ + libcap-dev:s390x \ + libelf-dev:s390x \ + libncurses-dev + +Latest Clang targeting BPF can be installed as follows:: + + git clone https://github.com/llvm/llvm-project.git + ln -s ../../clang llvm-project/llvm/tools/ + mkdir llvm-project-build + cd llvm-project-build + cmake \ + -DLLVM_TARGETS_TO_BUILD=BPF \ + -DCMAKE_BUILD_TYPE=Release \ + -DCMAKE_INSTALL_PREFIX=/opt/clang-bpf \ + ../llvm-project/llvm + make + sudo make install + export PATH=/opt/clang-bpf/bin:$PATH + +The disk image can be prepared using a loopback mount and debootstrap:: + + qemu-img create -f raw ./s390.img 1G + sudo losetup -f ./s390.img + sudo mkfs.ext4 /dev/loopX + mkdir ./s390.rootfs + sudo mount /dev/loopX ./s390.rootfs + sudo debootstrap \ + --foreign \ + --arch=s390x \ + --variant=minbase \ + --include=" \ + iproute2, \ + iputils-ping, \ + isc-dhcp-client, \ + kmod, \ + libcap2, \ + libelf1, \ + netcat, \ + procps" \ + testing \ + ./s390.rootfs + sudo umount ./s390.rootfs + sudo losetup -d /dev/loopX + +3. Compilation +************** + +In addition to the usual Kconfig options required to run the BPF test suite, it +is also helpful to select:: + + CONFIG_NET_9P=y + CONFIG_9P_FS=y + CONFIG_NET_9P_VIRTIO=y + CONFIG_VIRTIO_PCI=y + +as that would enable a very easy way to share files with the s390 virtual +machine. + +Compiling kernel, modules and testsuite, as well as preparing gdb scripts to +simplify debugging, can be done using the following commands:: + + make ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- menuconfig + make ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- bzImage modules scripts_gdb + make ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- \ + -C tools/testing/selftests \ + TARGETS=bpf \ + INSTALL_PATH=$PWD/tools/testing/selftests/kselftest_install \ + install + +4. Running the test suite +************************* + +The virtual machine can be started as follows:: + + qemu-system-s390x \ + -cpu max,zpci=on \ + -smp 2 \ + -m 4G \ + -kernel linux/arch/s390/boot/compressed/vmlinux \ + -drive file=./s390.img,if=virtio,format=raw \ + -nographic \ + -append 'root=/dev/vda rw console=ttyS1' \ + -virtfs local,path=./linux,security_model=none,mount_tag=linux \ + -object rng-random,filename=/dev/urandom,id=rng0 \ + -device virtio-rng-ccw,rng=rng0 \ + -netdev user,id=net0 \ + -device virtio-net-ccw,netdev=net0 + +When using this on a real IBM Z, ``-enable-kvm`` may be added for better +performance. When starting the virtual machine for the first time, disk image +setup must be finalized using the following command:: + + /debootstrap/debootstrap --second-stage + +Directory with the code built on the host as well as ``/proc`` and ``/sys`` +need to be mounted as follows:: + + mkdir -p /linux + mount -t 9p linux /linux + mount -t proc proc /proc + mount -t sysfs sys /sys + +After that, the test suite can be run using the following commands:: + + cd /linux/tools/testing/selftests/kselftest_install + ./run_kselftest.sh + +As usual, tests can be also run individually:: + + cd /linux/tools/testing/selftests/bpf + ./test_verifier + +5. Debugging +************ + +It is possible to debug the s390 kernel using QEMU GDB stub, which is activated +by passing ``-s`` to QEMU. + +It is preferable to turn KASLR off, so that gdb would know where to find the +kernel image in memory, by building the kernel with:: + + RANDOMIZE_BASE=n + +GDB can then be attached using the following command:: + + gdb-multiarch -ex 'target remote localhost:1234' ./vmlinux + +6. Network +********** + +In case one needs to use the network in the virtual machine in order to e.g. +install additional packages, it can be configured using:: + + dhclient eth0 + +7. Links +******** + +This document is a compilation of techniques, whose more comprehensive +descriptions can be found by following these links: + +- `Debootstrap <https://wiki.debian.org/EmDebian/CrossDebootstrap>`_ +- `Multiarch <https://wiki.debian.org/Multiarch/HOWTO>`_ +- `Building LLVM <https://llvm.org/docs/CMake.html>`_ +- `Cross-compiling the kernel <https://wiki.gentoo.org/wiki/Embedded_Handbook/General/Cross-compiling_the_kernel>`_ +- `QEMU s390x Guest Support <https://wiki.qemu.org/Documentation/Platforms/S390X>`_ +- `Plan 9 folder sharing over Virtio <https://wiki.qemu.org/Documentation/9psetup>`_ +- `Using GDB with QEMU <https://wiki.osdev.org/Kernel_Debugging#Use_GDB_with_QEMU>`_ diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index ecbebf4ca8e7..ea21dd4b9bad 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -79,6 +79,18 @@ has the added benefit of providing a unique identifier. On 64-bit machines the first 32 bits are zeroed. The kernel will print ``(ptrval)`` until it gathers enough entropy. If you *really* want the address see %px below. +Error Pointers +-------------- + +:: + + %pe -ENOSPC + +For printing error pointers (i.e. a pointer for which IS_ERR() is true) +as a symbolic error name. Error values for which no symbolic name is +known are printed in decimal, while a non-ERR_PTR passed as the +argument to %pe gets treated as ordinary %p. + Symbols/Function Pointers ------------------------- @@ -86,8 +98,6 @@ Symbols/Function Pointers %pS versatile_init+0x0/0x110 %ps versatile_init - %pF versatile_init+0x0/0x110 - %pf versatile_init %pSR versatile_init+0x9/0x110 (with __builtin_extract_return_addr() translation) %pB prev_fn_of_versatile_init+0x88/0x88 @@ -97,14 +107,6 @@ The ``S`` and ``s`` specifiers are used for printing a pointer in symbolic format. They result in the symbol name with (S) or without (s) offsets. If KALLSYMS are disabled then the symbol address is printed instead. -Note, that the ``F`` and ``f`` specifiers are identical to ``S`` (``s``) -and thus deprecated. We have ``F`` and ``f`` because on ia64, ppc64 and -parisc64 function pointers are indirect and, in fact, are function -descriptors, which require additional dereferencing before we can lookup -the symbol. As of now, ``S`` and ``s`` perform dereferencing on those -platforms (when needed), so ``F`` and ``f`` exist for compatibility -reasons only. - The ``B`` specifier results in the symbol name with offsets and should be used when printing stack backtraces. The specifier takes into consideration the effect of compiler optimisations which may occur @@ -428,6 +430,30 @@ Examples:: Passed by reference. +Fwnode handles +-------------- + +:: + + %pfw[fP] + +For printing information on fwnode handles. The default is to print the full +node name, including the path. The modifiers are functionally equivalent to +%pOF above. + + - f - full name of the node, including the path + - P - the name of the node including an address (if there is one) + +Examples (ACPI):: + + %pfwf \_SB.PCI0.CIO2.port@1.endpoint@0 - Full node name + %pfwP endpoint@0 - Node name + +Examples (OF):: + + %pfwf /ocp@68000000/i2c@48072000/camera@10/port/endpoint - Full name + %pfwP endpoint - Node name + Time and date (struct rtc_time) ------------------------------- diff --git a/Documentation/crypto/api-skcipher.rst b/Documentation/crypto/api-skcipher.rst index 20ba08dddf2e..1aaf8985894b 100644 --- a/Documentation/crypto/api-skcipher.rst +++ b/Documentation/crypto/api-skcipher.rst @@ -5,7 +5,7 @@ Block Cipher Algorithm Definitions :doc: Block Cipher Algorithm Definitions .. kernel-doc:: include/linux/crypto.h - :functions: crypto_alg ablkcipher_alg blkcipher_alg cipher_alg compress_alg + :functions: crypto_alg cipher_alg compress_alg Symmetric Key Cipher API ------------------------ @@ -33,30 +33,3 @@ Single Block Cipher API .. kernel-doc:: include/linux/crypto.h :functions: crypto_alloc_cipher crypto_free_cipher crypto_has_cipher crypto_cipher_blocksize crypto_cipher_setkey crypto_cipher_encrypt_one crypto_cipher_decrypt_one - -Asynchronous Block Cipher API - Deprecated ------------------------------------------- - -.. kernel-doc:: include/linux/crypto.h - :doc: Asynchronous Block Cipher API - -.. kernel-doc:: include/linux/crypto.h - :functions: crypto_free_ablkcipher crypto_has_ablkcipher crypto_ablkcipher_ivsize crypto_ablkcipher_blocksize crypto_ablkcipher_setkey crypto_ablkcipher_reqtfm crypto_ablkcipher_encrypt crypto_ablkcipher_decrypt - -Asynchronous Cipher Request Handle - Deprecated ------------------------------------------------ - -.. kernel-doc:: include/linux/crypto.h - :doc: Asynchronous Cipher Request Handle - -.. kernel-doc:: include/linux/crypto.h - :functions: crypto_ablkcipher_reqsize ablkcipher_request_set_tfm ablkcipher_request_alloc ablkcipher_request_free ablkcipher_request_set_callback ablkcipher_request_set_crypt - -Synchronous Block Cipher API - Deprecated ------------------------------------------ - -.. kernel-doc:: include/linux/crypto.h - :doc: Synchronous Block Cipher API - -.. kernel-doc:: include/linux/crypto.h - :functions: crypto_alloc_blkcipher crypto_free_blkcipher crypto_has_blkcipher crypto_blkcipher_name crypto_blkcipher_ivsize crypto_blkcipher_blocksize crypto_blkcipher_setkey crypto_blkcipher_encrypt crypto_blkcipher_encrypt_iv crypto_blkcipher_decrypt crypto_blkcipher_decrypt_iv crypto_blkcipher_set_iv crypto_blkcipher_get_iv diff --git a/Documentation/crypto/architecture.rst b/Documentation/crypto/architecture.rst index 3eae1ae7f798..646c3380a7ed 100644 --- a/Documentation/crypto/architecture.rst +++ b/Documentation/crypto/architecture.rst @@ -201,10 +201,6 @@ the aforementioned cipher types: - CRYPTO_ALG_TYPE_AEAD Authenticated Encryption with Associated Data (MAC) -- CRYPTO_ALG_TYPE_BLKCIPHER Synchronous multi-block cipher - -- CRYPTO_ALG_TYPE_ABLKCIPHER Asynchronous multi-block cipher - - CRYPTO_ALG_TYPE_KPP Key-agreement Protocol Primitive (KPP) such as an ECDH or DH implementation diff --git a/Documentation/crypto/crypto_engine.rst b/Documentation/crypto/crypto_engine.rst index 3baa23c2cd08..25cf9836c336 100644 --- a/Documentation/crypto/crypto_engine.rst +++ b/Documentation/crypto/crypto_engine.rst @@ -63,8 +63,6 @@ request by using: When your driver receives a crypto_request, you must to transfer it to the crypto engine via one of: -* crypto_transfer_ablkcipher_request_to_engine() - * crypto_transfer_aead_request_to_engine() * crypto_transfer_akcipher_request_to_engine() @@ -75,8 +73,6 @@ the crypto engine via one of: At the end of the request process, a call to one of the following functions is needed: -* crypto_finalize_ablkcipher_request() - * crypto_finalize_aead_request() * crypto_finalize_akcipher_request() diff --git a/Documentation/crypto/devel-algos.rst b/Documentation/crypto/devel-algos.rst index c45c6f400dbd..f9d288015acc 100644 --- a/Documentation/crypto/devel-algos.rst +++ b/Documentation/crypto/devel-algos.rst @@ -128,25 +128,20 @@ process requests that are unaligned. This implies, however, additional overhead as the kernel crypto API needs to perform the realignment of the data which may imply moving of data. -Cipher Definition With struct blkcipher_alg and ablkcipher_alg -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Cipher Definition With struct skcipher_alg +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Struct blkcipher_alg defines a synchronous block cipher whereas struct -ablkcipher_alg defines an asynchronous block cipher. +Struct skcipher_alg defines a multi-block cipher, or more generally, a +length-preserving symmetric cipher algorithm. -Please refer to the single block cipher description for schematics of -the block cipher usage. +Scatterlist handling +~~~~~~~~~~~~~~~~~~~~ -Specifics Of Asynchronous Multi-Block Cipher -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -There are a couple of specifics to the asynchronous interface. - -First of all, some of the drivers will want to use the Generic -ScatterWalk in case the hardware needs to be fed separate chunks of the -scatterlist which contains the plaintext and will contain the -ciphertext. Please refer to the ScatterWalk interface offered by the -Linux kernel scatter / gather list implementation. +Some drivers will want to use the Generic ScatterWalk in case the +hardware needs to be fed separate chunks of the scatterlist which +contains the plaintext and will contain the ciphertext. Please refer +to the ScatterWalk interface offered by the Linux kernel scatter / +gather list implementation. Hashing [HASH] -------------- diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst index b0522a4dd107..09dee10d2592 100644 --- a/Documentation/dev-tools/index.rst +++ b/Documentation/dev-tools/index.rst @@ -24,6 +24,7 @@ whole; patches welcome! gdb-kernel-debugging kgdb kselftest + kunit/index .. only:: subproject and html diff --git a/Documentation/dev-tools/kunit/api/index.rst b/Documentation/dev-tools/kunit/api/index.rst new file mode 100644 index 000000000000..9b9bffe5d41a --- /dev/null +++ b/Documentation/dev-tools/kunit/api/index.rst @@ -0,0 +1,16 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +API Reference +============= +.. toctree:: + + test + +This section documents the KUnit kernel testing API. It is divided into the +following sections: + +================================= ============================================== +:doc:`test` documents all of the standard testing API + excluding mocking or mocking related features. +================================= ============================================== diff --git a/Documentation/dev-tools/kunit/api/test.rst b/Documentation/dev-tools/kunit/api/test.rst new file mode 100644 index 000000000000..aaa97f17e5b3 --- /dev/null +++ b/Documentation/dev-tools/kunit/api/test.rst @@ -0,0 +1,11 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======== +Test API +======== + +This file documents all of the standard testing API excluding mocking or mocking +related features. + +.. kernel-doc:: include/kunit/test.h + :internal: diff --git a/Documentation/dev-tools/kunit/faq.rst b/Documentation/dev-tools/kunit/faq.rst new file mode 100644 index 000000000000..bf2095112d89 --- /dev/null +++ b/Documentation/dev-tools/kunit/faq.rst @@ -0,0 +1,62 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +Frequently Asked Questions +========================== + +How is this different from Autotest, kselftest, etc? +==================================================== +KUnit is a unit testing framework. Autotest, kselftest (and some others) are +not. + +A `unit test <https://martinfowler.com/bliki/UnitTest.html>`_ is supposed to +test a single unit of code in isolation, hence the name. A unit test should be +the finest granularity of testing and as such should allow all possible code +paths to be tested in the code under test; this is only possible if the code +under test is very small and does not have any external dependencies outside of +the test's control like hardware. + +There are no testing frameworks currently available for the kernel that do not +require installing the kernel on a test machine or in a VM and all require +tests to be written in userspace and run on the kernel under test; this is true +for Autotest, kselftest, and some others, disqualifying any of them from being +considered unit testing frameworks. + +Does KUnit support running on architectures other than UML? +=========================================================== + +Yes, well, mostly. + +For the most part, the KUnit core framework (what you use to write the tests) +can compile to any architecture; it compiles like just another part of the +kernel and runs when the kernel boots. However, there is some infrastructure, +like the KUnit Wrapper (``tools/testing/kunit/kunit.py``) that does not support +other architectures. + +In short, this means that, yes, you can run KUnit on other architectures, but +it might require more work than using KUnit on UML. + +For more information, see :ref:`kunit-on-non-uml`. + +What is the difference between a unit test and these other kinds of tests? +========================================================================== +Most existing tests for the Linux kernel would be categorized as an integration +test, or an end-to-end test. + +- A unit test is supposed to test a single unit of code in isolation, hence the + name. A unit test should be the finest granularity of testing and as such + should allow all possible code paths to be tested in the code under test; this + is only possible if the code under test is very small and does not have any + external dependencies outside of the test's control like hardware. +- An integration test tests the interaction between a minimal set of components, + usually just two or three. For example, someone might write an integration + test to test the interaction between a driver and a piece of hardware, or to + test the interaction between the userspace libraries the kernel provides and + the kernel itself; however, one of these tests would probably not test the + entire kernel along with hardware interactions and interactions with the + userspace. +- An end-to-end test usually tests the entire system from the perspective of the + code under test. For example, someone might write an end-to-end test for the + kernel by installing a production configuration of the kernel on production + hardware with a production userspace and then trying to exercise some behavior + that depends on interactions between the hardware, the kernel, and userspace. diff --git a/Documentation/dev-tools/kunit/index.rst b/Documentation/dev-tools/kunit/index.rst new file mode 100644 index 000000000000..26ffb46bdf99 --- /dev/null +++ b/Documentation/dev-tools/kunit/index.rst @@ -0,0 +1,79 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================= +KUnit - Unit Testing for the Linux Kernel +========================================= + +.. toctree:: + :maxdepth: 2 + + start + usage + api/index + faq + +What is KUnit? +============== + +KUnit is a lightweight unit testing and mocking framework for the Linux kernel. +These tests are able to be run locally on a developer's workstation without a VM +or special hardware. + +KUnit is heavily inspired by JUnit, Python's unittest.mock, and +Googletest/Googlemock for C++. KUnit provides facilities for defining unit test +cases, grouping related test cases into test suites, providing common +infrastructure for running tests, and much more. + +Get started now: :doc:`start` + +Why KUnit? +========== + +A unit test is supposed to test a single unit of code in isolation, hence the +name. A unit test should be the finest granularity of testing and as such should +allow all possible code paths to be tested in the code under test; this is only +possible if the code under test is very small and does not have any external +dependencies outside of the test's control like hardware. + +Outside of KUnit, there are no testing frameworks currently +available for the kernel that do not require installing the kernel on a test +machine or in a VM and all require tests to be written in userspace running on +the kernel; this is true for Autotest, and kselftest, disqualifying +any of them from being considered unit testing frameworks. + +KUnit addresses the problem of being able to run tests without needing a virtual +machine or actual hardware with User Mode Linux. User Mode Linux is a Linux +architecture, like ARM or x86; however, unlike other architectures it compiles +to a standalone program that can be run like any other program directly inside +of a host operating system; to be clear, it does not require any virtualization +support; it is just a regular program. + +KUnit is fast. Excluding build time, from invocation to completion KUnit can run +several dozen tests in only 10 to 20 seconds; this might not sound like a big +deal to some people, but having such fast and easy to run tests fundamentally +changes the way you go about testing and even writing code in the first place. +Linus himself said in his `git talk at Google +<https://gist.github.com/lorn/1272686/revisions#diff-53c65572127855f1b003db4064a94573R874>`_: + + "... a lot of people seem to think that performance is about doing the + same thing, just doing it faster, and that is not true. That is not what + performance is all about. If you can do something really fast, really + well, people will start using it differently." + +In this context Linus was talking about branching and merging, +but this point also applies to testing. If your tests are slow, unreliable, are +difficult to write, and require a special setup or special hardware to run, +then you wait a lot longer to write tests, and you wait a lot longer to run +tests; this means that tests are likely to break, unlikely to test a lot of +things, and are unlikely to be rerun once they pass. If your tests are really +fast, you run them all the time, every time you make a change, and every time +someone sends you some code. Why trust that someone ran all their tests +correctly on every change when you can just run them yourself in less time than +it takes to read their test log? + +How do I use it? +================ + +* :doc:`start` - for new users of KUnit +* :doc:`usage` - for a more detailed explanation of KUnit features +* :doc:`api/index` - for the list of KUnit APIs used for testing diff --git a/Documentation/dev-tools/kunit/start.rst b/Documentation/dev-tools/kunit/start.rst new file mode 100644 index 000000000000..aeeddfafeea2 --- /dev/null +++ b/Documentation/dev-tools/kunit/start.rst @@ -0,0 +1,180 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +Getting Started +=============== + +Installing dependencies +======================= +KUnit has the same dependencies as the Linux kernel. As long as you can build +the kernel, you can run KUnit. + +KUnit Wrapper +============= +Included with KUnit is a simple Python wrapper that helps format the output to +easily use and read KUnit output. It handles building and running the kernel, as +well as formatting the output. + +The wrapper can be run with: + +.. code-block:: bash + + ./tools/testing/kunit/kunit.py run + +Creating a kunitconfig +====================== +The Python script is a thin wrapper around Kbuild as such, it needs to be +configured with a ``kunitconfig`` file. This file essentially contains the +regular Kernel config, with the specific test targets as well. + +.. code-block:: bash + + git clone -b master https://kunit.googlesource.com/kunitconfig $PATH_TO_KUNITCONFIG_REPO + cd $PATH_TO_LINUX_REPO + ln -s $PATH_TO_KUNIT_CONFIG_REPO/kunitconfig kunitconfig + +You may want to add kunitconfig to your local gitignore. + +Verifying KUnit Works +--------------------- + +To make sure that everything is set up correctly, simply invoke the Python +wrapper from your kernel repo: + +.. code-block:: bash + + ./tools/testing/kunit/kunit.py run + +.. note:: + You may want to run ``make mrproper`` first. + +If everything worked correctly, you should see the following: + +.. code-block:: bash + + Generating .config ... + Building KUnit Kernel ... + Starting KUnit Kernel ... + +followed by a list of tests that are run. All of them should be passing. + +.. note:: + Because it is building a lot of sources for the first time, the ``Building + kunit kernel`` step may take a while. + +Writing your first test +======================= + +In your kernel repo let's add some code that we can test. Create a file +``drivers/misc/example.h`` with the contents: + +.. code-block:: c + + int misc_example_add(int left, int right); + +create a file ``drivers/misc/example.c``: + +.. code-block:: c + + #include <linux/errno.h> + + #include "example.h" + + int misc_example_add(int left, int right) + { + return left + right; + } + +Now add the following lines to ``drivers/misc/Kconfig``: + +.. code-block:: kconfig + + config MISC_EXAMPLE + bool "My example" + +and the following lines to ``drivers/misc/Makefile``: + +.. code-block:: make + + obj-$(CONFIG_MISC_EXAMPLE) += example.o + +Now we are ready to write the test. The test will be in +``drivers/misc/example-test.c``: + +.. code-block:: c + + #include <kunit/test.h> + #include "example.h" + + /* Define the test cases. */ + + static void misc_example_add_test_basic(struct kunit *test) + { + KUNIT_EXPECT_EQ(test, 1, misc_example_add(1, 0)); + KUNIT_EXPECT_EQ(test, 2, misc_example_add(1, 1)); + KUNIT_EXPECT_EQ(test, 0, misc_example_add(-1, 1)); + KUNIT_EXPECT_EQ(test, INT_MAX, misc_example_add(0, INT_MAX)); + KUNIT_EXPECT_EQ(test, -1, misc_example_add(INT_MAX, INT_MIN)); + } + + static void misc_example_test_failure(struct kunit *test) + { + KUNIT_FAIL(test, "This test never passes."); + } + + static struct kunit_case misc_example_test_cases[] = { + KUNIT_CASE(misc_example_add_test_basic), + KUNIT_CASE(misc_example_test_failure), + {} + }; + + static struct kunit_suite misc_example_test_suite = { + .name = "misc-example", + .test_cases = misc_example_test_cases, + }; + kunit_test_suite(misc_example_test_suite); + +Now add the following to ``drivers/misc/Kconfig``: + +.. code-block:: kconfig + + config MISC_EXAMPLE_TEST + bool "Test for my example" + depends on MISC_EXAMPLE && KUNIT + +and the following to ``drivers/misc/Makefile``: + +.. code-block:: make + + obj-$(CONFIG_MISC_EXAMPLE_TEST) += example-test.o + +Now add it to your ``kunitconfig``: + +.. code-block:: none + + CONFIG_MISC_EXAMPLE=y + CONFIG_MISC_EXAMPLE_TEST=y + +Now you can run the test: + +.. code-block:: bash + + ./tools/testing/kunit/kunit.py + +You should see the following failure: + +.. code-block:: none + + ... + [16:08:57] [PASSED] misc-example:misc_example_add_test_basic + [16:08:57] [FAILED] misc-example:misc_example_test_failure + [16:08:57] EXPECTATION FAILED at drivers/misc/example-test.c:17 + [16:08:57] This test never passes. + ... + +Congrats! You just wrote your first KUnit test! + +Next Steps +========== +* Check out the :doc:`usage` page for a more + in-depth explanation of KUnit. diff --git a/Documentation/dev-tools/kunit/usage.rst b/Documentation/dev-tools/kunit/usage.rst new file mode 100644 index 000000000000..c6e69634e274 --- /dev/null +++ b/Documentation/dev-tools/kunit/usage.rst @@ -0,0 +1,576 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========== +Using KUnit +=========== + +The purpose of this document is to describe what KUnit is, how it works, how it +is intended to be used, and all the concepts and terminology that are needed to +understand it. This guide assumes a working knowledge of the Linux kernel and +some basic knowledge of testing. + +For a high level introduction to KUnit, including setting up KUnit for your +project, see :doc:`start`. + +Organization of this document +============================= + +This document is organized into two main sections: Testing and Isolating +Behavior. The first covers what a unit test is and how to use KUnit to write +them. The second covers how to use KUnit to isolate code and make it possible +to unit test code that was otherwise un-unit-testable. + +Testing +======= + +What is KUnit? +-------------- + +"K" is short for "kernel" so "KUnit" is the "(Linux) Kernel Unit Testing +Framework." KUnit is intended first and foremost for writing unit tests; it is +general enough that it can be used to write integration tests; however, this is +a secondary goal. KUnit has no ambition of being the only testing framework for +the kernel; for example, it does not intend to be an end-to-end testing +framework. + +What is Unit Testing? +--------------------- + +A `unit test <https://martinfowler.com/bliki/UnitTest.html>`_ is a test that +tests code at the smallest possible scope, a *unit* of code. In the C +programming language that's a function. + +Unit tests should be written for all the publicly exposed functions in a +compilation unit; so that is all the functions that are exported in either a +*class* (defined below) or all functions which are **not** static. + +Writing Tests +------------- + +Test Cases +~~~~~~~~~~ + +The fundamental unit in KUnit is the test case. A test case is a function with +the signature ``void (*)(struct kunit *test)``. It calls a function to be tested +and then sets *expectations* for what should happen. For example: + +.. code-block:: c + + void example_test_success(struct kunit *test) + { + } + + void example_test_failure(struct kunit *test) + { + KUNIT_FAIL(test, "This test never passes."); + } + +In the above example ``example_test_success`` always passes because it does +nothing; no expectations are set, so all expectations pass. On the other hand +``example_test_failure`` always fails because it calls ``KUNIT_FAIL``, which is +a special expectation that logs a message and causes the test case to fail. + +Expectations +~~~~~~~~~~~~ +An *expectation* is a way to specify that you expect a piece of code to do +something in a test. An expectation is called like a function. A test is made +by setting expectations about the behavior of a piece of code under test; when +one or more of the expectations fail, the test case fails and information about +the failure is logged. For example: + +.. code-block:: c + + void add_test_basic(struct kunit *test) + { + KUNIT_EXPECT_EQ(test, 1, add(1, 0)); + KUNIT_EXPECT_EQ(test, 2, add(1, 1)); + } + +In the above example ``add_test_basic`` makes a number of assertions about the +behavior of a function called ``add``; the first parameter is always of type +``struct kunit *``, which contains information about the current test context; +the second parameter, in this case, is what the value is expected to be; the +last value is what the value actually is. If ``add`` passes all of these +expectations, the test case, ``add_test_basic`` will pass; if any one of these +expectations fail, the test case will fail. + +It is important to understand that a test case *fails* when any expectation is +violated; however, the test will continue running, potentially trying other +expectations until the test case ends or is otherwise terminated. This is as +opposed to *assertions* which are discussed later. + +To learn about more expectations supported by KUnit, see :doc:`api/test`. + +.. note:: + A single test case should be pretty short, pretty easy to understand, + focused on a single behavior. + +For example, if we wanted to properly test the add function above, we would +create additional tests cases which would each test a different property that an +add function should have like this: + +.. code-block:: c + + void add_test_basic(struct kunit *test) + { + KUNIT_EXPECT_EQ(test, 1, add(1, 0)); + KUNIT_EXPECT_EQ(test, 2, add(1, 1)); + } + + void add_test_negative(struct kunit *test) + { + KUNIT_EXPECT_EQ(test, 0, add(-1, 1)); + } + + void add_test_max(struct kunit *test) + { + KUNIT_EXPECT_EQ(test, INT_MAX, add(0, INT_MAX)); + KUNIT_EXPECT_EQ(test, -1, add(INT_MAX, INT_MIN)); + } + + void add_test_overflow(struct kunit *test) + { + KUNIT_EXPECT_EQ(test, INT_MIN, add(INT_MAX, 1)); + } + +Notice how it is immediately obvious what all the properties that we are testing +for are. + +Assertions +~~~~~~~~~~ + +KUnit also has the concept of an *assertion*. An assertion is just like an +expectation except the assertion immediately terminates the test case if it is +not satisfied. + +For example: + +.. code-block:: c + + static void mock_test_do_expect_default_return(struct kunit *test) + { + struct mock_test_context *ctx = test->priv; + struct mock *mock = ctx->mock; + int param0 = 5, param1 = -5; + const char *two_param_types[] = {"int", "int"}; + const void *two_params[] = {¶m0, ¶m1}; + const void *ret; + + ret = mock->do_expect(mock, + "test_printk", test_printk, + two_param_types, two_params, + ARRAY_SIZE(two_params)); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ret); + KUNIT_EXPECT_EQ(test, -4, *((int *) ret)); + } + +In this example, the method under test should return a pointer to a value, so +if the pointer returned by the method is null or an errno, we don't want to +bother continuing the test since the following expectation could crash the test +case. `ASSERT_NOT_ERR_OR_NULL(...)` allows us to bail out of the test case if +the appropriate conditions have not been satisfied to complete the test. + +Test Suites +~~~~~~~~~~~ + +Now obviously one unit test isn't very helpful; the power comes from having +many test cases covering all of your behaviors. Consequently it is common to +have many *similar* tests; in order to reduce duplication in these closely +related tests most unit testing frameworks provide the concept of a *test +suite*, in KUnit we call it a *test suite*; all it is is just a collection of +test cases for a unit of code with a set up function that gets invoked before +every test cases and then a tear down function that gets invoked after every +test case completes. + +Example: + +.. code-block:: c + + static struct kunit_case example_test_cases[] = { + KUNIT_CASE(example_test_foo), + KUNIT_CASE(example_test_bar), + KUNIT_CASE(example_test_baz), + {} + }; + + static struct kunit_suite example_test_suite = { + .name = "example", + .init = example_test_init, + .exit = example_test_exit, + .test_cases = example_test_cases, + }; + kunit_test_suite(example_test_suite); + +In the above example the test suite, ``example_test_suite``, would run the test +cases ``example_test_foo``, ``example_test_bar``, and ``example_test_baz``, +each would have ``example_test_init`` called immediately before it and would +have ``example_test_exit`` called immediately after it. +``kunit_test_suite(example_test_suite)`` registers the test suite with the +KUnit test framework. + +.. note:: + A test case will only be run if it is associated with a test suite. + +For a more information on these types of things see the :doc:`api/test`. + +Isolating Behavior +================== + +The most important aspect of unit testing that other forms of testing do not +provide is the ability to limit the amount of code under test to a single unit. +In practice, this is only possible by being able to control what code gets run +when the unit under test calls a function and this is usually accomplished +through some sort of indirection where a function is exposed as part of an API +such that the definition of that function can be changed without affecting the +rest of the code base. In the kernel this primarily comes from two constructs, +classes, structs that contain function pointers that are provided by the +implementer, and architecture specific functions which have definitions selected +at compile time. + +Classes +------- + +Classes are not a construct that is built into the C programming language; +however, it is an easily derived concept. Accordingly, pretty much every project +that does not use a standardized object oriented library (like GNOME's GObject) +has their own slightly different way of doing object oriented programming; the +Linux kernel is no exception. + +The central concept in kernel object oriented programming is the class. In the +kernel, a *class* is a struct that contains function pointers. This creates a +contract between *implementers* and *users* since it forces them to use the +same function signature without having to call the function directly. In order +for it to truly be a class, the function pointers must specify that a pointer +to the class, known as a *class handle*, be one of the parameters; this makes +it possible for the member functions (also known as *methods*) to have access +to member variables (more commonly known as *fields*) allowing the same +implementation to have multiple *instances*. + +Typically a class can be *overridden* by *child classes* by embedding the +*parent class* in the child class. Then when a method provided by the child +class is called, the child implementation knows that the pointer passed to it is +of a parent contained within the child; because of this, the child can compute +the pointer to itself because the pointer to the parent is always a fixed offset +from the pointer to the child; this offset is the offset of the parent contained +in the child struct. For example: + +.. code-block:: c + + struct shape { + int (*area)(struct shape *this); + }; + + struct rectangle { + struct shape parent; + int length; + int width; + }; + + int rectangle_area(struct shape *this) + { + struct rectangle *self = container_of(this, struct shape, parent); + + return self->length * self->width; + }; + + void rectangle_new(struct rectangle *self, int length, int width) + { + self->parent.area = rectangle_area; + self->length = length; + self->width = width; + } + +In this example (as in most kernel code) the operation of computing the pointer +to the child from the pointer to the parent is done by ``container_of``. + +Faking Classes +~~~~~~~~~~~~~~ + +In order to unit test a piece of code that calls a method in a class, the +behavior of the method must be controllable, otherwise the test ceases to be a +unit test and becomes an integration test. + +A fake just provides an implementation of a piece of code that is different than +what runs in a production instance, but behaves identically from the standpoint +of the callers; this is usually done to replace a dependency that is hard to +deal with, or is slow. + +A good example for this might be implementing a fake EEPROM that just stores the +"contents" in an internal buffer. For example, let's assume we have a class that +represents an EEPROM: + +.. code-block:: c + + struct eeprom { + ssize_t (*read)(struct eeprom *this, size_t offset, char *buffer, size_t count); + ssize_t (*write)(struct eeprom *this, size_t offset, const char *buffer, size_t count); + }; + +And we want to test some code that buffers writes to the EEPROM: + +.. code-block:: c + + struct eeprom_buffer { + ssize_t (*write)(struct eeprom_buffer *this, const char *buffer, size_t count); + int flush(struct eeprom_buffer *this); + size_t flush_count; /* Flushes when buffer exceeds flush_count. */ + }; + + struct eeprom_buffer *new_eeprom_buffer(struct eeprom *eeprom); + void destroy_eeprom_buffer(struct eeprom *eeprom); + +We can easily test this code by *faking out* the underlying EEPROM: + +.. code-block:: c + + struct fake_eeprom { + struct eeprom parent; + char contents[FAKE_EEPROM_CONTENTS_SIZE]; + }; + + ssize_t fake_eeprom_read(struct eeprom *parent, size_t offset, char *buffer, size_t count) + { + struct fake_eeprom *this = container_of(parent, struct fake_eeprom, parent); + + count = min(count, FAKE_EEPROM_CONTENTS_SIZE - offset); + memcpy(buffer, this->contents + offset, count); + + return count; + } + + ssize_t fake_eeprom_write(struct eeprom *this, size_t offset, const char *buffer, size_t count) + { + struct fake_eeprom *this = container_of(parent, struct fake_eeprom, parent); + + count = min(count, FAKE_EEPROM_CONTENTS_SIZE - offset); + memcpy(this->contents + offset, buffer, count); + + return count; + } + + void fake_eeprom_init(struct fake_eeprom *this) + { + this->parent.read = fake_eeprom_read; + this->parent.write = fake_eeprom_write; + memset(this->contents, 0, FAKE_EEPROM_CONTENTS_SIZE); + } + +We can now use it to test ``struct eeprom_buffer``: + +.. code-block:: c + + struct eeprom_buffer_test { + struct fake_eeprom *fake_eeprom; + struct eeprom_buffer *eeprom_buffer; + }; + + static void eeprom_buffer_test_does_not_write_until_flush(struct kunit *test) + { + struct eeprom_buffer_test *ctx = test->priv; + struct eeprom_buffer *eeprom_buffer = ctx->eeprom_buffer; + struct fake_eeprom *fake_eeprom = ctx->fake_eeprom; + char buffer[] = {0xff}; + + eeprom_buffer->flush_count = SIZE_MAX; + + eeprom_buffer->write(eeprom_buffer, buffer, 1); + KUNIT_EXPECT_EQ(test, fake_eeprom->contents[0], 0); + + eeprom_buffer->write(eeprom_buffer, buffer, 1); + KUNIT_EXPECT_EQ(test, fake_eeprom->contents[1], 0); + + eeprom_buffer->flush(eeprom_buffer); + KUNIT_EXPECT_EQ(test, fake_eeprom->contents[0], 0xff); + KUNIT_EXPECT_EQ(test, fake_eeprom->contents[1], 0xff); + } + + static void eeprom_buffer_test_flushes_after_flush_count_met(struct kunit *test) + { + struct eeprom_buffer_test *ctx = test->priv; + struct eeprom_buffer *eeprom_buffer = ctx->eeprom_buffer; + struct fake_eeprom *fake_eeprom = ctx->fake_eeprom; + char buffer[] = {0xff}; + + eeprom_buffer->flush_count = 2; + + eeprom_buffer->write(eeprom_buffer, buffer, 1); + KUNIT_EXPECT_EQ(test, fake_eeprom->contents[0], 0); + + eeprom_buffer->write(eeprom_buffer, buffer, 1); + KUNIT_EXPECT_EQ(test, fake_eeprom->contents[0], 0xff); + KUNIT_EXPECT_EQ(test, fake_eeprom->contents[1], 0xff); + } + + static void eeprom_buffer_test_flushes_increments_of_flush_count(struct kunit *test) + { + struct eeprom_buffer_test *ctx = test->priv; + struct eeprom_buffer *eeprom_buffer = ctx->eeprom_buffer; + struct fake_eeprom *fake_eeprom = ctx->fake_eeprom; + char buffer[] = {0xff, 0xff}; + + eeprom_buffer->flush_count = 2; + + eeprom_buffer->write(eeprom_buffer, buffer, 1); + KUNIT_EXPECT_EQ(test, fake_eeprom->contents[0], 0); + + eeprom_buffer->write(eeprom_buffer, buffer, 2); + KUNIT_EXPECT_EQ(test, fake_eeprom->contents[0], 0xff); + KUNIT_EXPECT_EQ(test, fake_eeprom->contents[1], 0xff); + /* Should have only flushed the first two bytes. */ + KUNIT_EXPECT_EQ(test, fake_eeprom->contents[2], 0); + } + + static int eeprom_buffer_test_init(struct kunit *test) + { + struct eeprom_buffer_test *ctx; + + ctx = kunit_kzalloc(test, sizeof(*ctx), GFP_KERNEL); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ctx); + + ctx->fake_eeprom = kunit_kzalloc(test, sizeof(*ctx->fake_eeprom), GFP_KERNEL); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ctx->fake_eeprom); + fake_eeprom_init(ctx->fake_eeprom); + + ctx->eeprom_buffer = new_eeprom_buffer(&ctx->fake_eeprom->parent); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ctx->eeprom_buffer); + + test->priv = ctx; + + return 0; + } + + static void eeprom_buffer_test_exit(struct kunit *test) + { + struct eeprom_buffer_test *ctx = test->priv; + + destroy_eeprom_buffer(ctx->eeprom_buffer); + } + +.. _kunit-on-non-uml: + +KUnit on non-UML architectures +============================== + +By default KUnit uses UML as a way to provide dependencies for code under test. +Under most circumstances KUnit's usage of UML should be treated as an +implementation detail of how KUnit works under the hood. Nevertheless, there +are instances where being able to run architecture specific code, or test +against real hardware is desirable. For these reasons KUnit supports running on +other architectures. + +Running existing KUnit tests on non-UML architectures +----------------------------------------------------- + +There are some special considerations when running existing KUnit tests on +non-UML architectures: + +* Hardware may not be deterministic, so a test that always passes or fails + when run under UML may not always do so on real hardware. +* Hardware and VM environments may not be hermetic. KUnit tries its best to + provide a hermetic environment to run tests; however, it cannot manage state + that it doesn't know about outside of the kernel. Consequently, tests that + may be hermetic on UML may not be hermetic on other architectures. +* Some features and tooling may not be supported outside of UML. +* Hardware and VMs are slower than UML. + +None of these are reasons not to run your KUnit tests on real hardware; they are +only things to be aware of when doing so. + +The biggest impediment will likely be that certain KUnit features and +infrastructure may not support your target environment. For example, at this +time the KUnit Wrapper (``tools/testing/kunit/kunit.py``) does not work outside +of UML. Unfortunately, there is no way around this. Using UML (or even just a +particular architecture) allows us to make a lot of assumptions that make it +possible to do things which might otherwise be impossible. + +Nevertheless, all core KUnit framework features are fully supported on all +architectures, and using them is straightforward: all you need to do is to take +your kunitconfig, your Kconfig options for the tests you would like to run, and +merge them into whatever config your are using for your platform. That's it! + +For example, let's say you have the following kunitconfig: + +.. code-block:: none + + CONFIG_KUNIT=y + CONFIG_KUNIT_EXAMPLE_TEST=y + +If you wanted to run this test on an x86 VM, you might add the following config +options to your ``.config``: + +.. code-block:: none + + CONFIG_KUNIT=y + CONFIG_KUNIT_EXAMPLE_TEST=y + CONFIG_SERIAL_8250=y + CONFIG_SERIAL_8250_CONSOLE=y + +All these new options do is enable support for a common serial console needed +for logging. + +Next, you could build a kernel with these tests as follows: + + +.. code-block:: bash + + make ARCH=x86 olddefconfig + make ARCH=x86 + +Once you have built a kernel, you could run it on QEMU as follows: + +.. code-block:: bash + + qemu-system-x86_64 -enable-kvm \ + -m 1024 \ + -kernel arch/x86_64/boot/bzImage \ + -append 'console=ttyS0' \ + --nographic + +Interspersed in the kernel logs you might see the following: + +.. code-block:: none + + TAP version 14 + # Subtest: example + 1..1 + # example_simple_test: initializing + ok 1 - example_simple_test + ok 1 - example + +Congratulations, you just ran a KUnit test on the x86 architecture! + +Writing new tests for other architectures +----------------------------------------- + +The first thing you must do is ask yourself whether it is necessary to write a +KUnit test for a specific architecture, and then whether it is necessary to +write that test for a particular piece of hardware. In general, writing a test +that depends on having access to a particular piece of hardware or software (not +included in the Linux source repo) should be avoided at all costs. + +Even if you only ever plan on running your KUnit test on your hardware +configuration, other people may want to run your tests and may not have access +to your hardware. If you write your test to run on UML, then anyone can run your +tests without knowing anything about your particular setup, and you can still +run your tests on your hardware setup just by compiling for your architecture. + +.. important:: + Always prefer tests that run on UML to tests that only run under a particular + architecture, and always prefer tests that run under QEMU or another easy + (and monitarily free) to obtain software environment to a specific piece of + hardware. + +Nevertheless, there are still valid reasons to write an architecture or hardware +specific test: for example, you might want to test some code that really belongs +in ``arch/some-arch/*``. Even so, try your best to write the test so that it +does not depend on physical hardware: if some of your test cases don't need the +hardware, only require the hardware for tests that actually need it. + +Now that you have narrowed down exactly what bits are hardware specific, the +actual procedure for writing and running the tests is pretty much the same as +writing normal KUnit tests. One special caveat is that you have to reset +hardware state in between test cases; if this is not possible, you may only be +able to run one test case per invocation. + +.. TODO(brendanhiggins@google.com): Add an actual example of an architecture + dependent KUnit test. diff --git a/Documentation/devicetree/bindings/arm/coresight.txt b/Documentation/devicetree/bindings/arm/coresight.txt index fcc3bacfd8bc..d02c42d21f2f 100644 --- a/Documentation/devicetree/bindings/arm/coresight.txt +++ b/Documentation/devicetree/bindings/arm/coresight.txt @@ -87,6 +87,15 @@ its hardware characteristcs. * port or ports: see "Graph bindings for Coresight" below. +* Optional properties for all components: + + * arm,coresight-loses-context-with-cpu : boolean. Indicates that the + hardware will lose register context on CPU power down (e.g. CPUIdle). + An example of where this may be needed are systems which contain a + coresight component and CPU in the same power domain. When the CPU + powers down the coresight component also powers down and loses its + context. This property is currently only used for the ETM 4.x driver. + * Optional properties for ETM/PTMs: * arm,cp14: must be present if the system accesses ETM/PTM management diff --git a/Documentation/devicetree/bindings/arm/omap/omap.txt b/Documentation/devicetree/bindings/arm/omap/omap.txt index b301f753ed2c..e77635c5422c 100644 --- a/Documentation/devicetree/bindings/arm/omap/omap.txt +++ b/Documentation/devicetree/bindings/arm/omap/omap.txt @@ -43,7 +43,7 @@ SoC Families: - OMAP2 generic - defaults to OMAP2420 compatible = "ti,omap2" -- OMAP3 generic - defaults to OMAP3430 +- OMAP3 generic compatible = "ti,omap3" - OMAP4 generic - defaults to OMAP4430 compatible = "ti,omap4" @@ -51,6 +51,8 @@ SoC Families: compatible = "ti,omap5" - DRA7 generic - defaults to DRA742 compatible = "ti,dra7" +- AM33x generic + compatible = "ti,am33xx" - AM43x generic - defaults to AM4372 compatible = "ti,am43" @@ -63,12 +65,14 @@ SoCs: - OMAP3430 compatible = "ti,omap3430", "ti,omap3" + legacy: "ti,omap34xx" - please do not use any more - AM3517 compatible = "ti,am3517", "ti,omap3" - OMAP3630 - compatible = "ti,omap36xx", "ti,omap3" -- AM33xx - compatible = "ti,am33xx", "ti,omap3" + compatible = "ti,omap3630", "ti,omap3" + legacy: "ti,omap36xx" - please do not use any more +- AM335x + compatible = "ti,am33xx" - OMAP4430 compatible = "ti,omap4430", "ti,omap4" @@ -110,19 +114,19 @@ SoCs: - AM4372 compatible = "ti,am4372", "ti,am43" -Boards: +Boards (incomplete list of examples): - OMAP3 BeagleBoard : Low cost community board - compatible = "ti,omap3-beagle", "ti,omap3" + compatible = "ti,omap3-beagle", "ti,omap3430", "ti,omap3" - OMAP3 Tobi with Overo : Commercial expansion board with daughter board - compatible = "gumstix,omap3-overo-tobi", "gumstix,omap3-overo", "ti,omap3" + compatible = "gumstix,omap3-overo-tobi", "gumstix,omap3-overo", "ti,omap3430", "ti,omap3" - OMAP4 SDP : Software Development Board - compatible = "ti,omap4-sdp", "ti,omap4430" + compatible = "ti,omap4-sdp", "ti,omap4430", "ti,omap4" - OMAP4 PandaBoard : Low cost community board - compatible = "ti,omap4-panda", "ti,omap4430" + compatible = "ti,omap4-panda", "ti,omap4430", "ti,omap4" - OMAP4 DuoVero with Parlor : Commercial expansion board with daughter board compatible = "gumstix,omap4-duovero-parlor", "gumstix,omap4-duovero", "ti,omap4430", "ti,omap4"; @@ -134,16 +138,16 @@ Boards: compatible = "variscite,var-dvk-om44", "variscite,var-som-om44", "ti,omap4460", "ti,omap4"; - OMAP3 EVM : Software Development Board for OMAP35x, AM/DM37x - compatible = "ti,omap3-evm", "ti,omap3" + compatible = "ti,omap3-evm", "ti,omap3630", "ti,omap3" - AM335X EVM : Software Development Board for AM335x - compatible = "ti,am335x-evm", "ti,am33xx", "ti,omap3" + compatible = "ti,am335x-evm", "ti,am33xx" - AM335X Bone : Low cost community board - compatible = "ti,am335x-bone", "ti,am33xx", "ti,omap3" + compatible = "ti,am335x-bone", "ti,am33xx" - AM3359 ICEv2 : Low cost Industrial Communication Engine EVM. - compatible = "ti,am3359-icev2", "ti,am33xx", "ti,omap3" + compatible = "ti,am3359-icev2", "ti,am33xx" - AM335X OrionLXm : Substation Automation Platform compatible = "novatech,am335x-lxm", "ti,am33xx" diff --git a/Documentation/devicetree/bindings/counter/ti-eqep.yaml b/Documentation/devicetree/bindings/counter/ti-eqep.yaml new file mode 100644 index 000000000000..85f1ff83afe7 --- /dev/null +++ b/Documentation/devicetree/bindings/counter/ti-eqep.yaml @@ -0,0 +1,50 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/counter/ti-eqep.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Texas Instruments Enhanced Quadrature Encoder Pulse (eQEP) Module + +maintainers: + - David Lechner <david@lechnology.com> + +properties: + compatible: + const: ti,am3352-eqep + + reg: + maxItems: 1 + + interrupts: + description: The eQEP event interrupt + maxItems: 1 + + clocks: + description: The clock that determines the SYSCLKOUT rate for the eQEP + peripheral. + maxItems: 1 + + clock-names: + const: sysclkout + +required: + - compatible + - reg + - interrupts + - clocks + - clock-names + +additionalProperties: false + +examples: + - | + eqep0: counter@180 { + compatible = "ti,am3352-eqep"; + reg = <0x180 0x80>; + clocks = <&l4ls_gclk>; + clock-names = "sysclkout"; + interrupts = <79>; + }; + +... diff --git a/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt b/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt index 0c38e4b8fc51..1758051798fe 100644 --- a/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt +++ b/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt @@ -15,12 +15,16 @@ In 'cpus' nodes: In 'operating-points-v2' table: - compatible: Should be - - 'operating-points-v2-ti-cpu' for am335x, am43xx, and dra7xx/am57xx SoCs + - 'operating-points-v2-ti-cpu' for am335x, am43xx, and dra7xx/am57xx, + omap34xx, omap36xx and am3517 SoCs - syscon: A phandle pointing to a syscon node representing the control module register space of the SoC. Optional properties: -------------------- +- "vdd-supply", "vbb-supply": to define two regulators for dra7xx +- "cpu0-supply", "vbb-supply": to define two regulators for omap36xx + For each opp entry in 'operating-points-v2' table: - opp-supported-hw: Two bitfields indicating: 1. Which revision of the SoC the OPP is supported by diff --git a/Documentation/devicetree/bindings/crypto/allwinner,sun8i-ss.yaml b/Documentation/devicetree/bindings/crypto/allwinner,sun8i-ss.yaml new file mode 100644 index 000000000000..8a29d36edf26 --- /dev/null +++ b/Documentation/devicetree/bindings/crypto/allwinner,sun8i-ss.yaml @@ -0,0 +1,60 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/crypto/allwinner,sun8i-ss.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Allwinner Security System v2 driver + +maintainers: + - Corentin Labbe <corentin.labbe@gmail.com> + +properties: + compatible: + enum: + - allwinner,sun8i-a83t-crypto + - allwinner,sun9i-a80-crypto + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + clocks: + items: + - description: Bus clock + - description: Module clock + + clock-names: + items: + - const: bus + - const: mod + + resets: + maxItems: 1 + +required: + - compatible + - reg + - interrupts + - clocks + - clock-names + - resets + +additionalProperties: false + +examples: + - | + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/clock/sun8i-a83t-ccu.h> + #include <dt-bindings/reset/sun8i-a83t-ccu.h> + + crypto: crypto@1c15000 { + compatible = "allwinner,sun8i-a83t-crypto"; + reg = <0x01c15000 0x1000>; + interrupts = <GIC_SPI 94 IRQ_TYPE_LEVEL_HIGH>; + resets = <&ccu RST_BUS_SS>; + clocks = <&ccu CLK_BUS_SS>, <&ccu CLK_SS>; + clock-names = "bus", "mod"; + }; diff --git a/Documentation/devicetree/bindings/crypto/amlogic,gxl-crypto.yaml b/Documentation/devicetree/bindings/crypto/amlogic,gxl-crypto.yaml new file mode 100644 index 000000000000..5becc60a0e28 --- /dev/null +++ b/Documentation/devicetree/bindings/crypto/amlogic,gxl-crypto.yaml @@ -0,0 +1,52 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/crypto/amlogic,gxl-crypto.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Amlogic GXL Cryptographic Offloader + +maintainers: + - Corentin Labbe <clabbe@baylibre.com> + +properties: + compatible: + items: + - const: amlogic,gxl-crypto + + reg: + maxItems: 1 + + interrupts: + items: + - description: "Interrupt for flow 0" + - description: "Interrupt for flow 1" + + clocks: + maxItems: 1 + + clock-names: + const: blkmv + +required: + - compatible + - reg + - interrupts + - clocks + - clock-names + +additionalProperties: false + +examples: + - | + #include <dt-bindings/interrupt-controller/irq.h> + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/clock/gxbb-clkc.h> + + crypto: crypto-engine@c883e000 { + compatible = "amlogic,gxl-crypto"; + reg = <0x0 0xc883e000 0x0 0x36>; + interrupts = <GIC_SPI 188 IRQ_TYPE_EDGE_RISING>, <GIC_SPI 189 IRQ_TYPE_EDGE_RISING>; + clocks = <&clkc CLKID_BLKMV>; + clock-names = "blkmv"; + }; diff --git a/Documentation/devicetree/bindings/devfreq/event/exynos-ppmu.txt b/Documentation/devicetree/bindings/devfreq/event/exynos-ppmu.txt index 3e36c1d11386..fb46b491791c 100644 --- a/Documentation/devicetree/bindings/devfreq/event/exynos-ppmu.txt +++ b/Documentation/devicetree/bindings/devfreq/event/exynos-ppmu.txt @@ -10,14 +10,23 @@ The Exynos PPMU driver uses the devfreq-event class to provide event data to various devfreq devices. The devfreq devices would use the event data when derterming the current state of each IP. -Required properties: +Required properties for PPMU device: - compatible: Should be "samsung,exynos-ppmu" or "samsung,exynos-ppmu-v2. - reg: physical base address of each PPMU and length of memory mapped region. -Optional properties: +Optional properties for PPMU device: - clock-names : the name of clock used by the PPMU, "ppmu" - clocks : phandles for clock specified in "clock-names" property +Required properties for 'events' child node of PPMU device: +- event-name : the unique event name among PPMU device +Optional properties for 'events' child node of PPMU device: +- event-data-type : Define the type of data which shell be counted +by the counter. You can check include/dt-bindings/pmu/exynos_ppmu.h for +all possible type, i.e. count read requests, count write data in bytes, +etc. This field is optional and when it is missing, the driver code +will use default data type. + Example1 : PPMUv1 nodes in exynos3250.dtsi are listed below. ppmu_dmc0: ppmu_dmc0@106a0000 { @@ -145,3 +154,16 @@ Example3 : PPMUv2 nodes in exynos5433.dtsi are listed below. reg = <0x104d0000 0x2000>; status = "disabled"; }; + +Example4 : 'event-data-type' in exynos4412-ppmu-common.dtsi are listed below. + + &ppmu_dmc0 { + status = "okay"; + events { + ppmu_dmc0_3: ppmu-event3-dmc0 { + event-name = "ppmu-event3-dmc0"; + event-data-type = <(PPMU_RO_DATA_CNT | + PPMU_WO_DATA_CNT)>; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/devfreq/exynos-bus.txt b/Documentation/devicetree/bindings/devfreq/exynos-bus.txt index f8e946471a58..e71f752cc18f 100644 --- a/Documentation/devicetree/bindings/devfreq/exynos-bus.txt +++ b/Documentation/devicetree/bindings/devfreq/exynos-bus.txt @@ -50,8 +50,6 @@ Required properties only for passive bus device: Optional properties only for parent bus device: - exynos,saturation-ratio: the percentage value which is used to calibrate the performance count against total cycle count. -- exynos,voltage-tolerance: the percentage value for bus voltage tolerance - which is used to calculate the max voltage. Detailed correlation between sub-blocks and power line according to Exynos SoC: - In case of Exynos3250, there are two power line as following: diff --git a/Documentation/devicetree/bindings/display/allwinner,sun6i-a31-mipi-dsi.yaml b/Documentation/devicetree/bindings/display/allwinner,sun6i-a31-mipi-dsi.yaml index 47950fced28d..dafc0980c4fa 100644 --- a/Documentation/devicetree/bindings/display/allwinner,sun6i-a31-mipi-dsi.yaml +++ b/Documentation/devicetree/bindings/display/allwinner,sun6i-a31-mipi-dsi.yaml @@ -36,6 +36,9 @@ properties: resets: maxItems: 1 + vcc-dsi-supply: + description: VCC-DSI power supply of the DSI encoder + phys: maxItems: 1 @@ -64,6 +67,7 @@ required: - phys - phy-names - resets + - vcc-dsi-supply - port additionalProperties: false @@ -79,6 +83,7 @@ examples: resets = <&ccu 4>; phys = <&dphy0>; phy-names = "dphy"; + vcc-dsi-supply = <®_dcdc1>; #address-cells = <1>; #size-cells = <0>; diff --git a/Documentation/devicetree/bindings/display/arm,malidp.txt b/Documentation/devicetree/bindings/display/arm,malidp.txt index 2f7870983ef1..7a97a2b48c2a 100644 --- a/Documentation/devicetree/bindings/display/arm,malidp.txt +++ b/Documentation/devicetree/bindings/display/arm,malidp.txt @@ -37,6 +37,8 @@ Optional properties: Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt) to be used for the framebuffer; if not present, the framebuffer may be located anywhere in memory. + - arm,malidp-arqos-high-level: integer of u32 value describing the ARQoS + levels of DP500's QoS signaling. Example: @@ -54,6 +56,7 @@ Example: clocks = <&oscclk2>, <&fpgaosc0>, <&fpgaosc1>, <&fpgaosc1>; clock-names = "pxlclk", "mclk", "aclk", "pclk"; arm,malidp-output-port-lines = /bits/ 8 <8 8 8>; + arm,malidp-arqos-high-level = <0xd000d000>; port { dp0_output: endpoint { remote-endpoint = <&tda998x_2_input>; diff --git a/Documentation/devicetree/bindings/display/bridge/anx7814.txt b/Documentation/devicetree/bindings/display/bridge/anx7814.txt index dbd7c84ee584..17258747fff6 100644 --- a/Documentation/devicetree/bindings/display/bridge/anx7814.txt +++ b/Documentation/devicetree/bindings/display/bridge/anx7814.txt @@ -6,7 +6,11 @@ designed for portable devices. Required properties: - - compatible : "analogix,anx7814" + - compatible : Must be one of: + "analogix,anx7808" + "analogix,anx7812" + "analogix,anx7814" + "analogix,anx7818" - reg : I2C address of the device - interrupts : Should contain the INTP interrupt - hpd-gpios : Which GPIO to use for hpd diff --git a/Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt b/Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt index db680413e89c..819f3e31013c 100644 --- a/Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt +++ b/Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt @@ -13,6 +13,7 @@ Required properties: - compatible : Shall contain one or more of - "renesas,r8a774a1-hdmi" for R8A774A1 (RZ/G2M) compatible HDMI TX + - "renesas,r8a774b1-hdmi" for R8A774B1 (RZ/G2N) compatible HDMI TX - "renesas,r8a7795-hdmi" for R8A7795 (R-Car H3) compatible HDMI TX - "renesas,r8a7796-hdmi" for R8A7796 (R-Car M3-W) compatible HDMI TX - "renesas,r8a77965-hdmi" for R8A77965 (R-Car M3-N) compatible HDMI TX diff --git a/Documentation/devicetree/bindings/display/bridge/renesas,lvds.txt b/Documentation/devicetree/bindings/display/bridge/renesas,lvds.txt index c6a196d0b075..c62ce2494ed9 100644 --- a/Documentation/devicetree/bindings/display/bridge/renesas,lvds.txt +++ b/Documentation/devicetree/bindings/display/bridge/renesas,lvds.txt @@ -10,6 +10,7 @@ Required properties: - "renesas,r8a7743-lvds" for R8A7743 (RZ/G1M) compatible LVDS encoders - "renesas,r8a7744-lvds" for R8A7744 (RZ/G1N) compatible LVDS encoders - "renesas,r8a774a1-lvds" for R8A774A1 (RZ/G2M) compatible LVDS encoders + - "renesas,r8a774b1-lvds" for R8A774B1 (RZ/G2N) compatible LVDS encoders - "renesas,r8a774c0-lvds" for R8A774C0 (RZ/G2E) compatible LVDS encoders - "renesas,r8a7790-lvds" for R8A7790 (R-Car H2) compatible LVDS encoders - "renesas,r8a7791-lvds" for R8A7791 (R-Car M2-W) compatible LVDS encoders diff --git a/Documentation/devicetree/bindings/display/mediatek/mediatek,disp.txt b/Documentation/devicetree/bindings/display/mediatek/mediatek,disp.txt index 8469de510001..b91e709db7a4 100644 --- a/Documentation/devicetree/bindings/display/mediatek/mediatek,disp.txt +++ b/Documentation/devicetree/bindings/display/mediatek/mediatek,disp.txt @@ -27,19 +27,22 @@ Documentation/devicetree/bindings/display/mediatek/mediatek,dpi.txt. Required properties (all function blocks): - compatible: "mediatek,<chip>-disp-<function>", one of - "mediatek,<chip>-disp-ovl" - overlay (4 layers, blending, csc) - "mediatek,<chip>-disp-rdma" - read DMA / line buffer - "mediatek,<chip>-disp-wdma" - write DMA - "mediatek,<chip>-disp-color" - color processor - "mediatek,<chip>-disp-aal" - adaptive ambient light controller - "mediatek,<chip>-disp-gamma" - gamma correction - "mediatek,<chip>-disp-merge" - merge streams from two RDMA sources - "mediatek,<chip>-disp-split" - split stream to two encoders - "mediatek,<chip>-disp-ufoe" - data compression engine - "mediatek,<chip>-dsi" - DSI controller, see mediatek,dsi.txt - "mediatek,<chip>-dpi" - DPI controller, see mediatek,dpi.txt - "mediatek,<chip>-disp-mutex" - display mutex - "mediatek,<chip>-disp-od" - overdrive + "mediatek,<chip>-disp-ovl" - overlay (4 layers, blending, csc) + "mediatek,<chip>-disp-ovl-2l" - overlay (2 layers, blending, csc) + "mediatek,<chip>-disp-rdma" - read DMA / line buffer + "mediatek,<chip>-disp-wdma" - write DMA + "mediatek,<chip>-disp-ccorr" - color correction + "mediatek,<chip>-disp-color" - color processor + "mediatek,<chip>-disp-dither" - dither + "mediatek,<chip>-disp-aal" - adaptive ambient light controller + "mediatek,<chip>-disp-gamma" - gamma correction + "mediatek,<chip>-disp-merge" - merge streams from two RDMA sources + "mediatek,<chip>-disp-split" - split stream to two encoders + "mediatek,<chip>-disp-ufoe" - data compression engine + "mediatek,<chip>-dsi" - DSI controller, see mediatek,dsi.txt + "mediatek,<chip>-dpi" - DPI controller, see mediatek,dpi.txt + "mediatek,<chip>-disp-mutex" - display mutex + "mediatek,<chip>-disp-od" - overdrive the supported chips are mt2701, mt2712 and mt8173. - reg: Physical base address and length of the function block register space - interrupts: The interrupt signal from the function block (required, except for @@ -49,6 +52,7 @@ Required properties (all function blocks): For most function blocks this is just a single clock input. Only the DSI and DPI controller nodes have multiple clock inputs. These are documented in mediatek,dsi.txt and mediatek,dpi.txt, respectively. + An exception is that the mt8183 mutex is always free running with no clocks property. Required properties (DMA function blocks): - compatible: Should be one of diff --git a/Documentation/devicetree/bindings/display/mediatek/mediatek,dsi.txt b/Documentation/devicetree/bindings/display/mediatek/mediatek,dsi.txt index fadf327c7cdf..a19a6cc375ed 100644 --- a/Documentation/devicetree/bindings/display/mediatek/mediatek,dsi.txt +++ b/Documentation/devicetree/bindings/display/mediatek/mediatek,dsi.txt @@ -7,7 +7,7 @@ channel output. Required properties: - compatible: "mediatek,<chip>-dsi" - the supported chips are mt2701 and mt8173. + the supported chips are mt2701, mt8173 and mt8183. - reg: Physical base address and length of the controller's registers - interrupts: The interrupt signal from the function block. - clocks: device clocks @@ -26,7 +26,7 @@ The MIPI TX configuration module controls the MIPI D-PHY. Required properties: - compatible: "mediatek,<chip>-mipi-tx" - the supported chips are mt2701 and mt8173. + the supported chips are mt2701, mt8173 and mt8183. - reg: Physical base address and length of the controller's registers - clocks: PLL reference clock - clock-output-names: name of the output clock line to the DSI encoder diff --git a/Documentation/devicetree/bindings/display/renesas,du.txt b/Documentation/devicetree/bindings/display/renesas,du.txt index c97dfacad281..17cb2771364b 100644 --- a/Documentation/devicetree/bindings/display/renesas,du.txt +++ b/Documentation/devicetree/bindings/display/renesas,du.txt @@ -8,6 +8,7 @@ Required Properties: - "renesas,du-r8a7745" for R8A7745 (RZ/G1E) compatible DU - "renesas,du-r8a77470" for R8A77470 (RZ/G1C) compatible DU - "renesas,du-r8a774a1" for R8A774A1 (RZ/G2M) compatible DU + - "renesas,du-r8a774b1" for R8A774B1 (RZ/G2N) compatible DU - "renesas,du-r8a774c0" for R8A774C0 (RZ/G2E) compatible DU - "renesas,du-r8a7779" for R8A7779 (R-Car H1) compatible DU - "renesas,du-r8a7790" for R8A7790 (R-Car H2) compatible DU @@ -60,6 +61,7 @@ corresponding to each DU output. R8A7745 (RZ/G1E) DPAD 0 DPAD 1 - - R8A77470 (RZ/G1C) DPAD 0 DPAD 1 LVDS 0 - R8A774A1 (RZ/G2M) DPAD 0 HDMI 0 LVDS 0 - + R8A774B1 (RZ/G2N) DPAD 0 HDMI 0 LVDS 0 - R8A774C0 (RZ/G2E) DPAD 0 LVDS 0 LVDS 1 - R8A7779 (R-Car H1) DPAD 0 DPAD 1 - - R8A7790 (R-Car H2) DPAD 0 LVDS 0 LVDS 1 - diff --git a/Documentation/devicetree/bindings/display/rockchip/rockchip-vop.txt b/Documentation/devicetree/bindings/display/rockchip/rockchip-vop.txt index 4f58c5a2d195..8b3a5f514205 100644 --- a/Documentation/devicetree/bindings/display/rockchip/rockchip-vop.txt +++ b/Documentation/devicetree/bindings/display/rockchip/rockchip-vop.txt @@ -20,6 +20,10 @@ Required properties: "rockchip,rk3228-vop"; "rockchip,rk3328-vop"; +- reg: Must contain one entry corresponding to the base address and length + of the register space. Can optionally contain a second entry + corresponding to the CRTC gamma LUT address. + - interrupts: should contain a list of all VOP IP block interrupts in the order: VSYNC, LCD_SYSTEM. The interrupt specifier format depends on the interrupt controller used. @@ -48,7 +52,7 @@ Example: SoC specific DT entry: vopb: vopb@ff930000 { compatible = "rockchip,rk3288-vop"; - reg = <0xff930000 0x19c>; + reg = <0x0 0xff930000 0x0 0x19c>, <0x0 0xff931000 0x0 0x1000>; interrupts = <GIC_SPI 15 IRQ_TYPE_LEVEL_HIGH>; clocks = <&cru ACLK_VOP0>, <&cru DCLK_VOP0>, <&cru HCLK_VOP0>; clock-names = "aclk_vop", "dclk_vop", "hclk_vop"; diff --git a/Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt b/Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt index 372f0eeb5a2a..f1f95f678739 100644 --- a/Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt +++ b/Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt @@ -8,6 +8,7 @@ Required Properties: - "renesas,r8a7745-usb-dmac" (RZ/G1E) - "renesas,r8a77470-usb-dmac" (RZ/G1C) - "renesas,r8a774a1-usb-dmac" (RZ/G2M) + - "renesas,r8a774b1-usb-dmac" (RZ/G2N) - "renesas,r8a774c0-usb-dmac" (RZ/G2E) - "renesas,r8a7790-usb-dmac" (R-Car H2) - "renesas,r8a7791-usb-dmac" (R-Car M2-W) diff --git a/Documentation/devicetree/bindings/fsi/fsi-master-aspeed.txt b/Documentation/devicetree/bindings/fsi/fsi-master-aspeed.txt new file mode 100644 index 000000000000..b758f91914f7 --- /dev/null +++ b/Documentation/devicetree/bindings/fsi/fsi-master-aspeed.txt @@ -0,0 +1,24 @@ +Device-tree bindings for AST2600 FSI master +------------------------------------------- + +The AST2600 contains two identical FSI masters. They share a clock and have a +separate interrupt line and output pins. + +Required properties: + - compatible: "aspeed,ast2600-fsi-master" + - reg: base address and length + - clocks: phandle and clock number + - interrupts: platform dependent interrupt description + - pinctrl-0: phandle to pinctrl node + - pinctrl-names: pinctrl state + +Examples: + + fsi-master { + compatible = "aspeed,ast2600-fsi-master", "fsi-master"; + reg = <0x1e79b000 0x94>; + interrupts = <GIC_SPI 100 IRQ_TYPE_LEVEL_HIGH>; + pinctrl-names = "default"; + pinctrl-0 = <&pinctrl_fsi1_default>; + clocks = <&syscon ASPEED_CLK_GATE_FSICLK>; + }; diff --git a/Documentation/devicetree/bindings/hwmon/adi,ltc2947.yaml b/Documentation/devicetree/bindings/hwmon/adi,ltc2947.yaml new file mode 100644 index 000000000000..ae04903f34bf --- /dev/null +++ b/Documentation/devicetree/bindings/hwmon/adi,ltc2947.yaml @@ -0,0 +1,104 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/bindings/hwmon/adi,ltc2947.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Analog Devices LTC2947 high precision power and energy monitor + +maintainers: + - Nuno Sá <nuno.sa@analog.com> + +description: | + Analog Devices LTC2947 high precision power and energy monitor over SPI or I2C. + + https://www.analog.com/media/en/technical-documentation/data-sheets/LTC2947.pdf + +properties: + compatible: + enum: + - adi,ltc2947 + + reg: + maxItems: 1 + + clocks: + description: + The LTC2947 uses either a trimmed internal oscillator or an external clock + as the time base for determining the integration period to represent time, + charge and energy. When an external clock is used, this property must be + set accordingly. + maxItems: 1 + + adi,accumulator-ctl-pol: + description: + This property controls the polarity of current that is accumulated to + calculate charge and energy so that, they can be only accumulated for + positive current for example. Since there are two sets of registers for + the accumulated values, this entry can also have two items which sets + energy1/charge1 and energy2/charger2 respectively. Check table 12 of the + datasheet for more information on the supported options. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + - minItems: 2 + maxItems: 2 + items: + enum: [0, 1, 2, 3] + default: 0 + + adi,accumulation-deadband-microamp: + description: + This property controls the Accumulation Dead band which allows to set the + level of current below which no accumulation takes place. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + maximum: 255 + default: 0 + + adi,gpio-out-pol: + description: + This property controls the GPIO polarity. Setting it to one makes the GPIO + active high, setting it to zero makets it active low. When this property + is present, the GPIO is automatically configured as output and set to + control a fan as a function of measured temperature. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + enum: [0, 1] + default: 0 + + adi,gpio-in-accum: + description: + When set, this property sets the GPIO as input. It is then used to control + the accumulation of charge, energy and time. This function can be + enabled/configured separately for each of the two sets of accumulation + registers. Check table 13 of the datasheet for more information on the + supported options. This property cannot be used together with + adi,gpio-out-pol. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + - minItems: 2 + maxItems: 2 + items: + enum: [0, 1, 2] + default: 0 + +required: + - compatible + - reg + + +examples: + - | + spi { + #address-cells = <1>; + #size-cells = <0>; + + ltc2947_spi: ltc2947@0 { + compatible = "adi,ltc2947"; + reg = <0>; + /* accumulation takes place always for energ1/charge1. */ + /* accumulation only on positive current for energy2/charge2. */ + adi,accumulator-ctl-pol = <0 1>; + }; + }; +... diff --git a/Documentation/devicetree/bindings/hwmon/ibm,cffps1.txt b/Documentation/devicetree/bindings/hwmon/ibm,cffps1.txt index 1036f65fb778..d9a2719f9243 100644 --- a/Documentation/devicetree/bindings/hwmon/ibm,cffps1.txt +++ b/Documentation/devicetree/bindings/hwmon/ibm,cffps1.txt @@ -5,6 +5,9 @@ Required properties: - compatible : Must be one of the following: "ibm,cffps1" "ibm,cffps2" + or "ibm,cffps" if the system + must support any version of the + power supply - reg = < I2C bus address >; : Address of the power supply on the I2C bus. diff --git a/Documentation/devicetree/bindings/hwmon/ti,tmp513.yaml b/Documentation/devicetree/bindings/hwmon/ti,tmp513.yaml new file mode 100644 index 000000000000..168235ad5d81 --- /dev/null +++ b/Documentation/devicetree/bindings/hwmon/ti,tmp513.yaml @@ -0,0 +1,93 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- + +$id: http://devicetree.org/schemas/hwmon/ti,tmp513.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: TMP513/512 system monitor sensor + +maintainers: + - Eric Tremblay <etremblay@distech-controls.com> + +description: | + The TMP512 (dual-channel) and TMP513 (triple-channel) are system monitors + that include remote sensors, a local temperature sensor, and a high-side + current shunt monitor. These system monitors have the capability of measuring + remote temperatures, on-chip temperatures, and system voltage/power/current + consumption. + + Datasheets: + http://www.ti.com/lit/gpn/tmp513 + http://www.ti.com/lit/gpn/tmp512 + + +properties: + compatible: + enum: + - ti,tmp512 + - ti,tmp513 + + reg: + maxItems: 1 + + shunt-resistor-micro-ohms: + description: | + If 0, the calibration process will be skiped and the current and power + measurement engine will not work. Temperature and voltage measurement + will continue to work. The shunt value also need to respect: + rshunt <= pga-gain * 40 * 1000 * 1000. + If not, it's not possible to compute a valid calibration value. + default: 1000 + + ti,pga-gain: + description: | + The gain value for the PGA function. This is 8, 4, 2 or 1. + The PGA gain affect the shunt voltage range. + The range will be equal to: pga-gain * 40mV + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + enum: [1, 2, 4, 8] + default: 8 + + ti,bus-range-microvolt: + description: | + This is the operating range of the bus voltage in microvolt + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + enum: [16000000, 32000000] + default: 32000000 + + ti,nfactor: + description: | + Array of three(TMP513) or two(TMP512) n-Factor value for each remote + temperature channel. + See datasheet Table 11 for n-Factor range list and value interpretation. + allOf: + - $ref: /schemas/types.yaml#definitions/uint32-array + - minItems: 2 + maxItems: 3 + items: + default: 0x00 + minimum: 0x00 + maximum: 0xFF + +required: + - compatible + - reg + +examples: + - | + i2c { + #address-cells = <1>; + #size-cells = <0>; + + tmp513@5c { + compatible = "ti,tmp513"; + reg = <0x5C>; + shunt-resistor-micro-ohms = <330000>; + ti,bus-range-microvolt = <32000000>; + ti,pga-gain = <8>; + ti,nfactor = <0x1 0xF3 0x00>; + }; + }; diff --git a/Documentation/devicetree/bindings/iio/adc/adi,ad7292.yaml b/Documentation/devicetree/bindings/iio/adc/adi,ad7292.yaml new file mode 100644 index 000000000000..b68be3aaf587 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/adc/adi,ad7292.yaml @@ -0,0 +1,104 @@ +# SPDX-License-Identifier: GPL-2.0-only +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/adc/adi,ad7292.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Analog Devices AD7292 10-Bit Monitor and Control System + +maintainers: + - Marcelo Schmitt <marcelo.schmitt1@gmail.com> + +description: | + Analog Devices AD7292 10-Bit Monitor and Control System with ADC, DACs, + Temperature Sensor, and GPIOs + + Specifications about the part can be found at: + https://www.analog.com/media/en/technical-documentation/data-sheets/ad7292.pdf + +properties: + compatible: + enum: + - adi,ad7292 + + reg: + maxItems: 1 + + vref-supply: + description: | + The regulator supply for ADC and DAC reference voltage. + + spi-cpha: true + + '#address-cells': + const: 1 + + '#size-cells': + const: 0 + +required: + - compatible + - reg + - spi-cpha + +patternProperties: + "^channel@[0-7]$": + type: object + description: | + Represents the external channels which are connected to the ADC. + See Documentation/devicetree/bindings/iio/adc/adc.txt. + + properties: + reg: + description: | + The channel number. It can have up to 8 channels numbered from 0 to 7. + items: + maximum: 7 + + diff-channels: + description: see Documentation/devicetree/bindings/iio/adc/adc.txt + maxItems: 1 + + required: + - reg + +examples: + - | + spi { + #address-cells = <1>; + #size-cells = <0>; + + ad7292: adc@0 { + compatible = "adi,ad7292"; + reg = <0>; + spi-max-frequency = <25000000>; + vref-supply = <&adc_vref>; + spi-cpha; + + #address-cells = <1>; + #size-cells = <0>; + + channel@0 { + reg = <0>; + diff-channels = <0 1>; + }; + channel@2 { + reg = <2>; + }; + channel@3 { + reg = <3>; + }; + channel@4 { + reg = <4>; + }; + channel@5 { + reg = <5>; + }; + channel@6 { + reg = <6>; + }; + channel@7 { + reg = <7>; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/iio/adc/ingenic,adc.txt b/Documentation/devicetree/bindings/iio/adc/ingenic,adc.txt index f01159f20d87..cd9048cf9dcf 100644 --- a/Documentation/devicetree/bindings/iio/adc/ingenic,adc.txt +++ b/Documentation/devicetree/bindings/iio/adc/ingenic,adc.txt @@ -5,6 +5,7 @@ Required properties: - compatible: Should be one of: * ingenic,jz4725b-adc * ingenic,jz4740-adc + * ingenic,jz4770-adc - reg: ADC controller registers location and length. - clocks: phandle to the SoC's ADC clock. - clock-names: Must be set to "adc". diff --git a/Documentation/devicetree/bindings/iio/adc/max1027-adc.txt b/Documentation/devicetree/bindings/iio/adc/max1027-adc.txt deleted file mode 100644 index e680c61dfb84..000000000000 --- a/Documentation/devicetree/bindings/iio/adc/max1027-adc.txt +++ /dev/null @@ -1,20 +0,0 @@ -* Maxim 1027/1029/1031 Analog to Digital Converter (ADC) - -Required properties: - - compatible: Should be "maxim,max1027" or "maxim,max1029" or "maxim,max1031" - - reg: SPI chip select number for the device - - interrupts: IRQ line for the ADC - see: Documentation/devicetree/bindings/interrupt-controller/interrupts.txt - -Recommended properties: -- spi-max-frequency: Definition as per - Documentation/devicetree/bindings/spi/spi-bus.txt - -Example: -adc@0 { - compatible = "maxim,max1027"; - reg = <0>; - interrupt-parent = <&gpio5>; - interrupts = <15 IRQ_TYPE_EDGE_RISING>; - spi-max-frequency = <1000000>; -}; diff --git a/Documentation/devicetree/bindings/iio/adc/mcp3911.txt b/Documentation/devicetree/bindings/iio/adc/mcp3911.txt deleted file mode 100644 index 3071f48fb30b..000000000000 --- a/Documentation/devicetree/bindings/iio/adc/mcp3911.txt +++ /dev/null @@ -1,30 +0,0 @@ -* Microchip MCP3911 Dual channel analog front end (ADC) - -Required properties: - - compatible: Should be "microchip,mcp3911" - - reg: SPI chip select number for the device - -Recommended properties: - - spi-max-frequency: Definition as per - Documentation/devicetree/bindings/spi/spi-bus.txt. - Max frequency for this chip is 20MHz. - -Optional properties: - - clocks: Phandle and clock identifier for sampling clock - - interrupt-parent: Phandle to the parent interrupt controller - - interrupts: IRQ line for the ADC - - microchip,device-addr: Device address when multiple MCP3911 chips are present on the - same SPI bus. Valid values are 0-3. Defaults to 0. - - vref-supply: Phandle to the external reference voltage supply. - -Example: -adc@0 { - compatible = "microchip,mcp3911"; - reg = <0>; - interrupt-parent = <&gpio5>; - interrupts = <15 IRQ_TYPE_EDGE_RISING>; - spi-max-frequency = <20000000>; - microchip,device-addr = <0>; - vref-supply = <&vref_reg>; - clocks = <&xtal>; -}; diff --git a/Documentation/devicetree/bindings/iio/adc/microchip,mcp3911.yaml b/Documentation/devicetree/bindings/iio/adc/microchip,mcp3911.yaml new file mode 100644 index 000000000000..881059b80d61 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/adc/microchip,mcp3911.yaml @@ -0,0 +1,71 @@ +# SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +# Copyright 2019 Marcus Folkesson <marcus.folkesson@gmail.com> +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/bindings/iio/adc/microchip,mcp3911.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Microchip MCP3911 Dual channel analog front end (ADC) + +maintainers: + - Marcus Folkesson <marcus.folkesson@gmail.com> + - Kent Gustavsson <nedo80@gmail.com> + +description: | + Bindings for the Microchip MCP3911 Dual channel ADC device. Datasheet can be + found here: https://ww1.microchip.com/downloads/en/DeviceDoc/20002286C.pdf + +properties: + compatible: + enum: + - microchip,mcp3911 + + reg: + maxItems: 1 + + spi-max-frequency: + maximum: 20000000 + + clocks: + description: | + Phandle and clock identifier for external sampling clock. + If not specified, the internal crystal oscillator will be used. + maxItems: 1 + + interrupts: + description: IRQ line of the ADC + maxItems: 1 + + microchip,device-addr: + description: Device address when multiple MCP3911 chips are present on the same SPI bus. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: [0, 1, 2, 3] + - default: 0 + + vref-supply: + description: | + Phandle to the external reference voltage supply. + If not specified, the internal voltage reference (1.2V) will be used. + +required: + - compatible + - reg + +examples: + - | + spi { + #address-cells = <1>; + #size-cells = <0>; + + adc@0 { + compatible = "microchip,mcp3911"; + reg = <0>; + interrupt-parent = <&gpio5>; + interrupts = <15 2>; + spi-max-frequency = <20000000>; + microchip,device-addr = <0>; + vref-supply = <&vref_reg>; + clocks = <&xtal>; + }; + }; diff --git a/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt b/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt index 4c0da8c74bb2..8de933146771 100644 --- a/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt +++ b/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt @@ -53,6 +53,8 @@ Optional properties: analog input switches on stm32mp1. - st,syscfg: Phandle to system configuration controller. It can be used to control the analog circuitry on stm32mp1. +- st,max-clk-rate-hz: Allow to specify desired max clock rate used by analog + circuitry. Contents of a stm32 adc child node: ----------------------------------- diff --git a/Documentation/devicetree/bindings/iio/dac/lltc,ltc1660.yaml b/Documentation/devicetree/bindings/iio/dac/lltc,ltc1660.yaml new file mode 100644 index 000000000000..13d005b68931 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/dac/lltc,ltc1660.yaml @@ -0,0 +1,49 @@ +# SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +# Copyright 2019 Marcus Folkesson <marcus.folkesson@gmail.com> +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/bindings/iio/dac/lltc,ltc1660.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Linear Technology Micropower octal 8-Bit and 10-Bit DACs + +maintainers: + - Marcus Folkesson <marcus.folkesson@gmail.com> + +description: | + Bindings for the Linear Technology Micropower octal 8-Bit and 10-Bit DAC. + Datasheet can be found here: https://www.analog.com/media/en/technical-documentation/data-sheets/166560fa.pdf + +properties: + compatible: + enum: + - lltc,ltc1660 + - lltc,ltc1665 + + reg: + maxItems: 1 + + spi-max-frequency: + maximum: 5000000 + + vref-supply: + description: Phandle to the external reference voltage supply. + +required: + - compatible + - reg + - vref-supply + +examples: + - | + spi { + #address-cells = <1>; + #size-cells = <0>; + + dac@0 { + compatible = "lltc,ltc1660"; + reg = <0>; + spi-max-frequency = <5000000>; + vref-supply = <&vref_reg>; + }; + }; diff --git a/Documentation/devicetree/bindings/iio/dac/ltc1660.txt b/Documentation/devicetree/bindings/iio/dac/ltc1660.txt deleted file mode 100644 index c5b5f22d6c64..000000000000 --- a/Documentation/devicetree/bindings/iio/dac/ltc1660.txt +++ /dev/null @@ -1,21 +0,0 @@ -* Linear Technology Micropower octal 8-Bit and 10-Bit DACs - -Required properties: - - compatible: Must be one of the following: - "lltc,ltc1660" - "lltc,ltc1665" - - reg: SPI chip select number for the device - - vref-supply: Phandle to the voltage reference supply - -Recommended properties: - - spi-max-frequency: Definition as per - Documentation/devicetree/bindings/spi/spi-bus.txt. - Max frequency for this chip is 5 MHz. - -Example: -dac@0 { - compatible = "lltc,ltc1660"; - reg = <0>; - spi-max-frequency = <5000000>; - vref-supply = <&vref_reg>; -}; diff --git a/Documentation/devicetree/bindings/iio/iio-bindings.txt b/Documentation/devicetree/bindings/iio/iio-bindings.txt index 68d6f8ce063b..af33267727f4 100644 --- a/Documentation/devicetree/bindings/iio/iio-bindings.txt +++ b/Documentation/devicetree/bindings/iio/iio-bindings.txt @@ -18,12 +18,17 @@ Required properties: with a single IIO output and 1 for nodes with multiple IIO outputs. +Optional properties: +label: A symbolic name for the device. + + Example for a simple configuration with no trigger: adc: voltage-sensor@35 { compatible = "maxim,max1139"; reg = <0x35>; #io-channel-cells = <1>; + label = "voltage_feedback_group1"; }; Example for a configuration with trigger: diff --git a/Documentation/devicetree/bindings/iio/imu/inv_mpu6050.txt b/Documentation/devicetree/bindings/iio/imu/inv_mpu6050.txt index 268bf7568e19..c5ee8a20af9f 100644 --- a/Documentation/devicetree/bindings/iio/imu/inv_mpu6050.txt +++ b/Documentation/devicetree/bindings/iio/imu/inv_mpu6050.txt @@ -21,6 +21,7 @@ Required properties: bindings. Optional properties: + - vdd-supply: regulator phandle for VDD supply - vddio-supply: regulator phandle for VDDIO supply - mount-matrix: an optional 3x3 mounting rotation matrix - i2c-gate node. These devices also support an auxiliary i2c bus. This is diff --git a/Documentation/devicetree/bindings/iio/imu/nxp,fxos8700.yaml b/Documentation/devicetree/bindings/iio/imu/nxp,fxos8700.yaml new file mode 100644 index 000000000000..63bcb73ae309 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/imu/nxp,fxos8700.yaml @@ -0,0 +1,76 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/imu/nxp,fxos8700.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Freescale FXOS8700 Inertial Measurement Unit + +maintainers: + - Robert Jones <rjones@gateworks.com> + +description: | + Accelerometer and magnetometer combo device with an i2c and SPI interface. + https://www.nxp.com/products/sensors/motion-sensors/6-axis/digital-motion-sensor-3d-accelerometer-2g-4g-8g-plus-3d-magnetometer:FXOS8700CQ + +properties: + compatible: + enum: + - nxp,fxos8700 + + reg: + maxItems: 1 + + interrupts: + minItems: 1 + maxItems: 2 + + interrupt-names: + minItems: 1 + maxItems: 2 + items: + enum: + - INT1 + - INT2 + + drive-open-drain: + type: boolean + +required: + - compatible + - reg + +examples: + - | + #include <dt-bindings/gpio/gpio.h> + #include <dt-bindings/interrupt-controller/irq.h> + i2c0 { + #address-cells = <1>; + #size-cells = <0>; + + fxos8700@1e { + compatible = "nxp,fxos8700"; + reg = <0x1e>; + + interrupt-parent = <&gpio2>; + interrupts = <7 IRQ_TYPE_EDGE_RISING>; + interrupt-names = "INT1"; + }; + }; + - | + #include <dt-bindings/gpio/gpio.h> + #include <dt-bindings/interrupt-controller/irq.h> + spi0 { + #address-cells = <1>; + #size-cells = <0>; + + fxos8700@0 { + compatible = "nxp,fxos8700"; + reg = <0>; + + spi-max-frequency = <1000000>; + interrupt-parent = <&gpio1>; + interrupts = <7 IRQ_TYPE_EDGE_RISING>; + interrupt-names = "INT2"; + }; + }; diff --git a/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt b/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt index 6d0c050d89fe..cef4bc16fce1 100644 --- a/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt +++ b/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt @@ -14,6 +14,8 @@ Required properties: "st,lsm6ds3tr-c" "st,ism330dhcx" "st,lsm9ds1-imu" + "st,lsm6ds0" + "st,lsm6dsrx" - reg: i2c address of the sensor / spi cs line Optional properties: @@ -31,6 +33,7 @@ Optional properties: - interrupts: interrupt mapping for IRQ. It should be configured with flags IRQ_TYPE_LEVEL_HIGH, IRQ_TYPE_EDGE_RISING, IRQ_TYPE_LEVEL_LOW or IRQ_TYPE_EDGE_FALLING. +- wakeup-source: Enables wake up of host system on event. Refer to interrupt-controller/interrupts.txt for generic interrupt client node bindings. diff --git a/Documentation/devicetree/bindings/iio/light/adux1020.yaml b/Documentation/devicetree/bindings/iio/light/adux1020.yaml new file mode 100644 index 000000000000..69bd5c06319d --- /dev/null +++ b/Documentation/devicetree/bindings/iio/light/adux1020.yaml @@ -0,0 +1,47 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/light/adux1020.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Analog Devices ADUX1020 Photometric sensor + +maintainers: + - Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> + +description: | + Photometric sensor over an i2c interface. + https://www.analog.com/media/en/technical-documentation/data-sheets/ADUX1020.pdf + +properties: + compatible: + enum: + - adi,adux1020 + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + +required: + - compatible + - reg + +examples: + - | + #include <dt-bindings/interrupt-controller/irq.h> + + i2c { + + #address-cells = <1>; + #size-cells = <0>; + + adux1020@64 { + compatible = "adi,adux1020"; + reg = <0x64>; + interrupt-parent = <&msmgpio>; + interrupts = <24 IRQ_TYPE_LEVEL_HIGH>; + }; + }; +... diff --git a/Documentation/devicetree/bindings/iio/light/bh1750.txt b/Documentation/devicetree/bindings/iio/light/bh1750.txt deleted file mode 100644 index 1e7685797d7a..000000000000 --- a/Documentation/devicetree/bindings/iio/light/bh1750.txt +++ /dev/null @@ -1,18 +0,0 @@ -ROHM BH1750 - ALS, Ambient light sensor - -Required properties: - -- compatible: Must be one of: - "rohm,bh1710" - "rohm,bh1715" - "rohm,bh1721" - "rohm,bh1750" - "rohm,bh1751" -- reg: the I2C address of the sensor - -Example: - -light-sensor@23 { - compatible = "rohm,bh1750"; - reg = <0x23>; -}; diff --git a/Documentation/devicetree/bindings/iio/light/bh1750.yaml b/Documentation/devicetree/bindings/iio/light/bh1750.yaml new file mode 100644 index 000000000000..1cc60d7ecfa0 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/light/bh1750.yaml @@ -0,0 +1,43 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/light/bh1750.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: ROHM BH1750 ambient light sensor + +maintainers: + - Tomasz Duszynski <tduszyns@gmail.com> + +description: | + Ambient light sensor with an i2c interface. + +properties: + compatible: + enum: + - rohm,bh1710 + - rohm,bh1715 + - rohm,bh1721 + - rohm,bh1750 + - rohm,bh1751 + + reg: + maxItems: 1 + +required: + - compatible + - reg + +examples: + - | + i2c { + #address-cells = <1>; + #size-cells = <0>; + + light-sensor@23 { + compatible = "rohm,bh1750"; + reg = <0x23>; + }; + }; + +... diff --git a/Documentation/devicetree/bindings/iio/light/veml6030.yaml b/Documentation/devicetree/bindings/iio/light/veml6030.yaml new file mode 100644 index 000000000000..0ff9b11f9d18 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/light/veml6030.yaml @@ -0,0 +1,62 @@ +# SPDX-License-Identifier: GPL-2.0+ +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/light/veml6030.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: VEML6030 Ambient Light Sensor (ALS) + +maintainers: + - Rishi Gupta <gupt21@gmail.com> + +description: | + Bindings for the ambient light sensor veml6030 from Vishay + Semiconductors over an i2c interface. + + Irrespective of whether interrupt is used or not, application + can get the ALS and White channel reading from IIO raw interface. + + If the interrupts are used, application will receive an IIO event + whenever configured threshold is crossed. + + Specifications about the sensor can be found at: + https://www.vishay.com/docs/84366/veml6030.pdf + +properties: + compatible: + enum: + - vishay,veml6030 + + reg: + description: + I2C address of the device. + enum: + - 0x10 # ADDR pin pulled down + - 0x48 # ADDR pin pulled up + + interrupts: + description: + interrupt mapping for IRQ. Configure with IRQ_TYPE_LEVEL_LOW. + Refer to interrupt-controller/interrupts.txt for generic + interrupt client node bindings. + maxItems: 1 + +required: + - compatible + - reg + +examples: + - | + #include <dt-bindings/interrupt-controller/irq.h> + + i2c { + #address-cells = <1>; + #size-cells = <0>; + + light-sensor@10 { + compatible = "vishay,veml6030"; + reg = <0x10>; + interrupts = <12 IRQ_TYPE_LEVEL_LOW>; + }; + }; +... diff --git a/Documentation/devicetree/bindings/iio/proximity/maxbotix,mb1232.txt b/Documentation/devicetree/bindings/iio/proximity/maxbotix,mb1232.txt deleted file mode 100644 index dd1058fbe9c3..000000000000 --- a/Documentation/devicetree/bindings/iio/proximity/maxbotix,mb1232.txt +++ /dev/null @@ -1,29 +0,0 @@ -* MaxBotix I2CXL-MaxSonar ultrasonic distance sensor of type mb1202, - mb1212, mb1222, mb1232, mb1242, mb7040 or mb7137 using the i2c interface - for ranging - -Required properties: - - compatible: "maxbotix,mb1202", - "maxbotix,mb1212", - "maxbotix,mb1222", - "maxbotix,mb1232", - "maxbotix,mb1242", - "maxbotix,mb7040" or - "maxbotix,mb7137" - - - reg: i2c address of the device, see also i2c/i2c.txt - -Optional properties: - - interrupts: Interrupt used to announce the preceding reading - request has finished and that data is available. - If no interrupt is specified the device driver - falls back to wait a fixed amount of time until - data can be retrieved. - -Example: -proximity@70 { - compatible = "maxbotix,mb1232"; - reg = <0x70>; - interrupt-parent = <&gpio2>; - interrupts = <2 IRQ_TYPE_EDGE_FALLING>; -}; diff --git a/Documentation/devicetree/bindings/iio/proximity/maxbotix,mb1232.yaml b/Documentation/devicetree/bindings/iio/proximity/maxbotix,mb1232.yaml new file mode 100644 index 000000000000..3eac248f291d --- /dev/null +++ b/Documentation/devicetree/bindings/iio/proximity/maxbotix,mb1232.yaml @@ -0,0 +1,60 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/proximity/maxbotix,mb1232.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: MaxBotix I2CXL-MaxSonar ultrasonic distance sensor + +maintainers: + - Andreas Klinger <ak@it-klinger.de> + +description: | + MaxBotix I2CXL-MaxSonar ultrasonic distance sensor of type mb1202, + mb1212, mb1222, mb1232, mb1242, mb7040 or mb7137 using the i2c interface + for ranging + + Specifications about the devices can be found at: + https://www.maxbotix.com/documents/I2CXL-MaxSonar-EZ_Datasheet.pdf + +properties: + compatible: + enum: + - maxbotix,mb1202 + - maxbotix,mb1212 + - maxbotix,mb1222 + - maxbotix,mb1232 + - maxbotix,mb1242 + - maxbotix,mb7040 + - maxbotix,mb7137 + + reg: + maxItems: 1 + + interrupts: + description: + Interrupt used to announce the preceding reading request has finished + and that data is available. If no interrupt is specified the device + driver falls back to wait a fixed amount of time until data can be + retrieved. + maxItems: 1 + +required: + - compatible + - reg + +additionalProperties: false + +examples: + - | + #include <dt-bindings/interrupt-controller/irq.h> + i2c { + #address-cells = <1>; + #size-cells = <0>; + proximity@70 { + compatible = "maxbotix,mb1232"; + reg = <0x70>; + interrupt-parent = <&gpio2>; + interrupts = <2 IRQ_TYPE_EDGE_FALLING>; + }; + }; diff --git a/Documentation/devicetree/bindings/iio/temperature/adi,ltc2983.yaml b/Documentation/devicetree/bindings/iio/temperature/adi,ltc2983.yaml new file mode 100644 index 000000000000..d4922f9f0376 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/temperature/adi,ltc2983.yaml @@ -0,0 +1,480 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/temperature/adi,ltc2983.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Analog Devices LTC2983 Multi-sensor Temperature system + +maintainers: + - Nuno Sá <nuno.sa@analog.com> + +description: | + Analog Devices LTC2983 Multi-Sensor Digital Temperature Measurement System + https://www.analog.com/media/en/technical-documentation/data-sheets/2983fc.pdf + +properties: + compatible: + enum: + - adi,ltc2983 + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + adi,mux-delay-config-us: + description: + The LTC2983 performs 2 or 3 internal conversion cycles per temperature + result. Each conversion cycle is performed with different excitation and + input multiplexer configurations. Prior to each conversion, these + excitation circuits and input switch configurations are changed and an + internal 1ms delay ensures settling prior to the conversion cycle in most + cases. An extra delay can be configured using this property. The value is + rounded to nearest 100us. + maximum: 255 + + adi,filter-notch-freq: + description: + Set's the default setting of the digital filter. The default is + simultaneous 50/60Hz rejection. + 0 - 50/60Hz rejection + 1 - 60Hz rejection + 2 - 50Hz rejection + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - minimum: 0 + maximum: 2 + + '#address-cells': + const: 1 + + '#size-cells': + const: 0 + +patternProperties: + "@([1-9]|1[0-9]|20)$": + type: object + + properties: + reg: + description: + The channel number. It can be connected to one of the 20 channels of + the device. + minimum: 1 + maximum: 20 + + adi,sensor-type: + description: Identifies the type of sensor connected to the device. + $ref: /schemas/types.yaml#/definitions/uint32 + + required: + - reg + - adi,sensor-type + + "^thermocouple@": + type: object + description: + Represents a thermocouple sensor which is connected to one of the device + channels. + + properties: + adi,sensor-type: + description: | + 1 - Type J Thermocouple + 2 - Type K Thermocouple + 3 - Type E Thermocouple + 4 - Type N Thermocouple + 5 - Type R Thermocouple + 6 - Type S Thermocouple + 7 - Type T Thermocouple + 8 - Type B Thermocouple + 9 - Custom Thermocouple + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + minimum: 1 + maximum: 9 + + adi,single-ended: + description: + Boolean property which set's the thermocouple as single-ended. + type: boolean + + adi,sensor-oc-current-microamp: + description: + This property set's the pulsed current value applied during + open-circuit detect. + enum: [10, 100, 500, 1000] + + adi,cold-junction-handle: + description: + Phandle which points to a sensor object responsible for measuring + the thermocouple cold junction temperature. + $ref: "/schemas/types.yaml#/definitions/phandle" + + adi,custom-thermocouple: + description: + This is a table, where each entry should be a pair of + voltage(mv)-temperature(K). The entries must be given in nv and uK + so that, the original values must be multiplied by 1000000. For + more details look at table 69 and 70. + Note should be signed, but dtc doesn't currently maintain the + sign. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint64-matrix + items: + minItems: 3 + maxItems: 64 + items: + minItems: 2 + maxItems: 2 + + "^diode@": + type: object + description: + Represents a diode sensor which is connected to one of the device + channels. + + properties: + adi,sensor-type: + description: Identifies the sensor as a diode. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + const: 28 + + adi,single-ended: + description: Boolean property which set's the diode as single-ended. + type: boolean + + adi,three-conversion-cycles: + description: + Boolean property which set's three conversion cycles removing + parasitic resistance effects between the LTC2983 and the diode. + type: boolean + + adi,average-on: + description: + Boolean property which enables a running average of the diode + temperature reading. This reduces the noise when the diode is used + as a cold junction temperature element on an isothermal block + where temperatures change slowly. + type: boolean + + adi,excitation-current-microamp: + description: + This property controls the magnitude of the excitation current + applied to the diode. Depending on the number of conversions + cycles, this property will assume different predefined values on + each cycle. Just set the value of the first cycle (1l). + enum: [10, 20, 40, 80] + + adi,ideal-factor-value: + description: + This property sets the diode ideality factor. The real value must + be multiplied by 1000000 to remove the fractional part. For more + information look at table 20 of the datasheet. + $ref: /schemas/types.yaml#/definitions/uint32 + + "^rtd@": + type: object + description: + Represents a rtd sensor which is connected to one of the device channels. + + properties: + reg: + minimum: 2 + maximum: 20 + + adi,sensor-type: + description: | + 10 - RTD PT-10 + 11 - RTD PT-50 + 12 - RTD PT-100 + 13 - RTD PT-200 + 14 - RTD PT-500 + 15 - RTD PT-1000 + 16 - RTD PT-1000 (0.00375) + 17 - RTD NI-120 + 18 - RTD Custom + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + minimum: 10 + maximum: 18 + + adi,rsense-handle: + description: + Phandle pointing to a rsense object associated with this RTD. + $ref: "/schemas/types.yaml#/definitions/phandle" + + adi,number-of-wires: + description: + Identifies the number of wires used by the RTD. Setting this + property to 5 means 4 wires with Kelvin Rsense. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: [2, 3, 4, 5] + + adi,rsense-share: + description: + Boolean property which enables Rsense sharing, where one sense + resistor is used for multiple 2-, 3-, and/or 4-wire RTDs. + type: boolean + + adi,current-rotate: + description: + Boolean property which enables excitation current rotation to + automatically remove parasitic thermocouple effects. Note that + this property is not allowed for 2- and 3-wire RTDs. + type: boolean + + adi,excitation-current-microamp: + description: + This property controls the magnitude of the excitation current + applied to the RTD. + enum: [5, 10, 25, 50, 100, 250, 500, 1000] + + adi,rtd-curve: + description: + This property set the RTD curve used and the corresponding + Callendar-VanDusen constants. Look at table 30 of the datasheet. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - minimum: 0 + maximum: 3 + + adi,custom-rtd: + description: + This is a table, where each entry should be a pair of + resistance(ohm)-temperature(K). The entries added here are in uohm + and uK. For more details values look at table 74 and 75. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint64-matrix + items: + minItems: 3 + maxItems: 64 + items: + minItems: 2 + maxItems: 2 + + required: + - adi,rsense-handle + + dependencies: + adi,current-rotate: [ adi,rsense-share ] + + "^thermistor@": + type: object + description: + Represents a thermistor sensor which is connected to one of the device + channels. + + properties: + adi,sensor-type: + description: + 19 - Thermistor 44004/44033 2.252kohm at 25°C + 20 - Thermistor 44005/44030 3kohm at 25°C + 21 - Thermistor 44007/44034 5kohm at 25°C + 22 - Thermistor 44006/44031 10kohm at 25°C + 23 - Thermistor 44008/44032 30kohm at 25°C + 24 - Thermistor YSI 400 2.252kohm at 25°C + 25 - Thermistor Spectrum 1003k 1kohm + 26 - Thermistor Custom Steinhart-Hart + 27 - Custom Thermistor + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + minimum: 19 + maximum: 27 + + adi,rsense-handle: + description: + Phandle pointing to a rsense object associated with this + thermistor. + $ref: "/schemas/types.yaml#/definitions/phandle" + + adi,single-ended: + description: + Boolean property which set's the thermistor as single-ended. + type: boolean + + adi,rsense-share: + description: + Boolean property which enables Rsense sharing, where one sense + resistor is used for multiple thermistors. Note that this property + is ignored if adi,single-ended is set. + type: boolean + + adi,current-rotate: + description: + Boolean property which enables excitation current rotation to + automatically remove parasitic thermocouple effects. + type: boolean + + adi,excitation-current-nanoamp: + description: + This property controls the magnitude of the excitation current + applied to the thermistor. Value 0 set's the sensor in auto-range + mode. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: [0, 250, 500, 1000, 5000, 10000, 25000, 50000, 100000, + 250000, 500000, 1000000] + + adi,custom-thermistor: + description: + This is a table, where each entry should be a pair of + resistance(ohm)-temperature(K). The entries added here are in uohm + and uK only for custom thermistors. For more details look at table + 78 and 79. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint64-matrix + items: + minItems: 3 + maxItems: 64 + items: + minItems: 2 + maxItems: 2 + + adi,custom-steinhart: + description: + Steinhart-Hart coefficients are also supported and can + be programmed into the device memory using this property. For + Steinhart sensors the coefficients are given in the raw + format. Look at table 82 for more information. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + items: + minItems: 6 + maxItems: 6 + + required: + - adi,rsense-handle + + dependencies: + adi,current-rotate: [ adi,rsense-share ] + + "^adc@": + type: object + description: Represents a channel which is being used as a direct adc. + + properties: + adi,sensor-type: + description: Identifies the sensor as a direct adc. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + const: 30 + + adi,single-ended: + description: Boolean property which set's the adc as single-ended. + type: boolean + + "^rsense@": + type: object + description: + Represents a rsense which is connected to one of the device channels. + Rsense are used by thermistors and RTD's. + + properties: + reg: + minimum: 2 + maximum: 20 + + adi,sensor-type: + description: Identifies the sensor as a rsense. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + const: 29 + + adi,rsense-val-milli-ohms: + description: + Sets the value of the sense resistor. Look at table 20 of the + datasheet for information. + + required: + - adi,rsense-val-milli-ohms + +required: + - compatible + - reg + - interrupts + +examples: + - | + #include <dt-bindings/interrupt-controller/irq.h> + spi { + #address-cells = <1>; + #size-cells = <0>; + + sensor_ltc2983: ltc2983@0 { + compatible = "adi,ltc2983"; + reg = <0>; + + #address-cells = <1>; + #size-cells = <0>; + + interrupts = <20 IRQ_TYPE_EDGE_RISING>; + interrupt-parent = <&gpio>; + + thermocouple@18 { + reg = <18>; + adi,sensor-type = <8>; //Type B + adi,sensor-oc-current-microamp = <10>; + adi,cold-junction-handle = <&diode5>; + }; + + diode5: diode@5 { + reg = <5>; + adi,sensor-type = <28>; + }; + + rsense2: rsense@2 { + reg = <2>; + adi,sensor-type = <29>; + adi,rsense-val-milli-ohms = <1200000>; //1.2Kohms + }; + + rtd@14 { + reg = <14>; + adi,sensor-type = <15>; //PT1000 + /*2-wire, internal gnd, no current rotation*/ + adi,number-of-wires = <2>; + adi,rsense-share; + adi,excitation-current-microamp = <500>; + adi,rsense-handle = <&rsense2>; + }; + + adc@10 { + reg = <10>; + adi,sensor-type = <30>; + adi,single-ended; + }; + + thermistor@12 { + reg = <12>; + adi,sensor-type = <26>; //Steinhart + adi,rsense-handle = <&rsense2>; + adi,custom-steinhart = <0x00F371EC 0x12345678 + 0x2C0F8733 0x10018C66 0xA0FEACCD + 0x90021D99>; //6 entries + }; + + thermocouple@20 { + reg = <20>; + adi,sensor-type = <9>; //custom thermocouple + adi,single-ended; + adi,custom-thermocouple = /bits/ 64 + <(-50220000) 0 + (-30200000) 99100000 + (-5300000) 135400000 + 0 273150000 + 40200000 361200000 + 55300000 522100000 + 88300000 720300000 + 132200000 811200000 + 188700000 922500000 + 460400000 1000000000>; //10 pairs + }; + + }; + }; +... diff --git a/Documentation/devicetree/bindings/interconnect/qcom,msm8974.yaml b/Documentation/devicetree/bindings/interconnect/qcom,msm8974.yaml new file mode 100644 index 000000000000..9af3c6e59cff --- /dev/null +++ b/Documentation/devicetree/bindings/interconnect/qcom,msm8974.yaml @@ -0,0 +1,62 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/interconnect/qcom,msm8974.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Qualcomm MSM8974 Network-On-Chip Interconnect + +maintainers: + - Brian Masney <masneyb@onstation.org> + +description: | + The Qualcomm MSM8974 interconnect providers support setting system + bandwidth requirements between various network-on-chip fabrics. + +properties: + reg: + maxItems: 1 + + compatible: + enum: + - qcom,msm8974-bimc + - qcom,msm8974-cnoc + - qcom,msm8974-mmssnoc + - qcom,msm8974-ocmemnoc + - qcom,msm8974-pnoc + - qcom,msm8974-snoc + + '#interconnect-cells': + const: 1 + + clock-names: + items: + - const: bus + - const: bus_a + + clocks: + items: + - description: Bus Clock + - description: Bus A Clock + +required: + - compatible + - reg + - '#interconnect-cells' + - clock-names + - clocks + +additionalProperties: false + +examples: + - | + #include <dt-bindings/clock/qcom,rpmcc.h> + + bimc: interconnect@fc380000 { + reg = <0xfc380000 0x6a000>; + compatible = "qcom,msm8974-bimc"; + #interconnect-cells = <1>; + clock-names = "bus", "bus_a"; + clocks = <&rpmcc RPM_SMD_BIMC_CLK>, + <&rpmcc RPM_SMD_BIMC_A_CLK>; + }; diff --git a/Documentation/devicetree/bindings/media/allwinner,sun8i-h3-deinterlace.yaml b/Documentation/devicetree/bindings/media/allwinner,sun8i-h3-deinterlace.yaml new file mode 100644 index 000000000000..2e40f700e84f --- /dev/null +++ b/Documentation/devicetree/bindings/media/allwinner,sun8i-h3-deinterlace.yaml @@ -0,0 +1,76 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/media/allwinner,sun8i-h3-deinterlace.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Allwinner H3 Deinterlace Device Tree Bindings + +maintainers: + - Jernej Skrabec <jernej.skrabec@siol.net> + - Chen-Yu Tsai <wens@csie.org> + - Maxime Ripard <mripard@kernel.org> + +description: |- + The Allwinner H3 and later has a deinterlace core used for + deinterlacing interlaced video content. + +properties: + compatible: + const: allwinner,sun8i-h3-deinterlace + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + clocks: + items: + - description: Deinterlace interface clock + - description: Deinterlace module clock + - description: Deinterlace DRAM clock + + clock-names: + items: + - const: bus + - const: mod + - const: ram + + resets: + maxItems: 1 + + interconnects: + maxItems: 1 + + interconnect-names: + const: dma-mem + +required: + - compatible + - reg + - interrupts + - clocks + +additionalProperties: false + +examples: + - | + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/clock/sun8i-h3-ccu.h> + #include <dt-bindings/reset/sun8i-h3-ccu.h> + + deinterlace: deinterlace@1400000 { + compatible = "allwinner,sun8i-h3-deinterlace"; + reg = <0x01400000 0x20000>; + clocks = <&ccu CLK_BUS_DEINTERLACE>, + <&ccu CLK_DEINTERLACE>, + <&ccu CLK_DRAM_DEINTERLACE>; + clock-names = "bus", "mod", "ram"; + resets = <&ccu RST_BUS_DEINTERLACE>; + interrupts = <GIC_SPI 93 IRQ_TYPE_LEVEL_HIGH>; + interconnects = <&mbus 9>; + interconnect-names = "dma-mem"; + }; + +... diff --git a/Documentation/devicetree/bindings/media/i2c/ad5820.txt b/Documentation/devicetree/bindings/media/i2c/ad5820.txt index 5940ca11c021..5764cbedf9b7 100644 --- a/Documentation/devicetree/bindings/media/i2c/ad5820.txt +++ b/Documentation/devicetree/bindings/media/i2c/ad5820.txt @@ -2,12 +2,20 @@ Required Properties: - - compatible: Must contain "adi,ad5820" + - compatible: Must contain one of: + - "adi,ad5820" + - "adi,ad5821" + - "adi,ad5823" - reg: I2C slave address - VANA-supply: supply of voltage for VANA pin +Optional properties: + + - enable-gpios : GPIO spec for the XSHUTDOWN pin. The XSHUTDOWN signal is +active low, a high level on the pin enables the device. + Example: ad5820: coil@c { @@ -15,5 +23,6 @@ Example: reg = <0x0c>; VANA-supply = <&vaux4>; + enable-gpios = <&msmgpio 26 GPIO_ACTIVE_HIGH>; }; diff --git a/Documentation/devicetree/bindings/media/i2c/imx290.txt b/Documentation/devicetree/bindings/media/i2c/imx290.txt new file mode 100644 index 000000000000..a3cc21410f7c --- /dev/null +++ b/Documentation/devicetree/bindings/media/i2c/imx290.txt @@ -0,0 +1,57 @@ +* Sony IMX290 1/2.8-Inch CMOS Image Sensor + +The Sony IMX290 is a 1/2.8-Inch CMOS Solid-state image sensor with +Square Pixel for Color Cameras. It is programmable through I2C and 4-wire +interfaces. The sensor output is available via CMOS logic parallel SDR output, +Low voltage LVDS DDR output and CSI-2 serial data output. The CSI-2 bus is the +default. No bindings have been defined for the other busses. + +Required Properties: +- compatible: Should be "sony,imx290" +- reg: I2C bus address of the device +- clocks: Reference to the xclk clock. +- clock-names: Should be "xclk". +- clock-frequency: Frequency of the xclk clock in Hz. +- vdddo-supply: Sensor digital IO regulator. +- vdda-supply: Sensor analog regulator. +- vddd-supply: Sensor digital core regulator. + +Optional Properties: +- reset-gpios: Sensor reset GPIO + +The imx290 device node should contain one 'port' child node with +an 'endpoint' subnode. For further reading on port node refer to +Documentation/devicetree/bindings/media/video-interfaces.txt. + +Required Properties on endpoint: +- data-lanes: check ../video-interfaces.txt +- link-frequencies: check ../video-interfaces.txt +- remote-endpoint: check ../video-interfaces.txt + +Example: + &i2c1 { + ... + imx290: camera-sensor@1a { + compatible = "sony,imx290"; + reg = <0x1a>; + + reset-gpios = <&msmgpio 35 GPIO_ACTIVE_LOW>; + pinctrl-names = "default"; + pinctrl-0 = <&camera_rear_default>; + + clocks = <&gcc GCC_CAMSS_MCLK0_CLK>; + clock-names = "xclk"; + clock-frequency = <37125000>; + + vdddo-supply = <&camera_vdddo_1v8>; + vdda-supply = <&camera_vdda_2v8>; + vddd-supply = <&camera_vddd_1v5>; + + port { + imx290_ep: endpoint { + data-lanes = <1 2 3 4>; + link-frequencies = /bits/ 64 <445500000>; + remote-endpoint = <&csiphy0_ep>; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/media/i2c/nokia,smia.txt b/Documentation/devicetree/bindings/media/i2c/nokia,smia.txt index c3c3479233c4..10ece8108081 100644 --- a/Documentation/devicetree/bindings/media/i2c/nokia,smia.txt +++ b/Documentation/devicetree/bindings/media/i2c/nokia,smia.txt @@ -27,8 +27,6 @@ Mandatory properties Optional properties ------------------- -- nokia,nvm-size: The size of the NVM, in bytes. If the size is not given, - the NVM contents will not be read. - reset-gpios: XSHUTDOWN GPIO - flash-leds: See ../video-interfaces.txt - lens-focus: See ../video-interfaces.txt diff --git a/Documentation/devicetree/bindings/media/i2c/ov2659.txt b/Documentation/devicetree/bindings/media/i2c/ov2659.txt index cabc7d827dfb..92989a619f29 100644 --- a/Documentation/devicetree/bindings/media/i2c/ov2659.txt +++ b/Documentation/devicetree/bindings/media/i2c/ov2659.txt @@ -12,6 +12,12 @@ Required Properties: - clock-names: should be "xvclk". - link-frequencies: target pixel clock frequency. +Optional Properties: +- powerdown-gpios: reference to the GPIO connected to the pwdn pin, if any. + Active high with internal pull down resistor. +- reset-gpios: reference to the GPIO connected to the resetb pin, if any. + Active low with internal pull up resistor. + For further reading on port node refer to Documentation/devicetree/bindings/media/video-interfaces.txt. @@ -27,6 +33,9 @@ Example: clocks = <&clk_ov2659 0>; clock-names = "xvclk"; + powerdown-gpios = <&gpio6 14 GPIO_ACTIVE_HIGH>; + reset-gpios = <&gpio6 15 GPIO_ACTIVE_LOW>; + port { ov2659_0: endpoint { remote-endpoint = <&vpfe_ep>; diff --git a/Documentation/devicetree/bindings/media/rc.yaml b/Documentation/devicetree/bindings/media/rc.yaml index 9054555e6608..cdfc8ee023ac 100644 --- a/Documentation/devicetree/bindings/media/rc.yaml +++ b/Documentation/devicetree/bindings/media/rc.yaml @@ -39,6 +39,7 @@ properties: - rc-avermedia-rm-ks - rc-avertv-303 - rc-azurewave-ad-tu700 + - rc-beelink-gs1 - rc-behold - rc-behold-columbus - rc-budget-ci-old diff --git a/Documentation/devicetree/bindings/media/renesas,csi2.txt b/Documentation/devicetree/bindings/media/renesas,csi2.txt index 331409259752..2da6f60b2b56 100644 --- a/Documentation/devicetree/bindings/media/renesas,csi2.txt +++ b/Documentation/devicetree/bindings/media/renesas,csi2.txt @@ -9,6 +9,7 @@ Mandatory properties -------------------- - compatible: Must be one or more of the following - "renesas,r8a774a1-csi2" for the R8A774A1 device. + - "renesas,r8a774b1-csi2" for the R8A774B1 device. - "renesas,r8a774c0-csi2" for the R8A774C0 device. - "renesas,r8a7795-csi2" for the R8A7795 device. - "renesas,r8a7796-csi2" for the R8A7796 device. diff --git a/Documentation/devicetree/bindings/media/renesas,vin.txt b/Documentation/devicetree/bindings/media/renesas,vin.txt index aa217b096279..e30b0d4eefdd 100644 --- a/Documentation/devicetree/bindings/media/renesas,vin.txt +++ b/Documentation/devicetree/bindings/media/renesas,vin.txt @@ -14,6 +14,7 @@ on Gen3 and RZ/G2 platforms to a CSI-2 receiver. - "renesas,vin-r8a7744" for the R8A7744 device - "renesas,vin-r8a7745" for the R8A7745 device - "renesas,vin-r8a774a1" for the R8A774A1 device + - "renesas,vin-r8a774b1" for the R8A774B1 device - "renesas,vin-r8a774c0" for the R8A774C0 device - "renesas,vin-r8a7778" for the R8A7778 device - "renesas,vin-r8a7779" for the R8A7779 device @@ -43,7 +44,7 @@ on Gen3 and RZ/G2 platforms to a CSI-2 receiver. Additionally, an alias named vinX will need to be created to specify which video input device this is. -The per-board settings Gen2 platforms: +The per-board settings for Gen2 and RZ/G1 platforms: - port - sub-node describing a single endpoint connected to the VIN from external SoC pins as described in video-interfaces.txt[1]. @@ -63,7 +64,7 @@ The per-board settings Gen2 platforms: - data-enable-active: polarity of CLKENB signal, see [1] for description. Default is active high. -The per-board settings Gen3 and RZ/G2 platforms: +The per-board settings for Gen3 and RZ/G2 platforms: Gen3 and RZ/G2 platforms can support both a single connected parallel input source from external SoC pins (port@0) and/or multiple parallel input sources diff --git a/Documentation/devicetree/bindings/media/sh_mobile_ceu.txt b/Documentation/devicetree/bindings/media/sh_mobile_ceu.txt deleted file mode 100644 index cfa4ffada8ae..000000000000 --- a/Documentation/devicetree/bindings/media/sh_mobile_ceu.txt +++ /dev/null @@ -1,17 +0,0 @@ -Bindings, specific for the sh_mobile_ceu_camera.c driver: - - compatible: Should be "renesas,sh-mobile-ceu" - - reg: register base and size - - interrupts: the interrupt number - - renesas,max-width: maximum image width, supported on this SoC - - renesas,max-height: maximum image height, supported on this SoC - -Example: - -ceu0: ceu@fe910000 { - compatible = "renesas,sh-mobile-ceu"; - reg = <0xfe910000 0xa0>; - interrupt-parent = <&intcs>; - interrupts = <0x880>; - renesas,max-width = <8188>; - renesas,max-height = <8188>; -}; diff --git a/Documentation/devicetree/bindings/media/ti,vpe.yaml b/Documentation/devicetree/bindings/media/ti,vpe.yaml new file mode 100644 index 000000000000..f3a8a350e85f --- /dev/null +++ b/Documentation/devicetree/bindings/media/ti,vpe.yaml @@ -0,0 +1,64 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/media/ti,vpe.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Texas Instruments DRA7x Video Processing Engine (VPE) Device Tree Bindings + +maintainers: + - Benoit Parrot <bparrot@ti.com> + +description: |- + The Video Processing Engine (VPE) is a key component for image post + processing applications. VPE consist of a single memory to memory + path which can perform chroma up/down sampling, deinterlacing, + scaling and color space conversion. + +properties: + compatible: + const: ti,dra7-vpe + + reg: + items: + - description: The VPE main register region + - description: Scaler (SC) register region + - description: Color Space Conversion (CSC) register region + - description: Video Port Direct Memory Access (VPDMA) register region + + reg-names: + items: + - const: vpe_top + - const: sc + - const: csc + - const: vpdma + + interrupts: + maxItems: 1 + +required: + - compatible + - reg + - reg-names + - interrupts + +additionalProperties: false + +examples: + - | + #include <dt-bindings/interrupt-controller/arm-gic.h> + + vpe: vpe@489d0000 { + compatible = "ti,dra7-vpe"; + reg = <0x489d0000 0x120>, + <0x489d0700 0x80>, + <0x489d5700 0x18>, + <0x489dd000 0x400>; + reg-names = "vpe_top", + "sc", + "csc", + "vpdma"; + interrupts = <GIC_SPI 354 IRQ_TYPE_LEVEL_HIGH>; + }; + +... diff --git a/Documentation/devicetree/bindings/mfd/ab8500.txt b/Documentation/devicetree/bindings/mfd/ab8500.txt index cd9e90c5d171..b6bc30d7777e 100644 --- a/Documentation/devicetree/bindings/mfd/ab8500.txt +++ b/Documentation/devicetree/bindings/mfd/ab8500.txt @@ -69,6 +69,18 @@ Required child device properties: - compatible : "stericsson,ab8500-[bm|btemp|charger|fg|gpadc|gpio|ponkey| pwm|regulator|rtc|sysctrl|usb]"; + A few child devices require ADC channels from the GPADC node. Those follow the + standard bindings from iio/iio-bindings.txt and iio/adc/adc.txt + + abx500-temp : io-channels "aux1" and "aux2" for measuring external + temperatures. + ab8500-fg : io-channel "main_bat_v" for measuring main battery voltage, + ab8500-btemp : io-channels "btemp_ball" and "bat_ctrl" for measuring the + battery voltage. + ab8500-charger : io-channels "main_charger_v", "main_charger_c", "vbus_v", + "usb_charger_c" for measuring voltage and current of the + different charging supplies. + Optional child device properties: - interrupts : contains the device IRQ(s) using the 2-cell format (see above) - interrupt-names : contains names of IRQ resource in the order in which they were @@ -102,8 +114,115 @@ ab8500 { 39 0x4>; interrupt-names = "HW_CONV_END", "SW_CONV_END"; vddadc-supply = <&ab8500_ldo_tvout_reg>; + #address-cells = <1>; + #size-cells = <0>; + #io-channel-cells = <1>; + + /* GPADC channels */ + bat_ctrl: channel@1 { + reg = <0x01>; + }; + btemp_ball: channel@2 { + reg = <0x02>; + }; + main_charger_v: channel@3 { + reg = <0x03>; + }; + acc_detect1: channel@4 { + reg = <0x04>; + }; + acc_detect2: channel@5 { + reg = <0x05>; + }; + adc_aux1: channel@6 { + reg = <0x06>; + }; + adc_aux2: channel@7 { + reg = <0x07>; + }; + main_batt_v: channel@8 { + reg = <0x08>; + }; + vbus_v: channel@9 { + reg = <0x09>; + }; + main_charger_c: channel@a { + reg = <0x0a>; + }; + usb_charger_c: channel@b { + reg = <0x0b>; + }; + bk_bat_v: channel@c { + reg = <0x0c>; + }; + die_temp: channel@d { + reg = <0x0d>; + }; + usb_id: channel@e { + reg = <0x0e>; + }; + xtal_temp: channel@12 { + reg = <0x12>; + }; + vbat_true_meas: channel@13 { + reg = <0x13>; + }; + bat_ctrl_and_ibat: channel@1c { + reg = <0x1c>; + }; + vbat_meas_and_ibat: channel@1d { + reg = <0x1d>; + }; + vbat_true_meas_and_ibat: channel@1e { + reg = <0x1e>; + }; + bat_temp_and_ibat: channel@1f { + reg = <0x1f>; + }; }; + ab8500_temp { + compatible = "stericsson,abx500-temp"; + io-channels = <&gpadc 0x06>, + <&gpadc 0x07>; + io-channel-name = "aux1", "aux2"; + }; + + ab8500_battery: ab8500_battery { + stericsson,battery-type = "LIPO"; + thermistor-on-batctrl; + }; + + ab8500_fg { + compatible = "stericsson,ab8500-fg"; + battery = <&ab8500_battery>; + io-channels = <&gpadc 0x08>; + io-channel-name = "main_bat_v"; + }; + + ab8500_btemp { + compatible = "stericsson,ab8500-btemp"; + battery = <&ab8500_battery>; + io-channels = <&gpadc 0x02>, + <&gpadc 0x01>; + io-channel-name = "btemp_ball", + "bat_ctrl"; + }; + + ab8500_charger { + compatible = "stericsson,ab8500-charger"; + battery = <&ab8500_battery>; + vddadc-supply = <&ab8500_ldo_tvout_reg>; + io-channels = <&gpadc 0x03>, + <&gpadc 0x0a>, + <&gpadc 0x09>, + <&gpadc 0x0b>; + io-channel-name = "main_charger_v", + "main_charger_c", + "vbus_v", + "usb_charger_c"; + }; + ab8500-usb { compatible = "stericsson,ab8500-usb"; interrupts = < 90 0x4 diff --git a/Documentation/devicetree/bindings/mfd/da9062.txt b/Documentation/devicetree/bindings/mfd/da9062.txt index edca653a5777..bc4b59de6a55 100644 --- a/Documentation/devicetree/bindings/mfd/da9062.txt +++ b/Documentation/devicetree/bindings/mfd/da9062.txt @@ -66,6 +66,9 @@ Sub-nodes: details of individual regulator device can be found in: Documentation/devicetree/bindings/regulator/regulator.txt + regulator-initial-mode may be specified for buck regulators using mode values + from include/dt-bindings/regulator/dlg,da9063-regulator.h. + - rtc : This node defines settings required for the Real-Time Clock associated with the DA9062. There are currently no entries in this binding, however compatible = "dlg,da9062-rtc" should be added if a node is created. @@ -96,6 +99,7 @@ Example: regulator-max-microvolt = <1570000>; regulator-min-microamp = <500000>; regulator-max-microamp = <2000000>; + regulator-initial-mode = <DA9063_BUCK_MODE_SYNC>; regulator-boot-on; }; DA9062_LDO1: ldo1 { diff --git a/Documentation/devicetree/bindings/mips/ralink.txt b/Documentation/devicetree/bindings/mips/ralink.txt index a16e8d7fe56c..8cc0ab41578c 100644 --- a/Documentation/devicetree/bindings/mips/ralink.txt +++ b/Documentation/devicetree/bindings/mips/ralink.txt @@ -16,3 +16,17 @@ value must be one of the following values: ralink,mt7620a-soc ralink,mt7620n-soc ralink,mt7628a-soc + ralink,mt7688a-soc + +2. Boards + +GARDENA smart Gateway (MT7688) + +This board is based on the MediaTek MT7688 and equipped with 128 MiB +of DDR and 8 MiB of flash (SPI NOR) and additional 128MiB SPI NAND +storage. + +------------------------------ +Required root node properties: +- compatible = "gardena,smart-gateway-mt7688", "ralink,mt7688a-soc", + "ralink,mt7628a-soc"; diff --git a/Documentation/devicetree/bindings/mmc/arasan,sdhci.txt b/Documentation/devicetree/bindings/mmc/arasan,sdhci.txt index 7ca0aa7ccc0b..428685eb2ded 100644 --- a/Documentation/devicetree/bindings/mmc/arasan,sdhci.txt +++ b/Documentation/devicetree/bindings/mmc/arasan,sdhci.txt @@ -15,10 +15,15 @@ Required Properties: - "arasan,sdhci-5.1": generic Arasan SDHCI 5.1 PHY - "rockchip,rk3399-sdhci-5.1", "arasan,sdhci-5.1": rk3399 eMMC PHY For this device it is strongly suggested to include arasan,soc-ctl-syscon. + - "xlnx,zynqmp-8.9a": ZynqMP SDHCI 8.9a PHY + For this device it is strongly suggested to include clock-output-names and + #clock-cells. - "ti,am654-sdhci-5.1", "arasan,sdhci-5.1": TI AM654 MMC PHY Note: This binding has been deprecated and moved to [5]. - "intel,lgm-sdhci-5.1-emmc", "arasan,sdhci-5.1": Intel LGM eMMC PHY For this device it is strongly suggested to include arasan,soc-ctl-syscon. + - "intel,lgm-sdhci-5.1-sdxc", "arasan,sdhci-5.1": Intel LGM SDXC PHY + For this device it is strongly suggested to include arasan,soc-ctl-syscon. [5] Documentation/devicetree/bindings/mmc/sdhci-am654.txt @@ -38,15 +43,19 @@ Optional Properties: - clock-output-names: If specified, this will be the name of the card clock which will be exposed by this device. Required if #clock-cells is specified. - - #clock-cells: If specified this should be the value <0>. With this property - in place we will export a clock representing the Card Clock. This clock - is expected to be consumed by our PHY. You must also specify + - #clock-cells: If specified this should be the value <0> or <1>. With this + property in place we will export one or two clocks representing the Card + Clock. These clocks are expected to be consumed by our PHY. - xlnx,fails-without-test-cd: when present, the controller doesn't work when the CD line is not connected properly, and the line is not connected properly. Test mode can be used to force the controller to function. - xlnx,int-clock-stable-broken: when present, the controller always reports that the internal clock is stable even when it is not. + - xlnx,mio-bank: When specified, this will indicate the MIO bank number in + which the command and data lines are configured. If not specified, driver + will assume this as 0. + Example: sdhci@e0100000 { compatible = "arasan,sdhci-8.9a"; @@ -83,6 +92,18 @@ Example: #clock-cells = <0>; }; + sdhci: mmc@ff160000 { + compatible = "xlnx,zynqmp-8.9a", "arasan,sdhci-8.9a"; + interrupt-parent = <&gic>; + interrupts = <0 48 4>; + reg = <0x0 0xff160000 0x0 0x1000>; + clocks = <&clk200>, <&clk200>; + clock-names = "clk_xin", "clk_ahb"; + clock-output-names = "clk_out_sd0", "clk_in_sd0"; + #clock-cells = <1>; + clk-phase-sd-hs = <63>, <72>; + }; + emmc: sdhci@ec700000 { compatible = "intel,lgm-sdhci-5.1-emmc", "arasan,sdhci-5.1"; reg = <0xec700000 0x300>; @@ -97,3 +118,18 @@ Example: phy-names = "phy_arasan"; arasan,soc-ctl-syscon = <&sysconf>; }; + + sdxc: sdhci@ec600000 { + compatible = "arasan,sdhci-5.1", "intel,lgm-sdhci-5.1-sdxc"; + reg = <0xec600000 0x300>; + interrupt-parent = <&ioapic1>; + interrupts = <43 1>; + clocks = <&cgu0 LGM_CLK_SDIO>, <&cgu0 LGM_CLK_NGI>, + <&cgu0 LGM_GCLK_SDXC>; + clock-names = "clk_xin", "clk_ahb", "gate"; + clock-output-names = "sdxc_cardclock"; + #clock-cells = <0>; + phys = <&sdxc_phy>; + phy-names = "phy_arasan"; + arasan,soc-ctl-syscon = <&sysconf>; + }; diff --git a/Documentation/devicetree/bindings/mmc/fsl-imx-esdhc.txt b/Documentation/devicetree/bindings/mmc/fsl-imx-esdhc.txt index f707b8bee304..2fb466ca2a9d 100644 --- a/Documentation/devicetree/bindings/mmc/fsl-imx-esdhc.txt +++ b/Documentation/devicetree/bindings/mmc/fsl-imx-esdhc.txt @@ -18,6 +18,9 @@ Required properties: "fsl,imx6ull-usdhc" "fsl,imx7d-usdhc" "fsl,imx7ulp-usdhc" + "fsl,imx8mq-usdhc" + "fsl,imx8mm-usdhc" + "fsl,imx8mn-usdhc" "fsl,imx8qxp-usdhc" Optional properties: diff --git a/Documentation/devicetree/bindings/mmc/jz4740.txt b/Documentation/devicetree/bindings/mmc/jz4740.txt index 8a6f87f13114..453d3b9d145d 100644 --- a/Documentation/devicetree/bindings/mmc/jz4740.txt +++ b/Documentation/devicetree/bindings/mmc/jz4740.txt @@ -1,14 +1,16 @@ -* Ingenic JZ47xx MMC controllers +* Ingenic XBurst MMC controllers This file documents the device tree properties used for the MMC controller in -Ingenic JZ4740/JZ4780 SoCs. These are in addition to the core MMC properties -described in mmc.txt. +Ingenic JZ4740/JZ4760/JZ4780/X1000 SoCs. These are in addition to the core MMC +properties described in mmc.txt. Required properties: - compatible: Should be one of the following: - "ingenic,jz4740-mmc" for the JZ4740 - "ingenic,jz4725b-mmc" for the JZ4725B + - "ingenic,jz4760-mmc" for the JZ4760 - "ingenic,jz4780-mmc" for the JZ4780 + - "ingenic,x1000-mmc" for the X1000 - reg: Should contain the MMC controller registers location and length. - interrupts: Should contain the interrupt specifier of the MMC controller. - clocks: Clock for the MMC controller. diff --git a/Documentation/devicetree/bindings/mmc/mmc-controller.yaml b/Documentation/devicetree/bindings/mmc/mmc-controller.yaml index 080754e0ef35..b130450c3b34 100644 --- a/Documentation/devicetree/bindings/mmc/mmc-controller.yaml +++ b/Documentation/devicetree/bindings/mmc/mmc-controller.yaml @@ -333,6 +333,19 @@ patternProperties: required: - reg + "^clk-phase-(legacy|sd-hs|mmc-(hs|hs[24]00|ddr52)|uhs-(sdr(12|25|50|104)|ddr50))$": + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + minItems: 2 + maxItems: 2 + items: + minimum: 0 + maximum: 359 + description: + Set the clock (phase) delays which are to be configured in the + controller while switching to particular speed mode. These values + are in pair of degrees. + dependencies: cd-debounce-delay-ms: [ cd-gpios ] fixed-emmc-driver-type: [ non-removable ] @@ -351,6 +364,7 @@ examples: keep-power-in-suspend; wakeup-source; mmc-pwrseq = <&sdhci0_pwrseq>; + clk-phase-sd-hs = <63>, <72>; }; - | diff --git a/Documentation/devicetree/bindings/mmc/owl-mmc.yaml b/Documentation/devicetree/bindings/mmc/owl-mmc.yaml new file mode 100644 index 000000000000..12b40213426d --- /dev/null +++ b/Documentation/devicetree/bindings/mmc/owl-mmc.yaml @@ -0,0 +1,59 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/mmc/owl-mmc.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Actions Semi Owl SoCs SD/MMC/SDIO controller + +allOf: + - $ref: "mmc-controller.yaml" + +maintainers: + - Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> + +properties: + compatible: + const: actions,owl-mmc + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + clocks: + minItems: 1 + + resets: + maxItems: 1 + + dmas: + maxItems: 1 + + dma-names: + const: mmc + +required: + - compatible + - reg + - interrupts + - clocks + - resets + - dmas + - dma-names + +examples: + - | + mmc0: mmc@e0330000 { + compatible = "actions,owl-mmc"; + reg = <0x0 0xe0330000 0x0 0x4000>; + interrupts = <0 42 4>; + clocks = <&cmu 56>; + resets = <&cmu 23>; + dmas = <&dma 2>; + dma-names = "mmc"; + bus-width = <4>; + }; + +... diff --git a/Documentation/devicetree/bindings/mmc/renesas,sdhi.txt b/Documentation/devicetree/bindings/mmc/renesas,sdhi.txt index dd08d038a65c..bc08fc43a9be 100644 --- a/Documentation/devicetree/bindings/mmc/renesas,sdhi.txt +++ b/Documentation/devicetree/bindings/mmc/renesas,sdhi.txt @@ -11,6 +11,7 @@ Required properties: "renesas,sdhi-r8a7744" - SDHI IP on R8A7744 SoC "renesas,sdhi-r8a7745" - SDHI IP on R8A7745 SoC "renesas,sdhi-r8a774a1" - SDHI IP on R8A774A1 SoC + "renesas,sdhi-r8a774b1" - SDHI IP on R8A774B1 SoC "renesas,sdhi-r8a774c0" - SDHI IP on R8A774C0 SoC "renesas,sdhi-r8a77470" - SDHI IP on R8A77470 SoC "renesas,sdhi-mmc-r8a77470" - SDHI/MMC IP on R8A77470 SoC diff --git a/Documentation/devicetree/bindings/mmc/sdhci-atmel.txt b/Documentation/devicetree/bindings/mmc/sdhci-atmel.txt index 1b662d7171a0..503c6dbac1b2 100644 --- a/Documentation/devicetree/bindings/mmc/sdhci-atmel.txt +++ b/Documentation/devicetree/bindings/mmc/sdhci-atmel.txt @@ -9,6 +9,11 @@ Required properties: - clocks: Phandlers to the clocks. - clock-names: Must be "hclock", "multclk", "baseclk"; +Optional properties: +- microchip,sdcal-inverted: when present, polarity on the SDCAL SoC pin is + inverted. The default polarity for this signal is described in the datasheet. + For instance on SAMA5D2, the pin is usually tied to the GND with a resistor + and a capacitor (see "SDMMC I/O Calibration" chapter). Example: diff --git a/Documentation/devicetree/bindings/mmc/sdhci-milbeaut.txt b/Documentation/devicetree/bindings/mmc/sdhci-milbeaut.txt new file mode 100644 index 000000000000..627ee89c125b --- /dev/null +++ b/Documentation/devicetree/bindings/mmc/sdhci-milbeaut.txt @@ -0,0 +1,30 @@ +* SOCIONEXT Milbeaut SDHCI controller + +This file documents differences between the core properties in mmc.txt +and the properties used by the sdhci_milbeaut driver. + +Required properties: +- compatible: "socionext,milbeaut-m10v-sdhci-3.0" +- clocks: Must contain an entry for each entry in clock-names. It is a + list of phandles and clock-specifier pairs. + See ../clocks/clock-bindings.txt for details. +- clock-names: Should contain the following two entries: + "iface" - clock used for sdhci interface + "core" - core clock for sdhci controller + +Optional properties: +- fujitsu,cmd-dat-delay-select: boolean property indicating that this host + requires the CMD_DAT_DELAY control to be enabled. + +Example: + sdhci3: mmc@1b010000 { + compatible = "socionext,milbeaut-m10v-sdhci-3.0"; + reg = <0x1b010000 0x10000>; + interrupts = <0 265 0x4>; + voltage-ranges = <3300 3300>; + bus-width = <4>; + clocks = <&clk 7>, <&ahb_clk>; + clock-names = "core", "iface"; + cap-sdio-irq; + fujitsu,cmd-dat-delay-select; + }; diff --git a/Documentation/devicetree/bindings/mtd/cadence-nand-controller.txt b/Documentation/devicetree/bindings/mtd/cadence-nand-controller.txt new file mode 100644 index 000000000000..f3893c4d3c6a --- /dev/null +++ b/Documentation/devicetree/bindings/mtd/cadence-nand-controller.txt @@ -0,0 +1,53 @@ +* Cadence NAND controller + +Required properties: + - compatible : "cdns,hp-nfc" + - reg : Contains two entries, each of which is a tuple consisting of a + physical address and length. The first entry is the address and + length of the controller register set. The second entry is the + address and length of the Slave DMA data port. + - reg-names: should contain "reg" and "sdma" + - #address-cells: should be 1. The cell encodes the chip select connection. + - #size-cells : should be 0. + - interrupts : The interrupt number. + - clocks: phandle of the controller core clock (nf_clk). + +Optional properties: + - dmas: shall reference DMA channel associated to the NAND controller + - cdns,board-delay-ps : Estimated Board delay. The value includes the total + round trip delay for the signals and is used for deciding on values + associated with data read capture. The example formula for SDR mode is + the following: + board delay = RE#PAD delay + PCB trace to device + PCB trace from device + + DQ PAD delay + +Child nodes represent the available NAND chips. + +Required properties of NAND chips: + - reg: shall contain the native Chip Select ids from 0 to max supported by + the cadence nand flash controller + +See Documentation/devicetree/bindings/mtd/nand.txt for more details on +generic bindings. + +Example: + +nand_controller: nand-controller@60000000 { + compatible = "cdns,hp-nfc"; + #address-cells = <1>; + #size-cells = <0>; + reg = <0x60000000 0x10000>, <0x80000000 0x10000>; + reg-names = "reg", "sdma"; + clocks = <&nf_clk>; + cdns,board-delay-ps = <4830>; + interrupts = <2 0>; + nand@0 { + reg = <0>; + label = "nand-1"; + }; + nand@1 { + reg = <1>; + label = "nand-2"; + }; + +}; diff --git a/Documentation/devicetree/bindings/mtd/intel,ixp4xx-flash.txt b/Documentation/devicetree/bindings/mtd/intel,ixp4xx-flash.txt new file mode 100644 index 000000000000..4bdcb92ae381 --- /dev/null +++ b/Documentation/devicetree/bindings/mtd/intel,ixp4xx-flash.txt @@ -0,0 +1,22 @@ +Flash device on Intel IXP4xx SoC + +This flash is regular CFI compatible (Intel or AMD extended) flash chips with +specific big-endian or mixed-endian memory access pattern. + +Required properties: +- compatible : must be "intel,ixp4xx-flash", "cfi-flash"; +- reg : memory address for the flash chip +- bank-width : width in bytes of flash interface, should be <2> + +For the rest of the properties, see mtd-physmap.txt. + +The device tree may optionally contain sub-nodes describing partitions of the +address space. See partition.txt for more detail. + +Example: + +flash@50000000 { + compatible = "intel,ixp4xx-flash", "cfi-flash"; + reg = <0x50000000 0x01000000>; + bank-width = <2>; +}; diff --git a/Documentation/devicetree/bindings/net/brcm,bcm7445-switch-v4.0.txt b/Documentation/devicetree/bindings/net/brcm,bcm7445-switch-v4.0.txt index b7336b9d6a3c..48a7f916c5e4 100644 --- a/Documentation/devicetree/bindings/net/brcm,bcm7445-switch-v4.0.txt +++ b/Documentation/devicetree/bindings/net/brcm,bcm7445-switch-v4.0.txt @@ -44,6 +44,12 @@ Optional properties: Admission Control Block supports reporting the number of packets in-flight in a switch queue +- resets: a single phandle and reset identifier pair. See + Documentation/devicetree/binding/reset/reset.txt for details. + +- reset-names: If the "reset" property is specified, this property should have + the value "switch" to denote the switch reset line. + Port subnodes: Optional properties: diff --git a/Documentation/devicetree/bindings/net/brcm,bcmgenet.txt b/Documentation/devicetree/bindings/net/brcm,bcmgenet.txt index 3956af1d30f3..33a0d67e4ce5 100644 --- a/Documentation/devicetree/bindings/net/brcm,bcmgenet.txt +++ b/Documentation/devicetree/bindings/net/brcm,bcmgenet.txt @@ -2,7 +2,7 @@ Required properties: - compatible: should contain one of "brcm,genet-v1", "brcm,genet-v2", - "brcm,genet-v3", "brcm,genet-v4", "brcm,genet-v5". + "brcm,genet-v3", "brcm,genet-v4", "brcm,genet-v5", "brcm,bcm2711-genet-v5". - reg: address and length of the register set for the device - interrupts and/or interrupts-extended: must be two cells, the first cell is the general purpose interrupt line, while the second cell is the diff --git a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt index 4fa00e2eafcf..f16b99571af1 100644 --- a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt +++ b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt @@ -14,6 +14,8 @@ Required properties: * "brcm,bcm4330-bt" * "brcm,bcm43438-bt" * "brcm,bcm4345c5" + * "brcm,bcm43540-bt" + * "brcm,bcm4335a0" Optional properties: diff --git a/Documentation/devicetree/bindings/net/ethernet-controller.yaml b/Documentation/devicetree/bindings/net/ethernet-controller.yaml index 0e7c31794ae6..ac471b60ed6a 100644 --- a/Documentation/devicetree/bindings/net/ethernet-controller.yaml +++ b/Documentation/devicetree/bindings/net/ethernet-controller.yaml @@ -121,6 +121,11 @@ properties: and is useful for determining certain configuration settings such as flow control thresholds. + sfp: + $ref: /schemas/types.yaml#definitions/phandle + description: + Specifies a reference to a node representing a SFP cage. + tx-fifo-depth: $ref: /schemas/types.yaml#definitions/uint32 description: diff --git a/Documentation/devicetree/bindings/net/ethernet-phy.yaml b/Documentation/devicetree/bindings/net/ethernet-phy.yaml index f70f18ff821f..8927941c74bb 100644 --- a/Documentation/devicetree/bindings/net/ethernet-phy.yaml +++ b/Documentation/devicetree/bindings/net/ethernet-phy.yaml @@ -153,6 +153,11 @@ properties: Delay after the reset was deasserted in microseconds. If this property is missing the delay will be skipped. + sfp: + $ref: /schemas/types.yaml#definitions/phandle + description: + Specifies a reference to a node representing a SFP cage. + required: - reg diff --git a/Documentation/devicetree/bindings/net/ftgmac100.txt b/Documentation/devicetree/bindings/net/ftgmac100.txt index 72e7aaf7242e..f878c1103463 100644 --- a/Documentation/devicetree/bindings/net/ftgmac100.txt +++ b/Documentation/devicetree/bindings/net/ftgmac100.txt @@ -9,6 +9,7 @@ Required properties: - "aspeed,ast2400-mac" - "aspeed,ast2500-mac" + - "aspeed,ast2600-mac" - reg: Address and length of the register set for the device - interrupts: Should contain ethernet controller interrupt @@ -23,6 +24,13 @@ Optional properties: - no-hw-checksum: Used to disable HW checksum support. Here for backward compatibility as the driver now should have correct defaults based on the SoC. +- clocks: In accordance with the generic clock bindings. Must describe the MAC + IP clock, and optionally an RMII RCLK gate for the AST2500/AST2600. The + required MAC clock must be the first cell. +- clock-names: + + - "MACCLK": The MAC IP clock + - "RCLK": Clock gate for the RMII RCLK Example: diff --git a/Documentation/devicetree/bindings/net/lpc-eth.txt b/Documentation/devicetree/bindings/net/lpc-eth.txt index b92e927808b6..cfe0e5991d46 100644 --- a/Documentation/devicetree/bindings/net/lpc-eth.txt +++ b/Documentation/devicetree/bindings/net/lpc-eth.txt @@ -10,6 +10,11 @@ Optional properties: absent, "rmii" is assumed. - use-iram: Use LPC32xx internal SRAM (IRAM) for DMA buffering +Optional subnodes: +- mdio : specifies the mdio bus, used as a container for phy nodes according to + phy.txt in the same directory + + Example: mac: ethernet@31060000 { diff --git a/Documentation/devicetree/bindings/net/nfc/pn532.txt b/Documentation/devicetree/bindings/net/nfc/pn532.txt new file mode 100644 index 000000000000..a5507dc499bc --- /dev/null +++ b/Documentation/devicetree/bindings/net/nfc/pn532.txt @@ -0,0 +1,46 @@ +* NXP Semiconductors PN532 NFC Controller + +Required properties: +- compatible: Should be + - "nxp,pn532" Place a node with this inside the devicetree node of the bus + where the NFC chip is connected to. + Currently the kernel has phy bindings for uart and i2c. + - "nxp,pn532-i2c" (DEPRECATED) only works for the i2c binding. + - "nxp,pn533-i2c" (DEPRECATED) only works for the i2c binding. + +Required properties if connected on i2c: +- clock-frequency: I²C work frequency. +- reg: for the I²C bus address. This is fixed at 0x24 for the PN532. +- interrupts: GPIO interrupt to which the chip is connected + +Optional SoC Specific Properties: +- pinctrl-names: Contains only one value - "default". +- pintctrl-0: Specifies the pin control groups used for this controller. + +Example (for ARM-based BeagleBone with PN532 on I2C2): + +&i2c2 { + + + pn532: nfc@24 { + + compatible = "nxp,pn532"; + + reg = <0x24>; + clock-frequency = <400000>; + + interrupt-parent = <&gpio1>; + interrupts = <17 IRQ_TYPE_EDGE_FALLING>; + + }; +}; + +Example (for PN532 connected via uart): + +uart4: serial@49042000 { + compatible = "ti,omap3-uart"; + + pn532: nfc { + compatible = "nxp,pn532"; + }; +}; diff --git a/Documentation/devicetree/bindings/net/nfc/pn533-i2c.txt b/Documentation/devicetree/bindings/net/nfc/pn533-i2c.txt deleted file mode 100644 index 2efe3886b95b..000000000000 --- a/Documentation/devicetree/bindings/net/nfc/pn533-i2c.txt +++ /dev/null @@ -1,29 +0,0 @@ -* NXP Semiconductors PN532 NFC Controller - -Required properties: -- compatible: Should be "nxp,pn532-i2c" or "nxp,pn533-i2c". -- clock-frequency: I²C work frequency. -- reg: address on the bus -- interrupts: GPIO interrupt to which the chip is connected - -Optional SoC Specific Properties: -- pinctrl-names: Contains only one value - "default". -- pintctrl-0: Specifies the pin control groups used for this controller. - -Example (for ARM-based BeagleBone with PN532 on I2C2): - -&i2c2 { - - - pn532: pn532@24 { - - compatible = "nxp,pn532-i2c"; - - reg = <0x24>; - clock-frequency = <400000>; - - interrupt-parent = <&gpio1>; - interrupts = <17 IRQ_TYPE_EDGE_FALLING>; - - }; -}; diff --git a/Documentation/devicetree/bindings/net/qca,ar803x.yaml b/Documentation/devicetree/bindings/net/qca,ar803x.yaml new file mode 100644 index 000000000000..5a6c9d20c0ba --- /dev/null +++ b/Documentation/devicetree/bindings/net/qca,ar803x.yaml @@ -0,0 +1,111 @@ +# SPDX-License-Identifier: GPL-2.0+ +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/qca,ar803x.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Qualcomm Atheros AR803x PHY + +maintainers: + - Andrew Lunn <andrew@lunn.ch> + - Florian Fainelli <f.fainelli@gmail.com> + - Heiner Kallweit <hkallweit1@gmail.com> + +description: | + Bindings for Qualcomm Atheros AR803x PHYs + +allOf: + - $ref: ethernet-phy.yaml# + +properties: + qca,clk-out-frequency: + description: Clock output frequency in Hertz. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: [ 25000000, 50000000, 62500000, 125000000 ] + + qca,clk-out-strength: + description: Clock output driver strength. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: [ 0, 1, 2 ] + + qca,keep-pll-enabled: + description: | + If set, keep the PLL enabled even if there is no link. Useful if you + want to use the clock output without an ethernet link. + + Only supported on the AR8031. + type: boolean + + vddio-supply: + description: | + RGMII I/O voltage regulator (see regulator/regulator.yaml). + + The PHY supports RGMII I/O voltages of 1.5V, 1.8V and 2.5V. You can + either connect this to the vddio-regulator (1.5V / 1.8V) or the + vddh-regulator (2.5V). + + Only supported on the AR8031. + + vddio-regulator: + type: object + description: + Initial data for the VDDIO regulator. Set this to 1.5V or 1.8V. + allOf: + - $ref: /schemas/regulator/regulator.yaml + + vddh-regulator: + type: object + description: + Dummy subnode to model the external connection of the PHY VDDH + regulator to VDDIO. + allOf: + - $ref: /schemas/regulator/regulator.yaml + + +examples: + - | + #include <dt-bindings/net/qca-ar803x.h> + + ethernet { + #address-cells = <1>; + #size-cells = <0>; + + phy-mode = "rgmii-id"; + + ethernet-phy@0 { + reg = <0>; + + qca,clk-out-frequency = <125000000>; + qca,clk-out-strength = <AR803X_STRENGTH_FULL>; + + vddio-supply = <&vddio>; + + vddio: vddio-regulator { + regulator-min-microvolt = <1800000>; + regulator-max-microvolt = <1800000>; + }; + }; + }; + - | + #include <dt-bindings/net/qca-ar803x.h> + + ethernet { + #address-cells = <1>; + #size-cells = <0>; + + phy-mode = "rgmii-id"; + + ethernet-phy@0 { + reg = <0>; + + qca,clk-out-frequency = <50000000>; + qca,keep-pll-enabled; + + vddio-supply = <&vddh>; + + vddh: vddh-regulator { + }; + }; + }; diff --git a/Documentation/devicetree/bindings/net/renesas,ether.yaml b/Documentation/devicetree/bindings/net/renesas,ether.yaml new file mode 100644 index 000000000000..7f84df9790e2 --- /dev/null +++ b/Documentation/devicetree/bindings/net/renesas,ether.yaml @@ -0,0 +1,114 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/renesas,ether.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Renesas Electronics SH EtherMAC + +allOf: + - $ref: ethernet-controller.yaml# + +maintainers: + - Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> + +properties: + compatible: + oneOf: + - items: + - enum: + - renesas,gether-r8a7740 # device is a part of R8A7740 SoC + - renesas,gether-r8a77980 # device is a part of R8A77980 SoC + - renesas,ether-r7s72100 # device is a part of R7S72100 SoC + - renesas,ether-r7s9210 # device is a part of R7S9210 SoC + - items: + - enum: + - renesas,ether-r8a7778 # device is a part of R8A7778 SoC + - renesas,ether-r8a7779 # device is a part of R8A7779 SoC + - enum: + - renesas,rcar-gen1-ether # a generic R-Car Gen1 device + - items: + - enum: + - renesas,ether-r8a7745 # device is a part of R8A7745 SoC + - renesas,ether-r8a7743 # device is a part of R8A7743 SoC + - renesas,ether-r8a7790 # device is a part of R8A7790 SoC + - renesas,ether-r8a7791 # device is a part of R8A7791 SoC + - renesas,ether-r8a7793 # device is a part of R8A7793 SoC + - renesas,ether-r8a7794 # device is a part of R8A7794 SoC + - enum: + - renesas,rcar-gen2-ether # a generic R-Car Gen2 or RZ/G1 device + + reg: + items: + - description: E-DMAC/feLic registers + - description: TSU registers + minItems: 1 + + interrupts: + maxItems: 1 + + '#address-cells': + description: number of address cells for the MDIO bus + const: 1 + + '#size-cells': + description: number of size cells on the MDIO bus + const: 0 + + clocks: + maxItems: 1 + + pinctrl-0: true + + pinctrl-names: true + + renesas,no-ether-link: + type: boolean + description: + specify when a board does not provide a proper Ether LINK signal + + renesas,ether-link-active-low: + type: boolean + description: + specify when the Ether LINK signal is active-low instead of normal + active-high + +required: + - compatible + - reg + - interrupts + - phy-mode + - phy-handle + - '#address-cells' + - '#size-cells' + - clocks + - pinctrl-0 + +examples: + # Lager board + - | + #include <dt-bindings/clock/r8a7790-clock.h> + #include <dt-bindings/interrupt-controller/irq.h> + + ethernet@ee700000 { + compatible = "renesas,ether-r8a7790", "renesas,rcar-gen2-ether"; + reg = <0 0xee700000 0 0x400>; + interrupt-parent = <&gic>; + interrupts = <0 162 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&mstp8_clks R8A7790_CLK_ETHER>; + phy-mode = "rmii"; + phy-handle = <&phy1>; + pinctrl-0 = <ðer_pins>; + pinctrl-names = "default"; + renesas,ether-link-active-low; + #address-cells = <1>; + #size-cells = <0>; + + phy1: ethernet-phy@1 { + reg = <1>; + interrupt-parent = <&irqc0>; + interrupts = <0 IRQ_TYPE_LEVEL_LOW>; + pinctrl-0 = <&phy1_pins>; + pinctrl-names = "default"; + }; + }; diff --git a/Documentation/devicetree/bindings/net/sh_eth.txt b/Documentation/devicetree/bindings/net/sh_eth.txt deleted file mode 100644 index abc36274227c..000000000000 --- a/Documentation/devicetree/bindings/net/sh_eth.txt +++ /dev/null @@ -1,69 +0,0 @@ -* Renesas Electronics SH EtherMAC - -This file provides information on what the device node for the SH EtherMAC -interface contains. - -Required properties: -- compatible: Must contain one or more of the following: - "renesas,gether-r8a7740" if the device is a part of R8A7740 SoC. - "renesas,ether-r8a7743" if the device is a part of R8A7743 SoC. - "renesas,ether-r8a7745" if the device is a part of R8A7745 SoC. - "renesas,ether-r8a7778" if the device is a part of R8A7778 SoC. - "renesas,ether-r8a7779" if the device is a part of R8A7779 SoC. - "renesas,ether-r8a7790" if the device is a part of R8A7790 SoC. - "renesas,ether-r8a7791" if the device is a part of R8A7791 SoC. - "renesas,ether-r8a7793" if the device is a part of R8A7793 SoC. - "renesas,ether-r8a7794" if the device is a part of R8A7794 SoC. - "renesas,gether-r8a77980" if the device is a part of R8A77980 SoC. - "renesas,ether-r7s72100" if the device is a part of R7S72100 SoC. - "renesas,ether-r7s9210" if the device is a part of R7S9210 SoC. - "renesas,rcar-gen1-ether" for a generic R-Car Gen1 device. - "renesas,rcar-gen2-ether" for a generic R-Car Gen2 or RZ/G1 - device. - - When compatible with the generic version, nodes must list - the SoC-specific version corresponding to the platform - first followed by the generic version. - -- reg: offset and length of (1) the E-DMAC/feLic register block (required), - (2) the TSU register block (optional). -- interrupts: interrupt specifier for the sole interrupt. -- phy-mode: see ethernet.txt file in the same directory. -- phy-handle: see ethernet.txt file in the same directory. -- #address-cells: number of address cells for the MDIO bus, must be equal to 1. -- #size-cells: number of size cells on the MDIO bus, must be equal to 0. -- clocks: clock phandle and specifier pair. -- pinctrl-0: phandle, referring to a default pin configuration node. - -Optional properties: -- pinctrl-names: pin configuration state name ("default"). -- renesas,no-ether-link: boolean, specify when a board does not provide a proper - Ether LINK signal. -- renesas,ether-link-active-low: boolean, specify when the Ether LINK signal is - active-low instead of normal active-high. - -Example (Lager board): - - ethernet@ee700000 { - compatible = "renesas,ether-r8a7790", - "renesas,rcar-gen2-ether"; - reg = <0 0xee700000 0 0x400>; - interrupt-parent = <&gic>; - interrupts = <0 162 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&mstp8_clks R8A7790_CLK_ETHER>; - phy-mode = "rmii"; - phy-handle = <&phy1>; - pinctrl-0 = <ðer_pins>; - pinctrl-names = "default"; - renesas,ether-link-active-low; - #address-cells = <1>; - #size-cells = <0>; - - phy1: ethernet-phy@1 { - reg = <1>; - interrupt-parent = <&irqc0>; - interrupts = <0 IRQ_TYPE_LEVEL_LOW>; - pinctrl-0 = <&phy1_pins>; - pinctrl-names = "default"; - }; - }; diff --git a/Documentation/devicetree/bindings/net/ti,cpsw-switch.yaml b/Documentation/devicetree/bindings/net/ti,cpsw-switch.yaml new file mode 100644 index 000000000000..81ae8cafabc1 --- /dev/null +++ b/Documentation/devicetree/bindings/net/ti,cpsw-switch.yaml @@ -0,0 +1,240 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/ti,cpsw-switch.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: TI SoC Ethernet Switch Controller (CPSW) Device Tree Bindings + +maintainers: + - Grygorii Strashko <grygorii.strashko@ti.com> + - Sekhar Nori <nsekhar@ti.com> + +description: + The 3-port switch gigabit ethernet subsystem provides ethernet packet + communication and can be configured as an ethernet switch. It provides the + gigabit media independent interface (GMII),reduced gigabit media + independent interface (RGMII), reduced media independent interface (RMII), + the management data input output (MDIO) for physical layer device (PHY) + management. + +properties: + compatible: + oneOf: + - const: ti,cpsw-switch + - items: + - const: ti,am335x-cpsw-switch + - const: ti,cpsw-switch + - items: + - const: ti,am4372-cpsw-switch + - const: ti,cpsw-switch + - items: + - const: ti,dra7-cpsw-switch + - const: ti,cpsw-switch + + reg: + maxItems: 1 + description: + The physical base address and size of full the CPSW module IO range + + ranges: true + + clocks: + maxItems: 1 + description: CPSW functional clock + + clock-names: + maxItems: 1 + items: + - const: fck + + interrupts: + items: + - description: RX_THRESH interrupt + - description: RX interrupt + - description: TX interrupt + - description: MISC interrupt + + interrupt-names: + items: + - const: "rx_thresh" + - const: "rx" + - const: "tx" + - const: "misc" + + pinctrl-names: true + + syscon: + $ref: /schemas/types.yaml#definitions/phandle + description: + Phandle to the system control device node which provides access to + efuse IO range with MAC addresses + + + ethernet-ports: + type: object + properties: + '#address-cells': + const: 1 + '#size-cells': + const: 0 + + patternProperties: + "^port@[0-9]+$": + type: object + minItems: 1 + maxItems: 2 + description: CPSW external ports + + allOf: + - $ref: ethernet-controller.yaml# + + properties: + reg: + maxItems: 1 + enum: [1, 2] + description: CPSW port number + + phys: + $ref: /schemas/types.yaml#definitions/phandle-array + maxItems: 1 + description: phandle on phy-gmii-sel PHY + + label: + $ref: /schemas/types.yaml#/definitions/string-array + maxItems: 1 + description: label associated with this port + + ti,dual-emac-pvid: + $ref: /schemas/types.yaml#/definitions/uint32 + maxItems: 1 + minimum: 1 + maximum: 1024 + description: + Specifies default PORT VID to be used to segregate + ports. Default value - CPSW port number. + + required: + - reg + - phys + + mdio: + type: object + allOf: + - $ref: "ti,davinci-mdio.yaml#" + description: + CPSW MDIO bus. + + cpts: + type: object + description: + The Common Platform Time Sync (CPTS) module + + properties: + clocks: + maxItems: 1 + description: CPTS reference clock + + clock-names: + maxItems: 1 + items: + - const: cpts + + cpts_clock_mult: + $ref: /schemas/types.yaml#/definitions/uint32 + description: + Numerator to convert input clock ticks into ns + + cpts_clock_shift: + $ref: /schemas/types.yaml#/definitions/uint32 + description: + Denominator to convert input clock ticks into ns. + Mult and shift will be calculated basing on CPTS rftclk frequency if + both cpts_clock_shift and cpts_clock_mult properties are not provided. + + required: + - clocks + - clock-names + +required: + - compatible + - reg + - ranges + - clocks + - clock-names + - interrupts + - interrupt-names + - '#address-cells' + - '#size-cells' + +examples: + - | + #include <dt-bindings/interrupt-controller/irq.h> + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/clock/dra7.h> + + mac_sw: switch@0 { + compatible = "ti,dra7-cpsw-switch","ti,cpsw-switch"; + reg = <0x0 0x4000>; + ranges = <0 0 0x4000>; + clocks = <&gmac_main_clk>; + clock-names = "fck"; + #address-cells = <1>; + #size-cells = <1>; + syscon = <&scm_conf>; + inctrl-names = "default", "sleep"; + + interrupts = <GIC_SPI 334 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 335 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 336 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 337 IRQ_TYPE_LEVEL_HIGH>; + interrupt-names = "rx_thresh", "rx", "tx", "misc"; + + ethernet-ports { + #address-cells = <1>; + #size-cells = <0>; + + cpsw_port1: port@1 { + reg = <1>; + label = "port1"; + mac-address = [ 00 00 00 00 00 00 ]; + phys = <&phy_gmii_sel 1>; + phy-handle = <ðphy0_sw>; + phy-mode = "rgmii"; + ti,dual_emac_pvid = <1>; + }; + + cpsw_port2: port@2 { + reg = <2>; + label = "wan"; + mac-address = [ 00 00 00 00 00 00 ]; + phys = <&phy_gmii_sel 2>; + phy-handle = <ðphy1_sw>; + phy-mode = "rgmii"; + ti,dual_emac_pvid = <2>; + }; + }; + + davinci_mdio_sw: mdio@1000 { + compatible = "ti,cpsw-mdio","ti,davinci_mdio"; + reg = <0x1000 0x100>; + clocks = <&gmac_clkctrl DRA7_GMAC_GMAC_CLKCTRL 0>; + clock-names = "fck"; + #address-cells = <1>; + #size-cells = <0>; + bus_freq = <1000000>; + + ethphy0_sw: ethernet-phy@0 { + reg = <0>; + }; + + ethphy1_sw: ethernet-phy@1 { + reg = <1>; + }; + }; + + cpts { + clocks = <&gmac_clkctrl DRA7_GMAC_GMAC_CLKCTRL 25>; + clock-names = "cpts"; + }; + }; diff --git a/Documentation/devicetree/bindings/net/ti,dp83869.yaml b/Documentation/devicetree/bindings/net/ti,dp83869.yaml new file mode 100644 index 000000000000..6fe3e451da8a --- /dev/null +++ b/Documentation/devicetree/bindings/net/ti,dp83869.yaml @@ -0,0 +1,84 @@ +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) 2019 Texas Instruments Incorporated +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/net/ti,dp83869.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: TI DP83869 ethernet PHY + +allOf: + - $ref: "ethernet-controller.yaml#" + +maintainers: + - Dan Murphy <dmurphy@ti.com> + +description: | + The DP83869HM device is a robust, fully-featured Gigabit (PHY) transceiver + with integrated PMD sublayers that supports 10BASE-Te, 100BASE-TX and + 1000BASE-T Ethernet protocols. The DP83869 also supports 1000BASE-X and + 100BASE-FX Fiber protocols. + This device interfaces to the MAC layer through Reduced GMII (RGMII) and + SGMII The DP83869HM supports Media Conversion in Managed mode. In this mode, + the DP83869HM can run 1000BASE-X-to-1000BASE-T and 100BASE-FX-to-100BASE-TX + conversions. The DP83869HM can also support Bridge Conversion from RGMII to + SGMII and SGMII to RGMII. + + Specifications about the charger can be found at: + http://www.ti.com/lit/ds/symlink/dp83869hm.pdf + +properties: + reg: + maxItems: 1 + + ti,min-output-impedance: + type: boolean + description: | + MAC Interface Impedance control to set the programmable output impedance + to a minimum value (35 ohms). + + ti,max-output-impedance: + type: boolean + description: | + MAC Interface Impedance control to set the programmable output impedance + to a maximum value (70 ohms). + + tx-fifo-depth: + $ref: /schemas/types.yaml#definitions/uint32 + description: | + Transmitt FIFO depth see dt-bindings/net/ti-dp83869.h for values + + rx-fifo-depth: + $ref: /schemas/types.yaml#definitions/uint32 + description: | + Receive FIFO depth see dt-bindings/net/ti-dp83869.h for values + + ti,clk-output-sel: + $ref: /schemas/types.yaml#definitions/uint32 + description: | + Muxing option for CLK_OUT pin see dt-bindings/net/ti-dp83869.h for values. + + ti,op-mode: + $ref: /schemas/types.yaml#definitions/uint32 + description: | + Operational mode for the PHY. If this is not set then the operational + mode is set by the straps. see dt-bindings/net/ti-dp83869.h for values + +required: + - reg + +examples: + - | + #include <dt-bindings/net/ti-dp83869.h> + mdio0 { + #address-cells = <1>; + #size-cells = <0>; + ethphy0: ethernet-phy@0 { + reg = <0>; + tx-fifo-depth = <DP83869_PHYCR_FIFO_DEPTH_4_B_NIB>; + rx-fifo-depth = <DP83869_PHYCR_FIFO_DEPTH_4_B_NIB>; + ti,op-mode = <DP83869_RGMII_COPPER_ETHERNET>; + ti,max-output-impedance = "true"; + ti,clk-output-sel = <DP83869_CLK_O_SEL_CHN_A_RCLK>; + }; + }; diff --git a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt index ae661e65354e..017128394a3e 100644 --- a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt +++ b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt @@ -81,6 +81,12 @@ Optional properties: Definition: Name of external front end module used. Some valid FEM names for example: "microsemi-lx5586", "sky85703-11" and "sky85803" etc. +- qcom,snoc-host-cap-8bit-quirk: + Usage: Optional + Value type: <empty> + Definition: Quirk specifying that the firmware expects the 8bit version + of the host capability QMI request +- qcom,xo-cal-data: xo cal offset to be configured in xo trim register. Example (to supply PCI based wifi block details): diff --git a/Documentation/devicetree/bindings/net/wireless/ti,wl1251.txt b/Documentation/devicetree/bindings/net/wireless/ti,wl1251.txt index bb2fcde6f7ff..f38950560982 100644 --- a/Documentation/devicetree/bindings/net/wireless/ti,wl1251.txt +++ b/Documentation/devicetree/bindings/net/wireless/ti,wl1251.txt @@ -35,3 +35,29 @@ Examples: ti,power-gpio = <&gpio3 23 GPIO_ACTIVE_HIGH>; /* 87 */ }; }; + +&mmc3 { + vmmc-supply = <&wlan_en>; + + bus-width = <4>; + non-removable; + ti,non-removable; + cap-power-off-card; + + pinctrl-names = "default"; + pinctrl-0 = <&mmc3_pins>; + + #address-cells = <1>; + #size-cells = <0>; + + wlan: wifi@1 { + compatible = "ti,wl1251"; + + reg = <1>; + + interrupt-parent = <&gpio1>; + interrupts = <21 IRQ_TYPE_LEVEL_HIGH>; /* GPIO_21 */ + + ti,wl1251-has-eeprom; + }; +}; diff --git a/Documentation/devicetree/bindings/nvmem/rockchip-otp.txt b/Documentation/devicetree/bindings/nvmem/rockchip-otp.txt new file mode 100644 index 000000000000..40f649f7c2e5 --- /dev/null +++ b/Documentation/devicetree/bindings/nvmem/rockchip-otp.txt @@ -0,0 +1,25 @@ +Rockchip internal OTP (One Time Programmable) memory device tree bindings + +Required properties: +- compatible: Should be one of the following. + - "rockchip,px30-otp" - for PX30 SoCs. + - "rockchip,rk3308-otp" - for RK3308 SoCs. +- reg: Should contain the registers location and size +- clocks: Must contain an entry for each entry in clock-names. +- clock-names: Should be "otp", "apb_pclk" and "phy". +- resets: Must contain an entry for each entry in reset-names. + See ../../reset/reset.txt for details. +- reset-names: Should be "phy". + +See nvmem.txt for more information. + +Example: + otp: otp@ff290000 { + compatible = "rockchip,px30-otp"; + reg = <0x0 0xff290000 0x0 0x4000>; + #address-cells = <1>; + #size-cells = <1>; + clocks = <&cru SCLK_OTP_USR>, <&cru PCLK_OTP_NS>, + <&cru PCLK_OTP_PHY>; + clock-names = "otp", "apb_pclk", "phy"; + }; diff --git a/Documentation/devicetree/bindings/nvmem/sprd-efuse.txt b/Documentation/devicetree/bindings/nvmem/sprd-efuse.txt new file mode 100644 index 000000000000..96b6feec27f0 --- /dev/null +++ b/Documentation/devicetree/bindings/nvmem/sprd-efuse.txt @@ -0,0 +1,39 @@ += Spreadtrum eFuse device tree bindings = + +Required properties: +- compatible: Should be "sprd,ums312-efuse". +- reg: Specify the address offset of efuse controller. +- clock-names: Should be "enable". +- clocks: The phandle and specifier referencing the controller's clock. +- hwlocks: Reference to a phandle of a hwlock provider node. + += Data cells = +Are child nodes of eFuse, bindings of which as described in +bindings/nvmem/nvmem.txt + +Example: + + ap_efuse: efuse@32240000 { + compatible = "sprd,ums312-efuse"; + reg = <0 0x32240000 0 0x10000>; + clock-names = "enable"; + hwlocks = <&hwlock 8>; + clocks = <&aonapb_gate CLK_EFUSE_EB>; + + /* Data cells */ + thermal_calib: calib@10 { + reg = <0x10 0x2>; + }; + }; + += Data consumers = +Are device nodes which consume nvmem data cells. + +Example: + + thermal { + ... + + nvmem-cells = <&thermal_calib>; + nvmem-cell-names = "calibration"; + }; diff --git a/Documentation/devicetree/bindings/perf/arm-ccn.txt b/Documentation/devicetree/bindings/perf/arm-ccn.txt index 43b5a71a5a9d..1c53b5aa3317 100644 --- a/Documentation/devicetree/bindings/perf/arm-ccn.txt +++ b/Documentation/devicetree/bindings/perf/arm-ccn.txt @@ -6,6 +6,7 @@ Required properties: "arm,ccn-502" "arm,ccn-504" "arm,ccn-508" + "arm,ccn-512" - reg: (standard registers property) physical address and size (16MB) of the configuration registers block diff --git a/Documentation/devicetree/bindings/perf/fsl-imx-ddr.txt b/Documentation/devicetree/bindings/perf/fsl-imx-ddr.txt index d77e3f26f9e6..7822a806ea0a 100644 --- a/Documentation/devicetree/bindings/perf/fsl-imx-ddr.txt +++ b/Documentation/devicetree/bindings/perf/fsl-imx-ddr.txt @@ -5,6 +5,7 @@ Required properties: - compatible: should be one of: "fsl,imx8-ddr-pmu" "fsl,imx8m-ddr-pmu" + "fsl,imx8mp-ddr-pmu" - reg: physical address and size diff --git a/Documentation/devicetree/bindings/phy/allwinner,sun50i-h6-usb3-phy.yaml b/Documentation/devicetree/bindings/phy/allwinner,sun50i-h6-usb3-phy.yaml new file mode 100644 index 000000000000..e5922b427342 --- /dev/null +++ b/Documentation/devicetree/bindings/phy/allwinner,sun50i-h6-usb3-phy.yaml @@ -0,0 +1,47 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +# Copyright 2019 Ondrej Jirman <megous@megous.com> +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/phy/allwinner,sun50i-h6-usb3-phy.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Allwinner H6 USB3 PHY + +maintainers: + - Ondrej Jirman <megous@megous.com> + +properties: + compatible: + enum: + - allwinner,sun50i-h6-usb3-phy + + reg: + maxItems: 1 + + clocks: + maxItems: 1 + + resets: + maxItems: 1 + + "#phy-cells": + const: 0 + +required: + - compatible + - reg + - clocks + - resets + - "#phy-cells" + +examples: + - | + #include <dt-bindings/clock/sun50i-h6-ccu.h> + #include <dt-bindings/reset/sun50i-h6-ccu.h> + phy@5210000 { + compatible = "allwinner,sun50i-h6-usb3-phy"; + reg = <0x5210000 0x10000>; + clocks = <&ccu CLK_USB_PHY1>; + resets = <&ccu RST_USB_PHY1>; + #phy-cells = <0>; + }; diff --git a/Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt b/Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt index 00639baae74a..541f5298827c 100644 --- a/Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt +++ b/Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt @@ -2,6 +2,7 @@ ROCKCHIP USB2.0 PHY WITH INNO IP BLOCK Required properties (phy (parent) node): - compatible : should be one of the listed compatibles: + * "rockchip,px30-usb2phy" * "rockchip,rk3228-usb2phy" * "rockchip,rk3328-usb2phy" * "rockchip,rk3366-usb2phy" diff --git a/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt b/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt index 085fbd676cfc..eac9ad3cbbc8 100644 --- a/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt +++ b/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt @@ -14,7 +14,8 @@ Required properties: "qcom,msm8998-qmp-pcie-phy" for PCIe QMP phy on msm8998, "qcom,sdm845-qmp-usb3-phy" for USB3 QMP V3 phy on sdm845, "qcom,sdm845-qmp-usb3-uni-phy" for USB3 QMP V3 UNI phy on sdm845, - "qcom,sdm845-qmp-ufs-phy" for UFS QMP phy on sdm845. + "qcom,sdm845-qmp-ufs-phy" for UFS QMP phy on sdm845, + "qcom,sm8150-qmp-ufs-phy" for UFS QMP phy on sm8150. - reg: - index 0: address and length of register set for PHY's common @@ -57,6 +58,8 @@ Required properties: "aux", "cfg_ahb", "ref", "com_aux". For "qcom,sdm845-qmp-ufs-phy" must contain: "ref", "ref_aux". + For "qcom,sm8150-qmp-ufs-phy" must contain: + "ref", "ref_aux". - resets: a list of phandles and reset controller specifier pairs, one for each entry in reset-names. @@ -83,6 +86,8 @@ Required properties: "phy", "common". For "qcom,sdm845-qmp-ufs-phy": must contain: "ufsphy". + For "qcom,sm8150-qmp-ufs-phy": must contain: + "ufsphy". - vdda-phy-supply: Phandle to a regulator supply to PHY core block. - vdda-pll-supply: Phandle to 1.8V regulator supply to PHY refclk pll block. diff --git a/Documentation/devicetree/bindings/phy/rcar-gen3-phy-usb2.txt b/Documentation/devicetree/bindings/phy/rcar-gen3-phy-usb2.txt index 503a8cfb3184..7734b219d9aa 100644 --- a/Documentation/devicetree/bindings/phy/rcar-gen3-phy-usb2.txt +++ b/Documentation/devicetree/bindings/phy/rcar-gen3-phy-usb2.txt @@ -10,6 +10,8 @@ Required properties: SoC. "renesas,usb2-phy-r8a774a1" if the device is a part of an R8A774A1 SoC. + "renesas,usb2-phy-r8a774b1" if the device is a part of an R8A774B1 + SoC. "renesas,usb2-phy-r8a774c0" if the device is a part of an R8A774C0 SoC. "renesas,usb2-phy-r8a7795" if the device is a part of an R8A7795 diff --git a/Documentation/devicetree/bindings/phy/rcar-gen3-phy-usb3.txt b/Documentation/devicetree/bindings/phy/rcar-gen3-phy-usb3.txt index 9d9826609c2f..0fe433b9a592 100644 --- a/Documentation/devicetree/bindings/phy/rcar-gen3-phy-usb3.txt +++ b/Documentation/devicetree/bindings/phy/rcar-gen3-phy-usb3.txt @@ -9,6 +9,8 @@ need this driver. Required properties: - compatible: "renesas,r8a774a1-usb3-phy" if the device is a part of an R8A774A1 SoC. + "renesas,r8a774b1-usb3-phy" if the device is a part of an R8A774B1 + SoC. "renesas,r8a7795-usb3-phy" if the device is a part of an R8A7795 SoC. "renesas,r8a7796-usb3-phy" if the device is a part of an R8A7796 diff --git a/Documentation/devicetree/bindings/phy/rockchip,px30-dsi-dphy.yaml b/Documentation/devicetree/bindings/phy/rockchip,px30-dsi-dphy.yaml new file mode 100644 index 000000000000..bb0da87bcd84 --- /dev/null +++ b/Documentation/devicetree/bindings/phy/rockchip,px30-dsi-dphy.yaml @@ -0,0 +1,75 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/phy/rockchip,px30-dsi-dphy.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Rockchip MIPI DPHY with additional LVDS/TTL modes + +maintainers: + - Heiko Stuebner <heiko@sntech.de> + +properties: + "#phy-cells": + const: 0 + + "#clock-cells": + const: 0 + + compatible: + enum: + - rockchip,px30-dsi-dphy + - rockchip,rk3128-dsi-dphy + - rockchip,rk3368-dsi-dphy + + reg: + maxItems: 1 + + clocks: + items: + - description: PLL reference clock + - description: Module clock + + clock-names: + items: + - const: ref + - const: pclk + + power-domains: + maxItems: 1 + description: phandle to the associated power domain + + resets: + items: + - description: exclusive PHY reset line + + reset-names: + items: + - const: apb + +required: + - "#phy-cells" + - "#clock-cells" + - compatible + - reg + - clocks + - clock-names + - resets + - reset-names + +additionalProperties: false + +examples: + - | + dsi_dphy: phy@ff2e0000 { + compatible = "rockchip,px30-video-phy"; + reg = <0x0 0xff2e0000 0x0 0x10000>; + clocks = <&pmucru 13>, <&cru 12>; + clock-names = "ref", "pclk"; + #clock-cells = <0>; + resets = <&cru 12>; + reset-names = "apb"; + #phy-cells = <0>; + }; + +... diff --git a/Documentation/devicetree/bindings/pinctrl/allwinner,sun4i-a10-pinctrl.yaml b/Documentation/devicetree/bindings/pinctrl/allwinner,sun4i-a10-pinctrl.yaml new file mode 100644 index 000000000000..cd0503b6fe36 --- /dev/null +++ b/Documentation/devicetree/bindings/pinctrl/allwinner,sun4i-a10-pinctrl.yaml @@ -0,0 +1,243 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/pinctrl/allwinner,sun4i-a10-pinctrl.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Allwinner A10 Pin Controller Device Tree Bindings + +maintainers: + - Chen-Yu Tsai <wens@csie.org> + - Maxime Ripard <maxime.ripard@bootlin.com> + +properties: + "#gpio-cells": + const: 3 + description: + GPIO consumers must use three arguments, first the number of the + bank, then the pin number inside that bank, and finally the GPIO + flags. + + "#interrupt-cells": + const: 3 + description: + Interrupts consumers must use three arguments, first the number + of the bank, then the pin number inside that bank, and finally + the interrupts flags. + + compatible: + enum: + - allwinner,sun4i-a10-pinctrl + - allwinner,sun5i-a10s-pinctrl + - allwinner,sun5i-a13-pinctrl + - allwinner,sun6i-a31-pinctrl + - allwinner,sun6i-a31-r-pinctrl + - allwinner,sun6i-a31s-pinctrl + - allwinner,sun7i-a20-pinctrl + - allwinner,sun8i-a23-pinctrl + - allwinner,sun8i-a23-r-pinctrl + - allwinner,sun8i-a33-pinctrl + - allwinner,sun8i-a83t-pinctrl + - allwinner,sun8i-a83t-r-pinctrl + - allwinner,sun8i-h3-pinctrl + - allwinner,sun8i-h3-r-pinctrl + - allwinner,sun8i-r40-pinctrl + - allwinner,sun8i-v3-pinctrl + - allwinner,sun8i-v3s-pinctrl + - allwinner,sun9i-a80-pinctrl + - allwinner,sun9i-a80-r-pinctrl + - allwinner,sun50i-a64-pinctrl + - allwinner,sun50i-a64-r-pinctrl + - allwinner,sun50i-h5-pinctrl + - allwinner,sun50i-h6-pinctrl + - allwinner,sun50i-h6-r-pinctrl + - allwinner,suniv-f1c100s-pinctrl + - nextthing,gr8-pinctrl + + reg: + maxItems: 1 + + interrupts: + minItems: 1 + maxItems: 5 + description: + One interrupt per external interrupt bank supported on the + controller, sorted by bank number ascending order. + + clocks: + items: + - description: Bus Clock + - description: High Frequency Oscillator + - description: Low Frequency Oscillator + + clock-names: + items: + - const: apb + - const: hosc + - const: losc + + resets: + maxItems: 1 + + gpio-controller: true + interrupt-controller: true + gpio-line-names: true + + input-debounce: + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + - minItems: 1 + maxItems: 5 + description: + Debouncing periods in microseconds, one period per interrupt + bank found in the controller + +patternProperties: + # It's pretty scary, but the basic idea is that: + # - One node name can start with either s- or r- for PRCM nodes, + # - Then, the name itself can be any repetition of <string>- (to + # accomodate with nodes like uart4-rts-cts-pins), where each + # string can be either starting with 'p' but in a string longer + # than 3, or something that doesn't start with 'p', + # - Then, the bank name is optional and will be between pa and pg, + # pl or pm. Some pins groups that have several options will have + # the pin numbers then, + # - Finally, the name will end with either -pin or pins. + + "^([rs]-)?(([a-z0-9]{3,}|[a-oq-z][a-z0-9]*?)?-)+?(p[a-ilm][0-9]*?-)??pins?$": + type: object + + properties: + pins: true + function: true + bias-disable: true + bias-pull-up: true + bias-pull-down: true + + drive-strength: + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: [ 10, 20, 30, 40 ] + + required: + - pins + - function + + additionalProperties: false + + "^vcc-p[a-hlm]-supply$": + description: + Power supplies for pin banks. + +required: + - "#gpio-cells" + - "#interrupt-cells" + - compatible + - reg + - interrupts + - clocks + - clock-names + - gpio-controller + - interrupt-controller + +allOf: + # FIXME: We should have the pin bank supplies here, but not a lot of + # boards are defining it at the moment so it would generate a lot of + # warnings. + + - if: + properties: + compatible: + enum: + - allwinner,sun9i-a80-pinctrl + + then: + properties: + interrupts: + minItems: 5 + maxItems: 5 + + else: + if: + properties: + compatible: + enum: + - allwinner,sun6i-a31-pinctrl + - allwinner,sun6i-a31s-pinctrl + - allwinner,sun50i-h6-pinctrl + + then: + properties: + interrupts: + minItems: 4 + maxItems: 4 + + else: + if: + properties: + compatible: + enum: + - allwinner,sun8i-a23-pinctrl + - allwinner,sun8i-a83t-pinctrl + - allwinner,sun50i-a64-pinctrl + - allwinner,sun50i-h5-pinctrl + - allwinner,suniv-f1c100s-pinctrl + + then: + properties: + interrupts: + minItems: 3 + maxItems: 3 + + else: + if: + properties: + compatible: + enum: + - allwinner,sun6i-a31-r-pinctrl + - allwinner,sun8i-a33-pinctrl + - allwinner,sun8i-h3-pinctrl + - allwinner,sun8i-v3-pinctrl + - allwinner,sun8i-v3s-pinctrl + - allwinner,sun9i-a80-r-pinctrl + - allwinner,sun50i-h6-r-pinctrl + + then: + properties: + interrupts: + minItems: 2 + maxItems: 2 + + else: + properties: + interrupts: + minItems: 1 + maxItems: 1 + +additionalProperties: false + +examples: + - | + #include <dt-bindings/clock/sun5i-ccu.h> + + pio: pinctrl@1c20800 { + compatible = "allwinner,sun5i-a13-pinctrl"; + reg = <0x01c20800 0x400>; + interrupts = <28>; + clocks = <&ccu CLK_APB0_PIO>, <&osc24M>, <&osc32k>; + clock-names = "apb", "hosc", "losc"; + gpio-controller; + interrupt-controller; + #interrupt-cells = <3>; + #gpio-cells = <3>; + + uart1_pe_pins: uart1-pe-pins { + pins = "PE10", "PE11"; + function = "uart1"; + }; + + uart1_pg_pins: uart1-pg-pins { + pins = "PG3", "PG4"; + function = "uart1"; + }; + }; diff --git a/Documentation/devicetree/bindings/pinctrl/allwinner,sunxi-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/allwinner,sunxi-pinctrl.txt deleted file mode 100644 index 328585c6da58..000000000000 --- a/Documentation/devicetree/bindings/pinctrl/allwinner,sunxi-pinctrl.txt +++ /dev/null @@ -1,164 +0,0 @@ -* Allwinner A1X Pin Controller - -The pins controlled by sunXi pin controller are organized in banks, -each bank has 32 pins. Each pin has 7 multiplexing functions, with -the first two functions being GPIO in and out. The configuration on -the pins includes drive strength and pull-up. - -Required properties: -- compatible: Should be one of the following (depending on your SoC): - "allwinner,sun4i-a10-pinctrl" - "allwinner,sun5i-a10s-pinctrl" - "allwinner,sun5i-a13-pinctrl" - "allwinner,sun6i-a31-pinctrl" - "allwinner,sun6i-a31s-pinctrl" - "allwinner,sun6i-a31-r-pinctrl" - "allwinner,sun7i-a20-pinctrl" - "allwinner,sun8i-a23-pinctrl" - "allwinner,sun8i-a23-r-pinctrl" - "allwinner,sun8i-a33-pinctrl" - "allwinner,sun9i-a80-pinctrl" - "allwinner,sun9i-a80-r-pinctrl" - "allwinner,sun8i-a83t-pinctrl" - "allwinner,sun8i-a83t-r-pinctrl" - "allwinner,sun8i-h3-pinctrl" - "allwinner,sun8i-h3-r-pinctrl" - "allwinner,sun8i-r40-pinctrl" - "allwinner,sun8i-v3-pinctrl" - "allwinner,sun8i-v3s-pinctrl" - "allwinner,sun50i-a64-pinctrl" - "allwinner,sun50i-a64-r-pinctrl" - "allwinner,sun50i-h5-pinctrl" - "allwinner,sun50i-h6-pinctrl" - "allwinner,sun50i-h6-r-pinctrl" - "allwinner,suniv-f1c100s-pinctrl" - "nextthing,gr8-pinctrl" - -- reg: Should contain the register physical address and length for the - pin controller. - -- clocks: phandle to the clocks feeding the pin controller: - - "apb": the gated APB parent clock - - "hosc": the high frequency oscillator in the system - - "losc": the low frequency oscillator in the system - -Note: For backward compatibility reasons, the hosc and losc clocks are only -required if you need to use the optional input-debounce property. Any new -device tree should set them. - -Each pin bank, depending on the SoC, can have an associated regulator: - -- vcc-pa-supply: for the A10, A20, A31, A31s, A80 and R40 SoCs -- vcc-pb-supply: for the A31, A31s, A80 and V3s SoCs -- vcc-pc-supply: for the A10, A20, A31, A31s, A64, A80, H5, R40 and V3s SoCs -- vcc-pd-supply: for the A23, A31, A31s, A64, A80, A83t, H3, H5 and R40 SoCs -- vcc-pe-supply: for the A10, A20, A31, A31s, A64, A80, R40 and V3s SoCs -- vcc-pf-supply: for the A10, A20, A31, A31s, A80, R40 and V3s SoCs -- vcc-pg-supply: for the A10, A20, A31, A31s, A64, A80, H3, H5, R40 and V3s SoCs -- vcc-ph-supply: for the A31, A31s and A80 SoCs -- vcc-pl-supply: for the r-pinctrl of the A64, A80 and A83t SoCs -- vcc-pm-supply: for the r-pinctrl of the A31, A31s and A80 SoCs - -Optional properties: - - input-debounce: Array of debouncing periods in microseconds. One period per - irq bank found in the controller. 0 if no setup required. - - -Please refer to pinctrl-bindings.txt in this directory for details of the -common pinctrl bindings used by client devices. - -A pinctrl node should contain at least one subnodes representing the -pinctrl groups available on the machine. Each subnode will list the -pins it needs, and how they should be configured, with regard to muxer -configuration, drive strength and pullups. If one of these options is -not set, its actual value will be unspecified. - -Allwinner A1X Pin Controller supports the generic pin multiplexing and -configuration bindings. For details on each properties, you can refer to - ./pinctrl-bindings.txt. - -Required sub-node properties: - - pins - - function - -Optional sub-node properties: - - bias-disable - - bias-pull-up - - bias-pull-down - - drive-strength - -*** Deprecated pin configuration and multiplexing binding - -Required subnode-properties: - -- allwinner,pins: List of strings containing the pin name. -- allwinner,function: Function to mux the pins listed above to. - -Optional subnode-properties: -- allwinner,drive: Integer. Represents the current sent to the pin - 0: 10 mA - 1: 20 mA - 2: 30 mA - 3: 40 mA -- allwinner,pull: Integer. - 0: No resistor - 1: Pull-up resistor - 2: Pull-down resistor - -Examples: - -pio: pinctrl@1c20800 { - compatible = "allwinner,sun5i-a13-pinctrl"; - reg = <0x01c20800 0x400>; - #address-cells = <1>; - #size-cells = <0>; - - uart1_pins_a: uart1@0 { - allwinner,pins = "PE10", "PE11"; - allwinner,function = "uart1"; - allwinner,drive = <0>; - allwinner,pull = <0>; - }; - - uart1_pins_b: uart1@1 { - allwinner,pins = "PG3", "PG4"; - allwinner,function = "uart1"; - allwinner,drive = <0>; - allwinner,pull = <0>; - }; -}; - - -GPIO and interrupt controller ------------------------------ - -This hardware also acts as a GPIO controller and an interrupt -controller. - -Consumers that would want to refer to one or the other (or both) -should provide through the usual *-gpios and interrupts properties a -cell with 3 arguments, first the number of the bank, then the pin -inside that bank, and finally the flags for the GPIO/interrupts. - -Example: - -xio: gpio@38 { - compatible = "nxp,pcf8574a"; - reg = <0x38>; - - gpio-controller; - #gpio-cells = <2>; - - interrupt-parent = <&pio>; - interrupts = <6 0 IRQ_TYPE_EDGE_FALLING>; - interrupt-controller; - #interrupt-cells = <2>; -}; - -reg_usb1_vbus: usb1-vbus { - compatible = "regulator-fixed"; - regulator-name = "usb1-vbus"; - regulator-min-microvolt = <5000000>; - regulator-max-microvolt = <5000000>; - gpio = <&pio 7 6 GPIO_ACTIVE_HIGH>; -}; diff --git a/Documentation/devicetree/bindings/pinctrl/intel,lgm-pinctrl.yaml b/Documentation/devicetree/bindings/pinctrl/intel,lgm-pinctrl.yaml new file mode 100644 index 000000000000..240d429f773b --- /dev/null +++ b/Documentation/devicetree/bindings/pinctrl/intel,lgm-pinctrl.yaml @@ -0,0 +1,116 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/bindings/pinctrl/intel,lgm-pinctrl.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Intel Lightning Mountain SoC pinmux & GPIO controller binding + +maintainers: + - Rahul Tanwar <rahul.tanwar@linux.intel.com> + +description: | + Pinmux & GPIO controller controls pin multiplexing & configuration including + GPIO function selection & GPIO attributes configuration. + + Please refer to [1] for details of the common pinctrl bindings used by the + client devices. + + [1] Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt + +properties: + compatible: + const: intel,lgm-io + + reg: + maxItems: 1 + +# Client device subnode's properties +patternProperties: + '-pins$': + type: object + description: + Pinctrl node's client devices use subnodes for desired pin configuration. + Client device subnodes use below standard properties. + + properties: + function: + $ref: /schemas/types.yaml#/definitions/string + description: + A string containing the name of the function to mux to the group. + + groups: + $ref: /schemas/types.yaml#/definitions/string-array + description: + An array of strings identifying the list of groups. + + pins: + $ref: /schemas/types.yaml#/definitions/uint32-array + description: + List of pins to select with this function. + + pinmux: + description: The applicable mux group. + allOf: + - $ref: "/schemas/types.yaml#/definitions/uint32-array" + + bias-pull-up: + type: boolean + + bias-pull-down: + type: boolean + + drive-strength: + description: | + Selects the drive strength for the specified pins in mA. + 0: 2 mA + 1: 4 mA + 2: 8 mA + 3: 12 mA + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: [0, 1, 2, 3] + + slew-rate: + type: boolean + description: | + Sets slew rate for specified pins. + 0: slow slew + 1: fast slew + + drive-open-drain: + type: boolean + + output-enable: + type: boolean + + required: + - function + - groups + + additionalProperties: false + +required: + - compatible + - reg + +additionalProperties: false + +examples: + # Pinmux controller node + - | + pinctrl: pinctrl@e2880000 { + compatible = "intel,lgm-pinctrl"; + reg = <0xe2880000 0x100000>; + + uart0-pins { + pins = <64>, /* UART_RX0 */ + <65>; /* UART_TX0 */ + function = "CONSOLE_UART0"; + pinmux = <1>, + <1>; + groups = "CONSOLE_UART0"; + }; + }; + +... diff --git a/Documentation/devicetree/bindings/pinctrl/meson,pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/meson,pinctrl.txt index 10dc4f7176ca..0aff1f28495c 100644 --- a/Documentation/devicetree/bindings/pinctrl/meson,pinctrl.txt +++ b/Documentation/devicetree/bindings/pinctrl/meson,pinctrl.txt @@ -15,6 +15,7 @@ Required properties for the root node: "amlogic,meson-axg-aobus-pinctrl" "amlogic,meson-g12a-periphs-pinctrl" "amlogic,meson-g12a-aobus-pinctrl" + "amlogic,meson-a1-periphs-pinctrl" - reg: address and size of registers controlling irq functionality === GPIO sub-nodes === diff --git a/Documentation/devicetree/bindings/pinctrl/pincfg-node.yaml b/Documentation/devicetree/bindings/pinctrl/pincfg-node.yaml new file mode 100644 index 000000000000..13b7ab9dd6d5 --- /dev/null +++ b/Documentation/devicetree/bindings/pinctrl/pincfg-node.yaml @@ -0,0 +1,140 @@ +# SPDX-License-Identifier: GPL-2.0-only +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/pinctrl/pincfg-node.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Generic pin configuration node schema + +maintainers: + - Linus Walleij <linus.walleij@linaro.org> + +description: + Many data items that are represented in a pin configuration node are common + and generic. Pin control bindings should use the properties defined below + where they are applicable; not all of these properties are relevant or useful + for all hardware or binding structures. Each individual binding document + should state which of these generic properties, if any, are used, and the + structure of the DT nodes that contain these properties. + +properties: + bias-disable: + type: boolean + description: disable any pin bias + + bias-high-impedance: + type: boolean + description: high impedance mode ("third-state", "floating") + + bias-bus-hold: + type: boolean + description: latch weakly + + bias-pull-up: + oneOf: + - type: boolean + - $ref: /schemas/types.yaml#/definitions/uint32 + description: pull up the pin. Takes as optional argument on hardware + supporting it the pull strength in Ohm. + + bias-pull-down: + oneOf: + - type: boolean + - $ref: /schemas/types.yaml#/definitions/uint32 + description: pull down the pin. Takes as optional argument on hardware + supporting it the pull strength in Ohm. + + bias-pull-pin-default: + oneOf: + - type: boolean + - $ref: /schemas/types.yaml#/definitions/uint32 + description: use pin-default pull state. Takes as optional argument on + hardware supporting it the pull strength in Ohm. + + drive-push-pull: + type: boolean + description: drive actively high and low + + drive-open-drain: + type: boolean + description: drive with open drain + + drive-open-source: + type: boolean + description: drive with open source + + drive-strength: + $ref: /schemas/types.yaml#/definitions/uint32 + description: sink or source at most X mA + + drive-strength-microamp: + description: sink or source at most X uA + + input-enable: + type: boolean + description: enable input on pin (no effect on output, such as + enabling an input buffer) + + input-disable: + type: boolean + description: disable input on pin (no effect on output, such as + disabling an input buffer) + + input-schmitt-enable: + type: boolean + description: enable schmitt-trigger mode + + input-schmitt-disable: + type: boolean + description: disable schmitt-trigger mode + + input-debounce: + $ref: /schemas/types.yaml#/definitions/uint32 + description: Takes the debounce time in usec as argument or 0 to disable + debouncing + + power-source: + $ref: /schemas/types.yaml#/definitions/uint32 + description: select between different power supplies + + low-power-enable: + type: boolean + description: enable low power mode + + low-power-disable: + type: boolean + description: disable low power mode + + output-disable: + type: boolean + description: disable output on a pin (such as disable an output buffer) + + output-enable: + type: boolean + description: enable output on a pin without actively driving it + (such as enabling an output buffer) + + output-low: + type: boolean + description: set the pin to output mode with low level + + output-high: + type: boolean + description: set the pin to output mode with high level + + sleep-hardware-state: + type: boolean + description: indicate this is sleep related state which will be + programmed into the registers for the sleep state. + + slew-rate: + $ref: /schemas/types.yaml#/definitions/uint32 + description: set the slew rate + + skew-delay: + $ref: /schemas/types.yaml#/definitions/uint32 + description: + this affects the expected clock skew on input pins + and the delay before latching a value to an output + pin. Typically indicates how many double-inverters are + used to delay the signal. diff --git a/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt b/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt index fcd37e93ed4d..4613bb17ace3 100644 --- a/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt +++ b/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt @@ -141,196 +141,8 @@ controller device. == Generic pin multiplexing node content == -pin multiplexing nodes: - -function - the mux function to select -groups - the list of groups to select with this function - (either this or "pins" must be specified) -pins - the list of pins to select with this function (either - this or "groups" must be specified) - -Example: - -state_0_node_a { - uart0 { - function = "uart0"; - groups = "u0rxtx", "u0rtscts"; - }; -}; -state_1_node_a { - spi0 { - function = "spi0"; - groups = "spi0pins"; - }; -}; -state_2_node_a { - function = "i2c0"; - pins = "mfio29", "mfio30"; -}; - -Optionally an alternative binding can be used if more suitable depending on the -pin controller hardware. For hardware where there is a large number of identical -pin controller instances, naming each pin and function can easily become -unmaintainable. This is especially the case if the same controller is used for -different pins and functions depending on the SoC revision and packaging. - -For cases like this, the pin controller driver may use pinctrl-pin-array helper -binding with a hardware based index and a number of pin configuration values: - -pincontroller { - ... /* Standard DT properties for the device itself elided */ - #pinctrl-cells = <2>; - - state_0_node_a { - pinctrl-pin-array = < - 0 A_DELAY_PS(0) G_DELAY_PS(120) - 4 A_DELAY_PS(0) G_DELAY_PS(360) - ... - >; - }; - ... -}; - -Above #pinctrl-cells specifies the number of value cells in addition to the -index of the registers. This is similar to the interrupts-extended binding with -one exception. There is no need to specify the phandle for each entry as that -is already known as the defined pins are always children of the pin controller -node. Further having the phandle pointing to another pin controller would not -currently work as the pinctrl framework uses named modes to group pins for each -pin control device. - -The index for pinctrl-pin-array must relate to the hardware for the pinctrl -registers, and must not be a virtual index of pin instances. The reason for -this is to avoid mapping of the index in the dts files and the pin controller -driver as it can change. - -For hardware where pin multiplexing configurations have to be specified for -each single pin the number of required sub-nodes containing "pin" and -"function" properties can quickly escalate and become hard to write and -maintain. - -For cases like this, the pin controller driver may use the pinmux helper -property, where the pin identifier is provided with mux configuration settings -in a pinmux group. A pinmux group consists of the pin identifier and mux -settings represented as a single integer or an array of integers. - -The pinmux property accepts an array of pinmux groups, each of them describing -a single pin multiplexing configuration. - -pincontroller { - state_0_node_a { - pinmux = <PINMUX_GROUP>, <PINMUX_GROUP>, ...; - }; -}; - -Each individual pin controller driver bindings documentation shall specify -how pin IDs and pin multiplexing configuration are defined and assembled -together in a pinmux group. +See pinmux-node.yaml == Generic pin configuration node content == -Many data items that are represented in a pin configuration node are common -and generic. Pin control bindings should use the properties defined below -where they are applicable; not all of these properties are relevant or useful -for all hardware or binding structures. Each individual binding document -should state which of these generic properties, if any, are used, and the -structure of the DT nodes that contain these properties. - -Supported generic properties are: - -pins - the list of pins that properties in the node - apply to (either this, "group" or "pinmux" has to be - specified) -group - the group to apply the properties to, if the driver - supports configuration of whole groups rather than - individual pins (either this, "pins" or "pinmux" has - to be specified) -pinmux - the list of numeric pin ids and their mux settings - that properties in the node apply to (either this, - "pins" or "groups" have to be specified) -bias-disable - disable any pin bias -bias-high-impedance - high impedance mode ("third-state", "floating") -bias-bus-hold - latch weakly -bias-pull-up - pull up the pin -bias-pull-down - pull down the pin -bias-pull-pin-default - use pin-default pull state -drive-push-pull - drive actively high and low -drive-open-drain - drive with open drain -drive-open-source - drive with open source -drive-strength - sink or source at most X mA -drive-strength-microamp - sink or source at most X uA -input-enable - enable input on pin (no effect on output, such as - enabling an input buffer) -input-disable - disable input on pin (no effect on output, such as - disabling an input buffer) -input-schmitt-enable - enable schmitt-trigger mode -input-schmitt-disable - disable schmitt-trigger mode -input-debounce - debounce mode with debound time X -power-source - select between different power supplies -low-power-enable - enable low power mode -low-power-disable - disable low power mode -output-disable - disable output on a pin (such as disable an output - buffer) -output-enable - enable output on a pin without actively driving it - (such as enabling an output buffer) -output-low - set the pin to output mode with low level -output-high - set the pin to output mode with high level -sleep-hardware-state - indicate this is sleep related state which will be programmed - into the registers for the sleep state. -slew-rate - set the slew rate -skew-delay - this affects the expected clock skew on input pins - and the delay before latching a value to an output - pin. Typically indicates how many double-inverters are - used to delay the signal. - -For example: - -state_0_node_a { - cts_rxd { - pins = "GPIO0_AJ5", "GPIO2_AH4"; /* CTS+RXD */ - bias-pull-up; - }; -}; -state_1_node_a { - rts_txd { - pins = "GPIO1_AJ3", "GPIO3_AH3"; /* RTS+TXD */ - output-high; - }; -}; -state_2_node_a { - foo { - group = "foo-group"; - bias-pull-up; - }; -}; -state_3_node_a { - mux { - pinmux = <GPIOx_PINm_MUXn>, <GPIOx_PINj_MUXk)>; - input-enable; - }; -}; - -Some of the generic properties take arguments. For those that do, the -arguments are described below. - -- pins takes a list of pin names or IDs as a required argument. The specific - binding for the hardware defines: - - Whether the entries are integers or strings, and their meaning. - -- pinmux takes a list of pin IDs and mux settings as required argument. The - specific bindings for the hardware defines: - - How pin IDs and mux settings are defined and assembled together in a single - integer or an array of integers. - -- bias-pull-up, -down and -pin-default take as optional argument on hardware - supporting it the pull strength in Ohm. bias-disable will disable the pull. - -- drive-strength takes as argument the target strength in mA. - -- drive-strength-microamp takes as argument the target strength in uA. - -- input-debounce takes the debounce time in usec as argument - or 0 to disable debouncing - -More in-depth documentation on these parameters can be found in -<include/linux/pinctrl/pinconf-generic.h> +See pincfg-node.yaml diff --git a/Documentation/devicetree/bindings/pinctrl/pinmux-node.yaml b/Documentation/devicetree/bindings/pinctrl/pinmux-node.yaml new file mode 100644 index 000000000000..777623a57fd5 --- /dev/null +++ b/Documentation/devicetree/bindings/pinctrl/pinmux-node.yaml @@ -0,0 +1,132 @@ +# SPDX-License-Identifier: GPL-2.0-only +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/pinctrl/pinmux-node.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Generic pin multiplexing node schema + +maintainers: + - Linus Walleij <linus.walleij@linaro.org> + +description: | + The contents of the pin configuration child nodes are defined by the binding + for the individual pin controller device. The pin configuration nodes need not + be direct children of the pin controller device; they may be grandchildren, + for example. Whether this is legal, and whether there is any interaction + between the child and intermediate parent nodes, is again defined entirely by + the binding for the individual pin controller device. + + While not required to be used, there are 3 generic forms of pin muxing nodes + which pin controller devices can use. + + pin multiplexing nodes: + + Example: + + state_0_node_a { + uart0 { + function = "uart0"; + groups = "u0rxtx", "u0rtscts"; + }; + }; + state_1_node_a { + spi0 { + function = "spi0"; + groups = "spi0pins"; + }; + }; + state_2_node_a { + function = "i2c0"; + pins = "mfio29", "mfio30"; + }; + + Optionally an alternative binding can be used if more suitable depending on the + pin controller hardware. For hardware where there is a large number of identical + pin controller instances, naming each pin and function can easily become + unmaintainable. This is especially the case if the same controller is used for + different pins and functions depending on the SoC revision and packaging. + + For cases like this, the pin controller driver may use pinctrl-pin-array helper + binding with a hardware based index and a number of pin configuration values: + + pincontroller { + ... /* Standard DT properties for the device itself elided */ + #pinctrl-cells = <2>; + + state_0_node_a { + pinctrl-pin-array = < + 0 A_DELAY_PS(0) G_DELAY_PS(120) + 4 A_DELAY_PS(0) G_DELAY_PS(360) + ... + >; + }; + ... + }; + + Above #pinctrl-cells specifies the number of value cells in addition to the + index of the registers. This is similar to the interrupts-extended binding with + one exception. There is no need to specify the phandle for each entry as that + is already known as the defined pins are always children of the pin controller + node. Further having the phandle pointing to another pin controller would not + currently work as the pinctrl framework uses named modes to group pins for each + pin control device. + + The index for pinctrl-pin-array must relate to the hardware for the pinctrl + registers, and must not be a virtual index of pin instances. The reason for + this is to avoid mapping of the index in the dts files and the pin controller + driver as it can change. + + For hardware where pin multiplexing configurations have to be specified for + each single pin the number of required sub-nodes containing "pin" and + "function" properties can quickly escalate and become hard to write and + maintain. + + For cases like this, the pin controller driver may use the pinmux helper + property, where the pin identifier is provided with mux configuration settings + in a pinmux group. A pinmux group consists of the pin identifier and mux + settings represented as a single integer or an array of integers. + + The pinmux property accepts an array of pinmux groups, each of them describing + a single pin multiplexing configuration. + + pincontroller { + state_0_node_a { + pinmux = <PINMUX_GROUP>, <PINMUX_GROUP>, ...; + }; + }; + + Each individual pin controller driver bindings documentation shall specify + how pin IDs and pin multiplexing configuration are defined and assembled + together in a pinmux group. + +properties: + function: + $ref: /schemas/types.yaml#/definitions/string + description: The mux function to select + + pins: + oneOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + - $ref: /schemas/types.yaml#/definitions/string-array + description: + The list of pin identifiers that properties in the node apply to. The + specific binding for the hardware defines whether the entries are integers + or strings, and their meaning. + + group: + $ref: /schemas/types.yaml#/definitions/string-array + description: + the group to apply the properties to, if the driver supports + configuration of whole groups rather than individual pins (either + this, "pins" or "pinmux" has to be specified) + + pinmux: + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + description: + The list of numeric pin ids and their mux settings that properties in the + node apply to (either this, "pins" or "groups" have to be specified) + + pinctrl-pin-array: + $ref: /schemas/types.yaml#/definitions/uint32-array diff --git a/Documentation/devicetree/bindings/pinctrl/qcom,msm8976-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/qcom,msm8976-pinctrl.txt new file mode 100644 index 000000000000..70d04d12f136 --- /dev/null +++ b/Documentation/devicetree/bindings/pinctrl/qcom,msm8976-pinctrl.txt @@ -0,0 +1,183 @@ +Qualcomm MSM8976 TLMM block + +This binding describes the Top Level Mode Multiplexer block found in the +MSM8956 and MSM8976 platforms. + +- compatible: + Usage: required + Value type: <string> + Definition: must be "qcom,msm8976-pinctrl" + +- reg: + Usage: required + Value type: <prop-encoded-array> + Definition: the base address and size of the TLMM register space. + +- interrupts: + Usage: required + Value type: <prop-encoded-array> + Definition: should specify the TLMM summary IRQ. + +- interrupt-controller: + Usage: required + Value type: <none> + Definition: identifies this node as an interrupt controller + +- #interrupt-cells: + Usage: required + Value type: <u32> + Definition: must be 2. Specifying the pin number and flags, as defined + in <dt-bindings/interrupt-controller/irq.h> + +- gpio-controller: + Usage: required + Value type: <none> + Definition: identifies this node as a gpio controller + +- #gpio-cells: + Usage: required + Value type: <u32> + Definition: must be 2. Specifying the pin number and flags, as defined + in <dt-bindings/gpio/gpio.h> + +- gpio-ranges: + Usage: required + Definition: see ../gpio/gpio.txt + +- gpio-reserved-ranges: + Usage: optional + Definition: see ../gpio/gpio.txt + +Please refer to ../gpio/gpio.txt and ../interrupt-controller/interrupts.txt for +a general description of GPIO and interrupt bindings. + +Please refer to pinctrl-bindings.txt in this directory for details of the +common pinctrl bindings used by client devices, including the meaning of the +phrase "pin configuration node". + +The pin configuration nodes act as a container for an arbitrary number of +subnodes. Each of these subnodes represents some desired configuration for a +pin, a group, or a list of pins or groups. This configuration can include the +mux function to select on those pin(s)/group(s), and various pin configuration +parameters, such as pull-up, drive strength, etc. + + +PIN CONFIGURATION NODES: + +The name of each subnode is not important; all subnodes should be enumerated +and processed purely based on their content. + +Each subnode only affects those parameters that are explicitly listed. In +other words, a subnode that lists a mux function but no pin configuration +parameters implies no information about any pin configuration parameters. +Similarly, a pin subnode that describes a pullup parameter implies no +information about e.g. the mux function. + + +The following generic properties as defined in pinctrl-bindings.txt are valid +to specify in a pin configuration subnode: + +- pins: + Usage: required + Value type: <string-array> + Definition: List of gpio pins affected by the properties specified in + this subnode. + + Valid pins are: + gpio0-gpio145 + Supports mux, bias and drive-strength + + sdc1_clk, sdc1_cmd, sdc1_data, + sdc2_clk, sdc2_cmd, sdc2_data, + sdc3_clk, sdc3_cmd, sdc3_data + Supports bias and drive-strength + +- function: + Usage: required + Value type: <string> + Definition: Specify the alternative function to be configured for the + specified pins. Functions are only valid for gpio pins. + Valid values are: + + gpio, blsp_uart1, blsp_spi1, smb_int, blsp_i2c1, blsp_spi2, + blsp_uart2, blsp_i2c2, gcc_gp1_clk_b, blsp_spi3, + qdss_tracedata_b, blsp_i2c3, gcc_gp2_clk_b, gcc_gp3_clk_b, + blsp_spi4, cap_int, blsp_i2c4, blsp_spi5, blsp_uart5, + qdss_traceclk_a, m_voc, blsp_i2c5, qdss_tracectl_a, + qdss_tracedata_a, blsp_spi6, blsp_uart6, qdss_tracectl_b, + blsp_i2c6, qdss_traceclk_b, mdp_vsync, pri_mi2s_mclk_a, + sec_mi2s_mclk_a, cam_mclk, cci0_i2c, cci1_i2c, blsp1_spi, + blsp3_spi, gcc_gp1_clk_a, gcc_gp2_clk_a, gcc_gp3_clk_a, + uim_batt, sd_write, uim1_data, uim1_clk, uim1_reset, + uim1_present, uim2_data, uim2_clk, uim2_reset, + uim2_present, ts_xvdd, mipi_dsi0, us_euro, ts_resout, + ts_sample, sec_mi2s_mclk_b, pri_mi2s, codec_reset, + cdc_pdm0, us_emitter, pri_mi2s_mclk_b, pri_mi2s_mclk_c, + lpass_slimbus, lpass_slimbus0, lpass_slimbus1, codec_int1, + codec_int2, wcss_bt, sdc3, wcss_wlan2, wcss_wlan1, + wcss_wlan0, wcss_wlan, wcss_fm, key_volp, key_snapshot, + key_focus, key_home, pwr_down, dmic0_clk, hdmi_int, + dmic0_data, wsa_vi, wsa_en, blsp_spi8, wsa_irq, blsp_i2c8, + pa_indicator, modem_tsync, ssbi_wtr1, gsm1_tx, gsm0_tx, + sdcard_det, sec_mi2s, ss_switch, + +- bias-disable: + Usage: optional + Value type: <none> + Definition: The specified pins should be configured as no pull. + +- bias-pull-down: + Usage: optional + Value type: <none> + Definition: The specified pins should be configured as pull down. + +- bias-pull-up: + Usage: optional + Value type: <none> + Definition: The specified pins should be configured as pull up. + +- output-high: + Usage: optional + Value type: <none> + Definition: The specified pins are configured in output mode, driven + high. + Not valid for sdc pins. + +- output-low: + Usage: optional + Value type: <none> + Definition: The specified pins are configured in output mode, driven + low. + Not valid for sdc pins. + +- drive-strength: + Usage: optional + Value type: <u32> + Definition: Selects the drive strength for the specified pins, in mA. + Valid values are: 2, 4, 6, 8, 10, 12, 14 and 16 + +Example: + + tlmm: pinctrl@1000000 { + compatible = "qcom,msm8976-pinctrl"; + reg = <0x1000000 0x300000>; + interrupts = <GIC_SPI 208 IRQ_TYPE_LEVEL_HIGH>; + gpio-controller; + #gpio-cells = <2>; + gpio-ranges = <&tlmm 0 0 145>; + interrupt-controller; + #interrupt-cells = <2>; + + blsp1_uart2_active: blsp1_uart2_active { + mux { + pins = "gpio4", "gpio5", "gpio6", "gpio7"; + function = "blsp_uart2"; + }; + + config { + pins = "gpio4", "gpio5", "gpio6", "gpio7"; + drive-strength = <2>; + bias-disable; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.txt b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.txt index c32bf3237545..7be5de8d253f 100644 --- a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.txt +++ b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.txt @@ -15,14 +15,18 @@ PMIC's from Qualcomm. "qcom,pm8917-gpio" "qcom,pm8921-gpio" "qcom,pm8941-gpio" + "qcom,pm8950-gpio" "qcom,pm8994-gpio" "qcom,pm8998-gpio" "qcom,pma8084-gpio" + "qcom,pmi8950-gpio" "qcom,pmi8994-gpio" "qcom,pmi8998-gpio" "qcom,pms405-gpio" "qcom,pm8150-gpio" "qcom,pm8150b-gpio" + "qcom,pm6150-gpio" + "qcom,pm6150l-gpio" And must contain either "qcom,spmi-gpio" or "qcom,ssbi-gpio" if the device is on an spmi bus or an ssbi bus respectively @@ -91,15 +95,19 @@ to specify in a pin configuration subnode: gpio1-gpio38 for pm8917 gpio1-gpio44 for pm8921 gpio1-gpio36 for pm8941 + gpio1-gpio8 for pm8950 (hole on gpio3) gpio1-gpio22 for pm8994 gpio1-gpio26 for pm8998 gpio1-gpio22 for pma8084 + gpio1-gpio2 for pmi8950 gpio1-gpio10 for pmi8994 gpio1-gpio12 for pms405 (holes on gpio1, gpio9 and gpio10) gpio1-gpio10 for pm8150 (holes on gpio2, gpio5, gpio7 and gpio8) gpio1-gpio12 for pm8150b (holes on gpio3, gpio4, gpio7) gpio1-gpio12 for pm8150l (hole on gpio7) + gpio1-gpio10 for pm6150 + gpio1-gpio12 for pm6150l - function: Usage: required diff --git a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-mpp.txt b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-mpp.txt index 2ab95bc26066..448d36a85730 100644 --- a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-mpp.txt +++ b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-mpp.txt @@ -16,6 +16,8 @@ of PMIC's from Qualcomm. "qcom,pm8917-mpp", "qcom,pm8921-mpp", "qcom,pm8941-mpp", + "qcom,pm8950-mpp", + "qcom,pmi8950-mpp", "qcom,pm8994-mpp", "qcom,pma8084-mpp", @@ -80,6 +82,8 @@ to specify in a pin configuration subnode: mpp1-mpp4 for pm8841 mpp1-mpp4 for pm8916 mpp1-mpp8 for pm8941 + mpp1-mpp4 for pm8950 + mpp1-mpp4 for pmi8950 mpp1-mpp4 for pma8084 - function: diff --git a/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt index 3902efa18fd0..6eada23eaa31 100644 --- a/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt +++ b/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt @@ -18,6 +18,7 @@ Required Properties: - "renesas,pfc-r8a7745": for R8A7745 (RZ/G1E) compatible pin-controller. - "renesas,pfc-r8a77470": for R8A77470 (RZ/G1C) compatible pin-controller. - "renesas,pfc-r8a774a1": for R8A774A1 (RZ/G2M) compatible pin-controller. + - "renesas,pfc-r8a774b1": for R8A774B1 (RZ/G2N) compatible pin-controller. - "renesas,pfc-r8a774c0": for R8A774C0 (RZ/G2E) compatible pin-controller. - "renesas,pfc-r8a7778": for R8A7778 (R-Car M1) compatible pin-controller. - "renesas,pfc-r8a7779": for R8A7779 (R-Car H1) compatible pin-controller. @@ -27,7 +28,8 @@ Required Properties: - "renesas,pfc-r8a7793": for R8A7793 (R-Car M2-N) compatible pin-controller. - "renesas,pfc-r8a7794": for R8A7794 (R-Car E2) compatible pin-controller. - "renesas,pfc-r8a7795": for R8A7795 (R-Car H3) compatible pin-controller. - - "renesas,pfc-r8a7796": for R8A7796 (R-Car M3-W) compatible pin-controller. + - "renesas,pfc-r8a7796": for R8A77960 (R-Car M3-W) compatible pin-controller. + - "renesas,pfc-r8a77961": for R8A77961 (R-Car M3-W+) compatible pin-controller. - "renesas,pfc-r8a77965": for R8A77965 (R-Car M3-N) compatible pin-controller. - "renesas,pfc-r8a77970": for R8A77970 (R-Car V3M) compatible pin-controller. - "renesas,pfc-r8a77980": for R8A77980 (R-Car V3H) compatible pin-controller. diff --git a/Documentation/devicetree/bindings/pinctrl/rockchip,pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/rockchip,pinctrl.txt index 0919db294c17..2113cfaa26e6 100644 --- a/Documentation/devicetree/bindings/pinctrl/rockchip,pinctrl.txt +++ b/Documentation/devicetree/bindings/pinctrl/rockchip,pinctrl.txt @@ -29,6 +29,7 @@ Required properties for iomux controller: "rockchip,rk3188-pinctrl": for Rockchip RK3188 "rockchip,rk3228-pinctrl": for Rockchip RK3228 "rockchip,rk3288-pinctrl": for Rockchip RK3288 + "rockchip,rk3308-pinctrl": for Rockchip RK3308 "rockchip,rk3328-pinctrl": for Rockchip RK3328 "rockchip,rk3368-pinctrl": for Rockchip RK3368 "rockchip,rk3399-pinctrl": for Rockchip RK3399 diff --git a/Documentation/devicetree/bindings/power/supply/cpcap-charger.txt b/Documentation/devicetree/bindings/power/supply/cpcap-charger.txt index 80bd873c3b1d..6048f636783f 100644 --- a/Documentation/devicetree/bindings/power/supply/cpcap-charger.txt +++ b/Documentation/devicetree/bindings/power/supply/cpcap-charger.txt @@ -5,7 +5,8 @@ Required properties: - interrupts: Interrupt specifier for each name in interrupt-names - interrupt-names: Should contain the following entries: "chrg_det", "rvrs_chrg", "chrg_se1b", "se0conn", - "rvrs_mode", "chrgcurr1", "vbusvld", "battdetb" + "rvrs_mode", "chrgcurr2", "chrgcurr1", "vbusvld", + "battdetb" - io-channels: IIO ADC channel specifier for each name in io-channel-names - io-channel-names: Should contain the following entries: "battdetb", "battp", "vbus", "chg_isense", "batti" @@ -21,11 +22,13 @@ cpcap_charger: charger { compatible = "motorola,mapphone-cpcap-charger"; interrupts-extended = < &cpcap 13 0 &cpcap 12 0 &cpcap 29 0 &cpcap 28 0 - &cpcap 22 0 &cpcap 20 0 &cpcap 19 0 &cpcap 54 0 + &cpcap 22 0 &cpcap 21 0 &cpcap 20 0 &cpcap 19 0 + &cpcap 54 0 >; interrupt-names = "chrg_det", "rvrs_chrg", "chrg_se1b", "se0conn", - "rvrs_mode", "chrgcurr1", "vbusvld", "battdetb"; + "rvrs_mode", "chrgcurr2", "chrgcurr1", "vbusvld", + "battdetb"; mode-gpios = <&gpio3 29 GPIO_ACTIVE_LOW &gpio3 23 GPIO_ACTIVE_LOW>; io-channels = <&cpcap_adc 0 &cpcap_adc 1 diff --git a/Documentation/devicetree/bindings/ptp/ptp-idtcm.yaml b/Documentation/devicetree/bindings/ptp/ptp-idtcm.yaml new file mode 100644 index 000000000000..9e21b83d717e --- /dev/null +++ b/Documentation/devicetree/bindings/ptp/ptp-idtcm.yaml @@ -0,0 +1,69 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/ptp/ptp-idtcm.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: IDT ClockMatrix (TM) PTP Clock Device Tree Bindings + +maintainers: + - Vincent Cheng <vincent.cheng.xh@renesas.com> + +properties: + compatible: + enum: + # For System Synchronizer + - idt,8a34000 + - idt,8a34001 + - idt,8a34002 + - idt,8a34003 + - idt,8a34004 + - idt,8a34005 + - idt,8a34006 + - idt,8a34007 + - idt,8a34008 + - idt,8a34009 + # For Port Synchronizer + - idt,8a34010 + - idt,8a34011 + - idt,8a34012 + - idt,8a34013 + - idt,8a34014 + - idt,8a34015 + - idt,8a34016 + - idt,8a34017 + - idt,8a34018 + - idt,8a34019 + # For Universal Frequency Translator (UFT) + - idt,8a34040 + - idt,8a34041 + - idt,8a34042 + - idt,8a34043 + - idt,8a34044 + - idt,8a34045 + - idt,8a34046 + - idt,8a34047 + - idt,8a34048 + - idt,8a34049 + + reg: + maxItems: 1 + description: + I2C slave address of the device. + +required: + - compatible + - reg + +examples: + - | + i2c@1 { + compatible = "abc,acme-1234"; + reg = <0x01 0x400>; + #address-cells = <1>; + #size-cells = <0>; + phc@5b { + compatible = "idt,8a34000"; + reg = <0x5b>; + }; + }; diff --git a/Documentation/devicetree/bindings/regulator/fixed-regulator.yaml b/Documentation/devicetree/bindings/regulator/fixed-regulator.yaml index f32416968197..59b4b73d4051 100644 --- a/Documentation/devicetree/bindings/regulator/fixed-regulator.yaml +++ b/Documentation/devicetree/bindings/regulator/fixed-regulator.yaml @@ -50,6 +50,10 @@ properties: description: startup time in microseconds $ref: /schemas/types.yaml#/definitions/uint32 + off-on-delay-us: + description: off delay time in microseconds + $ref: /schemas/types.yaml#/definitions/uint32 + enable-active-high: description: Polarity of GPIO is Active high. If this property is missing, diff --git a/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.txt b/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.txt index bab9f71140b8..97c3e0b7611c 100644 --- a/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.txt +++ b/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.txt @@ -28,6 +28,8 @@ Supported regulator node names: PM8150L: smps1 - smps8, ldo1 - ldo11, bob, flash, rgb PM8998: smps1 - smps13, ldo1 - ldo28, lvs1 - lvs2 PMI8998: bob + PM6150: smps1 - smps5, ldo1 - ldo19 + PM6150L: smps1 - smps8, ldo1 - ldo11, bob ======================== First Level Nodes - PMIC @@ -43,6 +45,8 @@ First Level Nodes - PMIC "qcom,pm8150l-rpmh-regulators" "qcom,pm8998-rpmh-regulators" "qcom,pmi8998-rpmh-regulators" + "qcom,pm6150-rpmh-regulators" + "qcom,pm6150l-rpmh-regulators" - qcom,pmic-id Usage: required diff --git a/Documentation/devicetree/bindings/regulator/qcom,smd-rpm-regulator.txt b/Documentation/devicetree/bindings/regulator/qcom,smd-rpm-regulator.txt index 45025b5b67f6..d126df043403 100644 --- a/Documentation/devicetree/bindings/regulator/qcom,smd-rpm-regulator.txt +++ b/Documentation/devicetree/bindings/regulator/qcom,smd-rpm-regulator.txt @@ -22,6 +22,7 @@ Regulator nodes are identified by their compatible: "qcom,rpm-pm8841-regulators" "qcom,rpm-pm8916-regulators" "qcom,rpm-pm8941-regulators" + "qcom,rpm-pm8950-regulators" "qcom,rpm-pm8994-regulators" "qcom,rpm-pm8998-regulators" "qcom,rpm-pma8084-regulators" @@ -57,6 +58,26 @@ Regulator nodes are identified by their compatible: - vdd_s1-supply: - vdd_s2-supply: - vdd_s3-supply: +- vdd_s4-supply: +- vdd_s4-supply: +- vdd_s5-supply: +- vdd_s6-supply: +- vdd_l1_l19-supply: +- vdd_l2_l23-supply: +- vdd_l3-supply: +- vdd_l4_l5_l6_l7_l16-supply: +- vdd_l8_l11_l12_l17_l22-supply: +- vdd_l9_l10_l13_l14_l15_l18-supply: +- vdd_l20-supply: +- vdd_l21-supply: + Usage: optional (pm8950 only) + Value type: <phandle> + Definition: reference to regulator supplying the input pin, as + described in the data sheet + +- vdd_s1-supply: +- vdd_s2-supply: +- vdd_s3-supply: - vdd_l1_l3-supply: - vdd_l2_lvs1_2_3-supply: - vdd_l4_l11-supply: diff --git a/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.txt b/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.txt index 430b8622bda1..f5cdac8b2847 100644 --- a/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.txt +++ b/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.txt @@ -4,10 +4,12 @@ Qualcomm SPMI Regulators Usage: required Value type: <string> Definition: must be one of: + "qcom,pm8004-regulators" "qcom,pm8005-regulators" "qcom,pm8841-regulators" "qcom,pm8916-regulators" "qcom,pm8941-regulators" + "qcom,pm8950-regulators" "qcom,pm8994-regulators" "qcom,pmi8994-regulators" "qcom,pms405-regulators" @@ -76,6 +78,26 @@ Qualcomm SPMI Regulators - vdd_s2-supply: - vdd_s3-supply: - vdd_s4-supply: +- vdd_s4-supply: +- vdd_s5-supply: +- vdd_s6-supply: +- vdd_l1_l19-supply: +- vdd_l2_l23-supply: +- vdd_l3-supply: +- vdd_l4_l5_l6_l7_l16-supply: +- vdd_l8_l11_l12_l17_l22-supply: +- vdd_l9_l10_l13_l14_l15_l18-supply: +- vdd_l20-supply: +- vdd_l21-supply: + Usage: optional (pm8950 only) + Value type: <phandle> + Definition: reference to regulator supplying the input pin, as + described in the data sheet + +- vdd_s1-supply: +- vdd_s2-supply: +- vdd_s3-supply: +- vdd_s4-supply: - vdd_s5-supply: - vdd_s6-supply: - vdd_s7-supply: @@ -140,6 +162,9 @@ sub-node is identified using the node's name, with valid values listed for each of the PMICs below. pm8005: + s2, s5 + +pm8005: s1, s2, s3, s4 pm8841: diff --git a/Documentation/devicetree/bindings/regulator/regulator.yaml b/Documentation/devicetree/bindings/regulator/regulator.yaml index 02c3043ce419..92ff2e8ad572 100644 --- a/Documentation/devicetree/bindings/regulator/regulator.yaml +++ b/Documentation/devicetree/bindings/regulator/regulator.yaml @@ -38,7 +38,12 @@ properties: type: boolean regulator-boot-on: - description: bootloader/firmware enabled regulator + description: bootloader/firmware enabled regulator. + It's expected that this regulator was left on by the bootloader. + If the bootloader didn't leave it on then OS should turn it on + at boot but shouldn't prevent it from being turned off later. + This property is intended to only be used for regulators where + software cannot read the state of the regulator. type: boolean regulator-allow-bypass: diff --git a/Documentation/devicetree/bindings/rng/atmel-trng.txt b/Documentation/devicetree/bindings/rng/atmel-trng.txt index 4ac5aaa2d024..3900ee4f3532 100644 --- a/Documentation/devicetree/bindings/rng/atmel-trng.txt +++ b/Documentation/devicetree/bindings/rng/atmel-trng.txt @@ -1,7 +1,7 @@ Atmel TRNG (True Random Number Generator) block Required properties: -- compatible : Should be "atmel,at91sam9g45-trng" +- compatible : Should be "atmel,at91sam9g45-trng" or "microchip,sam9x60-trng" - reg : Offset and length of the register set of this block - interrupts : the interrupt number for the TRNG block - clocks: should contain the TRNG clk source diff --git a/Documentation/devicetree/bindings/rng/nuvoton,npcm-rng.txt b/Documentation/devicetree/bindings/rng/nuvoton,npcm-rng.txt new file mode 100644 index 000000000000..65c04172fc8c --- /dev/null +++ b/Documentation/devicetree/bindings/rng/nuvoton,npcm-rng.txt @@ -0,0 +1,12 @@ +NPCM SoC Random Number Generator + +Required properties: +- compatible : "nuvoton,npcm750-rng" for the NPCM7XX BMC. +- reg : Specifies physical base address and size of the registers. + +Example: + +rng: rng@f000b000 { + compatible = "nuvoton,npcm750-rng"; + reg = <0xf000b000 0x8>; +}; diff --git a/Documentation/devicetree/bindings/rng/omap3_rom_rng.txt b/Documentation/devicetree/bindings/rng/omap3_rom_rng.txt new file mode 100644 index 000000000000..f315c9723bd2 --- /dev/null +++ b/Documentation/devicetree/bindings/rng/omap3_rom_rng.txt @@ -0,0 +1,27 @@ +OMAP ROM RNG driver binding + +Secure SoCs may provide RNG via secure ROM calls like Nokia N900 does. The +implementation can depend on the SoC secure ROM used. + +- compatible: + Usage: required + Value type: <string> + Definition: must be "nokia,n900-rom-rng" + +- clocks: + Usage: required + Value type: <prop-encoded-array> + Definition: reference to the the RNG interface clock + +- clock-names: + Usage: required + Value type: <stringlist> + Definition: must be "ick" + +Example: + + rom_rng: rng { + compatible = "nokia,n900-rom-rng"; + clocks = <&rng_ick>; + clock-names = "ick"; + }; diff --git a/Documentation/devicetree/bindings/rng/samsung,exynos5250-trng.txt b/Documentation/devicetree/bindings/rng/samsung,exynos5250-trng.txt new file mode 100644 index 000000000000..5a613a4ec780 --- /dev/null +++ b/Documentation/devicetree/bindings/rng/samsung,exynos5250-trng.txt @@ -0,0 +1,17 @@ +Exynos True Random Number Generator + +Required properties: + +- compatible : Should be "samsung,exynos5250-trng". +- reg : Specifies base physical address and size of the registers map. +- clocks : Phandle to clock-controller plus clock-specifier pair. +- clock-names : "secss" as a clock name. + +Example: + + rng@10830600 { + compatible = "samsung,exynos5250-trng"; + reg = <0x10830600 0x100>; + clocks = <&clock CLK_SSS>; + clock-names = "secss"; + }; diff --git a/Documentation/devicetree/bindings/security/tpm/google,cr50.txt b/Documentation/devicetree/bindings/security/tpm/google,cr50.txt new file mode 100644 index 000000000000..cd69c2efdd37 --- /dev/null +++ b/Documentation/devicetree/bindings/security/tpm/google,cr50.txt @@ -0,0 +1,19 @@ +* H1 Secure Microcontroller with Cr50 Firmware on SPI Bus. + +H1 Secure Microcontroller running Cr50 firmware provides several +functions, including TPM-like functionality. It communicates over +SPI using the FIFO protocol described in the PTP Spec, section 6. + +Required properties: +- compatible: Should be "google,cr50". +- spi-max-frequency: Maximum SPI frequency. + +Example: + +&spi0 { + tpm@0 { + compatible = "google,cr50"; + reg = <0>; + spi-max-frequency = <800000>; + }; +}; diff --git a/Documentation/devicetree/bindings/sound/adi,adau7118.yaml b/Documentation/devicetree/bindings/sound/adi,adau7118.yaml new file mode 100644 index 000000000000..75e0cbe6be70 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/adi,adau7118.yaml @@ -0,0 +1,85 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/sound/adi,adau7118.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + + +title: Analog Devices ADAU7118 8 Channel PDM to I2S/TDM Converter + +maintainers: + - Nuno Sá <nuno.sa@analog.com> + +description: | + Analog Devices ADAU7118 8 Channel PDM to I2S/TDM Converter over I2C or HW + standalone mode. + https://www.analog.com/media/en/technical-documentation/data-sheets/ADAU7118.pdf + +properties: + compatible: + enum: + - adi,adau7118 + + reg: + maxItems: 1 + + "#sound-dai-cells": + const: 0 + + iovdd-supply: + description: Digital Input/Output Power Supply. + + dvdd-supply: + description: Internal Core Digital Power Supply. + + adi,decimation-ratio: + description: | + This property set's the decimation ratio of PDM to PCM audio data. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: [64, 32, 16] + default: 64 + + adi,pdm-clk-map: + description: | + The ADAU7118 has two PDM clocks for the four Inputs. Each input must be + assigned to one of these two clocks. This property set's the mapping + between the clocks and the inputs. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + - minItems: 4 + maxItems: 4 + items: + maximum: 1 + default: [0, 0, 1, 1] + +required: + - "#sound-dai-cells" + - compatible + - iovdd-supply + - dvdd-supply + +examples: + - | + i2c { + /* example with i2c support */ + #address-cells = <1>; + #size-cells = <0>; + adau7118_codec: audio-codec@14 { + compatible = "adi,adau7118"; + reg = <0x14>; + #sound-dai-cells = <0>; + iovdd-supply = <&supply>; + dvdd-supply = <&supply>; + adi,pdm-clk-map = <1 1 0 0>; + adi,decimation-ratio = <16>; + }; + }; + + /* example with hw standalone mode */ + adau7118_codec_hw: adau7118-codec-hw { + compatible = "adi,adau7118"; + #sound-dai-cells = <0>; + iovdd-supply = <&supply>; + dvdd-supply = <&supply>; + }; diff --git a/Documentation/devicetree/bindings/sound/allwinner,sun4i-a10-codec.yaml b/Documentation/devicetree/bindings/sound/allwinner,sun4i-a10-codec.yaml new file mode 100644 index 000000000000..b8f89c7258eb --- /dev/null +++ b/Documentation/devicetree/bindings/sound/allwinner,sun4i-a10-codec.yaml @@ -0,0 +1,267 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/sound/allwinner,sun4i-a10-codec.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Allwinner A10 Codec Device Tree Bindings + +maintainers: + - Chen-Yu Tsai <wens@csie.org> + - Maxime Ripard <maxime.ripard@bootlin.com> + +properties: + "#sound-dai-cells": + const: 0 + + compatible: + enum: + - allwinner,sun4i-a10-codec + - allwinner,sun6i-a31-codec + - allwinner,sun7i-a20-codec + - allwinner,sun8i-a23-codec + - allwinner,sun8i-h3-codec + - allwinner,sun8i-v3s-codec + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + clocks: + items: + - description: Bus Clock + - description: Module Clock + + clock-names: + items: + - const: apb + - const: codec + + dmas: + items: + - description: RX DMA Channel + - description: TX DMA Channel + + dma-names: + items: + - const: rx + - const: tx + + resets: + maxItems: 1 + + allwinner,audio-routing: + description: |- + A list of the connections between audio components. Each entry + is a pair of strings, the first being the connection's sink, the + second being the connection's source. + allOf: + - $ref: /schemas/types.yaml#definitions/non-unique-string-array + - minItems: 2 + maxItems: 18 + items: + enum: + # Audio Pins on the SoC + - HP + - HPCOM + - LINEIN + - LINEOUT + - MIC1 + - MIC2 + - MIC3 + + # Microphone Biases from the SoC + - HBIAS + - MBIAS + + # Board Connectors + - Headphone + - Headset Mic + - Line In + - Line Out + - Mic + - Speaker + + allwinner,codec-analog-controls: + $ref: /schemas/types.yaml#/definitions/phandle + description: Phandle to the codec analog controls in the PRCM + + allwinner,pa-gpios: + description: GPIO to enable the external amplifier + +required: + - "#sound-dai-cells" + - compatible + - reg + - interrupts + - clocks + - clock-names + - dmas + - dma-names + +allOf: + - if: + properties: + compatible: + enum: + - allwinner,sun6i-a31-codec + - allwinner,sun8i-a23-codec + - allwinner,sun8i-h3-codec + - allwinner,sun8i-v3s-codec + + then: + if: + properties: + compatible: + const: allwinner,sun6i-a31-codec + + then: + required: + - resets + - allwinner,audio-routing + + else: + required: + - resets + - allwinner,audio-routing + - allwinner,codec-analog-controls + + - if: + properties: + compatible: + enum: + - allwinner,sun6i-a31-codec + + then: + properties: + allwinner,audio-routing: + items: + enum: + - HP + - HPCOM + - LINEIN + - LINEOUT + - MIC1 + - MIC2 + - MIC3 + - HBIAS + - MBIAS + - Headphone + - Headset Mic + - Line In + - Line Out + - Mic + - Speaker + + - if: + properties: + compatible: + enum: + - allwinner,sun8i-a23-codec + + then: + properties: + allwinner,audio-routing: + items: + enum: + - HP + - HPCOM + - LINEIN + - MIC1 + - MIC2 + - HBIAS + - MBIAS + - Headphone + - Headset Mic + - Line In + - Line Out + - Mic + - Speaker + + - if: + properties: + compatible: + enum: + - allwinner,sun8i-h3-codec + + then: + properties: + allwinner,audio-routing: + items: + enum: + - HP + - HPCOM + - LINEIN + - LINEOUT + - MIC1 + - MIC2 + - HBIAS + - MBIAS + - Headphone + - Headset Mic + - Line In + - Line Out + - Mic + - Speaker + + - if: + properties: + compatible: + enum: + - allwinner,sun8i-v3s-codec + + then: + properties: + allwinner,audio-routing: + items: + enum: + - HP + - HPCOM + - MIC1 + - HBIAS + - Headphone + - Headset Mic + - Line In + - Line Out + - Mic + - Speaker + +additionalProperties: false + +examples: + - | + codec@1c22c00 { + #sound-dai-cells = <0>; + compatible = "allwinner,sun7i-a20-codec"; + reg = <0x01c22c00 0x40>; + interrupts = <0 30 4>; + clocks = <&apb0_gates 0>, <&codec_clk>; + clock-names = "apb", "codec"; + dmas = <&dma 0 19>, <&dma 0 19>; + dma-names = "rx", "tx"; + }; + + - | + codec@1c22c00 { + #sound-dai-cells = <0>; + compatible = "allwinner,sun6i-a31-codec"; + reg = <0x01c22c00 0x98>; + interrupts = <0 29 4>; + clocks = <&ccu 61>, <&ccu 135>; + clock-names = "apb", "codec"; + resets = <&ccu 42>; + dmas = <&dma 15>, <&dma 15>; + dma-names = "rx", "tx"; + allwinner,audio-routing = + "Headphone", "HP", + "Speaker", "LINEOUT", + "LINEIN", "Line In", + "MIC1", "MBIAS", + "MIC1", "Mic", + "MIC2", "HBIAS", + "MIC2", "Headset Mic"; + }; + +... diff --git a/Documentation/devicetree/bindings/sound/allwinner,sun8i-a23-codec-analog.yaml b/Documentation/devicetree/bindings/sound/allwinner,sun8i-a23-codec-analog.yaml new file mode 100644 index 000000000000..85305b4c2729 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/allwinner,sun8i-a23-codec-analog.yaml @@ -0,0 +1,38 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/sound/allwinner,sun8i-a23-codec-analog.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Allwinner A23 Analog Codec Device Tree Bindings + +maintainers: + - Chen-Yu Tsai <wens@csie.org> + - Maxime Ripard <maxime.ripard@bootlin.com> + +properties: + compatible: + enum: + # FIXME: This is documented in the PRCM binding, but needs to be + # migrated here at some point + # - allwinner,sun8i-a23-codec-analog + - allwinner,sun8i-h3-codec-analog + - allwinner,sun8i-v3s-codec-analog + + reg: + maxItems: 1 + +required: + - compatible + - reg + +additionalProperties: false + +examples: + - | + codec_analog: codec-analog@1f015c0 { + compatible = "allwinner,sun8i-h3-codec-analog"; + reg = <0x01f015c0 0x4>; + }; + +... diff --git a/Documentation/devicetree/bindings/sound/arndale.txt b/Documentation/devicetree/bindings/sound/arndale.txt index 0e76946385ae..17530120ccfc 100644 --- a/Documentation/devicetree/bindings/sound/arndale.txt +++ b/Documentation/devicetree/bindings/sound/arndale.txt @@ -1,8 +1,9 @@ Audio Binding for Arndale boards Required properties: -- compatible : Can be the following, - "samsung,arndale-rt5631" +- compatible : Can be one of the following: + "samsung,arndale-rt5631", + "samsung,arndale-wm1811" - samsung,audio-cpu: The phandle of the Samsung I2S controller - samsung,audio-codec: The phandle of the audio codec diff --git a/Documentation/devicetree/bindings/sound/fsl,mqs.txt b/Documentation/devicetree/bindings/sound/fsl,mqs.txt new file mode 100644 index 000000000000..40353fc30255 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/fsl,mqs.txt @@ -0,0 +1,36 @@ +fsl,mqs audio CODEC + +Required properties: + - compatible : Must contain one of "fsl,imx6sx-mqs", "fsl,codec-mqs" + "fsl,imx8qm-mqs", "fsl,imx8qxp-mqs". + - clocks : A list of phandles + clock-specifiers, one for each entry in + clock-names + - clock-names : "mclk" - must required. + "core" - required if compatible is "fsl,imx8qm-mqs", it + is for register access. + - gpr : A phandle of General Purpose Registers in IOMUX Controller. + Required if compatible is "fsl,imx6sx-mqs". + +Required if compatible is "fsl,imx8qm-mqs": + - power-domains: A phandle of PM domain provider node. + - reg: Offset and length of the register set for the device. + +Example: + +mqs: mqs { + compatible = "fsl,imx6sx-mqs"; + gpr = <&gpr>; + clocks = <&clks IMX6SX_CLK_SAI1>; + clock-names = "mclk"; + status = "disabled"; +}; + +mqs: mqs@59850000 { + compatible = "fsl,imx8qm-mqs"; + reg = <0x59850000 0x10000>; + clocks = <&clk IMX8QM_AUD_MQS_IPG>, + <&clk IMX8QM_AUD_MQS_HMCLK>; + clock-names = "core", "mclk"; + power-domains = <&pd_mqs0>; + status = "disabled"; +}; diff --git a/Documentation/devicetree/bindings/sound/google,cros-ec-codec.txt b/Documentation/devicetree/bindings/sound/google,cros-ec-codec.txt index 1084f7f22eea..8ca52dcc5572 100644 --- a/Documentation/devicetree/bindings/sound/google,cros-ec-codec.txt +++ b/Documentation/devicetree/bindings/sound/google,cros-ec-codec.txt @@ -1,4 +1,4 @@ -* Audio codec controlled by ChromeOS EC +Audio codec controlled by ChromeOS EC Google's ChromeOS EC codec is a digital mic codec provided by the Embedded Controller (EC) and is controlled via a host-command interface. @@ -9,10 +9,27 @@ Documentation/devicetree/bindings/mfd/cros-ec.txt). Required properties: - compatible: Must contain "google,cros-ec-codec" - #sound-dai-cells: Should be 1. The cell specifies number of DAIs. -- max-dmic-gain: A number for maximum gain in dB on digital microphone. + +Optional properties: +- reg: Pysical base address and length of shared memory region from EC. + It contains 3 unsigned 32-bit integer. The first 2 integers + combine to become an unsigned 64-bit physical address. The last + one integer is length of the shared memory. +- memory-region: Shared memory region to EC. A "shared-dma-pool". See + ../reserved-memory/reserved-memory.txt for details. Example: +{ + ... + + reserved_mem: reserved_mem { + compatible = "shared-dma-pool"; + reg = <0 0x52800000 0 0x100000>; + no-map; + }; +} + cros-ec@0 { compatible = "google,cros-ec-spi"; @@ -21,6 +38,7 @@ cros-ec@0 { cros_ec_codec: ec-codec { compatible = "google,cros-ec-codec"; #sound-dai-cells = <1>; - max-dmic-gain = <43>; + reg = <0x0 0x10500000 0x80000>; + memory-region = <&reserved_mem>; }; }; diff --git a/Documentation/devicetree/bindings/sound/mt8183-afe-pcm.txt b/Documentation/devicetree/bindings/sound/mt8183-afe-pcm.txt index 396ba38619f6..1f1cba4152ce 100644 --- a/Documentation/devicetree/bindings/sound/mt8183-afe-pcm.txt +++ b/Documentation/devicetree/bindings/sound/mt8183-afe-pcm.txt @@ -4,6 +4,10 @@ Required properties: - compatible = "mediatek,mt68183-audio"; - reg: register location and size - interrupts: should contain AFE interrupt +- resets: Must contain an entry for each entry in reset-names + See ../reset/reset.txt for details. +- reset-names: should have these reset names: + "audiosys"; - power-domains: should define the power domain - clocks: Must contain an entry for each entry in clock-names - clock-names: should have these clock names: @@ -20,6 +24,8 @@ Example: compatible = "mediatek,mt8183-audio"; reg = <0 0x11220000 0 0x1000>; interrupts = <GIC_SPI 161 IRQ_TYPE_LEVEL_LOW>; + resets = <&watchdog MT8183_TOPRGU_AUDIO_SW_RST>; + reset-names = "audiosys"; power-domains = <&scpsys MT8183_POWER_DOMAIN_AUDIO>; clocks = <&infrasys CLK_INFRA_AUDIO>, <&infrasys CLK_INFRA_AUDIO_26M_BCLK>, diff --git a/Documentation/devicetree/bindings/sound/mt8183-mt6358-ts3a227-max98357.txt b/Documentation/devicetree/bindings/sound/mt8183-mt6358-ts3a227-max98357.txt index d6d5207fa996..decaa013a07e 100644 --- a/Documentation/devicetree/bindings/sound/mt8183-mt6358-ts3a227-max98357.txt +++ b/Documentation/devicetree/bindings/sound/mt8183-mt6358-ts3a227-max98357.txt @@ -2,14 +2,19 @@ MT8183 with MT6358, TS3A227 and MAX98357 CODECS Required properties: - compatible : "mediatek,mt8183_mt6358_ts3a227_max98357" -- mediatek,headset-codec: the phandles of ts3a227 codecs - mediatek,platform: the phandle of MT8183 ASoC platform +Optional properties: +- mediatek,headset-codec: the phandles of ts3a227 codecs +- mediatek,ec-codec: the phandle of EC codecs. + See google,cros-ec-codec.txt for more details. + Example: sound { compatible = "mediatek,mt8183_mt6358_ts3a227_max98357"; mediatek,headset-codec = <&ts3a227>; + mediatek,ec-codec = <&ec_codec>; mediatek,platform = <&afe>; }; diff --git a/Documentation/devicetree/bindings/sound/renesas,fsi.txt b/Documentation/devicetree/bindings/sound/renesas,fsi.txt deleted file mode 100644 index 0cf0f819b823..000000000000 --- a/Documentation/devicetree/bindings/sound/renesas,fsi.txt +++ /dev/null @@ -1,31 +0,0 @@ -Renesas FSI - -Required properties: -- compatible : "renesas,fsi2-<soctype>", - "renesas,sh_fsi2" or "renesas,sh_fsi" as - fallback. - Examples with soctypes are: - - "renesas,fsi2-r8a7740" (R-Mobile A1) - - "renesas,fsi2-sh73a0" (SH-Mobile AG5) -- reg : Should contain the register physical address and length -- interrupts : Should contain FSI interrupt - -- fsia,spdif-connection : FSI is connected by S/PDIF -- fsia,stream-mode-support : FSI supports 16bit stream mode. -- fsia,use-internal-clock : FSI uses internal clock when master mode. - -- fsib,spdif-connection : same as fsia -- fsib,stream-mode-support : same as fsia -- fsib,use-internal-clock : same as fsia - -Example: - -sh_fsi2: sh_fsi2@ec230000 { - compatible = "renesas,sh_fsi2"; - reg = <0xec230000 0x400>; - interrupts = <0 146 0x4>; - - fsia,spdif-connection; - fsia,stream-mode-support; - fsia,use-internal-clock; -}; diff --git a/Documentation/devicetree/bindings/sound/renesas,fsi.yaml b/Documentation/devicetree/bindings/sound/renesas,fsi.yaml new file mode 100644 index 000000000000..140a37fc3c0b --- /dev/null +++ b/Documentation/devicetree/bindings/sound/renesas,fsi.yaml @@ -0,0 +1,76 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/sound/renesas,fsi.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Renesas FSI Sound Driver Device Tree Bindings + +maintainers: + - Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> + +properties: + $nodename: + pattern: "^sound@.*" + + compatible: + oneOf: + # for FSI2 SoC + - items: + - enum: + - renesas,fsi2-sh73a0 + - renesas,fsi2-r8a7740 + - enum: + - renesas,sh_fsi2 + # for Generic + - items: + - enum: + - renesas,sh_fsi + - renesas,sh_fsi2 + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + fsia,spdif-connection: + $ref: /schemas/types.yaml#/definitions/flag + description: FSI is connected by S/PDIF + + fsia,stream-mode-support: + $ref: /schemas/types.yaml#/definitions/flag + description: FSI supports 16bit stream mode + + fsia,use-internal-clock: + $ref: /schemas/types.yaml#/definitions/flag + description: FSI uses internal clock when master mode + + fsib,spdif-connection: + $ref: /schemas/types.yaml#/definitions/flag + description: same as fsia + + fsib,stream-mode-support: + $ref: /schemas/types.yaml#/definitions/flag + description: same as fsia + + fsib,use-internal-clock: + $ref: /schemas/types.yaml#/definitions/flag + description: same as fsia + +required: + - compatible + - reg + - interrupts + +examples: + - | + sh_fsi2: sound@ec230000 { + compatible = "renesas,fsi2-r8a7740", "renesas,sh_fsi2"; + reg = <0xec230000 0x400>; + interrupts = <0 146 0x4>; + + fsia,spdif-connection; + fsia,stream-mode-support; + fsia,use-internal-clock; + }; diff --git a/Documentation/devicetree/bindings/sound/renesas,rsnd.txt b/Documentation/devicetree/bindings/sound/renesas,rsnd.txt index 5c52182f7dcf..797fd035434c 100644 --- a/Documentation/devicetree/bindings/sound/renesas,rsnd.txt +++ b/Documentation/devicetree/bindings/sound/renesas,rsnd.txt @@ -268,6 +268,7 @@ Required properties: - "renesas,rcar_sound-r8a7745" (RZ/G1E) - "renesas,rcar_sound-r8a77470" (RZ/G1C) - "renesas,rcar_sound-r8a774a1" (RZ/G2M) + - "renesas,rcar_sound-r8a774b1" (RZ/G2N) - "renesas,rcar_sound-r8a774c0" (RZ/G2E) - "renesas,rcar_sound-r8a7778" (R-Car M1A) - "renesas,rcar_sound-r8a7779" (R-Car H1) diff --git a/Documentation/devicetree/bindings/sound/rockchip-max98090.txt b/Documentation/devicetree/bindings/sound/rockchip-max98090.txt index a805aa99ad75..e9c58b204399 100644 --- a/Documentation/devicetree/bindings/sound/rockchip-max98090.txt +++ b/Documentation/devicetree/bindings/sound/rockchip-max98090.txt @@ -5,15 +5,38 @@ Required properties: - rockchip,model: The user-visible name of this sound complex - rockchip,i2s-controller: The phandle of the Rockchip I2S controller that's connected to the CODEC -- rockchip,audio-codec: The phandle of the MAX98090 audio codec -- rockchip,headset-codec: The phandle of Ext chip for jack detection + +Optional properties: +- rockchip,audio-codec: The phandle of the MAX98090 audio codec. +- rockchip,headset-codec: The phandle of Ext chip for jack detection. This is + required if there is rockchip,audio-codec. +- rockchip,hdmi-codec: The phandle of HDMI device for HDMI codec. Example: +/* For max98090-only board. */ +sound { + compatible = "rockchip,rockchip-audio-max98090"; + rockchip,model = "ROCKCHIP-I2S"; + rockchip,i2s-controller = <&i2s>; + rockchip,audio-codec = <&max98090>; + rockchip,headset-codec = <&headsetcodec>; +}; + +/* For HDMI-only board. */ +sound { + compatible = "rockchip,rockchip-audio-max98090"; + rockchip,model = "ROCKCHIP-I2S"; + rockchip,i2s-controller = <&i2s>; + rockchip,hdmi-codec = <&hdmi>; +}; + +/* For max98090 plus HDMI board. */ sound { compatible = "rockchip,rockchip-audio-max98090"; rockchip,model = "ROCKCHIP-I2S"; rockchip,i2s-controller = <&i2s>; rockchip,audio-codec = <&max98090>; rockchip,headset-codec = <&headsetcodec>; + rockchip,hdmi-codec = <&hdmi>; }; diff --git a/Documentation/devicetree/bindings/sound/rt1011.txt b/Documentation/devicetree/bindings/sound/rt1011.txt index 35a23e60d679..02d53b9aa247 100644 --- a/Documentation/devicetree/bindings/sound/rt1011.txt +++ b/Documentation/devicetree/bindings/sound/rt1011.txt @@ -20,6 +20,14 @@ Required properties: | 1 | 1 | 0x3b | ------------------------------------- +Optional properties: + +- realtek,temperature_calib + u32. The temperature was measured while doing the calibration. Units: Celsius degree + +- realtek,r0_calib + u32. This is r0 calibration data which was measured in factory mode. + Pins on the device (for linking into audio routes) for RT1011: * SPO @@ -29,4 +37,6 @@ Example: rt1011: codec@38 { compatible = "realtek,rt1011"; reg = <0x38>; + realtek,temperature_calib = <25>; + realtek,r0_calib = <0x224050>; }; diff --git a/Documentation/devicetree/bindings/sound/rt5682.txt b/Documentation/devicetree/bindings/sound/rt5682.txt index 312e9a129530..30e927a28369 100644 --- a/Documentation/devicetree/bindings/sound/rt5682.txt +++ b/Documentation/devicetree/bindings/sound/rt5682.txt @@ -27,6 +27,11 @@ Optional properties: - realtek,ldo1-en-gpios : The GPIO that controls the CODEC's LDO1_EN pin. +- realtek,btndet-delay + The debounce delay for push button. + The delay time is realtek,btndet-delay value multiple of 8.192 ms. + If absent, the default is 16. + Pins on the device (for linking into audio routes) for RT5682: * DMIC L1 @@ -47,4 +52,5 @@ rt5682 { realtek,dmic1-data-pin = <1>; realtek,dmic1-clk-pin = <1>; realtek,jd-src = <1>; + realtek,btndet-delay = <16>; }; diff --git a/Documentation/devicetree/bindings/sound/samsung,odroid.txt b/Documentation/devicetree/bindings/sound/samsung,odroid.txt deleted file mode 100644 index e9da2200e173..000000000000 --- a/Documentation/devicetree/bindings/sound/samsung,odroid.txt +++ /dev/null @@ -1,54 +0,0 @@ -Samsung Exynos Odroid XU3/XU4 audio complex with MAX98090 codec - -Required properties: - - - compatible - "hardkernel,odroid-xu3-audio" - for Odroid XU3 board, - "hardkernel,odroid-xu4-audio" - for Odroid XU4 board (deprecated), - "samsung,odroid-xu3-audio" - for Odroid XU3 board (deprecated), - "samsung,odroid-xu4-audio" - for Odroid XU4 board (deprecated) - - model - the user-visible name of this sound complex - - clocks - should contain entries matching clock names in the clock-names - property - - samsung,audio-widgets - this property specifies off-codec audio elements - like headphones or speakers, for details see widgets.txt - - samsung,audio-routing - a list of the connections between audio - components; each entry is a pair of strings, the first being the - connection's sink, the second being the connection's source; - valid names for sources and sinks are the MAX98090's pins (as - documented in its binding), and the jacks on the board - - For Odroid X2: - "Headphone Jack", "Mic Jack", "DMIC" - - For Odroid U3, XU3: - "Headphone Jack", "Speakers" - - For Odroid XU4: - no entries - -Required sub-nodes: - - - 'cpu' subnode with a 'sound-dai' property containing the phandle of the I2S - controller - - 'codec' subnode with a 'sound-dai' property containing list of phandles - to the CODEC nodes, first entry must be corresponding to the MAX98090 - CODEC and the second entry must be the phandle of the HDMI IP block node - -Example: - -sound { - compatible = "hardkernel,odroid-xu3-audio"; - model = "Odroid-XU3"; - samsung,audio-routing = - "Headphone Jack", "HPL", - "Headphone Jack", "HPR", - "IN1", "Mic Jack", - "Mic Jack", "MICBIAS"; - - cpu { - sound-dai = <&i2s0 0>; - }; - codec { - sound-dai = <&hdmi>, <&max98090>; - }; -}; diff --git a/Documentation/devicetree/bindings/sound/samsung,odroid.yaml b/Documentation/devicetree/bindings/sound/samsung,odroid.yaml new file mode 100644 index 000000000000..c6b244352d05 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/samsung,odroid.yaml @@ -0,0 +1,91 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/sound/samsung,odroid.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Samsung Exynos Odroid XU3/XU4 audio complex with MAX98090 codec + +maintainers: + - Krzysztof Kozlowski <krzk@kernel.org> + - Sylwester Nawrocki <s.nawrocki@samsung.com> + +properties: + compatible: + oneOf: + - const: hardkernel,odroid-xu3-audio + + - const: hardkernel,odroid-xu4-audio + deprecated: true + + - const: samsung,odroid-xu3-audio + deprecated: true + + - const: samsung,odroid-xu4-audio + deprecated: true + + model: + $ref: /schemas/types.yaml#/definitions/string + description: The user-visible name of this sound complex. + + cpu: + type: object + properties: + sound-dai: + $ref: /schemas/types.yaml#/definitions/phandle-array + description: phandles to the I2S controllers + + codec: + type: object + properties: + sound-dai: + $ref: /schemas/types.yaml#/definitions/phandle-array + description: | + List of phandles to the CODEC nodes, + first entry must be corresponding to the MAX98090 CODEC and + the second entry must be the phandle of the HDMI IP block node. + + samsung,audio-routing: + $ref: /schemas/types.yaml#/definitions/non-unique-string-array + description: | + List of the connections between audio + components; each entry is a pair of strings, the first being the + connection's sink, the second being the connection's source; + valid names for sources and sinks are the MAX98090's pins (as + documented in its binding), and the jacks on the board. + For Odroid X2: "Headphone Jack", "Mic Jack", "DMIC" + For Odroid U3, XU3: "Headphone Jack", "Speakers" + For Odroid XU4: no entries + + samsung,audio-widgets: + $ref: /schemas/types.yaml#/definitions/non-unique-string-array + description: | + This property specifies off-codec audio elements + like headphones or speakers, for details see widgets.txt + +required: + - compatible + - model + - cpu + - codec + +examples: + - | + sound { + compatible = "hardkernel,odroid-xu3-audio"; + model = "Odroid-XU3"; + samsung,audio-routing = + "Headphone Jack", "HPL", + "Headphone Jack", "HPR", + "IN1", "Mic Jack", + "Mic Jack", "MICBIAS"; + + cpu { + sound-dai = <&i2s0 0>; + }; + + codec { + sound-dai = <&hdmi>, <&max98090>; + }; + }; + diff --git a/Documentation/devicetree/bindings/sound/samsung-i2s.txt b/Documentation/devicetree/bindings/sound/samsung-i2s.txt deleted file mode 100644 index a88cb00fa096..000000000000 --- a/Documentation/devicetree/bindings/sound/samsung-i2s.txt +++ /dev/null @@ -1,84 +0,0 @@ -* Samsung I2S controller - -Required SoC Specific Properties: - -- compatible : should be one of the following. - - samsung,s3c6410-i2s: for 8/16/24bit stereo I2S. - - samsung,s5pv210-i2s: for 8/16/24bit multichannel(5.1) I2S with - secondary fifo, s/w reset control and internal mux for root clk src. - - samsung,exynos5420-i2s: for 8/16/24bit multichannel(5.1) I2S for - playback, stereo channel capture, secondary fifo using internal - or external dma, s/w reset control, internal mux for root clk src - and 7.1 channel TDM support for playback. TDM (Time division multiplexing) - is to allow transfer of multiple channel audio data on single data line. - - samsung,exynos7-i2s: with all the available features of exynos5 i2s, - exynos7 I2S has 7.1 channel TDM support for capture, secondary fifo - with only external dma and more no.of root clk sampling frequencies. - - samsung,exynos7-i2s1: I2S1 on previous samsung platforms supports - stereo channels. exynos7 i2s1 upgraded to 5.1 multichannel with - slightly modified bit offsets. - -- reg: physical base address of the controller and length of memory mapped - region. -- dmas: list of DMA controller phandle and DMA request line ordered pairs. -- dma-names: identifier string for each DMA request line in the dmas property. - These strings correspond 1:1 with the ordered pairs in dmas. -- clocks: Handle to iis clock and RCLK source clk. -- clock-names: - i2s0 uses some base clocks from CMU and some are from audio subsystem internal - clock controller. The clock names for i2s0 should be "iis", "i2s_opclk0" and - "i2s_opclk1" as shown in the example below. - i2s1 and i2s2 uses clocks from CMU. The clock names for i2s1 and i2s2 should - be "iis" and "i2s_opclk0". - "iis" is the i2s bus clock and i2s_opclk0, i2s_opclk1 are sources of the root - clk. i2s0 has internal mux to select the source of root clk and i2s1 and i2s2 - doesn't have any such mux. -- #clock-cells: should be 1, this property must be present if the I2S device - is a clock provider in terms of the common clock bindings, described in - ../clock/clock-bindings.txt. -- clock-output-names (deprecated): from the common clock bindings, names of - the CDCLK I2S output clocks, suggested values are "i2s_cdclk0", "i2s_cdclk1", - "i2s_cdclk3" for the I2S0, I2S1, I2S2 devices respectively. - -There are following clocks available at the I2S device nodes: - CLK_I2S_CDCLK - the CDCLK (CODECLKO) gate clock, - CLK_I2S_RCLK_PSR - the RCLK prescaler divider clock (corresponding to the - IISPSR register), - CLK_I2S_RCLK_SRC - the RCLKSRC mux clock (corresponding to RCLKSRC bit in - IISMOD register). - -Refer to the SoC datasheet for availability of the above clocks. -The CLK_I2S_RCLK_PSR and CLK_I2S_RCLK_SRC clocks are usually only available -in the IIS Multi Audio Interface. - -Note: Old DTs may not have the #clock-cells property and then not use the I2S -node as a clock supplier. - -Optional SoC Specific Properties: - -- samsung,idma-addr: Internal DMA register base address of the audio - sub system(used in secondary sound source). -- pinctrl-0: Should specify pin control groups used for this controller. -- pinctrl-names: Should contain only one value - "default". -- #sound-dai-cells: should be 1. - - -Example: - -i2s0: i2s@3830000 { - compatible = "samsung,s5pv210-i2s"; - reg = <0x03830000 0x100>; - dmas = <&pdma0 10 - &pdma0 9 - &pdma0 8>; - dma-names = "tx", "rx", "tx-sec"; - clocks = <&clock_audss EXYNOS_I2S_BUS>, - <&clock_audss EXYNOS_I2S_BUS>, - <&clock_audss EXYNOS_SCLK_I2S>; - clock-names = "iis", "i2s_opclk0", "i2s_opclk1"; - #clock-cells = <1>; - samsung,idma-addr = <0x03000000>; - pinctrl-names = "default"; - pinctrl-0 = <&i2s0_bus>; - #sound-dai-cells = <1>; -}; diff --git a/Documentation/devicetree/bindings/sound/samsung-i2s.yaml b/Documentation/devicetree/bindings/sound/samsung-i2s.yaml new file mode 100644 index 000000000000..53e3bad4178c --- /dev/null +++ b/Documentation/devicetree/bindings/sound/samsung-i2s.yaml @@ -0,0 +1,138 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/sound/samsung-i2s.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Samsung SoC I2S controller + +maintainers: + - Krzysztof Kozlowski <krzk@kernel.org> + - Sylwester Nawrocki <s.nawrocki@samsung.com> + +properties: + compatible: + description: | + samsung,s3c6410-i2s: for 8/16/24bit stereo I2S. + + samsung,s5pv210-i2s: for 8/16/24bit multichannel (5.1) I2S with + secondary FIFO, s/w reset control and internal mux for root clock + source. + + samsung,exynos5420-i2s: for 8/16/24bit multichannel (5.1) I2S for + playback, stereo channel capture, secondary FIFO using internal + or external DMA, s/w reset control, internal mux for root clock + source and 7.1 channel TDM support for playback; TDM (Time division + multiplexing) is to allow transfer of multiple channel audio data on + single data line. + + samsung,exynos7-i2s: with all the available features of Exynos5 I2S. + Exynos7 I2S has 7.1 channel TDM support for capture, secondary FIFO + with only external DMA and more number of root clock sampling + frequencies. + + samsung,exynos7-i2s1: I2S1 on previous samsung platforms supports + stereo channels. Exynos7 I2S1 upgraded to 5.1 multichannel with + slightly modified bit offsets. + enum: + - samsung,s3c6410-i2s + - samsung,s5pv210-i2s + - samsung,exynos5420-i2s + - samsung,exynos7-i2s + - samsung,exynos7-i2s1 + + reg: + maxItems: 1 + + dmas: + minItems: 2 + maxItems: 3 + + dma-names: + oneOf: + - items: + - const: tx + - const: rx + - items: + - const: tx + - const: rx + - const: tx-sec + + clocks: + minItems: 1 + maxItems: 3 + + clock-names: + oneOf: + - items: + - const: iis + - items: # for I2S0 + - const: iis + - const: i2s_opclk0 + - const: i2s_opclk1 + - items: # for I2S1 and I2S2 + - const: iis + - const: i2s_opclk0 + description: | + "iis" is the I2S bus clock and i2s_opclk0, i2s_opclk1 are sources + of the root clock. I2S0 has internal mux to select the source + of root clock and I2S1 and I2S2 doesn't have any such mux. + + "#clock-cells": + const: 1 + + clock-output-names: + deprecated: true + oneOf: + - items: # for I2S0 + - const: i2s_cdclk0 + - items: # for I2S1 + - const: i2s_cdclk1 + - items: # for I2S2 + - const: i2s_cdclk2 + description: Names of the CDCLK I2S output clocks. + + samsung,idma-addr: + $ref: /schemas/types.yaml#/definitions/uint32 + description: | + Internal DMA register base address of the audio + subsystem (used in secondary sound source). + + pinctrl-0: + description: Should specify pin control groups used for this controller. + + pinctrl-names: + const: default + + "#sound-dai-cells": + const: 1 + +required: + - compatible + - reg + - dmas + - dma-names + - clocks + - clock-names + +examples: + - | + #include <dt-bindings/clock/exynos-audss-clk.h> + + i2s0: i2s@3830000 { + compatible = "samsung,s5pv210-i2s"; + reg = <0x03830000 0x100>; + dmas = <&pdma0 10>, + <&pdma0 9>, + <&pdma0 8>; + dma-names = "tx", "rx", "tx-sec"; + clocks = <&clock_audss EXYNOS_I2S_BUS>, + <&clock_audss EXYNOS_I2S_BUS>, + <&clock_audss EXYNOS_SCLK_I2S>; + clock-names = "iis", "i2s_opclk0", "i2s_opclk1"; + #clock-cells = <1>; + samsung,idma-addr = <0x03000000>; + pinctrl-names = "default"; + pinctrl-0 = <&i2s0_bus>; + #sound-dai-cells = <1>; + }; diff --git a/Documentation/devicetree/bindings/sound/sun4i-codec.txt b/Documentation/devicetree/bindings/sound/sun4i-codec.txt deleted file mode 100644 index 66579bbd3294..000000000000 --- a/Documentation/devicetree/bindings/sound/sun4i-codec.txt +++ /dev/null @@ -1,94 +0,0 @@ -* Allwinner A10 Codec - -Required properties: -- compatible: must be one of the following compatibles: - - "allwinner,sun4i-a10-codec" - - "allwinner,sun6i-a31-codec" - - "allwinner,sun7i-a20-codec" - - "allwinner,sun8i-a23-codec" - - "allwinner,sun8i-h3-codec" - - "allwinner,sun8i-v3s-codec" -- reg: must contain the registers location and length -- interrupts: must contain the codec interrupt -- dmas: DMA channels for tx and rx dma. See the DMA client binding, - Documentation/devicetree/bindings/dma/dma.txt -- dma-names: should include "tx" and "rx". -- clocks: a list of phandle + clock-specifer pairs, one for each entry - in clock-names. -- clock-names: should contain the following: - - "apb": the parent APB clock for this controller - - "codec": the parent module clock - -Optional properties: -- allwinner,pa-gpios: gpio to enable external amplifier - -Required properties for the following compatibles: - - "allwinner,sun6i-a31-codec" - - "allwinner,sun8i-a23-codec" - - "allwinner,sun8i-h3-codec" - - "allwinner,sun8i-v3s-codec" -- resets: phandle to the reset control for this device -- allwinner,audio-routing: A list of the connections between audio components. - Each entry is a pair of strings, the first being the - connection's sink, the second being the connection's - source. Valid names include: - - Audio pins on the SoC: - "HP" - "HPCOM" - "LINEIN" (not on sun8i-v3s) - "LINEOUT" (not on sun8i-a23 or sun8i-v3s) - "MIC1" - "MIC2" (not on sun8i-v3s) - "MIC3" (sun6i-a31 only) - - Microphone biases from the SoC: - "HBIAS" - "MBIAS" (not on sun8i-v3s) - - Board connectors: - "Headphone" - "Headset Mic" - "Line In" - "Line Out" - "Mic" - "Speaker" - -Required properties for the following compatibles: - - "allwinner,sun8i-a23-codec" - - "allwinner,sun8i-h3-codec" - - "allwinner,sun8i-v3s-codec" -- allwinner,codec-analog-controls: A phandle to the codec analog controls - block in the PRCM. - -Example: -codec: codec@1c22c00 { - #sound-dai-cells = <0>; - compatible = "allwinner,sun7i-a20-codec"; - reg = <0x01c22c00 0x40>; - interrupts = <0 30 4>; - clocks = <&apb0_gates 0>, <&codec_clk>; - clock-names = "apb", "codec"; - dmas = <&dma 0 19>, <&dma 0 19>; - dma-names = "rx", "tx"; -}; - -codec: codec@1c22c00 { - #sound-dai-cells = <0>; - compatible = "allwinner,sun6i-a31-codec"; - reg = <0x01c22c00 0x98>; - interrupts = <GIC_SPI 29 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&ccu CLK_APB1_CODEC>, <&ccu CLK_CODEC>; - clock-names = "apb", "codec"; - resets = <&ccu RST_APB1_CODEC>; - dmas = <&dma 15>, <&dma 15>; - dma-names = "rx", "tx"; - allwinner,audio-routing = - "Headphone", "HP", - "Speaker", "LINEOUT", - "LINEIN", "Line In", - "MIC1", "MBIAS", - "MIC1", "Mic", - "MIC2", "HBIAS", - "MIC2", "Headset Mic"; -}; diff --git a/Documentation/devicetree/bindings/sound/sun8i-codec-analog.txt b/Documentation/devicetree/bindings/sound/sun8i-codec-analog.txt deleted file mode 100644 index 07356758bd91..000000000000 --- a/Documentation/devicetree/bindings/sound/sun8i-codec-analog.txt +++ /dev/null @@ -1,17 +0,0 @@ -* Allwinner Codec Analog Controls - -Required properties: -- compatible: must be one of the following compatibles: - - "allwinner,sun8i-a23-codec-analog" - - "allwinner,sun8i-h3-codec-analog" - - "allwinner,sun8i-v3s-codec-analog" - -Required properties if not a sub-node of the PRCM node: -- reg: must contain the registers location and length - -Example: -prcm: prcm@1f01400 { - codec_analog: codec-analog { - compatible = "allwinner,sun8i-a23-codec-analog"; - }; -}; diff --git a/Documentation/devicetree/bindings/sound/tas2562.txt b/Documentation/devicetree/bindings/sound/tas2562.txt new file mode 100644 index 000000000000..658e1fb18a99 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/tas2562.txt @@ -0,0 +1,34 @@ +Texas Instruments TAS2562 Smart PA + +The TAS2562 is a mono, digital input Class-D audio amplifier optimized for +efficiently driving high peak power into small loudspeakers. +Integrated speaker voltage and current sense provides for +real time monitoring of loudspeaker behavior. + +Required properties: + - #address-cells - Should be <1>. + - #size-cells - Should be <0>. + - compatible: - Should contain "ti,tas2562". + - reg: - The i2c address. Should be 0x4c, 0x4d, 0x4e or 0x4f. + - ti,imon-slot-no:- TDM TX current sense time slot. + +Optional properties: +- interrupt-parent: phandle to the interrupt controller which provides + the interrupt. +- interrupts: (GPIO) interrupt to which the chip is connected. +- shut-down: GPIO used to control the state of the device. + +Examples: +tas2562@4c { + #address-cells = <1>; + #size-cells = <0>; + compatible = "ti,tas2562"; + reg = <0x4c>; + + interrupt-parent = <&gpio1>; + interrupts = <14>; + + shut-down = <&gpio1 15 0>; + ti,imon-slot-no = <0>; +}; + diff --git a/Documentation/devicetree/bindings/sound/tas2770.txt b/Documentation/devicetree/bindings/sound/tas2770.txt new file mode 100644 index 000000000000..ede6bb3d9637 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/tas2770.txt @@ -0,0 +1,37 @@ +Texas Instruments TAS2770 Smart PA + +The TAS2770 is a mono, digital input Class-D audio amplifier optimized for +efficiently driving high peak power into small loudspeakers. +Integrated speaker voltage and current sense provides for +real time monitoring of loudspeaker behavior. + +Required properties: + + - compatible: - Should contain "ti,tas2770". + - reg: - The i2c address. Should contain <0x4c>, <0x4d>,<0x4e>, or <0x4f>. + - #address-cells - Should be <1>. + - #size-cells - Should be <0>. + - ti,asi-format: - Sets TDM RX capture edge. 0->Rising; 1->Falling. + - ti,imon-slot-no:- TDM TX current sense time slot. + - ti,vmon-slot-no:- TDM TX voltage sense time slot. + +Optional properties: + +- interrupt-parent: the phandle to the interrupt controller which provides + the interrupt. +- interrupts: interrupt specification for data-ready. + +Examples: + + tas2770@4c { + compatible = "ti,tas2770"; + reg = <0x4c>; + #address-cells = <1>; + #size-cells = <0>; + interrupt-parent = <&msm_gpio>; + interrupts = <97 0>; + ti,asi-format = <0>; + ti,imon-slot-no = <0>; + ti,vmon-slot-no = <2>; + }; + diff --git a/Documentation/devicetree/bindings/sound/ti,pcm3168a.txt b/Documentation/devicetree/bindings/sound/ti,pcm3168a.txt index 5d9cb84c661d..a02ecaab5183 100644 --- a/Documentation/devicetree/bindings/sound/ti,pcm3168a.txt +++ b/Documentation/devicetree/bindings/sound/ti,pcm3168a.txt @@ -25,6 +25,13 @@ Required properties: For required properties on SPI/I2C, consult SPI/I2C device tree documentation +Optional properties: + + - reset-gpios : Optional reset gpio line connected to RST pin of the codec. + The RST line is low active: + RST = low: device power-down + RST = high: device is enabled + Examples: i2c0: i2c0@0 { @@ -34,6 +41,7 @@ i2c0: i2c0@0 { pcm3168a: audio-codec@44 { compatible = "ti,pcm3168a"; reg = <0x44>; + reset-gpios = <&gpio0 4 GPIO_ACTIVE_LOW>; clocks = <&clk_core CLK_AUDIO>; clock-names = "scki"; VDD1-supply = <&supply3v3>; diff --git a/Documentation/devicetree/bindings/sound/tlv320aic31xx.txt b/Documentation/devicetree/bindings/sound/tlv320aic31xx.txt index 5b3c33bb99e5..e372303697dc 100644 --- a/Documentation/devicetree/bindings/sound/tlv320aic31xx.txt +++ b/Documentation/devicetree/bindings/sound/tlv320aic31xx.txt @@ -29,6 +29,11 @@ Optional properties: 3 or MICBIAS_AVDD - MICBIAS output is connected to AVDD If this node is not mentioned or if the value is unknown, then micbias is set to 2.0V. +- ai31xx-ocmv - output common-mode voltage setting + 0 - 1.35V, + 1 - 1.5V, + 2 - 1.65V, + 3 - 1.8V Deprecated properties: diff --git a/Documentation/devicetree/bindings/spi/renesas,hspi.yaml b/Documentation/devicetree/bindings/spi/renesas,hspi.yaml new file mode 100644 index 000000000000..c429cf4bea5b --- /dev/null +++ b/Documentation/devicetree/bindings/spi/renesas,hspi.yaml @@ -0,0 +1,57 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/spi/renesas,hspi.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Renesas HSPI + +maintainers: + - Geert Uytterhoeven <geert+renesas@glider.be> + +allOf: + - $ref: spi-controller.yaml# + +properties: + compatible: + items: + - enum: + - renesas,hspi-r8a7778 # R-Car M1A + - renesas,hspi-r8a7779 # R-Car H1 + - const: renesas,hspi + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + clocks: + maxItems: 1 + + power-domains: + maxItems: 1 + +required: + - compatible + - reg + - interrupts + - clocks + - '#address-cells' + - '#size-cells' + +examples: + - | + #include <dt-bindings/clock/r8a7778-clock.h> + #include <dt-bindings/interrupt-controller/irq.h> + + hspi0: spi@fffc7000 { + compatible = "renesas,hspi-r8a7778", "renesas,hspi"; + reg = <0xfffc7000 0x18>; + interrupts = <0 63 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&mstp0_clks R8A7778_CLK_HSPI>; + power-domains = <&cpg_clocks>; + #address-cells = <1>; + #size-cells = <0>; + }; + diff --git a/Documentation/devicetree/bindings/spi/renesas,rzn1-spi.txt b/Documentation/devicetree/bindings/spi/renesas,rzn1-spi.txt new file mode 100644 index 000000000000..fb1a6728638d --- /dev/null +++ b/Documentation/devicetree/bindings/spi/renesas,rzn1-spi.txt @@ -0,0 +1,11 @@ +Renesas RZ/N1 SPI Controller + +This controller is based on the Synopsys DW Synchronous Serial Interface and +inherits all properties defined in snps,dw-apb-ssi.txt except for the +compatible property. + +Required properties: +- compatible : The device specific string followed by the generic RZ/N1 string. + Therefore it must be one of: + "renesas,r9a06g032-spi", "renesas,rzn1-spi" + "renesas,r9a06g033-spi", "renesas,rzn1-spi" diff --git a/Documentation/devicetree/bindings/spi/renesas,sh-msiof.yaml b/Documentation/devicetree/bindings/spi/renesas,sh-msiof.yaml new file mode 100644 index 000000000000..b6c1dd2a9c5e --- /dev/null +++ b/Documentation/devicetree/bindings/spi/renesas,sh-msiof.yaml @@ -0,0 +1,159 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/spi/renesas,sh-msiof.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Renesas MSIOF SPI controller + +maintainers: + - Geert Uytterhoeven <geert+renesas@glider.be> + +allOf: + - $ref: spi-controller.yaml# + +properties: + compatible: + oneOf: + - items: + - const: renesas,msiof-sh73a0 # SH-Mobile AG5 + - const: renesas,sh-mobile-msiof # generic SH-Mobile compatible + # device + - items: + - enum: + - renesas,msiof-r8a7743 # RZ/G1M + - renesas,msiof-r8a7744 # RZ/G1N + - renesas,msiof-r8a7745 # RZ/G1E + - renesas,msiof-r8a77470 # RZ/G1C + - renesas,msiof-r8a7790 # R-Car H2 + - renesas,msiof-r8a7791 # R-Car M2-W + - renesas,msiof-r8a7792 # R-Car V2H + - renesas,msiof-r8a7793 # R-Car M2-N + - renesas,msiof-r8a7794 # R-Car E2 + - const: renesas,rcar-gen2-msiof # generic R-Car Gen2 and RZ/G1 + # compatible device + - items: + - enum: + - renesas,msiof-r8a774a1 # RZ/G2M + - renesas,msiof-r8a774b1 # RZ/G2N + - renesas,msiof-r8a774c0 # RZ/G2E + - renesas,msiof-r8a7795 # R-Car H3 + - renesas,msiof-r8a7796 # R-Car M3-W + - renesas,msiof-r8a77965 # R-Car M3-N + - renesas,msiof-r8a77970 # R-Car V3M + - renesas,msiof-r8a77980 # R-Car V3H + - renesas,msiof-r8a77990 # R-Car E3 + - renesas,msiof-r8a77995 # R-Car D3 + - const: renesas,rcar-gen3-msiof # generic R-Car Gen3 and RZ/G2 + # compatible device + - items: + - const: renesas,sh-msiof # deprecated + + reg: + minItems: 1 + maxItems: 2 + oneOf: + - items: + - description: CPU and DMA engine registers + - items: + - description: CPU registers + - description: DMA engine registers + + interrupts: + maxItems: 1 + + clocks: + maxItems: 1 + + num-cs: + description: | + Total number of chip selects (default is 1). + Up to 3 native chip selects are supported: + 0: MSIOF_SYNC + 1: MSIOF_SS1 + 2: MSIOF_SS2 + Hardware limitations related to chip selects: + - Native chip selects are always deasserted in between transfers + that are part of the same message. Use cs-gpios to work around + this. + - All slaves using native chip selects must use the same spi-cs-high + configuration. Use cs-gpios to work around this. + - When using GPIO chip selects, at least one native chip select must + be left unused, as it will be driven anyway. + minimum: 1 + maximum: 3 + default: 1 + + dmas: + minItems: 2 + maxItems: 4 + + dma-names: + minItems: 2 + maxItems: 4 + items: + enum: [ tx, rx ] + + renesas,dtdl: + description: delay sync signal (setup) in transmit mode. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: + - 0 # no bit delay + - 50 # 0.5-clock-cycle delay + - 100 # 1-clock-cycle delay + - 150 # 1.5-clock-cycle delay + - 200 # 2-clock-cycle delay + + renesas,syncdl: + description: delay sync signal (hold) in transmit mode + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: + - 0 # no bit delay + - 50 # 0.5-clock-cycle delay + - 100 # 1-clock-cycle delay + - 150 # 1.5-clock-cycle delay + - 200 # 2-clock-cycle delay + - 300 # 3-clock-cycle delay + + renesas,tx-fifo-size: + # deprecated for soctype-specific bindings + description: | + Override the default TX fifo size. Unit is words. Ignored if 0. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - maxItems: 1 + default: 64 + + renesas,rx-fifo-size: + # deprecated for soctype-specific bindings + description: | + Override the default RX fifo size. Unit is words. Ignored if 0. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - maxItems: 1 + default: 64 + +required: + - compatible + - reg + - interrupts + - '#address-cells' + - '#size-cells' + +examples: + - | + #include <dt-bindings/clock/r8a7791-clock.h> + #include <dt-bindings/interrupt-controller/irq.h> + + msiof0: spi@e6e20000 { + compatible = "renesas,msiof-r8a7791", "renesas,rcar-gen2-msiof"; + reg = <0 0xe6e20000 0 0x0064>; + interrupts = <0 156 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&mstp0_clks R8A7791_CLK_MSIOF0>; + dmas = <&dmac0 0x51>, <&dmac0 0x52>; + dma-names = "tx", "rx"; + #address-cells = <1>; + #size-cells = <0>; + }; diff --git a/Documentation/devicetree/bindings/spi/sh-hspi.txt b/Documentation/devicetree/bindings/spi/sh-hspi.txt deleted file mode 100644 index b9d1e4d11a77..000000000000 --- a/Documentation/devicetree/bindings/spi/sh-hspi.txt +++ /dev/null @@ -1,26 +0,0 @@ -Renesas HSPI. - -Required properties: -- compatible : "renesas,hspi-<soctype>", "renesas,hspi" as fallback. - Examples with soctypes are: - - "renesas,hspi-r8a7778" (R-Car M1) - - "renesas,hspi-r8a7779" (R-Car H1) -- reg : Offset and length of the register set for the device -- interrupts : Interrupt specifier -- #address-cells : Must be <1> -- #size-cells : Must be <0> - -Pinctrl properties might be needed, too. See -Documentation/devicetree/bindings/pinctrl/renesas,*. - -Example: - - hspi0: spi@fffc7000 { - compatible = "renesas,hspi-r8a7778", "renesas,hspi"; - reg = <0xfffc7000 0x18>; - interrupt-parent = <&gic>; - interrupts = <0 63 IRQ_TYPE_LEVEL_HIGH>; - #address-cells = <1>; - #size-cells = <0>; - }; - diff --git a/Documentation/devicetree/bindings/spi/sh-msiof.txt b/Documentation/devicetree/bindings/spi/sh-msiof.txt deleted file mode 100644 index 18e14ee257b2..000000000000 --- a/Documentation/devicetree/bindings/spi/sh-msiof.txt +++ /dev/null @@ -1,105 +0,0 @@ -Renesas MSIOF spi controller - -Required properties: -- compatible : "renesas,msiof-r8a7743" (RZ/G1M) - "renesas,msiof-r8a7744" (RZ/G1N) - "renesas,msiof-r8a7745" (RZ/G1E) - "renesas,msiof-r8a77470" (RZ/G1C) - "renesas,msiof-r8a774a1" (RZ/G2M) - "renesas,msiof-r8a774c0" (RZ/G2E) - "renesas,msiof-r8a7790" (R-Car H2) - "renesas,msiof-r8a7791" (R-Car M2-W) - "renesas,msiof-r8a7792" (R-Car V2H) - "renesas,msiof-r8a7793" (R-Car M2-N) - "renesas,msiof-r8a7794" (R-Car E2) - "renesas,msiof-r8a7795" (R-Car H3) - "renesas,msiof-r8a7796" (R-Car M3-W) - "renesas,msiof-r8a77965" (R-Car M3-N) - "renesas,msiof-r8a77970" (R-Car V3M) - "renesas,msiof-r8a77980" (R-Car V3H) - "renesas,msiof-r8a77990" (R-Car E3) - "renesas,msiof-r8a77995" (R-Car D3) - "renesas,msiof-sh73a0" (SH-Mobile AG5) - "renesas,sh-mobile-msiof" (generic SH-Mobile compatibile device) - "renesas,rcar-gen2-msiof" (generic R-Car Gen2 and RZ/G1 compatible device) - "renesas,rcar-gen3-msiof" (generic R-Car Gen3 and RZ/G2 compatible device) - "renesas,sh-msiof" (deprecated) - - When compatible with the generic version, nodes - must list the SoC-specific version corresponding - to the platform first followed by the generic - version. - -- reg : A list of offsets and lengths of the register sets for - the device. - If only one register set is present, it is to be used - by both the CPU and the DMA engine. - If two register sets are present, the first is to be - used by the CPU, and the second is to be used by the - DMA engine. -- interrupts : Interrupt specifier -- #address-cells : Must be <1> -- #size-cells : Must be <0> - -Optional properties: -- clocks : Must contain a reference to the functional clock. -- num-cs : Total number of chip selects (default is 1). - Up to 3 native chip selects are supported: - 0: MSIOF_SYNC - 1: MSIOF_SS1 - 2: MSIOF_SS2 - Hardware limitations related to chip selects: - - Native chip selects are always deasserted in - between transfers that are part of the same - message. Use cs-gpios to work around this. - - All slaves using native chip selects must use the - same spi-cs-high configuration. Use cs-gpios to - work around this. - - When using GPIO chip selects, at least one native - chip select must be left unused, as it will be - driven anyway. -- dmas : Must contain a list of two references to DMA - specifiers, one for transmission, and one for - reception. -- dma-names : Must contain a list of two DMA names, "tx" and "rx". -- spi-slave : Empty property indicating the SPI controller is used - in slave mode. -- renesas,dtdl : delay sync signal (setup) in transmit mode. - Must contain one of the following values: - 0 (no bit delay) - 50 (0.5-clock-cycle delay) - 100 (1-clock-cycle delay) - 150 (1.5-clock-cycle delay) - 200 (2-clock-cycle delay) - -- renesas,syncdl : delay sync signal (hold) in transmit mode. - Must contain one of the following values: - 0 (no bit delay) - 50 (0.5-clock-cycle delay) - 100 (1-clock-cycle delay) - 150 (1.5-clock-cycle delay) - 200 (2-clock-cycle delay) - 300 (3-clock-cycle delay) - -Optional properties, deprecated for soctype-specific bindings: -- renesas,tx-fifo-size : Overrides the default tx fifo size given in words - (default is 64) -- renesas,rx-fifo-size : Overrides the default rx fifo size given in words - (default is 64) - -Pinctrl properties might be needed, too. See -Documentation/devicetree/bindings/pinctrl/renesas,*. - -Example: - - msiof0: spi@e6e20000 { - compatible = "renesas,msiof-r8a7791", - "renesas,rcar-gen2-msiof"; - reg = <0 0xe6e20000 0 0x0064>; - interrupts = <0 156 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&mstp0_clks R8A7791_CLK_MSIOF0>; - dmas = <&dmac0 0x51>, <&dmac0 0x52>; - dma-names = "tx", "rx"; - #address-cells = <1>; - #size-cells = <0>; - }; diff --git a/Documentation/devicetree/bindings/spi/snps,dw-apb-ssi.txt b/Documentation/devicetree/bindings/spi/snps,dw-apb-ssi.txt index f54c8c36395e..3ed08ee9feba 100644 --- a/Documentation/devicetree/bindings/spi/snps,dw-apb-ssi.txt +++ b/Documentation/devicetree/bindings/spi/snps,dw-apb-ssi.txt @@ -16,7 +16,8 @@ Required properties: Optional properties: - clock-names : Contains the names of the clocks: "ssi_clk", for the core clock used to generate the external SPI clock. - "pclk", the interface clock, required for register access. + "pclk", the interface clock, required for register access. If a clock domain + used to enable this clock then it should be named "pclk_clkdomain". - cs-gpios : Specifies the gpio pins to be used for chipselects. - num-cs : The number of chipselects. If omitted, this will default to 4. - reg-io-width : The I/O register width (in bytes) implemented by this diff --git a/Documentation/devicetree/bindings/spi/spi-sifive.txt b/Documentation/devicetree/bindings/spi/spi-sifive.txt deleted file mode 100644 index 3f5c6e438972..000000000000 --- a/Documentation/devicetree/bindings/spi/spi-sifive.txt +++ /dev/null @@ -1,37 +0,0 @@ -SiFive SPI controller Device Tree Bindings ------------------------------------------- - -Required properties: -- compatible : Should be "sifive,<chip>-spi" and "sifive,spi<version>". - Supported compatible strings are: - "sifive,fu540-c000-spi" for the SiFive SPI v0 as integrated - onto the SiFive FU540 chip, and "sifive,spi0" for the SiFive - SPI v0 IP block with no chip integration tweaks. - Please refer to sifive-blocks-ip-versioning.txt for details -- reg : Physical base address and size of SPI registers map - A second (optional) range can indicate memory mapped flash -- interrupts : Must contain one entry -- interrupt-parent : Must be core interrupt controller -- clocks : Must reference the frequency given to the controller -- #address-cells : Must be '1', indicating which CS to use -- #size-cells : Must be '0' - -Optional properties: -- sifive,fifo-depth : Depth of hardware queues; defaults to 8 -- sifive,max-bits-per-word : Maximum bits per word; defaults to 8 - -SPI RTL that corresponds to the IP block version numbers can be found here: -https://github.com/sifive/sifive-blocks/tree/master/src/main/scala/devices/spi - -Example: - spi: spi@10040000 { - compatible = "sifive,fu540-c000-spi", "sifive,spi0"; - reg = <0x0 0x10040000 0x0 0x1000 0x0 0x20000000 0x0 0x10000000>; - interrupt-parent = <&plic>; - interrupts = <51>; - clocks = <&tlclk>; - #address-cells = <1>; - #size-cells = <0>; - sifive,fifo-depth = <8>; - sifive,max-bits-per-word = <8>; - }; diff --git a/Documentation/devicetree/bindings/spi/spi-sifive.yaml b/Documentation/devicetree/bindings/spi/spi-sifive.yaml new file mode 100644 index 000000000000..140e4351a19f --- /dev/null +++ b/Documentation/devicetree/bindings/spi/spi-sifive.yaml @@ -0,0 +1,86 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/spi/spi-sifive.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: SiFive SPI controller + +maintainers: + - Pragnesh Patel <pragnesh.patel@sifive.com> + - Paul Walmsley <paul.walmsley@sifive.com> + - Palmer Dabbelt <palmer@sifive.com> + +allOf: + - $ref: "spi-controller.yaml#" + +properties: + compatible: + items: + - const: sifive,fu540-c000-spi + - const: sifive,spi0 + + description: + Should be "sifive,<chip>-spi" and "sifive,spi<version>". + Supported compatible strings are - + "sifive,fu540-c000-spi" for the SiFive SPI v0 as integrated + onto the SiFive FU540 chip, and "sifive,spi0" for the SiFive + SPI v0 IP block with no chip integration tweaks. + Please refer to sifive-blocks-ip-versioning.txt for details + + SPI RTL that corresponds to the IP block version numbers can be found here - + https://github.com/sifive/sifive-blocks/tree/master/src/main/scala/devices/spi + + reg: + maxItems: 1 + + description: + Physical base address and size of SPI registers map + A second (optional) range can indicate memory mapped flash + + interrupts: + maxItems: 1 + + clocks: + maxItems: 1 + + description: + Must reference the frequency given to the controller + + sifive,fifo-depth: + description: + Depth of hardware queues; defaults to 8 + allOf: + - $ref: "/schemas/types.yaml#/definitions/uint32" + - enum: [ 8 ] + - default: 8 + + sifive,max-bits-per-word: + description: + Maximum bits per word; defaults to 8 + allOf: + - $ref: "/schemas/types.yaml#/definitions/uint32" + - enum: [ 0, 1, 2, 3, 4, 5, 6, 7, 8 ] + - default: 8 + +required: + - compatible + - reg + - interrupts + - clocks + +examples: + - | + spi: spi@10040000 { + compatible = "sifive,fu540-c000-spi", "sifive,spi0"; + reg = <0x0 0x10040000 0x0 0x1000 0x0 0x20000000 0x0 0x10000000>; + interrupt-parent = <&plic>; + interrupts = <51>; + clocks = <&tlclk>; + #address-cells = <1>; + #size-cells = <0>; + sifive,fifo-depth = <8>; + sifive,max-bits-per-word = <8>; + }; + +... diff --git a/Documentation/devicetree/bindings/spi/spi-stm32-qspi.txt b/Documentation/devicetree/bindings/spi/spi-stm32-qspi.txt deleted file mode 100644 index bfc038b9478d..000000000000 --- a/Documentation/devicetree/bindings/spi/spi-stm32-qspi.txt +++ /dev/null @@ -1,47 +0,0 @@ -* STMicroelectronics Quad Serial Peripheral Interface(QSPI) - -Required properties: -- compatible: should be "st,stm32f469-qspi" -- reg: the first contains the register location and length. - the second contains the memory mapping address and length -- reg-names: should contain the reg names "qspi" "qspi_mm" -- interrupts: should contain the interrupt for the device -- clocks: the phandle of the clock needed by the QSPI controller -- A pinctrl must be defined to set pins in mode of operation for QSPI transfer - -Optional properties: -- resets: must contain the phandle to the reset controller. - -A spi flash (NOR/NAND) must be a child of spi node and could have some -properties. Also see jedec,spi-nor.txt. - -Required properties: -- reg: chip-Select number (QSPI controller may connect 2 flashes) -- spi-max-frequency: max frequency of spi bus - -Optional properties: -- spi-rx-bus-width: see ./spi-bus.txt for the description -- dmas: DMA specifiers for tx and rx dma. See the DMA client binding, -Documentation/devicetree/bindings/dma/dma.txt. -- dma-names: DMA request names should include "tx" and "rx" if present. - -Example: - -qspi: spi@a0001000 { - compatible = "st,stm32f469-qspi"; - reg = <0xa0001000 0x1000>, <0x90000000 0x10000000>; - reg-names = "qspi", "qspi_mm"; - interrupts = <91>; - resets = <&rcc STM32F4_AHB3_RESET(QSPI)>; - clocks = <&rcc 0 STM32F4_AHB3_CLOCK(QSPI)>; - pinctrl-names = "default"; - pinctrl-0 = <&pinctrl_qspi0>; - - flash@0 { - compatible = "jedec,spi-nor"; - reg = <0>; - spi-rx-bus-width = <4>; - spi-max-frequency = <108000000>; - ... - }; -}; diff --git a/Documentation/devicetree/bindings/spi/spi-xilinx.txt b/Documentation/devicetree/bindings/spi/spi-xilinx.txt index dc924a5f71db..5f4ed3e5c994 100644 --- a/Documentation/devicetree/bindings/spi/spi-xilinx.txt +++ b/Documentation/devicetree/bindings/spi/spi-xilinx.txt @@ -8,7 +8,8 @@ Required properties: number. Optional properties: -- xlnx,num-ss-bits : Number of chip selects used. +- xlnx,num-ss-bits : Number of chip selects used. +- xlnx,num-transfer-bits : Number of bits per transfer. This will be 8 if not specified Example: axi_quad_spi@41e00000 { @@ -17,5 +18,6 @@ Example: interrupts = <0 31 1>; reg = <0x41e00000 0x10000>; xlnx,num-ss-bits = <0x1>; + xlnx,num-transfer-bits = <32>; }; diff --git a/Documentation/devicetree/bindings/spi/st,stm32-qspi.yaml b/Documentation/devicetree/bindings/spi/st,stm32-qspi.yaml new file mode 100644 index 000000000000..3665a5fe6b7f --- /dev/null +++ b/Documentation/devicetree/bindings/spi/st,stm32-qspi.yaml @@ -0,0 +1,83 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/spi/st,stm32-qspi.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: STMicroelectronics STM32 Quad Serial Peripheral Interface (QSPI) bindings + +maintainers: + - Christophe Kerello <christophe.kerello@st.com> + - Patrice Chotard <patrice.chotard@st.com> + +allOf: + - $ref: "spi-controller.yaml#" + +properties: + compatible: + const: st,stm32f469-qspi + + reg: + items: + - description: registers + - description: memory mapping + + reg-names: + items: + - const: qspi + - const: qspi_mm + + clocks: + maxItems: 1 + + interrupts: + maxItems: 1 + + resets: + maxItems: 1 + + dmas: + items: + - description: tx DMA channel + - description: rx DMA channel + + dma-names: + items: + - const: tx + - const: rx + +required: + - compatible + - reg + - reg-names + - clocks + - interrupts + +examples: + - | + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/clock/stm32mp1-clks.h> + #include <dt-bindings/reset/stm32mp1-resets.h> + spi@58003000 { + compatible = "st,stm32f469-qspi"; + reg = <0x58003000 0x1000>, <0x70000000 0x10000000>; + reg-names = "qspi", "qspi_mm"; + interrupts = <GIC_SPI 92 IRQ_TYPE_LEVEL_HIGH>; + dmas = <&mdma1 22 0x10 0x100002 0x0 0x0>, + <&mdma1 22 0x10 0x100008 0x0 0x0>; + dma-names = "tx", "rx"; + clocks = <&rcc QSPI_K>; + resets = <&rcc QSPI_R>; + + #address-cells = <1>; + #size-cells = <0>; + + flash@0 { + compatible = "jedec,spi-nor"; + reg = <0>; + spi-rx-bus-width = <4>; + spi-max-frequency = <108000000>; + }; + }; + +... diff --git a/Documentation/devicetree/bindings/trivial-devices.yaml b/Documentation/devicetree/bindings/trivial-devices.yaml index 870ac52d2225..765fd1c170df 100644 --- a/Documentation/devicetree/bindings/trivial-devices.yaml +++ b/Documentation/devicetree/bindings/trivial-devices.yaml @@ -114,6 +114,18 @@ properties: - isil,isl68137 # 5 Bit Programmable, Pulse-Width Modulator - maxim,ds1050 + # 10-bit 8 channels 300ks/s SPI ADC with temperature sensor + - maxim,max1027 + # 10-bit 12 channels 300ks/s SPI ADC with temperature sensor + - maxim,max1029 + # 10-bit 16 channels 300ks/s SPI ADC with temperature sensor + - maxim,max1031 + # 12-bit 8 channels 300ks/s SPI ADC with temperature sensor + - maxim,max1227 + # 12-bit 12 channels 300ks/s SPI ADC with temperature sensor + - maxim,max1229 + # 12-bit 16 channels 300ks/s SPI ADC with temperature sensor + - maxim,max1231 # Low-Power, 4-/12-Channel, 2-Wire Serial, 12-Bit ADCs - maxim,max1237 # PECI-to-I2C translator for PECI-to-SMBus/I2C protocol conversion diff --git a/Documentation/devicetree/bindings/usb/renesas,usb3-peri.txt b/Documentation/devicetree/bindings/usb/renesas,usb3-peri.txt deleted file mode 100644 index 35039e720515..000000000000 --- a/Documentation/devicetree/bindings/usb/renesas,usb3-peri.txt +++ /dev/null @@ -1,41 +0,0 @@ -Renesas Electronics USB3.0 Peripheral driver - -Required properties: - - compatible: Must contain one of the following: - - "renesas,r8a774a1-usb3-peri" - - "renesas,r8a774c0-usb3-peri" - - "renesas,r8a7795-usb3-peri" - - "renesas,r8a7796-usb3-peri" - - "renesas,r8a77965-usb3-peri" - - "renesas,r8a77990-usb3-peri" - - "renesas,rcar-gen3-usb3-peri" for a generic R-Car Gen3 or RZ/G2 - compatible device - - When compatible with the generic version, nodes must list the - SoC-specific version corresponding to the platform first - followed by the generic version. - - - reg: Base address and length of the register for the USB3.0 Peripheral - - interrupts: Interrupt specifier for the USB3.0 Peripheral - - clocks: clock phandle and specifier pair - -Optional properties: - - phys: phandle + phy specifier pair - - phy-names: must be "usb" - -Example of R-Car H3 ES1.x: - usb3_peri0: usb@ee020000 { - compatible = "renesas,r8a7795-usb3-peri", - "renesas,rcar-gen3-usb3-peri"; - reg = <0 0xee020000 0 0x400>; - interrupts = <GIC_SPI 104 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&cpg CPG_MOD 328>; - }; - - usb3_peri1: usb@ee060000 { - compatible = "renesas,r8a7795-usb3-peri", - "renesas,rcar-gen3-usb3-peri"; - reg = <0 0xee060000 0 0x400>; - interrupts = <GIC_SPI 100 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&cpg CPG_MOD 327>; - }; diff --git a/Documentation/devicetree/bindings/usb/renesas,usb3-peri.yaml b/Documentation/devicetree/bindings/usb/renesas,usb3-peri.yaml new file mode 100644 index 000000000000..92d8631b9aa6 --- /dev/null +++ b/Documentation/devicetree/bindings/usb/renesas,usb3-peri.yaml @@ -0,0 +1,86 @@ +# SPDX-License-Identifier: GPL-2.0-only +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/usb/renesas,usb3-peri.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Renesas USB 3.0 Peripheral controller + +maintainers: + - Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> + +properties: + compatible: + items: + - enum: + - renesas,r8a774a1-usb3-peri # RZ/G2M + - renesas,r8a774b1-usb3-peri # RZ/G2N + - renesas,r8a774c0-usb3-peri # RZ/G2E + - renesas,r8a7795-usb3-peri # R-Car H3 + - renesas,r8a7796-usb3-peri # R-Car M3-W + - renesas,r8a77965-usb3-peri # R-Car M3-N + - renesas,r8a77990-usb3-peri # R-Car E3 + - const: renesas,rcar-gen3-usb3-peri + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + clocks: + maxItems: 1 + + phys: + maxItems: 1 + + phy-names: + const: usb + + power-domains: + maxItems: 1 + + resets: + maxItems: 1 + + usb-role-switch: + $ref: /schemas/types.yaml#/definitions/flag + description: Support role switch. + + companion: + $ref: /schemas/types.yaml#/definitions/phandle + description: phandle of a companion. + + port: + description: | + any connector to the data bus of this controller should be modelled + using the OF graph bindings specified, if the "usb-role-switch" + property is used. + +required: + - compatible + - interrupts + - clocks + +additionalProperties: false + +examples: + - | + #include <dt-bindings/clock/r8a774c0-cpg-mssr.h> + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/power/r8a774c0-sysc.h> + + usb3_peri0: usb@ee020000 { + compatible = "renesas,r8a774c0-usb3-peri", "renesas,rcar-gen3-usb3-peri"; + reg = <0 0xee020000 0 0x400>; + interrupts = <GIC_SPI 104 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&cpg CPG_MOD 328>; + companion = <&xhci0>; + usb-role-switch; + + port { + usb3_role_switch: endpoint { + remote-endpoint = <&hd3ss3220_ep>; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/usb/renesas,usbhs.txt b/Documentation/devicetree/bindings/usb/renesas,usbhs.txt deleted file mode 100644 index e39255ea6e4f..000000000000 --- a/Documentation/devicetree/bindings/usb/renesas,usbhs.txt +++ /dev/null @@ -1,57 +0,0 @@ -Renesas Electronics USBHS driver - -Required properties: - - compatible: Must contain one or more of the following: - - - "renesas,usbhs-r8a7743" for r8a7743 (RZ/G1M) compatible device - - "renesas,usbhs-r8a7744" for r8a7744 (RZ/G1N) compatible device - - "renesas,usbhs-r8a7745" for r8a7745 (RZ/G1E) compatible device - - "renesas,usbhs-r8a77470" for r8a77470 (RZ/G1C) compatible device - - "renesas,usbhs-r8a774a1" for r8a774a1 (RZ/G2M) compatible device - - "renesas,usbhs-r8a774c0" for r8a774c0 (RZ/G2E) compatible device - - "renesas,usbhs-r8a7790" for r8a7790 (R-Car H2) compatible device - - "renesas,usbhs-r8a7791" for r8a7791 (R-Car M2-W) compatible device - - "renesas,usbhs-r8a7792" for r8a7792 (R-Car V2H) compatible device - - "renesas,usbhs-r8a7793" for r8a7793 (R-Car M2-N) compatible device - - "renesas,usbhs-r8a7794" for r8a7794 (R-Car E2) compatible device - - "renesas,usbhs-r8a7795" for r8a7795 (R-Car H3) compatible device - - "renesas,usbhs-r8a7796" for r8a7796 (R-Car M3-W) compatible device - - "renesas,usbhs-r8a77965" for r8a77965 (R-Car M3-N) compatible device - - "renesas,usbhs-r8a77990" for r8a77990 (R-Car E3) compatible device - - "renesas,usbhs-r8a77995" for r8a77995 (R-Car D3) compatible device - - "renesas,usbhs-r7s72100" for r7s72100 (RZ/A1) compatible device - - "renesas,usbhs-r7s9210" for r7s9210 (RZ/A2) compatible device - - "renesas,rcar-gen2-usbhs" for R-Car Gen2 or RZ/G1 compatible devices - - "renesas,rcar-gen3-usbhs" for R-Car Gen3 or RZ/G2 compatible devices - - "renesas,rza1-usbhs" for RZ/A1 compatible device - - "renesas,rza2-usbhs" for RZ/A2 compatible device - - When compatible with the generic version, nodes must list the - SoC-specific version corresponding to the platform first followed - by the generic version. - - - reg: Base address and length of the register for the USBHS - - interrupts: Interrupt specifier for the USBHS - - clocks: A list of phandle + clock specifier pairs. - - In case of "renesas,rcar-gen3-usbhs", two clocks are required. - First clock should be peripheral and second one should be host. - - In case of except above, one clock is required. First clock - should be peripheral. - -Optional properties: - - renesas,buswait: Integer to use BUSWAIT register - - renesas,enable-gpio: A gpio specifier to check GPIO determining if USB - function should be enabled - - phys: phandle + phy specifier pair - - phy-names: must be "usb" - - dmas: Must contain a list of references to DMA specifiers. - - dma-names : named "ch%d", where %d is the channel number ranging from zero - to the number of channels (DnFIFOs) minus one. - -Example: - usbhs: usb@e6590000 { - compatible = "renesas,usbhs-r8a7790", "renesas,rcar-gen2-usbhs"; - reg = <0 0xe6590000 0 0x100>; - interrupts = <0 107 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&mstp7_clks R8A7790_CLK_HSUSB>; - }; diff --git a/Documentation/devicetree/bindings/usb/renesas,usbhs.yaml b/Documentation/devicetree/bindings/usb/renesas,usbhs.yaml new file mode 100644 index 000000000000..469affa872d3 --- /dev/null +++ b/Documentation/devicetree/bindings/usb/renesas,usbhs.yaml @@ -0,0 +1,126 @@ +# SPDX-License-Identifier: GPL-2.0-only +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/usb/renesas,usbhs.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Renesas USBHS (HS-USB) controller + +maintainers: + - Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> + +properties: + compatible: + oneOf: + - items: + - const: renesas,usbhs-r7s72100 # RZ/A1 + - const: renesas,rza1-usbhs + + - items: + - const: renesas,usbhs-r7s9210 # RZ/A2 + - const: renesas,rza2-usbhs + + - items: + - enum: + - renesas,usbhs-r8a7743 # RZ/G1M + - renesas,usbhs-r8a7744 # RZ/G1N + - renesas,usbhs-r8a7745 # RZ/G1E + - renesas,usbhs-r8a77470 # RZ/G1C + - renesas,usbhs-r8a7790 # R-Car H2 + - renesas,usbhs-r8a7791 # R-Car M2-W + - renesas,usbhs-r8a7792 # R-Car V2H + - renesas,usbhs-r8a7793 # R-Car M2-N + - renesas,usbhs-r8a7794 # R-Car E2 + - const: renesas,rcar-gen2-usbhs + + - items: + - enum: + - renesas,usbhs-r8a774a1 # RZ/G2M + - renesas,usbhs-r8a774b1 # RZ/G2N + - renesas,usbhs-r8a774c0 # RZ/G2E + - renesas,usbhs-r8a7795 # R-Car H3 + - renesas,usbhs-r8a7796 # R-Car M3-W + - renesas,usbhs-r8a77965 # R-Car M3-N + - renesas,usbhs-r8a77990 # R-Car E3 + - renesas,usbhs-r8a77995 # R-Car D3 + - const: renesas,rcar-gen3-usbhs + + reg: + maxItems: 1 + + clocks: + minItems: 1 + maxItems: 3 + items: + - description: USB 2.0 host + - description: USB 2.0 peripheral + - description: USB 2.0 clock selector + + interrupts: + maxItems: 1 + + renesas,buswait: + $ref: /schemas/types.yaml#/definitions/uint32 + description: | + Integer to use BUSWAIT register. + + renesas,enable-gpio: + description: | + gpio specifier to check GPIO determining if USB function should be + enabled. + + phys: + maxItems: 1 + items: + - description: phandle + phy specifier pair. + + phy-names: + maxItems: 1 + items: + - const: usb + + dmas: + minItems: 2 + maxItems: 4 + + dma-names: + minItems: 2 + maxItems: 4 + items: + - const: ch0 + - const: ch1 + - const: ch2 + - const: ch3 + + dr_mode: true + + power-domains: + maxItems: 1 + + resets: + minItems: 1 + maxItems: 2 + items: + - description: USB 2.0 host + - description: USB 2.0 peripheral + +required: + - compatible + - reg + - clocks + - interrupts + +additionalProperties: false + +examples: + - | + #include <dt-bindings/clock/r8a7790-cpg-mssr.h> + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/power/r8a7790-sysc.h> + + usbhs: usb@e6590000 { + compatible = "renesas,usbhs-r8a7790", "renesas,rcar-gen2-usbhs"; + reg = <0 0xe6590000 0 0x100>; + interrupts = <GIC_SPI 107 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&cpg CPG_MOD 704>; + }; diff --git a/Documentation/devicetree/bindings/usb/richtek,rt1711h.txt b/Documentation/devicetree/bindings/usb/richtek,rt1711h.txt index d4cf53c071d9..e3fc57e605ed 100644 --- a/Documentation/devicetree/bindings/usb/richtek,rt1711h.txt +++ b/Documentation/devicetree/bindings/usb/richtek,rt1711h.txt @@ -6,10 +6,39 @@ Required properties: - interrupts : <a b> where a is the interrupt number and b represents an encoding of the sense and level information for the interrupt. +Required sub-node: +- connector: The "usb-c-connector" attached to the tcpci chip, the bindings + of connector node are specified in + Documentation/devicetree/bindings/connector/usb-connector.txt + Example : rt1711h@4e { compatible = "richtek,rt1711h"; reg = <0x4e>; interrupt-parent = <&gpio26>; interrupts = <0 IRQ_TYPE_LEVEL_LOW>; + + usb_con: connector { + compatible = "usb-c-connector"; + label = "USB-C"; + data-role = "dual"; + power-role = "dual"; + try-power-role = "sink"; + source-pdos = <PDO_FIXED(5000, 2000, PDO_FIXED_USB_COMM)>; + sink-pdos = <PDO_FIXED(5000, 2000, PDO_FIXED_USB_COMM) + PDO_VAR(5000, 12000, 2000)>; + op-sink-microwatt = <10000000>; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@1 { + reg = <1>; + usb_con_ss: endpoint { + remote-endpoint = <&usb3_data_ss>; + }; + }; + }; + }; }; diff --git a/Documentation/devicetree/bindings/usb/ti,hd3ss3220.txt b/Documentation/devicetree/bindings/usb/ti,hd3ss3220.txt new file mode 100644 index 000000000000..25780e945b15 --- /dev/null +++ b/Documentation/devicetree/bindings/usb/ti,hd3ss3220.txt @@ -0,0 +1,38 @@ +TI HD3SS3220 TypeC DRP Port Controller. + +Required properties: + - compatible: Must be "ti,hd3ss3220". + - reg: I2C slave address, must be 0x47 or 0x67 based on ADDR pin. + - interrupts: An interrupt specifier. + +Required sub-node: + - connector: The "usb-c-connector" attached to the hd3ss3220 chip. The + bindings of the connector node are specified in: + + Documentation/devicetree/bindings/connector/usb-connector.txt + +Example: +hd3ss3220@47 { + compatible = "ti,hd3ss3220"; + reg = <0x47>; + interrupt-parent = <&gpio6>; + interrupts = <3 IRQ_TYPE_LEVEL_LOW>; + + connector { + compatible = "usb-c-connector"; + label = "USB-C"; + data-role = "dual"; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@1 { + reg = <1>; + hd3ss3220_ep: endpoint { + remote-endpoint = <&usb3_role_switch>; + }; + }; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/usb/ti,j721e-usb.yaml b/Documentation/devicetree/bindings/usb/ti,j721e-usb.yaml new file mode 100644 index 000000000000..5f5264b2e9ad --- /dev/null +++ b/Documentation/devicetree/bindings/usb/ti,j721e-usb.yaml @@ -0,0 +1,86 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/usb/ti,j721e-usb.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Bindings for the TI wrapper module for the Cadence USBSS-DRD controller + +maintainers: + - Roger Quadros <rogerq@ti.com> + +properties: + compatible: + items: + - const: ti,j721e-usb + + reg: + description: module registers + + power-domains: + description: + PM domain provider node and an args specifier containing + the USB device id value. See, + Documentation/devicetree/bindings/soc/ti/sci-pm-domain.txt + + clocks: + description: Clock phandles to usb2_refclk and lpm_clk + minItems: 2 + maxItems: 2 + + clock-names: + items: + - const: ref + - const: lpm + + ti,usb2-only: + description: + If present, it restricts the controller to USB2.0 mode of + operation. Must be present if USB3 PHY is not available + for USB. + type: boolean + + ti,vbus-divider: + description: + Should be present if USB VBUS line is connected to the + VBUS pin of the SoC via a 1/3 voltage divider. + type: boolean + +required: + - compatible + - reg + - power-domains + - clocks + - clock-names + +examples: + - | + #include <dt-bindings/soc/ti,sci_pm_domain.h> + #include <dt-bindings/interrupt-controller/arm-gic.h> + cdns_usb@4104000 { + compatible = "ti,j721e-usb"; + reg = <0x00 0x4104000 0x00 0x100>; + power-domains = <&k3_pds 288 TI_SCI_PD_EXCLUSIVE>; + clocks = <&k3_clks 288 15>, <&k3_clks 288 3>; + clock-names = "ref", "lpm"; + assigned-clocks = <&k3_clks 288 15>; /* USB2_REFCLK */ + assigned-clock-parents = <&k3_clks 288 16>; /* HFOSC0 */ + #address-cells = <2>; + #size-cells = <2>; + + usb@6000000 { + compatible = "cdns,usb3"; + reg = <0x00 0x6000000 0x00 0x10000>, + <0x00 0x6010000 0x00 0x10000>, + <0x00 0x6020000 0x00 0x10000>; + reg-names = "otg", "xhci", "dev"; + interrupts = <GIC_SPI 96 IRQ_TYPE_LEVEL_HIGH>, /* irq.0 */ + <GIC_SPI 102 IRQ_TYPE_LEVEL_HIGH>, /* irq.6 */ + <GIC_SPI 120 IRQ_TYPE_LEVEL_HIGH>; /* otgirq.0 */ + interrupt-names = "host", + "peripheral", + "otg"; + maximum-speed = "super-speed"; + dr_mode = "otg"; + }; + }; diff --git a/Documentation/devicetree/bindings/usb/usb-xhci.txt b/Documentation/devicetree/bindings/usb/usb-xhci.txt index b49b819571f9..3f378951d624 100644 --- a/Documentation/devicetree/bindings/usb/usb-xhci.txt +++ b/Documentation/devicetree/bindings/usb/usb-xhci.txt @@ -10,6 +10,7 @@ Required properties: - "renesas,xhci-r8a7743" for r8a7743 SoC - "renesas,xhci-r8a7744" for r8a7744 SoC - "renesas,xhci-r8a774a1" for r8a774a1 SoC + - "renesas,xhci-r8a774b1" for r8a774b1 SoC - "renesas,xhci-r8a774c0" for r8a774c0 SoC - "renesas,xhci-r8a7790" for r8a7790 SoC - "renesas,xhci-r8a7791" for r8a7791 SoC diff --git a/Documentation/devicetree/bindings/usb/usb251xb.txt b/Documentation/devicetree/bindings/usb/usb251xb.txt index 17915f64b8ee..1a934eab175e 100644 --- a/Documentation/devicetree/bindings/usb/usb251xb.txt +++ b/Documentation/devicetree/bindings/usb/usb251xb.txt @@ -7,11 +7,12 @@ Required properties : - compatible : Should be "microchip,usb251xb" or one of the specific types: "microchip,usb2512b", "microchip,usb2512bi", "microchip,usb2513b", "microchip,usb2513bi", "microchip,usb2514b", "microchip,usb2514bi", - "microchip,usb2517", "microchip,usb2517i" + "microchip,usb2517", "microchip,usb2517i", "microchip,usb2422" - reg : I2C address on the selected bus (default is <0x2C>) Optional properties : - reset-gpios : Should specify the gpio for hub reset + - vdd-supply : Should specify the phandle to the regulator supplying vdd - skip-config : Skip Hub configuration, but only send the USB-Attach command - vendor-id : Set USB Vendor ID of the hub (16 bit, default is 0x0424) - product-id : Set USB Product ID of the hub (16 bit, default depends on type) diff --git a/Documentation/devicetree/bindings/vendor-prefixes.yaml b/Documentation/devicetree/bindings/vendor-prefixes.yaml index 967e78c5ec0a..fd6fa07c45b8 100644 --- a/Documentation/devicetree/bindings/vendor-prefixes.yaml +++ b/Documentation/devicetree/bindings/vendor-prefixes.yaml @@ -16,7 +16,7 @@ properties: {} patternProperties: # Prefixes which are not vendors, but followed the pattern # DO NOT ADD NEW PROPERTIES TO THIS LIST - "^(at25|devbus|dmacap|dsa|exynos|gpio-fan|gpio|gpmc|hdmi|i2c-gpio),.*": true + "^(at25|devbus|dmacap|dsa|exynos|fsi[ab]|gpio-fan|gpio|gpmc|hdmi|i2c-gpio),.*": true "^(keypad|m25p|max8952|max8997|max8998|mpmc),.*": true "^(pinctrl-single|#pinctrl-single|PowerPC),.*": true "^(pl022|pxa-mmc|rcar_sound|rotary-encoder|s5m8767|sdhci),.*": true @@ -343,6 +343,8 @@ patternProperties: description: Freescale Semiconductor "^fujitsu,.*": description: Fujitsu Ltd. + "^gardena,.*": + description: GARDENA GmbH "^gateworks,.*": description: Gateworks Corporation "^gcw,.*": diff --git a/Documentation/driver-api/device_link.rst b/Documentation/driver-api/device_link.rst index 1b5020ec6517..bc2d89af88ce 100644 --- a/Documentation/driver-api/device_link.rst +++ b/Documentation/driver-api/device_link.rst @@ -281,7 +281,8 @@ State machine :c:func:`driver_bound()`.) * Before a consumer device is probed, presence of supplier drivers is - verified by checking that links to suppliers are in ``DL_STATE_AVAILABLE`` + verified by checking the consumer device is not in the wait_for_suppliers + list and by checking that links to suppliers are in ``DL_STATE_AVAILABLE`` state. The state of the links is updated to ``DL_STATE_CONSUMER_PROBE``. (Call to :c:func:`device_links_check_suppliers()` from :c:func:`really_probe()`.) diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index b541e97c7ab1..c78db28519f7 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -118,13 +118,13 @@ Kernel Functions and Structures Reference Reservation Objects ------------------- -.. kernel-doc:: drivers/dma-buf/reservation.c +.. kernel-doc:: drivers/dma-buf/dma-resv.c :doc: Reservation Object Overview -.. kernel-doc:: drivers/dma-buf/reservation.c +.. kernel-doc:: drivers/dma-buf/dma-resv.c :export: -.. kernel-doc:: include/linux/reservation.h +.. kernel-doc:: include/linux/dma-resv.h :internal: DMA Fences diff --git a/Documentation/driver-api/driver-model/devres.rst b/Documentation/driver-api/driver-model/devres.rst index a100bef54952..4ab193319d8c 100644 --- a/Documentation/driver-api/driver-model/devres.rst +++ b/Documentation/driver-api/driver-model/devres.rst @@ -316,6 +316,10 @@ IOMAP devm_ioremap_nocache() devm_ioremap_wc() devm_ioremap_resource() : checks resource, requests memory region, ioremaps + devm_ioremap_resource_wc() + devm_platform_ioremap_resource() : calls devm_ioremap_resource() for platform device + devm_platform_ioremap_resource_wc() + devm_platform_ioremap_resource_byname() devm_iounmap() pcim_iomap() pcim_iomap_regions() : do request_region() and iomap() on multiple BARs diff --git a/Documentation/driver-api/driver-model/driver.rst b/Documentation/driver-api/driver-model/driver.rst index 11d281506a04..baa6a85c8287 100644 --- a/Documentation/driver-api/driver-model/driver.rst +++ b/Documentation/driver-api/driver-model/driver.rst @@ -169,6 +169,49 @@ A driver's probe() may return a negative errno value to indicate that the driver did not bind to this device, in which case it should have released all resources it allocated:: + void (*sync_state)(struct device *dev); + +sync_state is called only once for a device. It's called when all the consumer +devices of the device have successfully probed. The list of consumers of the +device is obtained by looking at the device links connecting that device to its +consumer devices. + +The first attempt to call sync_state() is made during late_initcall_sync() to +give firmware and drivers time to link devices to each other. During the first +attempt at calling sync_state(), if all the consumers of the device at that +point in time have already probed successfully, sync_state() is called right +away. If there are no consumers of the device during the first attempt, that +too is considered as "all consumers of the device have probed" and sync_state() +is called right away. + +If during the first attempt at calling sync_state() for a device, there are +still consumers that haven't probed successfully, the sync_state() call is +postponed and reattempted in the future only when one or more consumers of the +device probe successfully. If during the reattempt, the driver core finds that +there are one or more consumers of the device that haven't probed yet, then +sync_state() call is postponed again. + +A typical use case for sync_state() is to have the kernel cleanly take over +management of devices from the bootloader. For example, if a device is left on +and at a particular hardware configuration by the bootloader, the device's +driver might need to keep the device in the boot configuration until all the +consumers of the device have probed. Once all the consumers of the device have +probed, the device's driver can synchronize the hardware state of the device to +match the aggregated software state requested by all the consumers. Hence the +name sync_state(). + +While obvious examples of resources that can benefit from sync_state() include +resources such as regulator, sync_state() can also be useful for complex +resources like IOMMUs. For example, IOMMUs with multiple consumers (devices +whose addresses are remapped by the IOMMU) might need to keep their mappings +fixed at (or additive to) the boot configuration until all its consumers have +probed. + +While the typical use case for sync_state() is to have the kernel cleanly take +over management of devices from the bootloader, the usage of sync_state() is +not restricted to that. Use it whenever it makes sense to take an action after +all the consumers of a device have probed. + int (*remove) (struct device *dev); remove is called to unbind a driver from a device. This may be diff --git a/Documentation/driver-api/generic-counter.rst b/Documentation/driver-api/generic-counter.rst index 8382f01a53e3..e622f8f6e56a 100644 --- a/Documentation/driver-api/generic-counter.rst +++ b/Documentation/driver-api/generic-counter.rst @@ -7,7 +7,7 @@ Generic Counter Interface Introduction ============ -Counter devices are prevalent within a diverse spectrum of industries. +Counter devices are prevalent among a diverse spectrum of industries. The ubiquitous presence of these devices necessitates a common interface and standard of interaction and exposure. This driver API attempts to resolve the issue of duplicate code found among existing counter device @@ -26,23 +26,72 @@ the Generic Counter interface. There are three core components to a counter: -* Count: - Count data for a set of Signals. - * Signal: - Input data that is evaluated by the counter to determine the count - data. + Stream of data to be evaluated by the counter. * Synapse: - The association of a Signal with a respective Count. + Association of a Signal, and evaluation trigger, with a Count. + +* Count: + Accumulation of the effects of connected Synapses. + +SIGNAL +------ +A Signal represents a stream of data. This is the input data that is +evaluated by the counter to determine the count data; e.g. a quadrature +signal output line of a rotary encoder. Not all counter devices provide +user access to the Signal data, so exposure is optional for drivers. + +When the Signal data is available for user access, the Generic Counter +interface provides the following available signal values: + +* SIGNAL_LOW: + Signal line is in a low state. + +* SIGNAL_HIGH: + Signal line is in a high state. + +A Signal may be associated with one or more Counts. + +SYNAPSE +------- +A Synapse represents the association of a Signal with a Count. Signal +data affects respective Count data, and the Synapse represents this +relationship. + +The Synapse action mode specifies the Signal data condition that +triggers the respective Count's count function evaluation to update the +count data. The Generic Counter interface provides the following +available action modes: + +* None: + Signal does not trigger the count function. In Pulse-Direction count + function mode, this Signal is evaluated as Direction. + +* Rising Edge: + Low state transitions to high state. + +* Falling Edge: + High state transitions to low state. + +* Both Edges: + Any state transition. + +A counter is defined as a set of input signals associated with count +data that are generated by the evaluation of the state of the associated +input signals as defined by the respective count functions. Within the +context of the Generic Counter interface, a counter consists of Counts +each associated with a set of Signals, whose respective Synapse +instances represent the count function update conditions for the +associated Counts. + +A Synapse associates one Signal with one Count. COUNT ----- -A Count represents the count data for a set of Signals. The Generic -Counter interface provides the following available count data types: - -* COUNT_POSITION: - Unsigned integer value representing position. +A Count represents the accumulation of the effects of connected +Synapses; i.e. the count data for a set of Signals. The Generic +Counter interface represents the count data as a natural number. A Count has a count function mode which represents the update behavior for the count data. The Generic Counter interface provides the following @@ -86,60 +135,7 @@ available count function modes: Any state transition on either quadrature pair signals updates the respective count. Quadrature encoding determines the direction. -A Count has a set of one or more associated Signals. - -SIGNAL ------- -A Signal represents a counter input data; this is the input data that is -evaluated by the counter to determine the count data; e.g. a quadrature -signal output line of a rotary encoder. Not all counter devices provide -user access to the Signal data. - -The Generic Counter interface provides the following available signal -data types for when the Signal data is available for user access: - -* SIGNAL_LEVEL: - Signal line state level. The following states are possible: - - - SIGNAL_LEVEL_LOW: - Signal line is in a low state. - - - SIGNAL_LEVEL_HIGH: - Signal line is in a high state. - -A Signal may be associated with one or more Counts. - -SYNAPSE -------- -A Synapse represents the association of a Signal with a respective -Count. Signal data affects respective Count data, and the Synapse -represents this relationship. - -The Synapse action mode specifies the Signal data condition which -triggers the respective Count's count function evaluation to update the -count data. The Generic Counter interface provides the following -available action modes: - -* None: - Signal does not trigger the count function. In Pulse-Direction count - function mode, this Signal is evaluated as Direction. - -* Rising Edge: - Low state transitions to high state. - -* Falling Edge: - High state transitions to low state. - -* Both Edges: - Any state transition. - -A counter is defined as a set of input signals associated with count -data that are generated by the evaluation of the state of the associated -input signals as defined by the respective count functions. Within the -context of the Generic Counter interface, a counter consists of Counts -each associated with a set of Signals, whose respective Synapse -instances represent the count function update conditions for the -associated Counts. +A Count has a set of one or more associated Synapses. Paradigm ======== @@ -286,10 +282,36 @@ if device memory-managed registration is desired. Extension sysfs attributes can be created for auxiliary functionality and data by passing in defined counter_device_ext, counter_count_ext, and counter_signal_ext structures. In these cases, the -counter_device_ext structure is used for global configuration of the -respective Counter device, while the counter_count_ext and -counter_signal_ext structures allow for auxiliary exposure and -configuration of a specific Count or Signal respectively. +counter_device_ext structure is used for global/miscellaneous exposure +and configuration of the respective Counter device, while the +counter_count_ext and counter_signal_ext structures allow for auxiliary +exposure and configuration of a specific Count or Signal respectively. + +Determining the type of extension to create is a matter of scope. + +* Signal extensions are attributes that expose information/control + specific to a Signal. These types of attributes will exist under a + Signal's directory in sysfs. + + For example, if you have an invert feature for a Signal, you can have + a Signal extension called "invert" that toggles that feature: + /sys/bus/counter/devices/counterX/signalY/invert + +* Count extensions are attributes that expose information/control + specific to a Count. These type of attributes will exist under a + Count's directory in sysfs. + + For example, if you want to pause/unpause a Count from updating, you + can have a Count extension called "enable" that toggles such: + /sys/bus/counter/devices/counterX/countY/enable + +* Device extensions are attributes that expose information/control + non-specific to a particular Count or Signal. This is where you would + put your global features or other miscellanous functionality. + + For example, if your device has an overtemp sensor, you can report the + chip overheated via a device extension called "error_overtemp": + /sys/bus/counter/devices/counterX/error_overtemp Architecture ============ diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 38e638abe3eb..d0efa35052e3 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -26,6 +26,7 @@ available subsections can be seen below. device_link component message-based + infiniband sound frame-buffer regulator diff --git a/Documentation/driver-api/infiniband.rst b/Documentation/driver-api/infiniband.rst new file mode 100644 index 000000000000..1a3116f32ff0 --- /dev/null +++ b/Documentation/driver-api/infiniband.rst @@ -0,0 +1,127 @@ +=========================================== +InfiniBand and Remote DMA (RDMA) Interfaces +=========================================== + +Introduction and Overview +========================= + +TBD + +InfiniBand core interfaces +========================== + +.. kernel-doc:: drivers/infiniband/core/iwpm_util.h + :internal: + +.. kernel-doc:: drivers/infiniband/core/cq.c + :export: + +.. kernel-doc:: drivers/infiniband/core/cm.c + :export: + +.. kernel-doc:: drivers/infiniband/core/rw.c + :export: + +.. kernel-doc:: drivers/infiniband/core/device.c + :export: + +.. kernel-doc:: drivers/infiniband/core/verbs.c + :export: + +.. kernel-doc:: drivers/infiniband/core/packer.c + :export: + +.. kernel-doc:: drivers/infiniband/core/sa_query.c + :export: + +.. kernel-doc:: drivers/infiniband/core/ud_header.c + :export: + +.. kernel-doc:: drivers/infiniband/core/fmr_pool.c + :export: + +.. kernel-doc:: drivers/infiniband/core/umem.c + :export: + +.. kernel-doc:: drivers/infiniband/core/umem_odp.c + :export: + +RDMA Verbs transport library +============================ + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/mr.c + :export: + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/rc.c + :export: + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/ah.c + :export: + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/vt.c + :export: + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/cq.c + :export: + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/qp.c + :export: + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/mcast.c + :export: + +Upper Layer Protocols +===================== + +iSCSI Extensions for RDMA (iSER) +-------------------------------- + +.. kernel-doc:: drivers/infiniband/ulp/iser/iscsi_iser.h + :internal: + +.. kernel-doc:: drivers/infiniband/ulp/iser/iscsi_iser.c + :functions: iscsi_iser_pdu_alloc iser_initialize_task_headers \ + iscsi_iser_task_init iscsi_iser_mtask_xmit iscsi_iser_task_xmit \ + iscsi_iser_cleanup_task iscsi_iser_check_protection \ + iscsi_iser_conn_create iscsi_iser_conn_bind \ + iscsi_iser_conn_start iscsi_iser_conn_stop \ + iscsi_iser_session_destroy iscsi_iser_session_create \ + iscsi_iser_set_param iscsi_iser_ep_connect iscsi_iser_ep_poll \ + iscsi_iser_ep_disconnect + +.. kernel-doc:: drivers/infiniband/ulp/iser/iser_initiator.c + :internal: + +.. kernel-doc:: drivers/infiniband/ulp/iser/iser_verbs.c + :internal: + +Omni-Path (OPA) Virtual NIC support +----------------------------------- + +.. kernel-doc:: drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h + :internal: + +.. kernel-doc:: drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h + :internal: + +.. kernel-doc:: drivers/infiniband/ulp/opa_vnic/opa_vnic_vema_iface.c + :internal: + +.. kernel-doc:: drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c + :internal: + +InfiniBand SCSI RDMA protocol target support +-------------------------------------------- + +.. kernel-doc:: drivers/infiniband/ulp/srpt/ib_srpt.h + :internal: + +.. kernel-doc:: drivers/infiniband/ulp/srpt/ib_srpt.c + :internal: + +iSCSI Extensions for RDMA (iSER) target support +----------------------------------------------- + +.. kernel-doc:: drivers/infiniband/ulp/isert/ib_isert.c + :internal: + diff --git a/Documentation/driver-api/libata.rst b/Documentation/driver-api/libata.rst index 70e180e6b93d..207f0d24de69 100644 --- a/Documentation/driver-api/libata.rst +++ b/Documentation/driver-api/libata.rst @@ -250,23 +250,23 @@ High-level taskfile hooks :: - void (*qc_prep) (struct ata_queued_cmd *qc); + enum ata_completion_errors (*qc_prep) (struct ata_queued_cmd *qc); int (*qc_issue) (struct ata_queued_cmd *qc); -Higher-level hooks, these two hooks can potentially supercede several of +Higher-level hooks, these two hooks can potentially supersede several of the above taskfile/DMA engine hooks. ``->qc_prep`` is called after the buffers have been DMA-mapped, and is typically used to populate the -hardware's DMA scatter-gather table. Most drivers use the standard -:c:func:`ata_qc_prep` helper function, but more advanced drivers roll their -own. +hardware's DMA scatter-gather table. Some drivers use the standard +:c:func:`ata_bmdma_qc_prep` and :c:func:`ata_bmdma_dumb_qc_prep` helper +functions, but more advanced drivers roll their own. ``->qc_issue`` is used to make a command active, once the hardware and S/G tables have been prepared. IDE BMDMA drivers use the helper function -:c:func:`ata_qc_issue_prot` for taskfile protocol-based dispatch. More +:c:func:`ata_sff_qc_issue` for taskfile protocol-based dispatch. More advanced drivers implement their own ``->qc_issue``. -:c:func:`ata_qc_issue_prot` calls ``->tf_load()``, ``->bmdma_setup()``, and +:c:func:`ata_sff_qc_issue` calls ``->sff_tf_load()``, ``->bmdma_setup()``, and ``->bmdma_start()`` as necessary to initiate a transfer. Exception and probe handling (EH) diff --git a/Documentation/driver-api/nvmem.rst b/Documentation/driver-api/nvmem.rst index d9d958d5c824..287e86819640 100644 --- a/Documentation/driver-api/nvmem.rst +++ b/Documentation/driver-api/nvmem.rst @@ -129,6 +129,8 @@ To facilitate such consumers NVMEM framework provides below apis:: struct nvmem_device *nvmem_device_get(struct device *dev, const char *name); struct nvmem_device *devm_nvmem_device_get(struct device *dev, const char *name); + struct nvmem_device *nvmem_device_find(void *data, + int (*match)(struct device *dev, const void *data)); void nvmem_device_put(struct nvmem_device *nvmem); int nvmem_device_read(struct nvmem_device *nvmem, unsigned int offset, size_t bytes, void *buf); diff --git a/Documentation/filesystems/debugfs.txt b/Documentation/filesystems/debugfs.txt index 9e27c843d00e..dc497b96fa4f 100644 --- a/Documentation/filesystems/debugfs.txt +++ b/Documentation/filesystems/debugfs.txt @@ -68,41 +68,49 @@ actually necessary; the debugfs code provides a number of helper functions for simple situations. Files containing a single integer value can be created with any of: - struct dentry *debugfs_create_u8(const char *name, umode_t mode, - struct dentry *parent, u8 *value); - struct dentry *debugfs_create_u16(const char *name, umode_t mode, - struct dentry *parent, u16 *value); + void debugfs_create_u8(const char *name, umode_t mode, + struct dentry *parent, u8 *value); + void debugfs_create_u16(const char *name, umode_t mode, + struct dentry *parent, u16 *value); struct dentry *debugfs_create_u32(const char *name, umode_t mode, struct dentry *parent, u32 *value); - struct dentry *debugfs_create_u64(const char *name, umode_t mode, - struct dentry *parent, u64 *value); + void debugfs_create_u64(const char *name, umode_t mode, + struct dentry *parent, u64 *value); These files support both reading and writing the given value; if a specific file should not be written to, simply set the mode bits accordingly. The values in these files are in decimal; if hexadecimal is more appropriate, the following functions can be used instead: - struct dentry *debugfs_create_x8(const char *name, umode_t mode, - struct dentry *parent, u8 *value); - struct dentry *debugfs_create_x16(const char *name, umode_t mode, - struct dentry *parent, u16 *value); - struct dentry *debugfs_create_x32(const char *name, umode_t mode, - struct dentry *parent, u32 *value); - struct dentry *debugfs_create_x64(const char *name, umode_t mode, - struct dentry *parent, u64 *value); + void debugfs_create_x8(const char *name, umode_t mode, + struct dentry *parent, u8 *value); + void debugfs_create_x16(const char *name, umode_t mode, + struct dentry *parent, u16 *value); + void debugfs_create_x32(const char *name, umode_t mode, + struct dentry *parent, u32 *value); + void debugfs_create_x64(const char *name, umode_t mode, + struct dentry *parent, u64 *value); These functions are useful as long as the developer knows the size of the value to be exported. Some types can have different widths on different -architectures, though, complicating the situation somewhat. There is a -function meant to help out in one special case: +architectures, though, complicating the situation somewhat. There are +functions meant to help out in such special cases: - struct dentry *debugfs_create_size_t(const char *name, umode_t mode, - struct dentry *parent, - size_t *value); + void debugfs_create_size_t(const char *name, umode_t mode, + struct dentry *parent, size_t *value); As might be expected, this function will create a debugfs file to represent a variable of type size_t. +Similarly, there are helpers for variables of type unsigned long, in decimal +and hexadecimal: + + struct dentry *debugfs_create_ulong(const char *name, umode_t mode, + struct dentry *parent, + unsigned long *value); + void debugfs_create_xul(const char *name, umode_t mode, + struct dentry *parent, unsigned long *value); + Boolean values can be placed in debugfs with: struct dentry *debugfs_create_bool(const char *name, umode_t mode, @@ -114,8 +122,8 @@ lower-case values, or 1 or 0. Any other input will be silently ignored. Also, atomic_t values can be placed in debugfs with: - struct dentry *debugfs_create_atomic_t(const char *name, umode_t mode, - struct dentry *parent, atomic_t *value) + void debugfs_create_atomic_t(const char *name, umode_t mode, + struct dentry *parent, atomic_t *value) A read of this file will get atomic_t values, and a write of this file will set atomic_t values. diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst index 8a0700af9596..471a511c7508 100644 --- a/Documentation/filesystems/fscrypt.rst +++ b/Documentation/filesystems/fscrypt.rst @@ -256,13 +256,8 @@ alternative master keys or to support rotating master keys. Instead, the master keys may be wrapped in userspace, e.g. as is done by the `fscrypt <https://github.com/google/fscrypt>`_ tool. -Including the inode number in the IVs was considered. However, it was -rejected as it would have prevented ext4 filesystems from being -resized, and by itself still wouldn't have been sufficient to prevent -the same key from being directly reused for both XTS and CTS-CBC. - -DIRECT_KEY and per-mode keys ----------------------------- +DIRECT_KEY policies +------------------- The Adiantum encryption mode (see `Encryption modes and usage`_) is suitable for both contents and filenames encryption, and it accepts @@ -285,6 +280,21 @@ IV. Moreover: key derived using the KDF. Users may use the same master key for other v2 encryption policies. +IV_INO_LBLK_64 policies +----------------------- + +When FSCRYPT_POLICY_FLAG_IV_INO_LBLK_64 is set in the fscrypt policy, +the encryption keys are derived from the master key, encryption mode +number, and filesystem UUID. This normally results in all files +protected by the same master key sharing a single contents encryption +key and a single filenames encryption key. To still encrypt different +files' data differently, inode numbers are included in the IVs. +Consequently, shrinking the filesystem may not be allowed. + +This format is optimized for use with inline encryption hardware +compliant with the UFS or eMMC standards, which support only 64 IV +bits per I/O request and may have only a small number of keyslots. + Key identifiers --------------- @@ -308,8 +318,9 @@ If unsure, you should use the (AES-256-XTS, AES-256-CTS-CBC) pair. AES-128-CBC was added only for low-powered embedded devices with crypto accelerators such as CAAM or CESA that do not support XTS. To -use AES-128-CBC, CONFIG_CRYPTO_SHA256 (or another SHA-256 -implementation) must be enabled so that ESSIV can be used. +use AES-128-CBC, CONFIG_CRYPTO_ESSIV and CONFIG_CRYPTO_SHA256 (or +another SHA-256 implementation) must be enabled so that ESSIV can be +used. Adiantum is a (primarily) stream cipher-based mode that is fast even on CPUs without dedicated crypto instructions. It's also a true @@ -341,10 +352,16 @@ a little endian number, except that: is encrypted with AES-256 where the AES-256 key is the SHA-256 hash of the file's data encryption key. -- In the "direct key" configuration (FSCRYPT_POLICY_FLAG_DIRECT_KEY - set in the fscrypt_policy), the file's nonce is also appended to the - IV. Currently this is only allowed with the Adiantum encryption - mode. +- With `DIRECT_KEY policies`_, the file's nonce is appended to the IV. + Currently this is only allowed with the Adiantum encryption mode. + +- With `IV_INO_LBLK_64 policies`_, the logical block number is limited + to 32 bits and is placed in bits 0-31 of the IV. The inode number + (which is also limited to 32 bits) is placed in bits 32-63. + +Note that because file logical block numbers are included in the IVs, +filesystems must enforce that blocks are never shifted around within +encrypted files, e.g. via "collapse range" or "insert range". Filenames encryption -------------------- @@ -354,10 +371,10 @@ the requirements to retain support for efficient directory lookups and filenames of up to 255 bytes, the same IV is used for every filename in a directory. -However, each encrypted directory still uses a unique key; or -alternatively (for the "direct key" configuration) has the file's -nonce included in the IVs. Thus, IV reuse is limited to within a -single directory. +However, each encrypted directory still uses a unique key, or +alternatively has the file's nonce (for `DIRECT_KEY policies`_) or +inode number (for `IV_INO_LBLK_64 policies`_) included in the IVs. +Thus, IV reuse is limited to within a single directory. With CTS-CBC, the IV reuse means that when the plaintext filenames share a common prefix at least as long as the cipher block size (16 @@ -431,12 +448,15 @@ This structure must be initialized as follows: (1) for ``contents_encryption_mode`` and FSCRYPT_MODE_AES_256_CTS (4) for ``filenames_encryption_mode``. -- ``flags`` must contain a value from ``<linux/fscrypt.h>`` which - identifies the amount of NUL-padding to use when encrypting - filenames. If unsure, use FSCRYPT_POLICY_FLAGS_PAD_32 (0x3). - Additionally, if the encryption modes are both - FSCRYPT_MODE_ADIANTUM, this can contain - FSCRYPT_POLICY_FLAG_DIRECT_KEY; see `DIRECT_KEY and per-mode keys`_. +- ``flags`` contains optional flags from ``<linux/fscrypt.h>``: + + - FSCRYPT_POLICY_FLAGS_PAD_*: The amount of NUL padding to use when + encrypting filenames. If unsure, use FSCRYPT_POLICY_FLAGS_PAD_32 + (0x3). + - FSCRYPT_POLICY_FLAG_DIRECT_KEY: See `DIRECT_KEY policies`_. + - FSCRYPT_POLICY_FLAG_IV_INO_LBLK_64: See `IV_INO_LBLK_64 + policies`_. This is mutually exclusive with DIRECT_KEY and is not + supported on v1 policies. - For v2 encryption policies, ``__reserved`` must be zeroed. @@ -1089,7 +1109,7 @@ policy structs (see `Setting an encryption policy`_), except that the context structs also contain a nonce. The nonce is randomly generated by the kernel and is used as KDF input or as a tweak to cause different files to be encrypted differently; see `Per-file keys`_ and -`DIRECT_KEY and per-mode keys`_. +`DIRECT_KEY policies`_. Data path changes ----------------- diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst index 42a0b6dd9e0b..a95536b6443c 100644 --- a/Documentation/filesystems/fsverity.rst +++ b/Documentation/filesystems/fsverity.rst @@ -226,6 +226,14 @@ To do so, check for FS_VERITY_FL (0x00100000) in the returned flags. The verity flag is not settable via FS_IOC_SETFLAGS. You must use FS_IOC_ENABLE_VERITY instead, since parameters must be provided. +statx +----- + +Since Linux v5.5, the statx() system call sets STATX_ATTR_VERITY if +the file has fs-verity enabled. This can perform better than +FS_IOC_GETFLAGS and FS_IOC_MEASURE_VERITY because it doesn't require +opening the file, and opening verity files can be expensive. + Accessing verity files ====================== @@ -398,7 +406,7 @@ pages have been read into the pagecache. (See `Verifying data`_.) ext4 ---- -ext4 supports fs-verity since Linux TODO and e2fsprogs v1.45.2. +ext4 supports fs-verity since Linux v5.4 and e2fsprogs v1.45.2. To create verity files on an ext4 filesystem, the filesystem must have been formatted with ``-O verity`` or had ``tune2fs -O verity`` run on @@ -434,7 +442,7 @@ also only supports extent-based files. f2fs ---- -f2fs supports fs-verity since Linux TODO and f2fs-tools v1.11.0. +f2fs supports fs-verity since Linux v5.4 and f2fs-tools v1.11.0. To create verity files on an f2fs filesystem, the filesystem must have been formatted with ``-O verity``. diff --git a/Documentation/firmware-guide/acpi/namespace.rst b/Documentation/firmware-guide/acpi/namespace.rst index 835521baeb89..3eb763d6656d 100644 --- a/Documentation/firmware-guide/acpi/namespace.rst +++ b/Documentation/firmware-guide/acpi/namespace.rst @@ -261,7 +261,7 @@ Description Tables contain information used for the creation of the struct acpi_device objects represented by the given row (xSDT means DSDT or SSDT). -The forth column of the above table indicates the 'bus_id' generation +The fourth column of the above table indicates the 'bus_id' generation rule of the struct acpi_device object: _HID: diff --git a/Documentation/fpga/dfl.rst b/Documentation/fpga/dfl.rst index 6fa483fc823e..094fc8aacd8e 100644 --- a/Documentation/fpga/dfl.rst +++ b/Documentation/fpga/dfl.rst @@ -108,6 +108,16 @@ More functions are exposed through sysfs error reporting sysfs interfaces allow user to read errors detected by the hardware, and clear the logged errors. + Power management (dfl_fme_power hwmon) + power management hwmon sysfs interfaces allow user to read power management + information (power consumption, thresholds, threshold status, limits, etc.) + and configure power thresholds for different throttling levels. + + Thermal management (dfl_fme_thermal hwmon) + thermal management hwmon sysfs interfaces allow user to read thermal + management information (current temperature, thresholds, threshold status, + etc.). + FIU - PORT ========== diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst index 5acdd1842ea2..0efede580039 100644 --- a/Documentation/gpu/amdgpu.rst +++ b/Documentation/gpu/amdgpu.rst @@ -79,16 +79,71 @@ AMDGPU XGMI Support .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c :internal: -AMDGPU RAS debugfs control interface -==================================== +AMDGPU RAS Support +================== + +The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and +debugfs (for error injection). + +RAS debugfs/sysfs Control and Error Injection Interfaces +-------------------------------------------------------- .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c :doc: AMDGPU RAS debugfs control interface +RAS Reboot Behavior for Unrecoverable Errors +-------------------------------------------------------- + +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c + :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors + +RAS Error Count sysfs Interface +------------------------------- + +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c + :doc: AMDGPU RAS sysfs Error Count Interface + +RAS EEPROM debugfs Interface +---------------------------- + +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c + :doc: AMDGPU RAS debugfs EEPROM table reset interface + +RAS VRAM Bad Pages sysfs Interface +---------------------------------- + +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c + :doc: AMDGPU RAS sysfs gpu_vram_bad_pages Interface .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c :internal: +Sample Code +----------- +Sample code for testing error injection can be found here: +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c + +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. +There are four sets of tests: + +RAS Basic Test + +The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files +are present. + +RAS Query Test + +This test checks the RAS availability and enablement status for each supported IP block as well as +the error counts. + +RAS Inject Test + +This test injects errors for each IP. + +RAS Disable Test + +This test tests disabling of RAS features for each IP block. + GPU Power/Thermal Controls and Monitoring ========================================= @@ -130,11 +185,11 @@ pp_od_clk_voltage .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c :doc: pp_od_clk_voltage -pp_dpm_sclk pp_dpm_mclk pp_dpm_pcie -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +pp_dpm_* +~~~~~~~~ .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c - :doc: pp_dpm_sclk pp_dpm_mclk pp_dpm_pcie + :doc: pp_dpm_sclk pp_dpm_mclk pp_dpm_socclk pp_dpm_fclk pp_dpm_dcefclk pp_dpm_pcie pp_power_profile_mode ~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/gpu/drm-kms-helpers.rst b/Documentation/gpu/drm-kms-helpers.rst index 3868008db8a9..9668a7fe2408 100644 --- a/Documentation/gpu/drm-kms-helpers.rst +++ b/Documentation/gpu/drm-kms-helpers.rst @@ -77,9 +77,6 @@ Atomic State Reset and Initialization Atomic State Helper Reference ----------------------------- -.. kernel-doc:: include/drm/drm_atomic_state_helper.h - :internal: - .. kernel-doc:: drivers/gpu/drm/drm_atomic_state_helper.c :export: diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst index b664f054c259..59619296c84b 100644 --- a/Documentation/gpu/drm-mm.rst +++ b/Documentation/gpu/drm-mm.rst @@ -400,16 +400,13 @@ GEM VRAM Helper Functions Reference .. kernel-doc:: drivers/gpu/drm/drm_gem_vram_helper.c :export: -VRAM MM Helper Functions Reference ----------------------------------- +GEM TTM Helper Functions Reference +----------------------------------- -.. kernel-doc:: drivers/gpu/drm/drm_vram_mm_helper.c +.. kernel-doc:: drivers/gpu/drm/drm_gem_ttm_helper.c :doc: overview -.. kernel-doc:: include/drm/drm_vram_mm_helper.h - :internal: - -.. kernel-doc:: drivers/gpu/drm/drm_vram_mm_helper.c +.. kernel-doc:: drivers/gpu/drm/drm_gem_ttm_helper.c :export: VMA Offset Manager diff --git a/Documentation/gpu/i915.rst b/Documentation/gpu/i915.rst index 3415255ad3dc..d0947c5c4ab8 100644 --- a/Documentation/gpu/i915.rst +++ b/Documentation/gpu/i915.rst @@ -246,6 +246,15 @@ Display PLLs .. kernel-doc:: drivers/gpu/drm/i915/display/intel_dpll_mgr.h :internal: +Display State Buffer +-------------------- + +.. kernel-doc:: drivers/gpu/drm/i915/display/intel_dsb.c + :doc: DSB + +.. kernel-doc:: drivers/gpu/drm/i915/display/intel_dsb.c + :internal: + Memory Management and Command Submission ======================================== @@ -358,15 +367,6 @@ Batchbuffer Parsing .. kernel-doc:: drivers/gpu/drm/i915/i915_cmd_parser.c :internal: -Batchbuffer Pools ------------------ - -.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_batch_pool.c - :doc: batch pool - -.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_batch_pool.c - :internal: - User Batchbuffer Execution -------------------------- @@ -415,32 +415,53 @@ Object Tiling IOCTLs .. kernel-doc:: drivers/gpu/drm/i915/gem/i915_gem_tiling.c :doc: buffer object tiling +Microcontrollers +================ + +Starting from gen9, three microcontrollers are available on the HW: the +graphics microcontroller (GuC), the HEVC/H.265 microcontroller (HuC) and the +display microcontroller (DMC). The driver is responsible for loading the +firmwares on the microcontrollers; the GuC and HuC firmwares are transferred +to WOPCM using the DMA engine, while the DMC firmware is written through MMIO. + WOPCM -===== +----- WOPCM Layout ------------- +~~~~~~~~~~~~ .. kernel-doc:: drivers/gpu/drm/i915/intel_wopcm.c :doc: WOPCM Layout GuC -=== +--- -Firmware Layout -------------------- +.. kernel-doc:: drivers/gpu/drm/i915/gt/uc/intel_guc.c + :doc: GuC + +GuC Firmware Layout +~~~~~~~~~~~~~~~~~~~ .. kernel-doc:: drivers/gpu/drm/i915/gt/uc/intel_uc_fw_abi.h :doc: Firmware Layout +GuC Memory Management +~~~~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: drivers/gpu/drm/i915/gt/uc/intel_guc.c + :doc: GuC Memory Management +.. kernel-doc:: drivers/gpu/drm/i915/gt/uc/intel_guc.c + :functions: intel_guc_allocate_vma + + GuC-specific firmware loader ----------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. kernel-doc:: drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c :internal: GuC-based command submission ----------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. kernel-doc:: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c :doc: GuC-based command submission @@ -448,11 +469,26 @@ GuC-based command submission .. kernel-doc:: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c :internal: -GuC Address Space ------------------ +HuC +--- +.. kernel-doc:: drivers/gpu/drm/i915/gt/uc/intel_huc.c + :doc: HuC +.. kernel-doc:: drivers/gpu/drm/i915/gt/uc/intel_huc.c + :functions: intel_huc_auth -.. kernel-doc:: drivers/gpu/drm/i915/gt/uc/intel_guc.c - :doc: GuC Address Space +HuC Memory Management +~~~~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: drivers/gpu/drm/i915/gt/uc/intel_huc.c + :doc: HuC Memory Management + +HuC Firmware Layout +~~~~~~~~~~~~~~~~~~~ +The HuC FW layout is the same as the GuC one, see `GuC Firmware Layout`_ + +DMC +--- +See `CSR firmware support for DMC`_ Tracing ======= @@ -514,9 +550,9 @@ i915 Perf Stream This section covers the stream-semantics-agnostic structures and functions for representing an i915 perf stream FD and associated file operations. -.. kernel-doc:: drivers/gpu/drm/i915/i915_drv.h +.. kernel-doc:: drivers/gpu/drm/i915/i915_perf_types.h :functions: i915_perf_stream -.. kernel-doc:: drivers/gpu/drm/i915/i915_drv.h +.. kernel-doc:: drivers/gpu/drm/i915/i915_perf_types.h :functions: i915_perf_stream_ops .. kernel-doc:: drivers/gpu/drm/i915/i915_perf.c @@ -541,7 +577,7 @@ for representing an i915 perf stream FD and associated file operations. i915 Perf Observation Architecture Stream ----------------------------------------- -.. kernel-doc:: drivers/gpu/drm/i915/i915_drv.h +.. kernel-doc:: drivers/gpu/drm/i915/i915_perf_types.h :functions: i915_oa_ops .. kernel-doc:: drivers/gpu/drm/i915/i915_perf.c diff --git a/Documentation/gpu/mcde.rst b/Documentation/gpu/mcde.rst index c69e977defda..dd43dde379e0 100644 --- a/Documentation/gpu/mcde.rst +++ b/Documentation/gpu/mcde.rst @@ -5,4 +5,4 @@ ======================================================= .. kernel-doc:: drivers/gpu/drm/mcde/mcde_drv.c - :doc: ST-Ericsson MCDE DRM Driver + :doc: ST-Ericsson MCDE Driver diff --git a/Documentation/gpu/todo.rst b/Documentation/gpu/todo.rst index 32787acff0a8..6792fa9b6b6b 100644 --- a/Documentation/gpu/todo.rst +++ b/Documentation/gpu/todo.rst @@ -7,6 +7,22 @@ TODO list This section contains a list of smaller janitorial tasks in the kernel DRM graphics subsystem useful as newbie projects. Or for slow rainy days. +Difficulty +---------- + +To make it easier task are categorized into different levels: + +Starter: Good tasks to get started with the DRM subsystem. + +Intermediate: Tasks which need some experience with working in the DRM +subsystem, or some specific GPU/display graphics knowledge. For debugging issue +it's good to have the relevant hardware (or a virtual driver set up) available +for testing. + +Advanced: Tricky tasks that need fairly good understanding of the DRM subsystem +and graphics topics. Generally need the relevant hardware for development and +testing. + Subsystem-wide refactorings =========================== @@ -20,6 +36,8 @@ implementations), and then remove it. Contact: Daniel Vetter, respective driver maintainers +Level: Intermediate + Convert existing KMS drivers to atomic modesetting -------------------------------------------------- @@ -38,6 +56,8 @@ do by directly using the new atomic helper driver callbacks. Contact: Daniel Vetter, respective driver maintainers +Level: Advanced + Clean up the clipped coordination confusion around planes --------------------------------------------------------- @@ -50,6 +70,8 @@ helpers. Contact: Ville Syrjälä, Daniel Vetter, driver maintainers +Level: Advanced + Convert early atomic drivers to async commit helpers ---------------------------------------------------- @@ -63,6 +85,8 @@ events for atomic commits correctly. But fixing these bugs is good anyway. Contact: Daniel Vetter, respective driver maintainers +Level: Advanced + Fallout from atomic KMS ----------------------- @@ -91,6 +115,8 @@ interfaces to fix these issues: Contact: Daniel Vetter +Level: Intermediate + Get rid of dev->struct_mutex from GEM drivers --------------------------------------------- @@ -114,6 +140,8 @@ fine-grained per-buffer object and per-context lockings scheme. Currently only t Contact: Daniel Vetter, respective driver maintainers +Level: Advanced + Convert instances of dev_info/dev_err/dev_warn to their DRM_DEV_* equivalent ---------------------------------------------------------------------------- @@ -129,6 +157,8 @@ are better. Contact: Sean Paul, Maintainer of the driver you plan to convert +Level: Starter + Convert drivers to use simple modeset suspend/resume ---------------------------------------------------- @@ -139,6 +169,8 @@ of the atomic suspend/resume code in older atomic modeset drivers. Contact: Maintainer of the driver you plan to convert +Level: Intermediate + Convert drivers to use drm_fb_helper_fbdev_setup/teardown() ----------------------------------------------------------- @@ -157,6 +189,8 @@ probably use drm_fb_helper_fbdev_teardown(). Contact: Maintainer of the driver you plan to convert +Level: Intermediate + Clean up mmap forwarding ------------------------ @@ -166,14 +200,16 @@ There's drm_gem_prime_mmap() for this now, but still needs to be rolled out. Contact: Daniel Vetter +Level: Intermediate + Generic fbdev defio support --------------------------- The defio support code in the fbdev core has some very specific requirements, -which means drivers need to have a special framebuffer for fbdev. Which prevents -us from using the generic fbdev emulation code everywhere. The main issue is -that it uses some fields in struct page itself, which breaks shmem gem objects -(and other things). +which means drivers need to have a special framebuffer for fbdev. The main +issue is that it uses some fields in struct page itself, which breaks shmem +gem objects (and other things). To support defio, affected drivers require +the use of a shadow buffer, which may add CPU and memory overhead. Possible solution would be to write our own defio mmap code in the drm fbdev emulation. It would need to fully wrap the existing mmap ops, forwarding @@ -196,6 +232,8 @@ Might be good to also have some igt testcases for this. Contact: Daniel Vetter, Noralf Tronnes +Level: Advanced + idr_init_base() --------------- @@ -206,6 +244,8 @@ efficient. Contact: Daniel Vetter +Level: Starter + struct drm_gem_object_funcs --------------------------- @@ -216,6 +256,8 @@ We also need a 2nd version of the CMA define that doesn't require the vmapping to be present (different hook for prime importing). Plus this needs to be rolled out to all drivers using their own implementations, too. +Level: Intermediate + Use DRM_MODESET_LOCK_ALL_* helpers instead of boilerplate --------------------------------------------------------- @@ -231,6 +273,8 @@ As a reference, take a look at the conversions already completed in drm core. Contact: Sean Paul, respective driver maintainers +Level: Starter + Rename CMA helpers to DMA helpers --------------------------------- @@ -241,6 +285,9 @@ no one knows what that means) since underneath they just use dma_alloc_coherent. Contact: Laurent Pinchart, Daniel Vetter +Level: Intermediate (mostly because it is a huge tasks without good partial +milestones, not technically itself that challenging) + Convert direct mode.vrefresh accesses to use drm_mode_vrefresh() ---------------------------------------------------------------- @@ -259,6 +306,8 @@ drm_display_mode to avoid future use. Contact: Sean Paul +Level: Starter + Remove drm_display_mode.hsync ----------------------------- @@ -269,6 +318,8 @@ it to use drm_mode_hsync() instead. Contact: Sean Paul +Level: Starter + drm_fb_helper tasks ------------------- @@ -284,20 +335,24 @@ drm_fb_helper tasks removed: drm_fb_helper_single_add_all_connectors(), drm_fb_helper_add_one_connector() and drm_fb_helper_remove_one_connector(). -Core refactorings -================= +Level: Intermediate -Clean up the DRM header mess ----------------------------- +connector register/unregister fixes +----------------------------------- -The DRM subsystem originally had only one huge global header, ``drmP.h``. This -is now split up, but many source files still include it. The remaining part of -the cleanup work here is to replace any ``#include <drm/drmP.h>`` by only the -headers needed (and fixing up any missing pre-declarations in the headers). +- For most connectors it's a no-op to call drm_connector_register/unregister + directly from driver code, drm_dev_register/unregister take care of this + already. We can remove all of them. -In the end no .c file should need to include ``drmP.h`` anymore. +- For dp drivers it's a bit more a mess, since we need the connector to be + registered when calling drm_dp_aux_register. Fix this by instead calling + drm_dp_aux_init, and moving the actual registering into a late_register + callback as recommended in the kerneldoc. -Contact: Daniel Vetter +Level: Intermediate + +Core refactorings +================= Make panic handling work ------------------------ @@ -338,6 +393,8 @@ This is a really varied tasks with lots of little bits and pieces: Contact: Daniel Vetter +Level: Advanced + Clean up the debugfs support ---------------------------- @@ -367,6 +424,8 @@ There's a bunch of issues with it: Contact: Daniel Vetter +Level: Intermediate + KMS cleanups ------------ @@ -382,6 +441,8 @@ Some of these date from the very introduction of KMS in 2008 ... end, for which we could add drm_*_cleanup_kfree(). And then there's the (for historical reasons) misnamed drm_primary_helper_destroy() function. +Level: Intermediate + Better Testing ============== @@ -390,6 +451,8 @@ Enable trinity for DRM And fix up the fallout. Should be really interesting ... +Level: Advanced + Make KMS tests in i-g-t generic ------------------------------- @@ -403,6 +466,8 @@ converting things over. For modeset tests we also first need a bit of infrastructure to use dumb buffers for untiled buffers, to be able to run all the non-i915 specific modeset tests. +Level: Advanced + Extend virtual test driver (VKMS) --------------------------------- @@ -412,6 +477,8 @@ fit the available time. Contact: Daniel Vetter +Level: See details + Backlight Refactoring --------------------- @@ -425,6 +492,8 @@ Plan to fix this: Contact: Daniel Vetter +Level: Intermediate + Driver Specific =============== @@ -438,13 +507,6 @@ See drivers/gpu/drm/amd/display/TODO for tasks. Contact: Harry Wentland, Alex Deucher -i915 ----- - -- Our early/late pm callbacks could be removed in favour of using - device_link_add to model the dependency between i915 and snd_had. See - https://dri.freedesktop.org/docs/drm/driver-api/device_link.html - Bootsplash ========== @@ -460,5 +522,36 @@ for fbdev. Contact: Sam Ravnborg +Level: Advanced + Outside DRM =========== + +Convert fbdev drivers to DRM +---------------------------- + +There are plenty of fbdev drivers for older hardware. Some hwardware has +become obsolete, but some still provides good(-enough) framebuffers. The +drivers that are still useful should be converted to DRM and afterwards +removed from fbdev. + +Very simple fbdev drivers can best be converted by starting with a new +DRM driver. Simple KMS helpers and SHMEM should be able to handle any +existing hardware. The new driver's call-back functions are filled from +existing fbdev code. + +More complex fbdev drivers can be refactored step-by-step into a DRM +driver with the help of the DRM fbconv helpers. [1] These helpers provide +the transition layer between the DRM core infrastructure and the fbdev +driver interface. Create a new DRM driver on top of the fbconv helpers, +copy over the fbdev driver, and hook it up to the DRM code. Examples for +several fbdev drivers are available at [1] and a tutorial of this process +available at [2]. The result is a primitive DRM driver that can run X11 +and Weston. + + - [1] https://gitlab.freedesktop.org/tzimmermann/linux/tree/fbconv + - [2] https://gitlab.freedesktop.org/tzimmermann/linux/blob/fbconv/drivers/gpu/drm/drm_fbconv_helper.c + +Contact: Thomas Zimmermann <tzimmermann@suse.de> + +Level: Advanced diff --git a/Documentation/hwmon/bel-pfe.rst b/Documentation/hwmon/bel-pfe.rst new file mode 100644 index 000000000000..4b4a7d67854c --- /dev/null +++ b/Documentation/hwmon/bel-pfe.rst @@ -0,0 +1,112 @@ +Kernel driver bel-pfe +====================== + +Supported chips: + + * BEL PFE1100 + + Prefixes: 'pfe1100' + + Addresses scanned: - + + Datasheet: https://www.belfuse.com/resources/datasheets/powersolutions/ds-bps-pfe1100-12-054xa.pdf + + * BEL PFE3000 + + Prefixes: 'pfe3000' + + Addresses scanned: - + + Datasheet: https://www.belfuse.com/resources/datasheets/powersolutions/ds-bps-pfe3000-series.pdf + +Author: Tao Ren <rentao.bupt@gmail.com> + + +Description +----------- + +This driver supports hardware monitoring for below power supply devices +which support PMBus Protocol: + + * BEL PFE1100 + + 1100 Watt AC to DC power-factor-corrected (PFC) power supply. + PMBus Communication Manual is not publicly available. + + * BEL PFE3000 + + 3000 Watt AC/DC power-factor-corrected (PFC) and DC-DC power supply. + PMBus Communication Manual is not publicly available. + +The driver is a client driver to the core PMBus driver. Please see +Documentation/hwmon/pmbus.rst for details on PMBus client drivers. + + +Usage Notes +----------- + +This driver does not auto-detect devices. You will have to instantiate the +devices explicitly. Please see Documentation/i2c/instantiating-devices.rst for +details. + +Example: the following will load the driver for an PFE3000 at address 0x20 +on I2C bus #1:: + + $ modprobe bel-pfe + $ echo pfe3000 0x20 > /sys/bus/i2c/devices/i2c-1/new_device + + +Platform data support +--------------------- + +The driver supports standard PMBus driver platform data. + + +Sysfs entries +------------- + +======================= ======================================================= +curr1_label "iin" +curr1_input Measured input current +curr1_max Input current max value +curr1_max_alarm Input current max alarm + +curr[2-3]_label "iout[1-2]" +curr[2-3]_input Measured output current +curr[2-3]_max Output current max value +curr[2-3]_max_alarm Output current max alarm + +fan[1-2]_input Fan 1 and 2 speed in RPM +fan1_target Set fan speed reference for both fans + +in1_label "vin" +in1_input Measured input voltage +in1_crit Input voltage critical max value +in1_crit_alarm Input voltage critical max alarm +in1_lcrit Input voltage critical min value +in1_lcrit_alarm Input voltage critical min alarm +in1_max Input voltage max value +in1_max_alarm Input voltage max alarm + +in2_label "vcap" +in2_input Hold up capacitor voltage + +in[3-8]_label "vout[1-3,5-7]" +in[3-8]_input Measured output voltage +in[3-4]_alarm vout[1-2] output voltage alarm + +power[1-2]_label "pin[1-2]" +power[1-2]_input Measured input power +power[1-2]_alarm Input power high alarm + +power[3-4]_label "pout[1-2]" +power[3-4]_input Measured output power + +temp[1-3]_input Measured temperature +temp[1-3]_alarm Temperature alarm +======================= ======================================================= + +.. note:: + + - curr3, fan2, vout[2-7], vcap, pin2, pout2 and temp3 attributes only + exist for PFE3000. diff --git a/Documentation/hwmon/dell-smm-hwmon.rst b/Documentation/hwmon/dell-smm-hwmon.rst new file mode 100644 index 000000000000..3bf77a5df995 --- /dev/null +++ b/Documentation/hwmon/dell-smm-hwmon.rst @@ -0,0 +1,164 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +.. include:: <isonum.txt> + +Kernel driver dell-smm-hwmon +============================ + +:Copyright: |copy| 2002-2005 Massimo Dal Zotto <dz@debian.org> +:Copyright: |copy| 2019 Giovanni Mascellani <gio@debian.org> + +Description +----------- + +On many Dell laptops the System Management Mode (SMM) BIOS can be +queried for the status of fans and temperature sensors. Userspace +utilities like ``sensors`` can be used to return the readings. The +userspace suite `i8kutils`__ can also be used to read the sensors and +automatically adjust fan speed (please notice that it currently uses +the deprecated ``/proc/i8k`` interface). + + __ https://github.com/vitorafsr/i8kutils + +``sysfs`` interface +------------------- + +Temperature sensors and fans can be queried and set via the standard +``hwmon`` interface on ``sysfs``, under the directory +``/sys/class/hwmon/hwmonX`` for some value of ``X`` (search for the +``X`` such that ``/sys/class/hwmon/hwmonX/name`` has content +``dell_smm``). A number of other attributes can be read or written: + +=============================== ======= ======================================= +Name Perm Description +=============================== ======= ======================================= +fan[1-3]_input RO Fan speed in RPM. +fan[1-3]_label RO Fan label. +pwm[1-3] RW Control the fan PWM duty-cycle. +pwm1_enable WO Enable or disable automatic BIOS fan + control (not supported on all laptops, + see below for details). +temp[1-10]_input RO Temperature reading in milli-degrees + Celsius. +temp[1-10]_label RO Temperature sensor label. +=============================== ======= ======================================= + +Disabling automatic BIOS fan control +------------------------------------ + +On some laptops the BIOS automatically sets fan speed every few +seconds. Therefore the fan speed set by mean of this driver is quickly +overwritten. + +There is experimental support for disabling automatic BIOS fan +control, at least on laptops where the corresponding SMM command is +known, by writing the value ``1`` in the attribute ``pwm1_enable`` +(writing ``2`` enables automatic BIOS control again). Even if you have +more than one fan, all of them are set to either enabled or disabled +automatic fan control at the same time and, notwithstanding the name, +``pwm1_enable`` sets automatic control for all fans. + +If ``pwm1_enable`` is not available, then it means that SMM codes for +enabling and disabling automatic BIOS fan control are not whitelisted +for your hardware. It is possible that codes that work for other +laptops actually work for yours as well, or that you have to discover +new codes. + +Check the list ``i8k_whitelist_fan_control`` in file +``drivers/hwmon/dell-smm-hwmon.c`` in the kernel tree: as a first +attempt you can try to add your machine and use an already-known code +pair. If, after recompiling the kernel, you see that ``pwm1_enable`` +is present and works (i.e., you can manually control the fan speed), +then please submit your finding as a kernel patch, so that other users +can benefit from it. Please see +:ref:`Documentation/process/submitting-patches.rst <submittingpatches>` +for information on submitting patches. + +If no known code works on your machine, you need to resort to do some +probing, because unfortunately Dell does not publish datasheets for +its SMM. You can experiment with the code in `this repository`__ to +probe the BIOS on your machine and discover the appropriate codes. + + __ https://github.com/clopez/dellfan/ + +Again, when you find new codes, we'd be happy to have your patches! + +Module parameters +----------------- + +* force:bool + Force loading without checking for supported + models. (default: 0) + +* ignore_dmi:bool + Continue probing hardware even if DMI data does not + match. (default: 0) + +* restricted:bool + Allow fan control only to processes with the + ``CAP_SYS_ADMIN`` capability set or processes run + as root when using the legacy ``/proc/i8k`` + interface. In this case normal users will be able + to read temperature and fan status but not to + control the fan. If your notebook is shared with + other users and you don't trust them you may want + to use this option. (default: 1, only available + with ``CONFIG_I8K``) + +* power_status:bool + Report AC status in ``/proc/i8k``. (default: 0, + only available with ``CONFIG_I8K``) + +* fan_mult:uint + Factor to multiply fan speed with. (default: + autodetect) + +* fan_max:uint + Maximum configurable fan speed. (default: + autodetect) + +Legacy ``/proc`` interface +-------------------------- + +.. warning:: This interface is obsolete and deprecated and should not + used in new applications. This interface is only + available when kernel is compiled with option + ``CONFIG_I8K``. + +The information provided by the kernel driver can be accessed by +simply reading the ``/proc/i8k`` file. For example:: + + $ cat /proc/i8k + 1.0 A17 2J59L02 52 2 1 8040 6420 1 2 + +The fields read from ``/proc/i8k`` are:: + + 1.0 A17 2J59L02 52 2 1 8040 6420 1 2 + | | | | | | | | | | + | | | | | | | | | +------- 10. buttons status + | | | | | | | | +--------- 9. AC status + | | | | | | | +-------------- 8. fan0 RPM + | | | | | | +------------------- 7. fan1 RPM + | | | | | +--------------------- 6. fan0 status + | | | | +----------------------- 5. fan1 status + | | | +-------------------------- 4. temp0 reading (Celsius) + | | +---------------------------------- 3. Dell service tag (later known as 'serial number') + | +-------------------------------------- 2. BIOS version + +------------------------------------------ 1. /proc/i8k format version + +A negative value, for example -22, indicates that the BIOS doesn't +return the corresponding information. This is normal on some +models/BIOSes. + +For performance reasons the ``/proc/i8k`` doesn't report by default +the AC status since this SMM call takes a long time to execute and is +not really needed. If you want to see the ac status in ``/proc/i8k`` +you must explictitly enable this option by passing the +``power_status=1`` parameter to insmod. If AC status is not +available -1 is printed instead. + +The driver provides also an ioctl interface which can be used to +obtain the same information and to control the fan status. The ioctl +interface can be accessed from C programs or from shell using the +i8kctl utility. See the source file of ``i8kutils`` for more +information on how to use the ioctl interface. diff --git a/Documentation/hwmon/ina3221.rst b/Documentation/hwmon/ina3221.rst index f6007ae8f4e2..297f7323b441 100644 --- a/Documentation/hwmon/ina3221.rst +++ b/Documentation/hwmon/ina3221.rst @@ -41,6 +41,18 @@ curr[123]_max Warning alert current(mA) setting, activates the average is above this value. curr[123]_max_alarm Warning alert current limit exceeded in[456]_input Shunt voltage(uV) for channels 1, 2, and 3 respectively +in7_input Sum of shunt voltage(uV) channels +in7_label Channel label for sum of shunt voltage +curr4_input Sum of current(mA) measurement channels, + (only available when all channels use the same resistor + value for their shunt resistors) +curr4_crit Critical alert current(mA) setting for sum of current + measurements, activates the corresponding alarm + when the respective current is above this value + (only effective when all channels use the same resistor + value for their shunt resistors) +curr4_crit_alarm Critical alert current limit exceeded for sum of + current measurements. samples Number of samples using in the averaging mode. Supports the list of number of samples: diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst index 230ad59b462b..43cc605741ea 100644 --- a/Documentation/hwmon/index.rst +++ b/Documentation/hwmon/index.rst @@ -41,9 +41,11 @@ Hardware Monitoring Kernel Drivers asb100 asc7621 aspeed-pwm-tacho + bel-pfe coretemp da9052 da9055 + dell-smm-hwmon dme1737 ds1621 ds620 @@ -90,6 +92,7 @@ Hardware Monitoring Kernel Drivers lm95245 lochnagar ltc2945 + ltc2947 ltc2978 ltc2990 ltc3815 @@ -153,6 +156,7 @@ Hardware Monitoring Kernel Drivers tmp108 tmp401 tmp421 + tmp513 tps40422 twl4030-madc-hwmon ucd9000 diff --git a/Documentation/hwmon/ltc2947.rst b/Documentation/hwmon/ltc2947.rst new file mode 100644 index 000000000000..419fc84fe934 --- /dev/null +++ b/Documentation/hwmon/ltc2947.rst @@ -0,0 +1,100 @@ +Kernel drivers ltc2947-i2c and ltc2947-spi +========================================== + +Supported chips: + + * Analog Devices LTC2947 + + Prefix: 'ltc2947' + + Addresses scanned: - + + Datasheet: + + https://www.analog.com/media/en/technical-documentation/data-sheets/LTC2947.pdf + +Author: Nuno Sá <nuno.sa@analog.com> + +Description +___________ + +The LTC2947 is a high precision power and energy monitor that measures current, +voltage, power, temperature, charge and energy. The device supports both SPI +and I2C depending on the chip configuration. +The device also measures accumulated quantities as energy. It has two banks of +register's to read/set energy related values. These banks can be configured +independently to have setups like: energy1 accumulates always and enrgy2 only +accumulates if current is positive (to check battery charging efficiency for +example). The device also supports a GPIO pin that can be configured as output +to control a fan as a function of measured temperature. Then, the GPIO becomes +active as soon as a temperature reading is higher than a defined threshold. The +temp2 channel is used to control this thresholds and to read the respective +alarms. + +Sysfs entries +_____________ + +The following attributes are supported. Limits are read-write, reset_history +is write-only and all the other attributes are read-only. + +======================= ========================================== +in0_input VP-VM voltage (mV). +in0_min Undervoltage threshold +in0_max Overvoltage threshold +in0_lowest Lowest measured voltage +in0_highest Highest measured voltage +in0_reset_history Write 1 to reset in1 history +in0_min_alarm Undervoltage alarm +in0_max_alarm Overvoltage alarm +in0_label Channel label (VP-VM) + +in1_input DVCC voltage (mV) +in1_min Undervoltage threshold +in1_max Overvoltage threshold +in1_lowest Lowest measured voltage +in1_highest Highest measured voltage +in1_reset_history Write 1 to reset in2 history +in1_min_alarm Undervoltage alarm +in1_max_alarm Overvoltage alarm +in1_label Channel label (DVCC) + +curr1_input IP-IM Sense current (mA) +curr1_min Undercurrent threshold +curr1_max Overcurrent threshold +curr1_lowest Lowest measured current +curr1_highest Highest measured current +curr1_reset_history Write 1 to reset curr1 history +curr1_min_alarm Undercurrent alarm +curr1_max_alarm Overcurrent alarm +curr1_label Channel label (IP-IM) + +power1_input Power (in uW) +power1_min Low power threshold +power1_max High power threshold +power1_input_lowest Historical minimum power use +power1_input_highest Historical maximum power use +power1_reset_history Write 1 to reset power1 history +power1_min_alarm Low power alarm +power1_max_alarm High power alarm +power1_label Channel label (Power) + +temp1_input Chip Temperature (in milliC) +temp1_min Low temperature threshold +temp1_max High temperature threshold +temp1_input_lowest Historical minimum temperature use +temp1_input_highest Historical maximum temperature use +temp1_reset_history Write 1 to reset temp1 history +temp1_min_alarm Low temperature alarm +temp1_max_alarm High temperature alarm +temp1_label Channel label (Ambient) + +temp2_min Low temperature threshold for fan control +temp2_max High temperature threshold for fan control +temp2_min_alarm Low temperature fan control alarm +temp2_max_alarm High temperature fan control alarm +temp2_label Channel label (TEMPFAN) + +energy1_input Measured energy over time (in microJoule) + +energy2_input Measured energy over time (in microJoule) +======================= ========================================== diff --git a/Documentation/hwmon/tmp513.rst b/Documentation/hwmon/tmp513.rst new file mode 100644 index 000000000000..6c8fae4b1a75 --- /dev/null +++ b/Documentation/hwmon/tmp513.rst @@ -0,0 +1,103 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Kernel driver tmp513 +==================== + +Supported chips: + + * Texas Instruments TMP512 + + Prefix: 'tmp512' + + Datasheet: http://www.ti.com/lit/ds/symlink/tmp512.pdf + + * Texas Instruments TMP513 + + Prefix: 'tmp513' + + Datasheet: http://www.ti.com/lit/ds/symlink/tmp513.pdf + +Authors: + + Eric Tremblay <etremblay@distech-controls.com> + +Description +----------- + +This driver implements support for Texas Instruments TMP512, and TMP513. +The TMP512 (dual-channel) and TMP513 (triple-channel) are system monitors +that include remote sensors, a local temperature sensor, and a high-side current +shunt monitor. These system monitors have the capability of measuring remote +temperatures, on-chip temperatures, and system voltage/power/current +consumption. + +The temperatures are measured in degrees Celsius with a range of +-40 to + 125 degrees with a resolution of 0.0625 degree C. + +For hysteresis value, only the first channel is writable. Writing to it +will affect all other values since each channels are sharing the same +hysteresis value. The hysteresis is in degrees Celsius with a range of +0 to 127.5 degrees with a resolution of 0.5 degree. + +The driver exports the temperature values via the following sysfs files: + +**temp[1-4]_input** + +**temp[1-4]_crit** + +**temp[1-4]_crit_alarm** + +**temp[1-4]_crit_hyst** + +The driver read the shunt voltage from the chip and convert it to current. +The readable range depends on the "ti,pga-gain" property (default to 8) and the +shunt resistor value. The value resolution will be equal to 10uV/Rshunt. + +The driver exports the shunt currents values via the following sysFs files: + +**curr1_input** + +**curr1_lcrit** + +**curr1_lcrit_alarm** + +**curr1_crit** + +**curr1_crit_alarm** + +The bus voltage range is read from the chip with a resolution of 4mV. The chip +can be configurable in two different range (32V or 16V) using the +ti,bus-range-microvolt property in the devicetree. + +The driver exports the bus voltage values via the following sysFs files: + +**in0_input** + +**in0_lcrit** + +**in0_lcrit_alarm** + +**in0_crit** + +**in0_crit_alarm** + +The bus power and bus currents range and resolution depends on the calibration +register value. Those values are calculate by the hardware using those +formulas: + +Current = (ShuntVoltage * CalibrationRegister) / 4096 +Power = (Current * BusVoltage) / 5000 + +The driver exports the bus current and bus power values via the following +sysFs files: + +**curr2_input** + +**power1_input** + +**power1_crit** + +**power1_crit_alarm** + +The calibration process follow the procedure of the datasheet (without overflow) +and depend on the shunt resistor value and the pga_gain value. diff --git a/Documentation/index.rst b/Documentation/index.rst index b843e313d2f2..2ceab197246f 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -135,6 +135,14 @@ needed). mic/index scheduler/index +Architecture-agnostic documentation +----------------------------------- + +.. toctree:: + :maxdepth: 2 + + asm-annotations + Architecture-specific documentation ----------------------------------- diff --git a/Documentation/ioctl/ioctl-number.rst b/Documentation/ioctl/ioctl-number.rst index bef79cd4c6b4..4ef86433bd67 100644 --- a/Documentation/ioctl/ioctl-number.rst +++ b/Documentation/ioctl/ioctl-number.rst @@ -233,6 +233,7 @@ Code Seq# Include File Comments 'f' 00-0F fs/ext4/ext4.h conflict! 'f' 00-0F linux/fs.h conflict! 'f' 00-0F fs/ocfs2/ocfs2_fs.h conflict! +'f' 13-27 linux/fscrypt.h 'f' 81-8F linux/fsverity.h 'g' 00-0F linux/usb/gadgetfs.h 'g' 20-2F linux/usb/g_printer.h diff --git a/Documentation/livepatch/index.rst b/Documentation/livepatch/index.rst index 17674a9e21b2..525944063be7 100644 --- a/Documentation/livepatch/index.rst +++ b/Documentation/livepatch/index.rst @@ -12,6 +12,7 @@ Kernel Livepatching cumulative-patches module-elf-format shadow-vars + system-state .. only:: subproject and html diff --git a/Documentation/livepatch/system-state.rst b/Documentation/livepatch/system-state.rst new file mode 100644 index 000000000000..c6d127c2d9aa --- /dev/null +++ b/Documentation/livepatch/system-state.rst @@ -0,0 +1,167 @@ +==================== +System State Changes +==================== + +Some users are really reluctant to reboot a system. This brings the need +to provide more livepatches and maintain some compatibility between them. + +Maintaining more livepatches is much easier with cumulative livepatches. +Each new livepatch completely replaces any older one. It can keep, +add, and even remove fixes. And it is typically safe to replace any version +of the livepatch with any other one thanks to the atomic replace feature. + +The problems might come with shadow variables and callbacks. They might +change the system behavior or state so that it is no longer safe to +go back and use an older livepatch or the original kernel code. Also +any new livepatch must be able to detect what changes have already been +done by the already installed livepatches. + +This is where the livepatch system state tracking gets useful. It +allows to: + + - store data needed to manipulate and restore the system state + + - define compatibility between livepatches using a change id + and version + + +1. Livepatch system state API +============================= + +The state of the system might get modified either by several livepatch callbacks +or by the newly used code. Also it must be possible to find changes done by +already installed livepatches. + +Each modified state is described by struct klp_state, see +include/linux/livepatch.h. + +Each livepatch defines an array of struct klp_states. They mention +all states that the livepatch modifies. + +The livepatch author must define the following two fields for each +struct klp_state: + + - *id* + + - Non-zero number used to identify the affected system state. + + - *version* + + - Number describing the variant of the system state change that + is supported by the given livepatch. + +The state can be manipulated using two functions: + + - *klp_get_state(patch, id)* + + - Get struct klp_state associated with the given livepatch + and state id. + + - *klp_get_prev_state(id)* + + - Get struct klp_state associated with the given feature id and + already installed livepatches. + +2. Livepatch compatibility +========================== + +The system state version is used to prevent loading incompatible livepatches. +The check is done when the livepatch is enabled. The rules are: + + - Any completely new system state modification is allowed. + + - System state modifications with the same or higher version are allowed + for already modified system states. + + - Cumulative livepatches must handle all system state modifications from + already installed livepatches. + + - Non-cumulative livepatches are allowed to touch already modified + system states. + +3. Supported scenarios +====================== + +Livepatches have their life-cycle and the same is true for the system +state changes. Every compatible livepatch has to support the following +scenarios: + + - Modify the system state when the livepatch gets enabled and the state + has not been already modified by a livepatches that are being + replaced. + + - Take over or update the system state modification when is has already + been done by a livepatch that is being replaced. + + - Restore the original state when the livepatch is disabled. + + - Restore the previous state when the transition is reverted. + It might be the original system state or the state modification + done by livepatches that were being replaced. + + - Remove any already made changes when error occurs and the livepatch + cannot get enabled. + +4. Expected usage +================= + +System states are usually modified by livepatch callbacks. The expected +role of each callback is as follows: + +*pre_patch()* + + - Allocate *state->data* when necessary. The allocation might fail + and *pre_patch()* is the only callback that could stop loading + of the livepatch. The allocation is not needed when the data + are already provided by previously installed livepatches. + + - Do any other preparatory action that is needed by + the new code even before the transition gets finished. + For example, initialize *state->data*. + + The system state itself is typically modified in *post_patch()* + when the entire system is able to handle it. + + - Clean up its own mess in case of error. It might be done by a custom + code or by calling *post_unpatch()* explicitly. + +*post_patch()* + + - Copy *state->data* from the previous livepatch when they are + compatible. + + - Do the actual system state modification. Eventually allow + the new code to use it. + + - Make sure that *state->data* has all necessary information. + + - Free *state->data* from replaces livepatches when they are + not longer needed. + +*pre_unpatch()* + + - Prevent the code, added by the livepatch, relying on the system + state change. + + - Revert the system state modification.. + +*post_unpatch()* + + - Distinguish transition reverse and livepatch disabling by + checking *klp_get_prev_state()*. + + - In case of transition reverse, restore the previous system + state. It might mean doing nothing. + + - Remove any not longer needed setting or data. + +.. note:: + + *pre_unpatch()* typically does symmetric operations to *post_patch()*. + Except that it is called only when the livepatch is being disabled. + Therefore it does not need to care about any previously installed + livepatch. + + *post_unpatch()* typically does symmetric operations to *pre_patch()*. + It might be called also during the transition reverse. Therefore it + has to handle the state of the previously installed livepatches. diff --git a/Documentation/media/cec.h.rst.exceptions b/Documentation/media/cec.h.rst.exceptions index 014816d04b9e..d83790ccac8e 100644 --- a/Documentation/media/cec.h.rst.exceptions +++ b/Documentation/media/cec.h.rst.exceptions @@ -335,6 +335,95 @@ ignore define CEC_OP_MENU_STATE_DEACTIVATED ignore define CEC_MSG_USER_CONTROL_PRESSED +ignore define CEC_OP_UI_CMD_SELECT +ignore define CEC_OP_UI_CMD_UP +ignore define CEC_OP_UI_CMD_DOWN +ignore define CEC_OP_UI_CMD_LEFT +ignore define CEC_OP_UI_CMD_RIGHT +ignore define CEC_OP_UI_CMD_RIGHT_UP +ignore define CEC_OP_UI_CMD_RIGHT_DOWN +ignore define CEC_OP_UI_CMD_LEFT_UP +ignore define CEC_OP_UI_CMD_LEFT_DOWN +ignore define CEC_OP_UI_CMD_DEVICE_ROOT_MENU +ignore define CEC_OP_UI_CMD_DEVICE_SETUP_MENU +ignore define CEC_OP_UI_CMD_CONTENTS_MENU +ignore define CEC_OP_UI_CMD_FAVORITE_MENU +ignore define CEC_OP_UI_CMD_BACK +ignore define CEC_OP_UI_CMD_MEDIA_TOP_MENU +ignore define CEC_OP_UI_CMD_MEDIA_CONTEXT_SENSITIVE_MENU +ignore define CEC_OP_UI_CMD_NUMBER_ENTRY_MODE +ignore define CEC_OP_UI_CMD_NUMBER_11 +ignore define CEC_OP_UI_CMD_NUMBER_12 +ignore define CEC_OP_UI_CMD_NUMBER_0_OR_NUMBER_10 +ignore define CEC_OP_UI_CMD_NUMBER_1 +ignore define CEC_OP_UI_CMD_NUMBER_2 +ignore define CEC_OP_UI_CMD_NUMBER_3 +ignore define CEC_OP_UI_CMD_NUMBER_4 +ignore define CEC_OP_UI_CMD_NUMBER_5 +ignore define CEC_OP_UI_CMD_NUMBER_6 +ignore define CEC_OP_UI_CMD_NUMBER_7 +ignore define CEC_OP_UI_CMD_NUMBER_8 +ignore define CEC_OP_UI_CMD_NUMBER_9 +ignore define CEC_OP_UI_CMD_DOT +ignore define CEC_OP_UI_CMD_ENTER +ignore define CEC_OP_UI_CMD_CLEAR +ignore define CEC_OP_UI_CMD_NEXT_FAVORITE +ignore define CEC_OP_UI_CMD_CHANNEL_UP +ignore define CEC_OP_UI_CMD_CHANNEL_DOWN +ignore define CEC_OP_UI_CMD_PREVIOUS_CHANNEL +ignore define CEC_OP_UI_CMD_SOUND_SELECT +ignore define CEC_OP_UI_CMD_INPUT_SELECT +ignore define CEC_OP_UI_CMD_DISPLAY_INFORMATION +ignore define CEC_OP_UI_CMD_HELP +ignore define CEC_OP_UI_CMD_PAGE_UP +ignore define CEC_OP_UI_CMD_PAGE_DOWN +ignore define CEC_OP_UI_CMD_POWER +ignore define CEC_OP_UI_CMD_VOLUME_UP +ignore define CEC_OP_UI_CMD_VOLUME_DOWN +ignore define CEC_OP_UI_CMD_MUTE +ignore define CEC_OP_UI_CMD_PLAY +ignore define CEC_OP_UI_CMD_STOP +ignore define CEC_OP_UI_CMD_PAUSE +ignore define CEC_OP_UI_CMD_RECORD +ignore define CEC_OP_UI_CMD_REWIND +ignore define CEC_OP_UI_CMD_FAST_FORWARD +ignore define CEC_OP_UI_CMD_EJECT +ignore define CEC_OP_UI_CMD_SKIP_FORWARD +ignore define CEC_OP_UI_CMD_SKIP_BACKWARD +ignore define CEC_OP_UI_CMD_STOP_RECORD +ignore define CEC_OP_UI_CMD_PAUSE_RECORD +ignore define CEC_OP_UI_CMD_ANGLE +ignore define CEC_OP_UI_CMD_SUB_PICTURE +ignore define CEC_OP_UI_CMD_VIDEO_ON_DEMAND +ignore define CEC_OP_UI_CMD_ELECTRONIC_PROGRAM_GUIDE +ignore define CEC_OP_UI_CMD_TIMER_PROGRAMMING +ignore define CEC_OP_UI_CMD_INITIAL_CONFIGURATION +ignore define CEC_OP_UI_CMD_SELECT_BROADCAST_TYPE +ignore define CEC_OP_UI_CMD_SELECT_SOUND_PRESENTATION +ignore define CEC_OP_UI_CMD_AUDIO_DESCRIPTION +ignore define CEC_OP_UI_CMD_INTERNET +ignore define CEC_OP_UI_CMD_3D_MODE +ignore define CEC_OP_UI_CMD_PLAY_FUNCTION +ignore define CEC_OP_UI_CMD_PAUSE_PLAY_FUNCTION +ignore define CEC_OP_UI_CMD_RECORD_FUNCTION +ignore define CEC_OP_UI_CMD_PAUSE_RECORD_FUNCTION +ignore define CEC_OP_UI_CMD_STOP_FUNCTION +ignore define CEC_OP_UI_CMD_MUTE_FUNCTION +ignore define CEC_OP_UI_CMD_RESTORE_VOLUME_FUNCTION +ignore define CEC_OP_UI_CMD_TUNE_FUNCTION +ignore define CEC_OP_UI_CMD_SELECT_MEDIA_FUNCTION +ignore define CEC_OP_UI_CMD_SELECT_AV_INPUT_FUNCTION +ignore define CEC_OP_UI_CMD_SELECT_AUDIO_INPUT_FUNCTION +ignore define CEC_OP_UI_CMD_POWER_TOGGLE_FUNCTION +ignore define CEC_OP_UI_CMD_POWER_OFF_FUNCTION +ignore define CEC_OP_UI_CMD_POWER_ON_FUNCTION +ignore define CEC_OP_UI_CMD_F1_BLUE +ignore define CEC_OP_UI_CMD_F2_RED +ignore define CEC_OP_UI_CMD_F3_GREEN +ignore define CEC_OP_UI_CMD_F4_YELLOW +ignore define CEC_OP_UI_CMD_F5 +ignore define CEC_OP_UI_CMD_DATA + ignore define CEC_OP_UI_BCAST_TYPE_TOGGLE_ALL ignore define CEC_OP_UI_BCAST_TYPE_TOGGLE_DIG_ANA ignore define CEC_OP_UI_BCAST_TYPE_ANALOGUE diff --git a/Documentation/media/kapi/v4l2-controls.rst b/Documentation/media/kapi/v4l2-controls.rst index ebe2a55908be..b20800cae3f2 100644 --- a/Documentation/media/kapi/v4l2-controls.rst +++ b/Documentation/media/kapi/v4l2-controls.rst @@ -140,6 +140,15 @@ Menu controls with a driver specific menu are added by calling const struct v4l2_ctrl_ops *ops, u32 id, s32 max, s32 skip_mask, s32 def, const char * const *qmenu); +Standard compound controls can be added by calling +:c:func:`v4l2_ctrl_new_std_compound`: + +.. code-block:: c + + struct v4l2_ctrl *v4l2_ctrl_new_std_compound(struct v4l2_ctrl_handler *hdl, + const struct v4l2_ctrl_ops *ops, u32 id, + const union v4l2_ctrl_ptr p_def); + Integer menu controls with a driver specific menu can be added by calling :c:func:`v4l2_ctrl_new_int_menu`: diff --git a/Documentation/media/uapi/cec/cec-funcs.rst b/Documentation/media/uapi/cec/cec-funcs.rst index 620590b168c9..dc6da9c639a8 100644 --- a/Documentation/media/uapi/cec/cec-funcs.rst +++ b/Documentation/media/uapi/cec/cec-funcs.rst @@ -24,6 +24,7 @@ Function Reference cec-ioc-adap-g-caps cec-ioc-adap-g-log-addrs cec-ioc-adap-g-phys-addr + cec-ioc-adap-g-conn-info cec-ioc-dqevent cec-ioc-g-mode cec-ioc-receive diff --git a/Documentation/media/uapi/cec/cec-ioc-adap-g-caps.rst b/Documentation/media/uapi/cec/cec-ioc-adap-g-caps.rst index 0c44f31a9b59..76761a98c312 100644 --- a/Documentation/media/uapi/cec/cec-ioc-adap-g-caps.rst +++ b/Documentation/media/uapi/cec/cec-ioc-adap-g-caps.rst @@ -135,8 +135,12 @@ returns the information to the application. The ioctl never fails. - The CEC hardware can monitor CEC pin changes from low to high voltage and vice versa. When in pin monitoring mode the application will receive ``CEC_EVENT_PIN_CEC_LOW`` and ``CEC_EVENT_PIN_CEC_HIGH`` events. + * .. _`CEC-CAP-CONNECTOR-INFO`: - + - ``CEC_CAP_CONNECTOR_INFO`` + - 0x00000100 + - If this capability is set, then :ref:`CEC_ADAP_G_CONNECTOR_INFO` can + be used. Return Value ============ diff --git a/Documentation/media/uapi/cec/cec-ioc-adap-g-conn-info.rst b/Documentation/media/uapi/cec/cec-ioc-adap-g-conn-info.rst new file mode 100644 index 000000000000..a21659d55c6b --- /dev/null +++ b/Documentation/media/uapi/cec/cec-ioc-adap-g-conn-info.rst @@ -0,0 +1,105 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. +.. Copyright 2019 Google LLC +.. +.. _CEC_ADAP_G_CONNECTOR_INFO: + +******************************* +ioctl CEC_ADAP_G_CONNECTOR_INFO +******************************* + +Name +==== + +CEC_ADAP_G_CONNECTOR_INFO - Query HDMI connector information + +Synopsis +======== + +.. c:function:: int ioctl( int fd, CEC_ADAP_G_CONNECTOR_INFO, struct cec_connector_info *argp ) + :name: CEC_ADAP_G_CONNECTOR_INFO + +Arguments +========= + +``fd`` + File descriptor returned by :c:func:`open() <cec-open>`. + +``argp`` + + +Description +=========== + +Using this ioctl an application can learn which HDMI connector this CEC +device corresponds to. While calling this ioctl the application should +provide a pointer to a cec_connector_info struct which will be populated +by the kernel with the info provided by the adapter's driver. This ioctl +is only available if the ``CEC_CAP_CONNECTOR_INFO`` capability is set. + +.. tabularcolumns:: |p{1.0cm}|p{4.4cm}|p{2.5cm}|p{9.6cm}| + +.. c:type:: cec_connector_info + +.. flat-table:: struct cec_connector_info + :header-rows: 0 + :stub-columns: 0 + :widths: 1 1 1 8 + + * - __u32 + - ``type`` + - The type of connector this adapter is associated with. + * - union + - ``(anonymous)`` + - + * - + - ``struct cec_drm_connector_info`` + - drm + - :ref:`cec-drm-connector-info` + + +.. tabularcolumns:: |p{4.4cm}|p{2.5cm}|p{10.6cm}| + +.. _connector-type: + +.. flat-table:: Connector types + :header-rows: 0 + :stub-columns: 0 + :widths: 3 1 8 + + * .. _`CEC-CONNECTOR-TYPE-NO-CONNECTOR`: + + - ``CEC_CONNECTOR_TYPE_NO_CONNECTOR`` + - 0 + - No connector is associated with the adapter/the information is not + provided by the driver. + * .. _`CEC-CONNECTOR-TYPE-DRM`: + + - ``CEC_CONNECTOR_TYPE_DRM`` + - 1 + - Indicates that a DRM connector is associated with this adapter. + Information about the connector can be found in + :ref:`cec-drm-connector-info`. + +.. tabularcolumns:: |p{4.4cm}|p{2.5cm}|p{10.6cm}| + +.. c:type:: cec_drm_connector_info + +.. _cec-drm-connector-info: + +.. flat-table:: struct cec_drm_connector_info + :header-rows: 0 + :stub-columns: 0 + :widths: 3 1 8 + + * .. _`CEC-DRM-CONNECTOR-TYPE-CARD-NO`: + + - __u32 + - ``card_no`` + - DRM card number: the number from a card's path, e.g. 0 in case of + /dev/card0. + * .. _`CEC-DRM-CONNECTOR-TYPE-CONNECTOR_ID`: + + - __u32 + - ``connector_id`` + - DRM connector ID. diff --git a/Documentation/media/uapi/cec/cec-ioc-dqevent.rst b/Documentation/media/uapi/cec/cec-ioc-dqevent.rst index 46a1c99a595e..5e21b1fbfc01 100644 --- a/Documentation/media/uapi/cec/cec-ioc-dqevent.rst +++ b/Documentation/media/uapi/cec/cec-ioc-dqevent.rst @@ -70,6 +70,14 @@ it is guaranteed that the state did change in between the two events. addresses are claimed or if ``phys_addr`` is ``CEC_PHYS_ADDR_INVALID``. If bit 15 is set (``1 << CEC_LOG_ADDR_UNREGISTERED``) then this device has the unregistered logical address. In that case all other bits are 0. + * - __u16 + - ``have_conn_info`` + - If non-zero, then HDMI connector information is available. + This field is only valid if ``CEC_CAP_CONNECTOR_INFO`` is set. If that + capability is set and ``have_conn_info`` is zero, then that indicates + that the HDMI connector device is not instantiated, either because + the HDMI driver is still configuring the device or because the HDMI + device was unbound. .. c:type:: cec_event_lost_msgs diff --git a/Documentation/media/uapi/mediactl/request-api.rst b/Documentation/media/uapi/mediactl/request-api.rst index a74c82d95609..01abe8103bdd 100644 --- a/Documentation/media/uapi/mediactl/request-api.rst +++ b/Documentation/media/uapi/mediactl/request-api.rst @@ -53,8 +53,8 @@ with different configurations in advance, knowing that the configuration will be applied when needed to get the expected result. Configuration values at the time of request completion are also available for reading. -Usage -===== +General Usage +------------- The Request API extends the Media Controller API and cooperates with subsystem-specific APIs to support request usage. At the Media Controller diff --git a/Documentation/media/uapi/v4l/biblio.rst b/Documentation/media/uapi/v4l/biblio.rst index ad2ff258afa8..8095f57d3d75 100644 --- a/Documentation/media/uapi/v4l/biblio.rst +++ b/Documentation/media/uapi/v4l/biblio.rst @@ -131,6 +131,15 @@ ITU-T Rec. H.264 Specification (04/2017 Edition) :author: International Telecommunication Union (http://www.itu.ch) +.. _hevc: + +ITU H.265/HEVC +============== + +:title: ITU-T Rec. H.265 | ISO/IEC 23008-2 "High Efficiency Video Coding" + +:author: International Telecommunication Union (http://www.itu.ch), International Organisation for Standardisation (http://www.iso.ch) + .. _jfif: JFIF diff --git a/Documentation/media/uapi/v4l/buffer.rst b/Documentation/media/uapi/v4l/buffer.rst index 1cbd9cde57f3..9149b57728e5 100644 --- a/Documentation/media/uapi/v4l/buffer.rst +++ b/Documentation/media/uapi/v4l/buffer.rst @@ -607,6 +607,19 @@ Buffer Flags applications shall use this flag for output buffers if the data in this buffer has not been created by the CPU but by some DMA-capable unit, in which case caches have not been used. + * .. _`V4L2-BUF-FLAG-M2M-HOLD-CAPTURE-BUF`: + + - ``V4L2_BUF_FLAG_M2M_HOLD_CAPTURE_BUF`` + - 0x00000200 + - Only valid if ``V4L2_BUF_CAP_SUPPORTS_M2M_HOLD_CAPTURE_BUF`` is + set. It is typically used with stateless decoders where multiple + output buffers each decode to a slice of the decoded frame. + Applications can set this flag when queueing the output buffer + to prevent the driver from dequeueing the capture buffer after + the output buffer has been decoded (i.e. the capture buffer is + 'held'). If the timestamp of this output buffer differs from that + of the previous output buffer, then that indicates the start of a + new frame and the previously held capture buffer is dequeued. * .. _`V4L2-BUF-FLAG-LAST`: - ``V4L2_BUF_FLAG_LAST`` diff --git a/Documentation/media/uapi/v4l/dev-mem2mem.rst b/Documentation/media/uapi/v4l/dev-mem2mem.rst index caa05f5f6380..70953958cee6 100644 --- a/Documentation/media/uapi/v4l/dev-mem2mem.rst +++ b/Documentation/media/uapi/v4l/dev-mem2mem.rst @@ -46,3 +46,4 @@ devices are given in the following sections. :maxdepth: 1 dev-decoder + dev-stateless-decoder diff --git a/Documentation/media/uapi/v4l/dev-stateless-decoder.rst b/Documentation/media/uapi/v4l/dev-stateless-decoder.rst new file mode 100644 index 000000000000..4a26646eeec5 --- /dev/null +++ b/Documentation/media/uapi/v4l/dev-stateless-decoder.rst @@ -0,0 +1,424 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _stateless_decoder: + +************************************************** +Memory-to-memory Stateless Video Decoder Interface +************************************************** + +A stateless decoder is a decoder that works without retaining any kind of state +between processed frames. This means that each frame is decoded independently +of any previous and future frames, and that the client is responsible for +maintaining the decoding state and providing it to the decoder with each +decoding request. This is in contrast to the stateful video decoder interface, +where the hardware and driver maintain the decoding state and all the client +has to do is to provide the raw encoded stream and dequeue decoded frames in +display order. + +This section describes how user-space ("the client") is expected to communicate +with stateless decoders in order to successfully decode an encoded stream. +Compared to stateful codecs, the decoder/client sequence is simpler, but the +cost of this simplicity is extra complexity in the client which is responsible +for maintaining a consistent decoding state. + +Stateless decoders make use of the :ref:`media-request-api`. A stateless +decoder must expose the ``V4L2_BUF_CAP_SUPPORTS_REQUESTS`` capability on its +``OUTPUT`` queue when :c:func:`VIDIOC_REQBUFS` or :c:func:`VIDIOC_CREATE_BUFS` +are invoked. + +Depending on the encoded formats supported by the decoder, a single decoded +frame may be the result of several decode requests (for instance, H.264 streams +with multiple slices per frame). Decoders that support such formats must also +expose the ``V4L2_BUF_CAP_SUPPORTS_M2M_HOLD_CAPTURE_BUF`` capability on their +``OUTPUT`` queue. + +Querying capabilities +===================== + +1. To enumerate the set of coded formats supported by the decoder, the client + calls :c:func:`VIDIOC_ENUM_FMT` on the ``OUTPUT`` queue. + + * The driver must always return the full set of supported ``OUTPUT`` formats, + irrespective of the format currently set on the ``CAPTURE`` queue. + + * Simultaneously, the driver must restrain the set of values returned by + codec-specific capability controls (such as H.264 profiles) to the set + actually supported by the hardware. + +2. To enumerate the set of supported raw formats, the client calls + :c:func:`VIDIOC_ENUM_FMT` on the ``CAPTURE`` queue. + + * The driver must return only the formats supported for the format currently + active on the ``OUTPUT`` queue. + + * Depending on the currently set ``OUTPUT`` format, the set of supported raw + formats may depend on the value of some codec-dependent controls. + The client is responsible for making sure that these controls are set + before querying the ``CAPTURE`` queue. Failure to do so will result in the + default values for these controls being used, and a returned set of formats + that may not be usable for the media the client is trying to decode. + +3. The client may use :c:func:`VIDIOC_ENUM_FRAMESIZES` to detect supported + resolutions for a given format, passing desired pixel format in + :c:type:`v4l2_frmsizeenum`'s ``pixel_format``. + +4. Supported profiles and levels for the current ``OUTPUT`` format, if + applicable, may be queried using their respective controls via + :c:func:`VIDIOC_QUERYCTRL`. + +Initialization +============== + +1. Set the coded format on the ``OUTPUT`` queue via :c:func:`VIDIOC_S_FMT`. + + * **Required fields:** + + ``type`` + a ``V4L2_BUF_TYPE_*`` enum appropriate for ``OUTPUT``. + + ``pixelformat`` + a coded pixel format. + + ``width``, ``height`` + coded width and height parsed from the stream. + + other fields + follow standard semantics. + + .. note:: + + Changing the ``OUTPUT`` format may change the currently set ``CAPTURE`` + format. The driver will derive a new ``CAPTURE`` format from the + ``OUTPUT`` format being set, including resolution, colorimetry + parameters, etc. If the client needs a specific ``CAPTURE`` format, + it must adjust it afterwards. + +2. Call :c:func:`VIDIOC_S_EXT_CTRLS` to set all the controls (parsed headers, + etc.) required by the ``OUTPUT`` format to enumerate the ``CAPTURE`` formats. + +3. Call :c:func:`VIDIOC_G_FMT` for ``CAPTURE`` queue to get the format for the + destination buffers parsed/decoded from the bytestream. + + * **Required fields:** + + ``type`` + a ``V4L2_BUF_TYPE_*`` enum appropriate for ``CAPTURE``. + + * **Returned fields:** + + ``width``, ``height`` + frame buffer resolution for the decoded frames. + + ``pixelformat`` + pixel format for decoded frames. + + ``num_planes`` (for _MPLANE ``type`` only) + number of planes for pixelformat. + + ``sizeimage``, ``bytesperline`` + as per standard semantics; matching frame buffer format. + + .. note:: + + The value of ``pixelformat`` may be any pixel format supported for the + ``OUTPUT`` format, based on the hardware capabilities. It is suggested + that the driver chooses the preferred/optimal format for the current + configuration. For example, a YUV format may be preferred over an RGB + format, if an additional conversion step would be required for RGB. + +4. *[optional]* Enumerate ``CAPTURE`` formats via :c:func:`VIDIOC_ENUM_FMT` on + the ``CAPTURE`` queue. The client may use this ioctl to discover which + alternative raw formats are supported for the current ``OUTPUT`` format and + select one of them via :c:func:`VIDIOC_S_FMT`. + + .. note:: + + The driver will return only formats supported for the currently selected + ``OUTPUT`` format and currently set controls, even if more formats may be + supported by the decoder in general. + + For example, a decoder may support YUV and RGB formats for + resolutions 1920x1088 and lower, but only YUV for higher resolutions (due + to hardware limitations). After setting a resolution of 1920x1088 or lower + as the ``OUTPUT`` format, :c:func:`VIDIOC_ENUM_FMT` may return a set of + YUV and RGB pixel formats, but after setting a resolution higher than + 1920x1088, the driver will not return RGB pixel formats, since they are + unsupported for this resolution. + +5. *[optional]* Choose a different ``CAPTURE`` format than suggested via + :c:func:`VIDIOC_S_FMT` on ``CAPTURE`` queue. It is possible for the client to + choose a different format than selected/suggested by the driver in + :c:func:`VIDIOC_G_FMT`. + + * **Required fields:** + + ``type`` + a ``V4L2_BUF_TYPE_*`` enum appropriate for ``CAPTURE``. + + ``pixelformat`` + a raw pixel format. + + ``width``, ``height`` + frame buffer resolution of the decoded stream; typically unchanged from + what was returned with :c:func:`VIDIOC_G_FMT`, but it may be different + if the hardware supports composition and/or scaling. + + After performing this step, the client must perform step 3 again in order + to obtain up-to-date information about the buffers size and layout. + +6. Allocate source (bytestream) buffers via :c:func:`VIDIOC_REQBUFS` on + ``OUTPUT`` queue. + + * **Required fields:** + + ``count`` + requested number of buffers to allocate; greater than zero. + + ``type`` + a ``V4L2_BUF_TYPE_*`` enum appropriate for ``OUTPUT``. + + ``memory`` + follows standard semantics. + + * **Return fields:** + + ``count`` + actual number of buffers allocated. + + * If required, the driver will adjust ``count`` to be equal or bigger to the + minimum of required number of ``OUTPUT`` buffers for the given format and + requested count. The client must check this value after the ioctl returns + to get the actual number of buffers allocated. + +7. Allocate destination (raw format) buffers via :c:func:`VIDIOC_REQBUFS` on the + ``CAPTURE`` queue. + + * **Required fields:** + + ``count`` + requested number of buffers to allocate; greater than zero. The client + is responsible for deducing the minimum number of buffers required + for the stream to be properly decoded (taking e.g. reference frames + into account) and pass an equal or bigger number. + + ``type`` + a ``V4L2_BUF_TYPE_*`` enum appropriate for ``CAPTURE``. + + ``memory`` + follows standard semantics. ``V4L2_MEMORY_USERPTR`` is not supported + for ``CAPTURE`` buffers. + + * **Return fields:** + + ``count`` + adjusted to allocated number of buffers, in case the codec requires + more buffers than requested. + + * The driver must adjust count to the minimum of required number of + ``CAPTURE`` buffers for the current format, stream configuration and + requested count. The client must check this value after the ioctl + returns to get the number of buffers allocated. + +8. Allocate requests (likely one per ``OUTPUT`` buffer) via + :c:func:`MEDIA_IOC_REQUEST_ALLOC` on the media device. + +9. Start streaming on both ``OUTPUT`` and ``CAPTURE`` queues via + :c:func:`VIDIOC_STREAMON`. + +Decoding +======== + +For each frame, the client is responsible for submitting at least one request to +which the following is attached: + +* The amount of encoded data expected by the codec for its current + configuration, as a buffer submitted to the ``OUTPUT`` queue. Typically, this + corresponds to one frame worth of encoded data, but some formats may allow (or + require) different amounts per unit. +* All the metadata needed to decode the submitted encoded data, in the form of + controls relevant to the format being decoded. + +The amount of data and contents of the source ``OUTPUT`` buffer, as well as the +controls that must be set on the request, depend on the active coded pixel +format and might be affected by codec-specific extended controls, as stated in +documentation of each format. + +If there is a possibility that the decoded frame will require one or more +decode requests after the current one in order to be produced, then the client +must set the ``V4L2_BUF_FLAG_M2M_HOLD_CAPTURE_BUF`` flag on the ``OUTPUT`` +buffer. This will result in the (potentially partially) decoded ``CAPTURE`` +buffer not being made available for dequeueing, and reused for the next decode +request if the timestamp of the next ``OUTPUT`` buffer has not changed. + +A typical frame would thus be decoded using the following sequence: + +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bytestream data for + the decoding request, using :c:func:`VIDIOC_QBUF`. + + * **Required fields:** + + ``index`` + index of the buffer being queued. + + ``type`` + type of the buffer. + + ``bytesused`` + number of bytes taken by the encoded data frame in the buffer. + + ``flags`` + the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set. Additionally, if + we are not sure that the current decode request is the last one needed + to produce a fully decoded frame, then + ``V4L2_BUF_FLAG_M2M_HOLD_CAPTURE_BUF`` must also be set. + + ``request_fd`` + must be set to the file descriptor of the decoding request. + + ``timestamp`` + must be set to a unique value per frame. This value will be propagated + into the decoded frame's buffer and can also be used to use this frame + as the reference of another. If using multiple decode requests per + frame, then the timestamps of all the ``OUTPUT`` buffers for a given + frame must be identical. If the timestamp changes, then the currently + held ``CAPTURE`` buffer will be made available for dequeuing and the + current request will work on a new ``CAPTURE`` buffer. + +2. Set the codec-specific controls for the decoding request, using + :c:func:`VIDIOC_S_EXT_CTRLS`. + + * **Required fields:** + + ``which`` + must be ``V4L2_CTRL_WHICH_REQUEST_VAL``. + + ``request_fd`` + must be set to the file descriptor of the decoding request. + + other fields + other fields are set as usual when setting controls. The ``controls`` + array must contain all the codec-specific controls required to decode + a frame. + + .. note:: + + It is possible to specify the controls in different invocations of + :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as + long as ``request_fd`` and ``which`` are properly set. The controls state + at the moment of request submission is the one that will be considered. + + .. note:: + + The order in which steps 1 and 2 take place is interchangeable. + +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the + request FD. + + If the request is submitted without an ``OUTPUT`` buffer, or if some of the + required controls are missing from the request, then + :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one + ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``. + :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no + ``CAPTURE`` buffer will be produced for this request. + +``CAPTURE`` buffers must not be part of the request, and are queued +independently. They are returned in decode order (i.e. the same order as coded +frames were submitted to the ``OUTPUT`` queue). + +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an +error, then all following decoded frames that refer to it also have the +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to +produce (likely corrupted) frames. + +Buffer management while decoding +================================ +Contrary to stateful decoders, a stateless decoder does not perform any kind of +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be +used by the client for as long as they are not queued again. "Used" here +encompasses using the buffer for compositing or display. + +A dequeued capture buffer can also be used as the reference frame of another +buffer. + +A frame is specified as reference by converting its timestamp into nanoseconds, +and storing it into the relevant member of a codec-dependent control structure. +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that +conversion. The timestamp of a frame can be used to reference it as soon as all +its units of encoded data are successfully submitted to the ``OUTPUT`` queue. + +A decoded buffer containing a reference frame must not be reused as a decoding +target until all the frames referencing it have been decoded. The safest way to +achieve this is to refrain from queueing a reference buffer until all the +decoded frames referencing it have been dequeued. However, if the driver can +guarantee that buffers queued to the ``CAPTURE`` queue are processed in queued +order, then user-space can take advantage of this guarantee and queue a +reference buffer when the following conditions are met: + +1. All the requests for frames affected by the reference frame have been + queued, and + +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded + referencing frames have been queued. + +When queuing a decoding request, the driver will increase the reference count of +all the resources associated with reference frames. This means that the client +can e.g. close the DMABUF file descriptors of reference frame buffers if it +won't need them afterwards. + +Seeking +======= +In order to seek, the client just needs to submit requests using input buffers +corresponding to the new stream position. It must however be aware that +resolution may have changed and follow the dynamic resolution change sequence in +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS +for H.264) may have changed and the client is responsible for making sure that a +valid state is sent to the decoder. + +The client is then free to ignore any returned ``CAPTURE`` buffer that comes +from the pre-seek position. + +Pausing +======= + +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT`` +queue. Without source bytestream data, there is no data to process and the codec +will remain idle. + +Dynamic resolution change +========================= + +If the client detects a resolution change in the stream, it will need to perform +the initialization sequence again with the new resolution: + +1. If the last submitted request resulted in a ``CAPTURE`` buffer being + held by the use of the ``V4L2_BUF_FLAG_M2M_HOLD_CAPTURE_BUF`` flag, then the + last frame is not available on the ``CAPTURE`` queue. In this case, a + ``V4L2_DEC_CMD_FLUSH`` command shall be sent. This will make the driver + dequeue the held ``CAPTURE`` buffer. + +2. Wait until all submitted requests have completed and dequeue the + corresponding output buffers. + +3. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE`` + queues. + +4. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the + ``CAPTURE`` queue with a buffer count of zero. + +5. Perform the initialization sequence again (minus the allocation of + ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue. + Note that due to resolution constraints, a different format may need to be + picked on the ``CAPTURE`` queue. + +Drain +===== + +If the last submitted request resulted in a ``CAPTURE`` buffer being +held by the use of the ``V4L2_BUF_FLAG_M2M_HOLD_CAPTURE_BUF`` flag, then the +last frame is not available on the ``CAPTURE`` queue. In this case, a +``V4L2_DEC_CMD_FLUSH`` command shall be sent. This will make the driver +dequeue the held ``CAPTURE`` buffer. + +After that, in order to drain the stream on a stateless decoder, the client +just needs to wait until all the submitted requests are completed. diff --git a/Documentation/media/uapi/v4l/ext-ctrls-codec.rst b/Documentation/media/uapi/v4l/ext-ctrls-codec.rst index bc5dd8e76567..28313c0f4e7c 100644 --- a/Documentation/media/uapi/v4l/ext-ctrls-codec.rst +++ b/Documentation/media/uapi/v4l/ext-ctrls-codec.rst @@ -1713,10 +1713,14 @@ enum v4l2_mpeg_video_h264_hierarchical_coding_type - * - __u8 - ``scaling_list_4x4[6][16]`` - - + - Scaling matrix after applying the inverse scanning process. + Expected list order is Intra Y, Intra Cb, Intra Cr, Inter Y, + Inter Cb, Inter Cr. * - __u8 - ``scaling_list_8x8[6][64]`` - - + - Scaling matrix after applying the inverse scanning process. + Expected list order is Intra Y, Inter Y, Intra Cb, Inter Cb, + Intra Cr, Inter Cr. ``V4L2_CID_MPEG_VIDEO_H264_SLICE_PARAMS (struct)`` Specifies the slice parameters (as extracted from the bitstream) @@ -1796,7 +1800,7 @@ enum v4l2_mpeg_video_h264_hierarchical_coding_type - - * - __u32 - ``dec_ref_pic_marking_bit_size`` - - + - Size in bits of the dec_ref_pic_marking() syntax element. * - __u32 - ``pic_order_cnt_bit_size`` - @@ -1820,10 +1824,12 @@ enum v4l2_mpeg_video_h264_hierarchical_coding_type - - * - __u8 - ``num_ref_idx_l0_active_minus1`` - - + - If num_ref_idx_active_override_flag is not set, this field must be + set to the value of num_ref_idx_l0_default_active_minus1. * - __u8 - ``num_ref_idx_l1_active_minus1`` - - + - If num_ref_idx_active_override_flag is not set, this field must be + set to the value of num_ref_idx_l1_default_active_minus1. * - __u32 - ``slice_group_change_cycle`` - @@ -1983,9 +1989,9 @@ enum v4l2_mpeg_video_h264_hierarchical_coding_type - - ``reference_ts`` - Timestamp of the V4L2 capture buffer to use as reference, used with B-coded and P-coded frames. The timestamp refers to the - ``timestamp`` field in struct :c:type:`v4l2_buffer`. Use the - :c:func:`v4l2_timeval_to_ns()` function to convert the struct - :c:type:`timeval` in struct :c:type:`v4l2_buffer` to a __u64. + ``timestamp`` field in struct :c:type:`v4l2_buffer`. Use the + :c:func:`v4l2_timeval_to_ns()` function to convert the struct + :c:type:`timeval` in struct :c:type:`v4l2_buffer` to a __u64. * - __u16 - ``frame_num`` - @@ -3693,3 +3699,550 @@ enum v4l2_mpeg_video_hevc_size_of_length_field - Indicates whether to generate SPS and PPS at every IDR. Setting it to 0 disables generating SPS and PPS at every IDR. Setting it to one enables generating SPS and PPS at every IDR. + +.. _v4l2-mpeg-hevc: + +``V4L2_CID_MPEG_VIDEO_HEVC_SPS (struct)`` + Specifies the Sequence Parameter Set fields (as extracted from the + bitstream) for the associated HEVC slice data. + These bitstream parameters are defined according to :ref:`hevc`. + They are described in section 7.4.3.2 "Sequence parameter set RBSP + semantics" of the specification. + +.. c:type:: v4l2_ctrl_hevc_sps + +.. cssclass:: longtable + +.. flat-table:: struct v4l2_ctrl_hevc_sps + :header-rows: 0 + :stub-columns: 0 + :widths: 1 1 2 + + * - __u16 + - ``pic_width_in_luma_samples`` + - + * - __u16 + - ``pic_height_in_luma_samples`` + - + * - __u8 + - ``bit_depth_luma_minus8`` + - + * - __u8 + - ``bit_depth_chroma_minus8`` + - + * - __u8 + - ``log2_max_pic_order_cnt_lsb_minus4`` + - + * - __u8 + - ``sps_max_dec_pic_buffering_minus1`` + - + * - __u8 + - ``sps_max_num_reorder_pics`` + - + * - __u8 + - ``sps_max_latency_increase_plus1`` + - + * - __u8 + - ``log2_min_luma_coding_block_size_minus3`` + - + * - __u8 + - ``log2_diff_max_min_luma_coding_block_size`` + - + * - __u8 + - ``log2_min_luma_transform_block_size_minus2`` + - + * - __u8 + - ``log2_diff_max_min_luma_transform_block_size`` + - + * - __u8 + - ``max_transform_hierarchy_depth_inter`` + - + * - __u8 + - ``max_transform_hierarchy_depth_intra`` + - + * - __u8 + - ``pcm_sample_bit_depth_luma_minus1`` + - + * - __u8 + - ``pcm_sample_bit_depth_chroma_minus1`` + - + * - __u8 + - ``log2_min_pcm_luma_coding_block_size_minus3`` + - + * - __u8 + - ``log2_diff_max_min_pcm_luma_coding_block_size`` + - + * - __u8 + - ``num_short_term_ref_pic_sets`` + - + * - __u8 + - ``num_long_term_ref_pics_sps`` + - + * - __u8 + - ``chroma_format_idc`` + - + * - __u64 + - ``flags`` + - See :ref:`Sequence Parameter Set Flags <hevc_sps_flags>` + +.. _hevc_sps_flags: + +``Sequence Parameter Set Flags`` + +.. cssclass:: longtable + +.. flat-table:: + :header-rows: 0 + :stub-columns: 0 + :widths: 1 1 2 + + * - ``V4L2_HEVC_SPS_FLAG_SEPARATE_COLOUR_PLANE`` + - 0x00000001 + - + * - ``V4L2_HEVC_SPS_FLAG_SCALING_LIST_ENABLED`` + - 0x00000002 + - + * - ``V4L2_HEVC_SPS_FLAG_AMP_ENABLED`` + - 0x00000004 + - + * - ``V4L2_HEVC_SPS_FLAG_SAMPLE_ADAPTIVE_OFFSET`` + - 0x00000008 + - + * - ``V4L2_HEVC_SPS_FLAG_PCM_ENABLED`` + - 0x00000010 + - + * - ``V4L2_HEVC_SPS_FLAG_PCM_LOOP_FILTER_DISABLED`` + - 0x00000020 + - + * - ``V4L2_HEVC_SPS_FLAG_LONG_TERM_REF_PICS_PRESENT`` + - 0x00000040 + - + * - ``V4L2_HEVC_SPS_FLAG_SPS_TEMPORAL_MVP_ENABLED`` + - 0x00000080 + - + * - ``V4L2_HEVC_SPS_FLAG_STRONG_INTRA_SMOOTHING_ENABLED`` + - 0x00000100 + - + +``V4L2_CID_MPEG_VIDEO_HEVC_PPS (struct)`` + Specifies the Picture Parameter Set fields (as extracted from the + bitstream) for the associated HEVC slice data. + These bitstream parameters are defined according to :ref:`hevc`. + They are described in section 7.4.3.3 "Picture parameter set RBSP + semantics" of the specification. + +.. c:type:: v4l2_ctrl_hevc_pps + +.. cssclass:: longtable + +.. flat-table:: struct v4l2_ctrl_hevc_pps + :header-rows: 0 + :stub-columns: 0 + :widths: 1 1 2 + + * - __u8 + - ``num_extra_slice_header_bits`` + - + * - __s8 + - ``init_qp_minus26`` + - + * - __u8 + - ``diff_cu_qp_delta_depth`` + - + * - __s8 + - ``pps_cb_qp_offset`` + - + * - __s8 + - ``pps_cr_qp_offset`` + - + * - __u8 + - ``num_tile_columns_minus1`` + - + * - __u8 + - ``num_tile_rows_minus1`` + - + * - __u8 + - ``column_width_minus1[20]`` + - + * - __u8 + - ``row_height_minus1[22]`` + - + * - __s8 + - ``pps_beta_offset_div2`` + - + * - __s8 + - ``pps_tc_offset_div2`` + - + * - __u8 + - ``log2_parallel_merge_level_minus2`` + - + * - __u8 + - ``padding[4]`` + - Applications and drivers must set this to zero. + * - __u64 + - ``flags`` + - See :ref:`Picture Parameter Set Flags <hevc_pps_flags>` + +.. _hevc_pps_flags: + +``Picture Parameter Set Flags`` + +.. cssclass:: longtable + +.. flat-table:: + :header-rows: 0 + :stub-columns: 0 + :widths: 1 1 2 + + * - ``V4L2_HEVC_PPS_FLAG_DEPENDENT_SLICE_SEGMENT`` + - 0x00000001 + - + * - ``V4L2_HEVC_PPS_FLAG_OUTPUT_FLAG_PRESENT`` + - 0x00000002 + - + * - ``V4L2_HEVC_PPS_FLAG_SIGN_DATA_HIDING_ENABLED`` + - 0x00000004 + - + * - ``V4L2_HEVC_PPS_FLAG_CABAC_INIT_PRESENT`` + - 0x00000008 + - + * - ``V4L2_HEVC_PPS_FLAG_CONSTRAINED_INTRA_PRED`` + - 0x00000010 + - + * - ``V4L2_HEVC_PPS_FLAG_TRANSFORM_SKIP_ENABLED`` + - 0x00000020 + - + * - ``V4L2_HEVC_PPS_FLAG_CU_QP_DELTA_ENABLED`` + - 0x00000040 + - + * - ``V4L2_HEVC_PPS_FLAG_PPS_SLICE_CHROMA_QP_OFFSETS_PRESENT`` + - 0x00000080 + - + * - ``V4L2_HEVC_PPS_FLAG_WEIGHTED_PRED`` + - 0x00000100 + - + * - ``V4L2_HEVC_PPS_FLAG_WEIGHTED_BIPRED`` + - 0x00000200 + - + * - ``V4L2_HEVC_PPS_FLAG_TRANSQUANT_BYPASS_ENABLED`` + - 0x00000400 + - + * - ``V4L2_HEVC_PPS_FLAG_TILES_ENABLED`` + - 0x00000800 + - + * - ``V4L2_HEVC_PPS_FLAG_ENTROPY_CODING_SYNC_ENABLED`` + - 0x00001000 + - + * - ``V4L2_HEVC_PPS_FLAG_LOOP_FILTER_ACROSS_TILES_ENABLED`` + - 0x00002000 + - + * - ``V4L2_HEVC_PPS_FLAG_PPS_LOOP_FILTER_ACROSS_SLICES_ENABLED`` + - 0x00004000 + - + * - ``V4L2_HEVC_PPS_FLAG_DEBLOCKING_FILTER_OVERRIDE_ENABLED`` + - 0x00008000 + - + * - ``V4L2_HEVC_PPS_FLAG_PPS_DISABLE_DEBLOCKING_FILTER`` + - 0x00010000 + - + * - ``V4L2_HEVC_PPS_FLAG_LISTS_MODIFICATION_PRESENT`` + - 0x00020000 + - + * - ``V4L2_HEVC_PPS_FLAG_SLICE_SEGMENT_HEADER_EXTENSION_PRESENT`` + - 0x00040000 + - + +``V4L2_CID_MPEG_VIDEO_HEVC_SLICE_PARAMS (struct)`` + Specifies various slice-specific parameters, especially from the NAL unit + header, general slice segment header and weighted prediction parameter + parts of the bitstream. + These bitstream parameters are defined according to :ref:`hevc`. + They are described in section 7.4.7 "General slice segment header + semantics" of the specification. + +.. c:type:: v4l2_ctrl_hevc_slice_params + +.. cssclass:: longtable + +.. flat-table:: struct v4l2_ctrl_hevc_slice_params + :header-rows: 0 + :stub-columns: 0 + :widths: 1 1 2 + + * - __u32 + - ``bit_size`` + - Size (in bits) of the current slice data. + * - __u32 + - ``data_bit_offset`` + - Offset (in bits) to the video data in the current slice data. + * - __u8 + - ``nal_unit_type`` + - + * - __u8 + - ``nuh_temporal_id_plus1`` + - + * - __u8 + - ``slice_type`` + - + (V4L2_HEVC_SLICE_TYPE_I, V4L2_HEVC_SLICE_TYPE_P or + V4L2_HEVC_SLICE_TYPE_B). + * - __u8 + - ``colour_plane_id`` + - + * - __u16 + - ``slice_pic_order_cnt`` + - + * - __u8 + - ``num_ref_idx_l0_active_minus1`` + - + * - __u8 + - ``num_ref_idx_l1_active_minus1`` + - + * - __u8 + - ``collocated_ref_idx`` + - + * - __u8 + - ``five_minus_max_num_merge_cand`` + - + * - __s8 + - ``slice_qp_delta`` + - + * - __s8 + - ``slice_cb_qp_offset`` + - + * - __s8 + - ``slice_cr_qp_offset`` + - + * - __s8 + - ``slice_act_y_qp_offset`` + - + * - __s8 + - ``slice_act_cb_qp_offset`` + - + * - __s8 + - ``slice_act_cr_qp_offset`` + - + * - __s8 + - ``slice_beta_offset_div2`` + - + * - __s8 + - ``slice_tc_offset_div2`` + - + * - __u8 + - ``pic_struct`` + - + * - __u8 + - ``num_active_dpb_entries`` + - The number of entries in ``dpb``. + * - __u8 + - ``ref_idx_l0[V4L2_HEVC_DPB_ENTRIES_NUM_MAX]`` + - The list of L0 reference elements as indices in the DPB. + * - __u8 + - ``ref_idx_l1[V4L2_HEVC_DPB_ENTRIES_NUM_MAX]`` + - The list of L1 reference elements as indices in the DPB. + * - __u8 + - ``num_rps_poc_st_curr_before`` + - The number of reference pictures in the short-term set that come before + the current frame. + * - __u8 + - ``num_rps_poc_st_curr_after`` + - The number of reference pictures in the short-term set that come after + the current frame. + * - __u8 + - ``num_rps_poc_lt_curr`` + - The number of reference pictures in the long-term set. + * - __u8 + - ``padding[7]`` + - Applications and drivers must set this to zero. + * - struct :c:type:`v4l2_hevc_dpb_entry` + - ``dpb[V4L2_HEVC_DPB_ENTRIES_NUM_MAX]`` + - The decoded picture buffer, for meta-data about reference frames. + * - struct :c:type:`v4l2_hevc_pred_weight_table` + - ``pred_weight_table`` + - The prediction weight coefficients for inter-picture prediction. + * - __u64 + - ``flags`` + - See :ref:`Slice Parameters Flags <hevc_slice_params_flags>` + +.. _hevc_slice_params_flags: + +``Slice Parameters Flags`` + +.. cssclass:: longtable + +.. flat-table:: + :header-rows: 0 + :stub-columns: 0 + :widths: 1 1 2 + + * - ``V4L2_HEVC_SLICE_PARAMS_FLAG_SLICE_SAO_LUMA`` + - 0x00000001 + - + * - ``V4L2_HEVC_SLICE_PARAMS_FLAG_SLICE_SAO_CHROMA`` + - 0x00000002 + - + * - ``V4L2_HEVC_SLICE_PARAMS_FLAG_SLICE_TEMPORAL_MVP_ENABLED`` + - 0x00000004 + - + * - ``V4L2_HEVC_SLICE_PARAMS_FLAG_MVD_L1_ZERO`` + - 0x00000008 + - + * - ``V4L2_HEVC_SLICE_PARAMS_FLAG_CABAC_INIT`` + - 0x00000010 + - + * - ``V4L2_HEVC_SLICE_PARAMS_FLAG_COLLOCATED_FROM_L0`` + - 0x00000020 + - + * - ``V4L2_HEVC_SLICE_PARAMS_FLAG_USE_INTEGER_MV`` + - 0x00000040 + - + * - ``V4L2_HEVC_SLICE_PARAMS_FLAG_SLICE_DEBLOCKING_FILTER_DISABLED`` + - 0x00000080 + - + * - ``V4L2_HEVC_SLICE_PARAMS_FLAG_SLICE_LOOP_FILTER_ACROSS_SLICES_ENABLED`` + - 0x00000100 + - + +.. c:type:: v4l2_hevc_dpb_entry + +.. cssclass:: longtable + +.. flat-table:: struct v4l2_hevc_dpb_entry + :header-rows: 0 + :stub-columns: 0 + :widths: 1 1 2 + + * - __u64 + - ``timestamp`` + - Timestamp of the V4L2 capture buffer to use as reference, used + with B-coded and P-coded frames. The timestamp refers to the + ``timestamp`` field in struct :c:type:`v4l2_buffer`. Use the + :c:func:`v4l2_timeval_to_ns()` function to convert the struct + :c:type:`timeval` in struct :c:type:`v4l2_buffer` to a __u64. + * - __u8 + - ``rps`` + - The reference set for the reference frame + (V4L2_HEVC_DPB_ENTRY_RPS_ST_CURR_BEFORE, + V4L2_HEVC_DPB_ENTRY_RPS_ST_CURR_AFTER or + V4L2_HEVC_DPB_ENTRY_RPS_LT_CURR) + * - __u8 + - ``field_pic`` + - Whether the reference is a field picture or a frame. + * - __u16 + - ``pic_order_cnt[2]`` + - The picture order count of the reference. Only the first element of the + array is used for frame pictures, while the first element identifies the + top field and the second the bottom field in field-coded pictures. + * - __u8 + - ``padding[2]`` + - Applications and drivers must set this to zero. + +.. c:type:: v4l2_hevc_pred_weight_table + +.. cssclass:: longtable + +.. flat-table:: struct v4l2_hevc_pred_weight_table + :header-rows: 0 + :stub-columns: 0 + :widths: 1 1 2 + + * - __u8 + - ``luma_log2_weight_denom`` + - + * - __s8 + - ``delta_chroma_log2_weight_denom`` + - + * - __s8 + - ``delta_luma_weight_l0[V4L2_HEVC_DPB_ENTRIES_NUM_MAX]`` + - + * - __s8 + - ``luma_offset_l0[V4L2_HEVC_DPB_ENTRIES_NUM_MAX]`` + - + * - __s8 + - ``delta_chroma_weight_l0[V4L2_HEVC_DPB_ENTRIES_NUM_MAX][2]`` + - + * - __s8 + - ``chroma_offset_l0[V4L2_HEVC_DPB_ENTRIES_NUM_MAX][2]`` + - + * - __s8 + - ``delta_luma_weight_l1[V4L2_HEVC_DPB_ENTRIES_NUM_MAX]`` + - + * - __s8 + - ``luma_offset_l1[V4L2_HEVC_DPB_ENTRIES_NUM_MAX]`` + - + * - __s8 + - ``delta_chroma_weight_l1[V4L2_HEVC_DPB_ENTRIES_NUM_MAX][2]`` + - + * - __s8 + - ``chroma_offset_l1[V4L2_HEVC_DPB_ENTRIES_NUM_MAX][2]`` + - + * - __u8 + - ``padding[6]`` + - Applications and drivers must set this to zero. + +``V4L2_CID_MPEG_VIDEO_HEVC_DECODE_MODE (enum)`` + Specifies the decoding mode to use. Currently exposes slice-based and + frame-based decoding but new modes might be added later on. + This control is used as a modifier for V4L2_PIX_FMT_HEVC_SLICE + pixel format. Applications that support V4L2_PIX_FMT_HEVC_SLICE + are required to set this control in order to specify the decoding mode + that is expected for the buffer. + Drivers may expose a single or multiple decoding modes, depending + on what they can support. + + .. note:: + + This menu control is not yet part of the public kernel API and + it is expected to change. + +.. c:type:: v4l2_mpeg_video_hevc_decode_mode + +.. cssclass:: longtable + +.. flat-table:: + :header-rows: 0 + :stub-columns: 0 + :widths: 1 1 2 + + * - ``V4L2_MPEG_VIDEO_HEVC_DECODE_MODE_SLICE_BASED`` + - 0 + - Decoding is done at the slice granularity. + The OUTPUT buffer must contain a single slice. + * - ``V4L2_MPEG_VIDEO_HEVC_DECODE_MODE_FRAME_BASED`` + - 1 + - Decoding is done at the frame granularity. + The OUTPUT buffer must contain all slices needed to decode the + frame. The OUTPUT buffer must also contain both fields. + +``V4L2_CID_MPEG_VIDEO_HEVC_START_CODE (enum)`` + Specifies the HEVC slice start code expected for each slice. + This control is used as a modifier for V4L2_PIX_FMT_HEVC_SLICE + pixel format. Applications that support V4L2_PIX_FMT_HEVC_SLICE + are required to set this control in order to specify the start code + that is expected for the buffer. + Drivers may expose a single or multiple start codes, depending + on what they can support. + + .. note:: + + This menu control is not yet part of the public kernel API and + it is expected to change. + +.. c:type:: v4l2_mpeg_video_hevc_start_code + +.. cssclass:: longtable + +.. flat-table:: + :header-rows: 0 + :stub-columns: 0 + :widths: 1 1 2 + + * - ``V4L2_MPEG_VIDEO_HEVC_START_CODE_NONE`` + - 0 + - Selecting this value specifies that HEVC slices are passed + to the driver without any start code. + * - ``V4L2_MPEG_VIDEO_HEVC_START_CODE_ANNEX_B`` + - 1 + - Selecting this value specifies that HEVC slices are expected + to be prefixed by Annex B start codes. According to :ref:`hevc` + valid start codes can be 3-bytes 0x000001 or 4-bytes 0x00000001. diff --git a/Documentation/media/uapi/v4l/ext-ctrls-flash.rst b/Documentation/media/uapi/v4l/ext-ctrls-flash.rst index eff056b17167..b9a6b08fbf32 100644 --- a/Documentation/media/uapi/v4l/ext-ctrls-flash.rst +++ b/Documentation/media/uapi/v4l/ext-ctrls-flash.rst @@ -98,7 +98,7 @@ Flash Control IDs V4L2_CID_FLASH_STROBE control. * - ``V4L2_FLASH_STROBE_SOURCE_EXTERNAL`` - The flash strobe is triggered by an external source. Typically - this is a sensor, which makes it possible to synchronises the + this is a sensor, which makes it possible to synchronise the flash strobe start to exposure start. diff --git a/Documentation/media/uapi/v4l/ext-ctrls-image-source.rst b/Documentation/media/uapi/v4l/ext-ctrls-image-source.rst index 2c3ab5796d76..2d3e2b83d6dd 100644 --- a/Documentation/media/uapi/v4l/ext-ctrls-image-source.rst +++ b/Documentation/media/uapi/v4l/ext-ctrls-image-source.rst @@ -55,3 +55,13 @@ Image Source Control IDs ``V4L2_CID_TEST_PATTERN_GREENB (integer)`` Test pattern green (next to blue) colour component. + +``V4L2_CID_UNIT_CELL_SIZE (struct)`` + This control returns the unit cell size in nanometers. The struct + :c:type:`v4l2_area` provides the width and the height in separate + fields to take into consideration asymmetric pixels. + This control does not take into consideration any possible hardware + binning. + The unit cell consists of the whole area of the pixel, sensitive and + non-sensitive. + This control is required for automatic calibration of sensors/cameras. diff --git a/Documentation/media/uapi/v4l/meta-formats.rst b/Documentation/media/uapi/v4l/meta-formats.rst index b10ca9ee3968..74c8659ee9d6 100644 --- a/Documentation/media/uapi/v4l/meta-formats.rst +++ b/Documentation/media/uapi/v4l/meta-formats.rst @@ -24,3 +24,4 @@ These formats are used for the :ref:`metadata` interface only. pixfmt-meta-uvc pixfmt-meta-vsp1-hgo pixfmt-meta-vsp1-hgt + pixfmt-meta-vivid diff --git a/Documentation/media/uapi/v4l/pixfmt-compressed.rst b/Documentation/media/uapi/v4l/pixfmt-compressed.rst index 292fdc116c77..561bda112809 100644 --- a/Documentation/media/uapi/v4l/pixfmt-compressed.rst +++ b/Documentation/media/uapi/v4l/pixfmt-compressed.rst @@ -61,10 +61,10 @@ Compressed Formats - ``V4L2_PIX_FMT_H264_SLICE`` - 'S264' - - H264 parsed slice data, without the start code and as - extracted from the H264 bitstream. This format is adapted for - stateless video decoders that implement an H264 pipeline - (using the :ref:`mem2mem` and :ref:`media-request-api`). + - H264 parsed slice data, including slice headers, either with or + without the start code, as extracted from the H264 bitstream. + This format is adapted for stateless video decoders that implement an + H264 pipeline (using the :ref:`mem2mem` and :ref:`media-request-api`). This pixelformat has two modifiers that must be set at least once through the ``V4L2_CID_MPEG_VIDEO_H264_DECODE_MODE`` and ``V4L2_CID_MPEG_VIDEO_H264_START_CODE`` controls. @@ -80,6 +80,10 @@ Compressed Formats appropriate number of macroblocks to decode a full corresponding frame to the matching capture buffer. + The syntax for this format is documented in :ref:`h264`, section + 7.3.2.8 "Slice layer without partitioning RBSP syntax" and the following + sections. + .. note:: This format is not yet part of the public kernel API and it @@ -188,6 +192,29 @@ Compressed Formats If :ref:`VIDIOC_ENUM_FMT` reports ``V4L2_FMT_FLAG_CONTINUOUS_BYTESTREAM`` then the decoder has no requirements since it can parse all the information from the raw bytestream. + * .. _V4L2-PIX-FMT-HEVC-SLICE: + + - ``V4L2_PIX_FMT_HEVC_SLICE`` + - 'S265' + - HEVC parsed slice data, as extracted from the HEVC bitstream. + This format is adapted for stateless video decoders that implement a + HEVC pipeline (using the :ref:`mem2mem` and :ref:`media-request-api`). + This pixelformat has two modifiers that must be set at least once + through the ``V4L2_CID_MPEG_VIDEO_HEVC_DECODE_MODE`` + and ``V4L2_CID_MPEG_VIDEO_HEVC_START_CODE`` controls. + Metadata associated with the frame to decode is required to be passed + through the following controls : + * ``V4L2_CID_MPEG_VIDEO_HEVC_SPS`` + * ``V4L2_CID_MPEG_VIDEO_HEVC_PPS`` + * ``V4L2_CID_MPEG_VIDEO_HEVC_SLICE_PARAMS`` + See the :ref:`associated Codec Control IDs <v4l2-mpeg-hevc>`. + Buffers associated with this pixel format must contain the appropriate + number of macroblocks to decode a full corresponding frame. + + .. note:: + + This format is not yet part of the public kernel API and it + is expected to change. * .. _V4L2-PIX-FMT-FWHT: - ``V4L2_PIX_FMT_FWHT`` diff --git a/Documentation/media/uapi/v4l/pixfmt-meta-vivid.rst b/Documentation/media/uapi/v4l/pixfmt-meta-vivid.rst new file mode 100644 index 000000000000..eed20eaefe24 --- /dev/null +++ b/Documentation/media/uapi/v4l/pixfmt-meta-vivid.rst @@ -0,0 +1,60 @@ +.. This file is dual-licensed: you can use it either under the terms +.. of the GPL 2.0 or the GFDL 1.1+ license, at your option. Note that this +.. dual licensing only applies to this file, and not this project as a +.. whole. +.. +.. a) This file is free software; you can redistribute it and/or +.. modify it under the terms of the GNU General Public License as +.. published by the Free Software Foundation version 2 of +.. the License. +.. +.. This file is distributed in the hope that it will be useful, +.. but WITHOUT ANY WARRANTY; without even the implied warranty of +.. MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +.. GNU General Public License for more details. +.. +.. Or, alternatively, +.. +.. b) Permission is granted to copy, distribute and/or modify this +.. document under the terms of the GNU Free Documentation License, +.. Version 1.1 or any later version published by the Free Software +.. Foundation, with no Invariant Sections, no Front-Cover Texts +.. and no Back-Cover Texts. A copy of the license is included at +.. Documentation/media/uapi/fdl-appendix.rst. +.. +.. TODO: replace it to GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections + +.. _v4l2-meta-fmt-vivid: + +******************************* +V4L2_META_FMT_VIVID ('VIVD') +******************************* + +VIVID Metadata Format + + +Description +=========== + +This describes metadata format used by the vivid driver. + +It sets Brightness, Saturation, Contrast and Hue, each of which maps to +corresponding controls of the vivid driver with respect to the range and default values. + +It contains the following fields: + +.. flat-table:: VIVID Metadata + :widths: 1 4 + :header-rows: 1 + :stub-columns: 0 + + * - Field + - Description + * - u16 brightness; + - Image brightness, the value is in the range 0 to 255, with the default value as 128. + * - u16 contrast; + - Image contrast, the value is in the range 0 to 255, with the default value as 128. + * - u16 saturation; + - Image color saturation, the value is in the range 0 to 255, with the default value as 128. + * - s16 hue; + - Image color balance, the value is in the range -128 to 128, with the default value as 0. diff --git a/Documentation/media/uapi/v4l/v4l2-selection-targets.rst b/Documentation/media/uapi/v4l/v4l2-selection-targets.rst index f74f239b0510..aae0c0013eb1 100644 --- a/Documentation/media/uapi/v4l/v4l2-selection-targets.rst +++ b/Documentation/media/uapi/v4l/v4l2-selection-targets.rst @@ -38,8 +38,10 @@ of the two interfaces they are used. * - ``V4L2_SEL_TGT_CROP_DEFAULT`` - 0x0001 - Suggested cropping rectangle that covers the "whole picture". + This includes only active pixels and excludes other non-active + pixels such as black pixels. + - Yes - Yes - - No * - ``V4L2_SEL_TGT_CROP_BOUNDS`` - 0x0002 - Bounds of the crop rectangle. All valid crop rectangles fit inside diff --git a/Documentation/media/uapi/v4l/vidioc-decoder-cmd.rst b/Documentation/media/uapi/v4l/vidioc-decoder-cmd.rst index 57f0066f4cff..f1a504836f31 100644 --- a/Documentation/media/uapi/v4l/vidioc-decoder-cmd.rst +++ b/Documentation/media/uapi/v4l/vidioc-decoder-cmd.rst @@ -208,7 +208,15 @@ introduced in Linux 3.3. They are, however, mandatory for stateful mem2mem decod been started yet, the driver will return an ``EPERM`` error code. When the decoder is already running, this command does nothing. No flags are defined for this command. - + * - ``V4L2_DEC_CMD_FLUSH`` + - 4 + - Flush any held capture buffers. Only valid for stateless decoders. + This command is typically used when the application reached the + end of the stream and the last output buffer had the + ``V4L2_BUF_FLAG_M2M_HOLD_CAPTURE_BUF`` flag set. This would prevent + dequeueing the capture buffer containing the last decoded frame. + So this command can be used to explicitly flush that final decoded + frame. This command does nothing if there are no held capture buffers. Return Value ============ diff --git a/Documentation/media/uapi/v4l/vidioc-g-ext-ctrls.rst b/Documentation/media/uapi/v4l/vidioc-g-ext-ctrls.rst index 13dc1a986249..271cac18afbb 100644 --- a/Documentation/media/uapi/v4l/vidioc-g-ext-ctrls.rst +++ b/Documentation/media/uapi/v4l/vidioc-g-ext-ctrls.rst @@ -199,6 +199,11 @@ still cause this situation. - A pointer to a matrix control of unsigned 32-bit values. Valid if this control is of type ``V4L2_CTRL_TYPE_U32``. * - + - :c:type:`v4l2_area` * + - ``p_area`` + - A pointer to a struct :c:type:`v4l2_area`. Valid if this control is + of type ``V4L2_CTRL_TYPE_AREA``. + * - - void * - ``ptr`` - A pointer to a compound type which can be an N-dimensional array diff --git a/Documentation/media/uapi/v4l/vidioc-g-fbuf.rst b/Documentation/media/uapi/v4l/vidioc-g-fbuf.rst index 7b6179627803..2d197e6bba8f 100644 --- a/Documentation/media/uapi/v4l/vidioc-g-fbuf.rst +++ b/Documentation/media/uapi/v4l/vidioc-g-fbuf.rst @@ -63,7 +63,7 @@ EINVAL error code when overlays are not supported. To set the parameters for a *Video Output Overlay*, applications must initialize the ``flags`` field of a struct -struct :c:type:`v4l2_framebuffer`. Since the framebuffer is +:c:type:`v4l2_framebuffer`. Since the framebuffer is implemented on the TV card all other parameters are determined by the driver. When an application calls :ref:`VIDIOC_S_FBUF <VIDIOC_G_FBUF>` with a pointer to this structure, the driver prepares for the overlay and returns the diff --git a/Documentation/media/uapi/v4l/vidioc-queryctrl.rst b/Documentation/media/uapi/v4l/vidioc-queryctrl.rst index a3d56ffbf4cc..6690928e657b 100644 --- a/Documentation/media/uapi/v4l/vidioc-queryctrl.rst +++ b/Documentation/media/uapi/v4l/vidioc-queryctrl.rst @@ -443,6 +443,12 @@ See also the examples in :ref:`control`. - n/a - A struct :c:type:`v4l2_ctrl_mpeg2_quantization`, containing MPEG-2 quantization matrices for stateless video decoders. + * - ``V4L2_CTRL_TYPE_AREA`` + - n/a + - n/a + - n/a + - A struct :c:type:`v4l2_area`, containing the width and the height + of a rectangular area. Units depend on the use case. * - ``V4L2_CTRL_TYPE_H264_SPS`` - n/a - n/a @@ -473,6 +479,24 @@ See also the examples in :ref:`control`. - n/a - A struct :c:type:`v4l2_ctrl_h264_decode_params`, containing H264 decode parameters for stateless video decoders. + * - ``V4L2_CTRL_TYPE_HEVC_SPS`` + - n/a + - n/a + - n/a + - A struct :c:type:`v4l2_ctrl_hevc_sps`, containing HEVC Sequence + Parameter Set for stateless video decoders. + * - ``V4L2_CTRL_TYPE_HEVC_PPS`` + - n/a + - n/a + - n/a + - A struct :c:type:`v4l2_ctrl_hevc_pps`, containing HEVC Picture + Parameter Set for stateless video decoders. + * - ``V4L2_CTRL_TYPE_HEVC_SLICE_PARAMS`` + - n/a + - n/a + - n/a + - A struct :c:type:`v4l2_ctrl_hevc_slice_params`, containing HEVC + slice parameters for stateless video decoders. .. tabularcolumns:: |p{6.6cm}|p{2.2cm}|p{8.7cm}| diff --git a/Documentation/media/uapi/v4l/vidioc-reqbufs.rst b/Documentation/media/uapi/v4l/vidioc-reqbufs.rst index d7faef10e39b..d0c643db477a 100644 --- a/Documentation/media/uapi/v4l/vidioc-reqbufs.rst +++ b/Documentation/media/uapi/v4l/vidioc-reqbufs.rst @@ -125,6 +125,7 @@ aborting or finishing any DMA in progress, an implicit .. _V4L2-BUF-CAP-SUPPORTS-DMABUF: .. _V4L2-BUF-CAP-SUPPORTS-REQUESTS: .. _V4L2-BUF-CAP-SUPPORTS-ORPHANED-BUFS: +.. _V4L2-BUF-CAP-SUPPORTS-M2M-HOLD-CAPTURE-BUF: .. cssclass:: longtable @@ -150,6 +151,11 @@ aborting or finishing any DMA in progress, an implicit - The kernel allows calling :ref:`VIDIOC_REQBUFS` while buffers are still mapped or exported via DMABUF. These orphaned buffers will be freed when they are unmapped or when the exported DMABUF fds are closed. + * - ``V4L2_BUF_CAP_SUPPORTS_M2M_HOLD_CAPTURE_BUF`` + - 0x00000020 + - Only valid for stateless decoders. If set, then userspace can set the + ``V4L2_BUF_FLAG_M2M_HOLD_CAPTURE_BUF`` flag to hold off on returning the + capture buffer until the OUTPUT timestamp changes. Return Value ============ diff --git a/Documentation/media/v4l-drivers/imx.rst b/Documentation/media/v4l-drivers/imx.rst index 1d7eb8c7bd5c..1246573c1019 100644 --- a/Documentation/media/v4l-drivers/imx.rst +++ b/Documentation/media/v4l-drivers/imx.rst @@ -515,10 +515,10 @@ Streaming can then begin independently on the capture device nodes be used to select any supported YUV pixelformat on the capture device nodes, including planar. -SabreAuto with ADV7180 decoder ------------------------------- +i.MX6Q SabreAuto with ADV7180 decoder +------------------------------------- -On the SabreAuto, an on-board ADV7180 SD decoder is connected to the +On the i.MX6Q SabreAuto, an on-board ADV7180 SD decoder is connected to the parallel bus input on the internal video mux to IPU1 CSI0. The following example configures a pipeline to capture from the ADV7180 @@ -547,8 +547,6 @@ This example configures a pipeline to capture from the ADV7180 video decoder, assuming PAL 720x576 input signals, with Motion Compensated de-interlacing. The adv7180 must output sequential or alternating fields (field type 'seq-tb' for PAL, or 'alternate'). -$outputfmt can be any format supported by the ipu1_ic_prpvf entity -at its output pad: .. code-block:: none @@ -565,11 +563,70 @@ at its output pad: media-ctl -V "'ipu1_csi0':1 [fmt:AYUV32/720x576]" media-ctl -V "'ipu1_vdic':2 [fmt:AYUV32/720x576 field:none]" media-ctl -V "'ipu1_ic_prp':2 [fmt:AYUV32/720x576 field:none]" - media-ctl -V "'ipu1_ic_prpvf':1 [fmt:$outputfmt field:none]" + media-ctl -V "'ipu1_ic_prpvf':1 [fmt:AYUV32/720x576 field:none]" + # Configure "ipu1_ic_prpvf capture" interface (assumed at /dev/video2) + v4l2-ctl -d2 --set-fmt-video=field=none + +Streaming can then begin on /dev/video2. The v4l2-ctl tool can also be +used to select any supported YUV pixelformat on /dev/video2. + +This platform accepts Composite Video analog inputs to the ADV7180 on +Ain1 (connector J42). + +i.MX6DL SabreAuto with ADV7180 decoder +-------------------------------------- + +On the i.MX6DL SabreAuto, an on-board ADV7180 SD decoder is connected to the +parallel bus input on the internal video mux to IPU1 CSI0. + +The following example configures a pipeline to capture from the ADV7180 +video decoder, assuming NTSC 720x480 input signals, using simple +interweave (unconverted and without motion compensation). The adv7180 +must output sequential or alternating fields (field type 'seq-bt' for +NTSC, or 'alternate'): + +.. code-block:: none + + # Setup links + media-ctl -l "'adv7180 4-0021':0 -> 'ipu1_csi0_mux':4[1]" + media-ctl -l "'ipu1_csi0_mux':5 -> 'ipu1_csi0':0[1]" + media-ctl -l "'ipu1_csi0':2 -> 'ipu1_csi0 capture':0[1]" + # Configure pads + media-ctl -V "'adv7180 4-0021':0 [fmt:UYVY2X8/720x480 field:seq-bt]" + media-ctl -V "'ipu1_csi0_mux':5 [fmt:UYVY2X8/720x480]" + media-ctl -V "'ipu1_csi0':2 [fmt:AYUV32/720x480]" + # Configure "ipu1_csi0 capture" interface (assumed at /dev/video0) + v4l2-ctl -d0 --set-fmt-video=field=interlaced_bt + +Streaming can then begin on /dev/video0. The v4l2-ctl tool can also be +used to select any supported YUV pixelformat on /dev/video0. + +This example configures a pipeline to capture from the ADV7180 +video decoder, assuming PAL 720x576 input signals, with Motion +Compensated de-interlacing. The adv7180 must output sequential or +alternating fields (field type 'seq-tb' for PAL, or 'alternate'). + +.. code-block:: none + + # Setup links + media-ctl -l "'adv7180 4-0021':0 -> 'ipu1_csi0_mux':4[1]" + media-ctl -l "'ipu1_csi0_mux':5 -> 'ipu1_csi0':0[1]" + media-ctl -l "'ipu1_csi0':1 -> 'ipu1_vdic':0[1]" + media-ctl -l "'ipu1_vdic':2 -> 'ipu1_ic_prp':0[1]" + media-ctl -l "'ipu1_ic_prp':2 -> 'ipu1_ic_prpvf':0[1]" + media-ctl -l "'ipu1_ic_prpvf':1 -> 'ipu1_ic_prpvf capture':0[1]" + # Configure pads + media-ctl -V "'adv7180 4-0021':0 [fmt:UYVY2X8/720x576 field:seq-tb]" + media-ctl -V "'ipu1_csi0_mux':5 [fmt:UYVY2X8/720x576]" + media-ctl -V "'ipu1_csi0':1 [fmt:AYUV32/720x576]" + media-ctl -V "'ipu1_vdic':2 [fmt:AYUV32/720x576 field:none]" + media-ctl -V "'ipu1_ic_prp':2 [fmt:AYUV32/720x576 field:none]" + media-ctl -V "'ipu1_ic_prpvf':1 [fmt:AYUV32/720x576 field:none]" + # Configure "ipu1_ic_prpvf capture" interface (assumed at /dev/video2) + v4l2-ctl -d2 --set-fmt-video=field=none -Streaming can then begin on the capture device node at -"ipu1_ic_prpvf capture". The v4l2-ctl tool can be used to select any -supported YUV or RGB pixelformat on the capture device node. +Streaming can then begin on /dev/video2. The v4l2-ctl tool can also be +used to select any supported YUV pixelformat on /dev/video2. This platform accepts Composite Video analog inputs to the ADV7180 on Ain1 (connector J42). diff --git a/Documentation/media/v4l-drivers/ipu3.rst b/Documentation/media/v4l-drivers/ipu3.rst index c9f780404eee..e4904ab44e60 100644 --- a/Documentation/media/v4l-drivers/ipu3.rst +++ b/Documentation/media/v4l-drivers/ipu3.rst @@ -265,19 +265,56 @@ below. yavta -w "0x009819A1 1" /dev/v4l-subdev7 -RAW Bayer frames go through the following ImgU pipeline HW blocks to have the +Certain hardware blocks in ImgU pipeline can change the frame resolution by +cropping or scaling, these hardware blocks include Input Feeder(IF), Bayer Down +Scaler (BDS) and Geometric Distortion Correction (GDC). +There is also a block which can change the frame resolution - YUV Scaler, it is +only applicable to the secondary output. + +RAW Bayer frames go through these ImgU pipeline hardware blocks and the final processed image output to the DDR memory. -RAW Bayer frame -> Input Feeder -> Bayer Down Scaling (BDS) -> Geometric -Distortion Correction (GDC) -> DDR +.. kernel-figure:: ipu3_rcb.svg + :alt: ipu3 resolution blocks image -The ImgU V4L2 subdev has to be configured with the supported resolutions in all -the above HW blocks, for a given input resolution. + IPU3 resolution change hardware blocks + +**Input Feeder** + +Input Feeder gets the Bayer frame data from the sensor, it can enable cropping +of lines and columns from the frame and then store pixels into device's internal +pixel buffer which are ready to readout by following blocks. + +**Bayer Down Scaler** + +Bayer Down Scaler is capable of performing image scaling in Bayer domain, the +downscale factor can be configured from 1X to 1/4X in each axis with +configuration steps of 0.03125 (1/32). +**Geometric Distortion Correction** + +Geometric Distortion Correction is used to performe correction of distortions +and image filtering. It needs some extra filter and envelop padding pixels to +work, so the input resolution of GDC should be larger than the output +resolution. + +**YUV Scaler** + +YUV Scaler which similar with BDS, but it is mainly do image down scaling in +YUV domain, it can support up to 1/12X down scaling, but it can not be applied +to the main output. + +The ImgU V4L2 subdev has to be configured with the supported resolutions in all +the above hardware blocks, for a given input resolution. For a given supported resolution for an input frame, the Input Feeder, Bayer -Down Scaling and GDC blocks should be configured with the supported resolutions. -This information can be obtained by looking at the following IPU3 ImgU -configuration table. +Down Scaler and GDC blocks should be configured with the supported resolutions +as each hardware block has its own alignment requirement. + +You must configure the output resolution of the hardware blocks smartly to meet +the hardware requirement along with keeping the maximum field of view. +The intermediate resolutions can be generated by specific tool and this +information can be obtained by looking at the following IPU3 ImgU configuration +table. https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/master diff --git a/Documentation/media/v4l-drivers/ipu3_rcb.svg b/Documentation/media/v4l-drivers/ipu3_rcb.svg new file mode 100644 index 000000000000..d878421b42a0 --- /dev/null +++ b/Documentation/media/v4l-drivers/ipu3_rcb.svg @@ -0,0 +1,331 @@ +<?xml version="1.0" encoding="UTF-8"?> +<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="774pt" height="152pt" viewBox="0 0 774 152" version="1.1"> +<defs> +<g> +<symbol overflow="visible" id="glyph0-0"> +<path style="stroke:none;" d="M 1 0 L 1 -15 L 9 -15 L 9 0 Z M 8 -1 L 8 -14 L 2 -14 L 2 -1 Z M 8 -1 "/> +</symbol> +<symbol overflow="visible" id="glyph0-1"> +<path style="stroke:none;" d="M 4.6875 -1.15625 C 5.519531 -1.15625 6.15625 -1.316406 6.59375 -1.640625 C 7.039062 -1.960938 7.265625 -2.441406 7.265625 -3.078125 C 7.265625 -3.460938 7.179688 -3.789062 7.015625 -4.0625 C 6.859375 -4.34375 6.644531 -4.582031 6.375 -4.78125 C 6.113281 -4.988281 5.816406 -5.171875 5.484375 -5.328125 C 5.148438 -5.484375 4.804688 -5.628906 4.453125 -5.765625 C 4.054688 -5.921875 3.675781 -6.097656 3.3125 -6.296875 C 2.945312 -6.492188 2.617188 -6.726562 2.328125 -7 C 2.046875 -7.269531 1.820312 -7.582031 1.65625 -7.9375 C 1.488281 -8.300781 1.40625 -8.726562 1.40625 -9.21875 C 1.40625 -10.300781 1.742188 -11.144531 2.421875 -11.75 C 3.097656 -12.351562 4.046875 -12.65625 5.265625 -12.65625 C 5.597656 -12.65625 5.925781 -12.628906 6.25 -12.578125 C 6.570312 -12.535156 6.875 -12.476562 7.15625 -12.40625 C 7.4375 -12.34375 7.6875 -12.265625 7.90625 -12.171875 C 8.125 -12.085938 8.300781 -12 8.4375 -11.90625 L 7.921875 -10.515625 C 7.648438 -10.679688 7.28125 -10.84375 6.8125 -11 C 6.351562 -11.15625 5.835938 -11.234375 5.265625 -11.234375 C 4.660156 -11.234375 4.140625 -11.082031 3.703125 -10.78125 C 3.265625 -10.488281 3.046875 -10.039062 3.046875 -9.4375 C 3.046875 -9.09375 3.109375 -8.800781 3.234375 -8.5625 C 3.359375 -8.320312 3.53125 -8.109375 3.75 -7.921875 C 3.96875 -7.742188 4.222656 -7.582031 4.515625 -7.4375 C 4.804688 -7.289062 5.128906 -7.144531 5.484375 -7 C 5.984375 -6.789062 6.441406 -6.578125 6.859375 -6.359375 C 7.285156 -6.148438 7.648438 -5.894531 7.953125 -5.59375 C 8.253906 -5.300781 8.488281 -4.953125 8.65625 -4.546875 C 8.820312 -4.148438 8.90625 -3.664062 8.90625 -3.09375 C 8.90625 -2.019531 8.539062 -1.191406 7.8125 -0.609375 C 7.082031 -0.0234375 6.039062 0.265625 4.6875 0.265625 C 4.238281 0.265625 3.820312 0.234375 3.4375 0.171875 C 3.050781 0.109375 2.707031 0.03125 2.40625 -0.0625 C 2.101562 -0.15625 1.835938 -0.25 1.609375 -0.34375 C 1.390625 -0.4375 1.21875 -0.519531 1.09375 -0.59375 L 1.59375 -1.953125 C 1.863281 -1.804688 2.257812 -1.632812 2.78125 -1.4375 C 3.300781 -1.25 3.9375 -1.15625 4.6875 -1.15625 Z M 4.6875 -1.15625 "/> +</symbol> +<symbol overflow="visible" id="glyph0-2"> +<path style="stroke:none;" d="M 5.1875 -9.5 C 6.4375 -9.5 7.398438 -9.109375 8.078125 -8.328125 C 8.753906 -7.546875 9.09375 -6.363281 9.09375 -4.78125 L 9.09375 -4.203125 L 2.453125 -4.203125 C 2.523438 -3.242188 2.84375 -2.515625 3.40625 -2.015625 C 3.976562 -1.515625 4.773438 -1.265625 5.796875 -1.265625 C 6.390625 -1.265625 6.890625 -1.3125 7.296875 -1.40625 C 7.710938 -1.5 8.023438 -1.597656 8.234375 -1.703125 L 8.453125 -0.296875 C 8.253906 -0.191406 7.894531 -0.0820312 7.375 0.03125 C 6.851562 0.15625 6.269531 0.21875 5.625 0.21875 C 4.820312 0.21875 4.113281 0.0976562 3.5 -0.140625 C 2.894531 -0.390625 2.394531 -0.726562 2 -1.15625 C 1.601562 -1.582031 1.300781 -2.09375 1.09375 -2.6875 C 0.894531 -3.28125 0.796875 -3.925781 0.796875 -4.625 C 0.796875 -5.445312 0.921875 -6.164062 1.171875 -6.78125 C 1.429688 -7.394531 1.765625 -7.898438 2.171875 -8.296875 C 2.585938 -8.703125 3.054688 -9.003906 3.578125 -9.203125 C 4.097656 -9.398438 4.632812 -9.5 5.1875 -9.5 Z M 7.421875 -5.546875 C 7.421875 -6.328125 7.210938 -6.945312 6.796875 -7.40625 C 6.390625 -7.863281 5.84375 -8.09375 5.15625 -8.09375 C 4.769531 -8.09375 4.421875 -8.019531 4.109375 -7.875 C 3.796875 -7.726562 3.523438 -7.535156 3.296875 -7.296875 C 3.066406 -7.054688 2.882812 -6.78125 2.75 -6.46875 C 2.625 -6.164062 2.539062 -5.859375 2.5 -5.546875 Z M 7.421875 -5.546875 "/> +</symbol> +<symbol overflow="visible" id="glyph0-3"> +<path style="stroke:none;" d="M 1.421875 -9.015625 C 2.015625 -9.160156 2.609375 -9.273438 3.203125 -9.359375 C 3.796875 -9.441406 4.351562 -9.484375 4.875 -9.484375 C 6.113281 -9.484375 7.050781 -9.160156 7.6875 -8.515625 C 8.320312 -7.878906 8.640625 -6.851562 8.640625 -5.4375 L 8.640625 0 L 7 0 L 7 -5.140625 C 7 -5.742188 6.945312 -6.226562 6.84375 -6.59375 C 6.738281 -6.96875 6.585938 -7.257812 6.390625 -7.46875 C 6.191406 -7.675781 5.957031 -7.816406 5.6875 -7.890625 C 5.414062 -7.972656 5.117188 -8.015625 4.796875 -8.015625 C 4.535156 -8.015625 4.253906 -8 3.953125 -7.96875 C 3.648438 -7.9375 3.359375 -7.894531 3.078125 -7.84375 L 3.078125 0 L 1.421875 0 Z M 1.421875 -9.015625 "/> +</symbol> +<symbol overflow="visible" id="glyph0-4"> +<path style="stroke:none;" d="M 7.015625 -2.3125 C 7.015625 -2.644531 6.878906 -2.914062 6.609375 -3.125 C 6.335938 -3.34375 6 -3.53125 5.59375 -3.6875 C 5.1875 -3.851562 4.742188 -4.015625 4.265625 -4.171875 C 3.785156 -4.328125 3.335938 -4.515625 2.921875 -4.734375 C 2.515625 -4.960938 2.175781 -5.242188 1.90625 -5.578125 C 1.632812 -5.910156 1.5 -6.34375 1.5 -6.875 C 1.5 -7.625 1.800781 -8.25 2.40625 -8.75 C 3.007812 -9.25 3.960938 -9.5 5.265625 -9.5 C 5.765625 -9.5 6.285156 -9.460938 6.828125 -9.390625 C 7.367188 -9.316406 7.832031 -9.21875 8.21875 -9.09375 L 7.921875 -7.625 C 7.816406 -7.675781 7.671875 -7.726562 7.484375 -7.78125 C 7.296875 -7.84375 7.082031 -7.894531 6.84375 -7.9375 C 6.601562 -7.988281 6.34375 -8.023438 6.0625 -8.046875 C 5.789062 -8.078125 5.53125 -8.09375 5.28125 -8.09375 C 3.84375 -8.09375 3.125 -7.703125 3.125 -6.921875 C 3.125 -6.640625 3.257812 -6.398438 3.53125 -6.203125 C 3.800781 -6.015625 4.144531 -5.835938 4.5625 -5.671875 C 4.976562 -5.515625 5.425781 -5.351562 5.90625 -5.1875 C 6.382812 -5.019531 6.828125 -4.816406 7.234375 -4.578125 C 7.648438 -4.335938 7.992188 -4.046875 8.265625 -3.703125 C 8.546875 -3.367188 8.6875 -2.941406 8.6875 -2.421875 C 8.6875 -1.578125 8.359375 -0.925781 7.703125 -0.46875 C 7.046875 -0.0078125 6.007812 0.21875 4.59375 0.21875 C 3.957031 0.21875 3.375 0.164062 2.84375 0.0625 C 2.3125 -0.0390625 1.800781 -0.203125 1.3125 -0.421875 L 1.640625 -1.921875 C 2.109375 -1.703125 2.597656 -1.523438 3.109375 -1.390625 C 3.617188 -1.253906 4.171875 -1.1875 4.765625 -1.1875 C 6.265625 -1.1875 7.015625 -1.5625 7.015625 -2.3125 Z M 7.015625 -2.3125 "/> +</symbol> +<symbol overflow="visible" id="glyph0-5"> +<path style="stroke:none;" d="M 9.203125 -4.640625 C 9.203125 -3.910156 9.097656 -3.25 8.890625 -2.65625 C 8.679688 -2.0625 8.390625 -1.550781 8.015625 -1.125 C 7.640625 -0.695312 7.191406 -0.363281 6.671875 -0.125 C 6.160156 0.101562 5.597656 0.21875 4.984375 0.21875 C 4.378906 0.21875 3.820312 0.101562 3.3125 -0.125 C 2.800781 -0.363281 2.359375 -0.695312 1.984375 -1.125 C 1.609375 -1.550781 1.316406 -2.0625 1.109375 -2.65625 C 0.898438 -3.25 0.796875 -3.910156 0.796875 -4.640625 C 0.796875 -5.367188 0.898438 -6.035156 1.109375 -6.640625 C 1.316406 -7.242188 1.609375 -7.753906 1.984375 -8.171875 C 2.359375 -8.585938 2.800781 -8.910156 3.3125 -9.140625 C 3.820312 -9.378906 4.378906 -9.5 4.984375 -9.5 C 5.597656 -9.5 6.160156 -9.378906 6.671875 -9.140625 C 7.191406 -8.910156 7.640625 -8.585938 8.015625 -8.171875 C 8.390625 -7.753906 8.679688 -7.242188 8.890625 -6.640625 C 9.097656 -6.035156 9.203125 -5.367188 9.203125 -4.640625 Z M 7.5 -4.640625 C 7.5 -5.691406 7.269531 -6.519531 6.8125 -7.125 C 6.363281 -7.738281 5.753906 -8.046875 4.984375 -8.046875 C 4.222656 -8.046875 3.617188 -7.738281 3.171875 -7.125 C 2.722656 -6.519531 2.5 -5.691406 2.5 -4.640625 C 2.5 -3.597656 2.722656 -2.773438 3.171875 -2.171875 C 3.617188 -1.566406 4.222656 -1.265625 4.984375 -1.265625 C 5.753906 -1.265625 6.363281 -1.566406 6.8125 -2.171875 C 7.269531 -2.773438 7.5 -3.597656 7.5 -4.640625 Z M 7.5 -4.640625 "/> +</symbol> +<symbol overflow="visible" id="glyph0-6"> +<path style="stroke:none;" d="M 2.140625 0 L 2.140625 -8.78125 C 3.503906 -9.25 4.878906 -9.484375 6.265625 -9.484375 C 6.691406 -9.484375 7.097656 -9.460938 7.484375 -9.421875 C 7.867188 -9.390625 8.296875 -9.320312 8.765625 -9.21875 L 8.453125 -7.765625 C 8.023438 -7.878906 7.648438 -7.953125 7.328125 -7.984375 C 7.003906 -8.023438 6.648438 -8.046875 6.265625 -8.046875 C 5.453125 -8.046875 4.625 -7.929688 3.78125 -7.703125 L 3.78125 0 Z M 2.140625 0 "/> +</symbol> +<symbol overflow="visible" id="glyph0-7"> +<path style="stroke:none;" d="M 5.8125 -10.984375 L 5.8125 -1.40625 L 8.21875 -1.40625 L 8.21875 0 L 1.78125 0 L 1.78125 -1.40625 L 4.1875 -1.40625 L 4.1875 -10.984375 L 1.78125 -10.984375 L 1.78125 -12.375 L 8.21875 -12.375 L 8.21875 -10.984375 Z M 5.8125 -10.984375 "/> +</symbol> +<symbol overflow="visible" id="glyph0-8"> +<path style="stroke:none;" d="M 1.8125 0 L 1.8125 -12.375 L 8.84375 -12.375 L 8.84375 -10.984375 L 3.453125 -10.984375 L 3.453125 -7.125 L 8.203125 -7.125 L 8.203125 -5.734375 L 3.453125 -5.734375 L 3.453125 0 Z M 1.8125 0 "/> +</symbol> +<symbol overflow="visible" id="glyph0-9"> +<path style="stroke:none;" d="M 4.078125 0.09375 C 3.878906 0.09375 3.644531 0.0859375 3.375 0.078125 C 3.113281 0.0664062 2.847656 0.0507812 2.578125 0.03125 C 2.316406 0.0078125 2.050781 -0.0195312 1.78125 -0.0625 C 1.507812 -0.101562 1.273438 -0.148438 1.078125 -0.203125 L 1.078125 -12.203125 C 1.273438 -12.253906 1.503906 -12.300781 1.765625 -12.34375 C 2.023438 -12.382812 2.289062 -12.410156 2.5625 -12.421875 C 2.84375 -12.441406 3.113281 -12.457031 3.375 -12.46875 C 3.632812 -12.488281 3.867188 -12.5 4.078125 -12.5 C 4.691406 -12.5 5.265625 -12.445312 5.796875 -12.34375 C 6.328125 -12.238281 6.789062 -12.054688 7.1875 -11.796875 C 7.582031 -11.546875 7.890625 -11.210938 8.109375 -10.796875 C 8.328125 -10.390625 8.4375 -9.878906 8.4375 -9.265625 C 8.4375 -8.960938 8.390625 -8.675781 8.296875 -8.40625 C 8.203125 -8.132812 8.070312 -7.878906 7.90625 -7.640625 C 7.738281 -7.398438 7.546875 -7.1875 7.328125 -7 C 7.109375 -6.820312 6.875 -6.6875 6.625 -6.59375 C 7.300781 -6.40625 7.867188 -6.0625 8.328125 -5.5625 C 8.785156 -5.0625 9.015625 -4.414062 9.015625 -3.625 C 9.015625 -2.394531 8.617188 -1.46875 7.828125 -0.84375 C 7.046875 -0.21875 5.796875 0.09375 4.078125 0.09375 Z M 2.71875 -5.78125 L 2.71875 -1.359375 C 2.75 -1.347656 2.898438 -1.332031 3.171875 -1.3125 C 3.441406 -1.289062 3.785156 -1.28125 4.203125 -1.28125 C 4.609375 -1.28125 5 -1.3125 5.375 -1.375 C 5.757812 -1.445312 6.097656 -1.570312 6.390625 -1.75 C 6.691406 -1.925781 6.929688 -2.160156 7.109375 -2.453125 C 7.285156 -2.753906 7.375 -3.132812 7.375 -3.59375 C 7.375 -4.007812 7.289062 -4.359375 7.125 -4.640625 C 6.957031 -4.921875 6.738281 -5.144531 6.46875 -5.3125 C 6.195312 -5.476562 5.878906 -5.597656 5.515625 -5.671875 C 5.160156 -5.742188 4.789062 -5.78125 4.40625 -5.78125 Z M 2.71875 -7.140625 L 4.015625 -7.140625 C 4.347656 -7.140625 4.679688 -7.171875 5.015625 -7.234375 C 5.347656 -7.304688 5.644531 -7.414062 5.90625 -7.5625 C 6.175781 -7.707031 6.390625 -7.90625 6.546875 -8.15625 C 6.710938 -8.414062 6.796875 -8.738281 6.796875 -9.125 C 6.796875 -9.476562 6.722656 -9.78125 6.578125 -10.03125 C 6.429688 -10.289062 6.238281 -10.5 6 -10.65625 C 5.757812 -10.820312 5.484375 -10.9375 5.171875 -11 C 4.859375 -11.0625 4.53125 -11.09375 4.1875 -11.09375 C 3.832031 -11.09375 3.523438 -11.085938 3.265625 -11.078125 C 3.003906 -11.078125 2.820312 -11.066406 2.71875 -11.046875 Z M 2.71875 -7.140625 "/> +</symbol> +<symbol overflow="visible" id="glyph0-10"> +<path style="stroke:none;" d="M 9.203125 -6.203125 C 9.203125 -5.054688 9.054688 -4.082031 8.765625 -3.28125 C 8.484375 -2.476562 8.09375 -1.828125 7.59375 -1.328125 C 7.09375 -0.828125 6.5 -0.460938 5.8125 -0.234375 C 5.125 -0.015625 4.378906 0.09375 3.578125 0.09375 C 2.753906 0.09375 1.921875 -0.00390625 1.078125 -0.203125 L 1.078125 -12.203125 C 1.921875 -12.398438 2.753906 -12.5 3.578125 -12.5 C 4.378906 -12.5 5.125 -12.382812 5.8125 -12.15625 C 6.5 -11.925781 7.09375 -11.554688 7.59375 -11.046875 C 8.09375 -10.546875 8.484375 -9.894531 8.765625 -9.09375 C 9.054688 -8.300781 9.203125 -7.335938 9.203125 -6.203125 Z M 2.71875 -1.375 C 3.050781 -1.332031 3.390625 -1.3125 3.734375 -1.3125 C 4.335938 -1.3125 4.875 -1.398438 5.34375 -1.578125 C 5.8125 -1.765625 6.203125 -2.054688 6.515625 -2.453125 C 6.835938 -2.847656 7.082031 -3.351562 7.25 -3.96875 C 7.425781 -4.59375 7.515625 -5.335938 7.515625 -6.203125 C 7.515625 -7.878906 7.191406 -9.109375 6.546875 -9.890625 C 5.898438 -10.679688 4.945312 -11.078125 3.6875 -11.078125 C 3.507812 -11.078125 3.335938 -11.070312 3.171875 -11.0625 C 3.003906 -11.0625 2.851562 -11.046875 2.71875 -11.015625 Z M 2.71875 -1.375 "/> +</symbol> +<symbol overflow="visible" id="glyph0-11"> +<path style="stroke:none;" d="M 7.453125 -6.09375 L 9.09375 -6.09375 L 9.09375 -0.296875 C 8.84375 -0.203125 8.4375 -0.0859375 7.875 0.046875 C 7.320312 0.191406 6.664062 0.265625 5.90625 0.265625 C 5.15625 0.265625 4.472656 0.125 3.859375 -0.15625 C 3.242188 -0.445312 2.71875 -0.863281 2.28125 -1.40625 C 1.851562 -1.957031 1.519531 -2.632812 1.28125 -3.4375 C 1.039062 -4.25 0.921875 -5.171875 0.921875 -6.203125 C 0.921875 -7.242188 1.050781 -8.160156 1.3125 -8.953125 C 1.582031 -9.753906 1.945312 -10.425781 2.40625 -10.96875 C 2.863281 -11.519531 3.398438 -11.9375 4.015625 -12.21875 C 4.628906 -12.507812 5.289062 -12.65625 6 -12.65625 C 6.457031 -12.65625 6.859375 -12.617188 7.203125 -12.546875 C 7.546875 -12.484375 7.835938 -12.40625 8.078125 -12.3125 C 8.328125 -12.226562 8.53125 -12.132812 8.6875 -12.03125 C 8.851562 -11.925781 8.976562 -11.847656 9.0625 -11.796875 L 8.515625 -10.421875 C 8.210938 -10.660156 7.847656 -10.851562 7.421875 -11 C 7.003906 -11.15625 6.5625 -11.234375 6.09375 -11.234375 C 5.59375 -11.234375 5.125 -11.113281 4.6875 -10.875 C 4.257812 -10.632812 3.890625 -10.296875 3.578125 -9.859375 C 3.273438 -9.421875 3.035156 -8.890625 2.859375 -8.265625 C 2.679688 -7.648438 2.59375 -6.960938 2.59375 -6.203125 C 2.59375 -5.453125 2.671875 -4.769531 2.828125 -4.15625 C 2.984375 -3.539062 3.207031 -3.015625 3.5 -2.578125 C 3.789062 -2.140625 4.148438 -1.796875 4.578125 -1.546875 C 5.015625 -1.304688 5.515625 -1.1875 6.078125 -1.1875 C 6.460938 -1.1875 6.757812 -1.210938 6.96875 -1.265625 C 7.1875 -1.316406 7.347656 -1.367188 7.453125 -1.421875 Z M 7.453125 -6.09375 "/> +</symbol> +<symbol overflow="visible" id="glyph0-12"> +<path style="stroke:none;" d="M 9.203125 -0.515625 C 8.734375 -0.253906 8.234375 -0.0625 7.703125 0.0625 C 7.179688 0.195312 6.617188 0.265625 6.015625 0.265625 C 5.285156 0.265625 4.609375 0.132812 3.984375 -0.125 C 3.367188 -0.382812 2.832031 -0.773438 2.375 -1.296875 C 1.925781 -1.828125 1.570312 -2.5 1.3125 -3.3125 C 1.050781 -4.132812 0.921875 -5.097656 0.921875 -6.203125 C 0.921875 -7.253906 1.054688 -8.179688 1.328125 -8.984375 C 1.597656 -9.785156 1.96875 -10.457031 2.4375 -11 C 2.90625 -11.539062 3.453125 -11.953125 4.078125 -12.234375 C 4.703125 -12.515625 5.367188 -12.65625 6.078125 -12.65625 C 6.566406 -12.65625 7.066406 -12.585938 7.578125 -12.453125 C 8.097656 -12.328125 8.601562 -12.109375 9.09375 -11.796875 L 8.625 -10.4375 C 7.738281 -10.945312 6.910156 -11.203125 6.140625 -11.203125 C 5.585938 -11.203125 5.09375 -11.082031 4.65625 -10.84375 C 4.226562 -10.613281 3.859375 -10.28125 3.546875 -9.84375 C 3.242188 -9.40625 3.007812 -8.878906 2.84375 -8.265625 C 2.675781 -7.648438 2.59375 -6.960938 2.59375 -6.203125 C 2.59375 -5.347656 2.679688 -4.609375 2.859375 -3.984375 C 3.046875 -3.359375 3.296875 -2.835938 3.609375 -2.421875 C 3.929688 -2.003906 4.316406 -1.695312 4.765625 -1.5 C 5.210938 -1.300781 5.695312 -1.203125 6.21875 -1.203125 C 6.601562 -1.203125 7.007812 -1.25 7.4375 -1.34375 C 7.863281 -1.445312 8.304688 -1.625 8.765625 -1.875 Z M 9.203125 -0.515625 "/> +</symbol> +<symbol overflow="visible" id="glyph1-0"> +<path style="stroke:none;" d="M 0.59375 0 L 0.59375 -9 L 5.40625 -9 L 5.40625 0 Z M 4.796875 -0.59375 L 4.796875 -8.40625 L 1.203125 -8.40625 L 1.203125 -0.59375 Z M 4.796875 -0.59375 "/> +</symbol> +<symbol overflow="visible" id="glyph1-1"> +<path style="stroke:none;" d="M 2.515625 0 L 2.515625 -2.765625 C 2.023438 -3.554688 1.582031 -4.332031 1.1875 -5.09375 C 0.789062 -5.851562 0.445312 -6.628906 0.15625 -7.421875 L 1.265625 -7.421875 C 1.492188 -6.753906 1.757812 -6.113281 2.0625 -5.5 C 2.363281 -4.882812 2.6875 -4.253906 3.03125 -3.609375 C 3.394531 -4.285156 3.71875 -4.929688 4 -5.546875 C 4.28125 -6.160156 4.539062 -6.785156 4.78125 -7.421875 L 5.859375 -7.421875 C 5.554688 -6.640625 5.207031 -5.875 4.8125 -5.125 C 4.414062 -4.382812 3.976562 -3.601562 3.5 -2.78125 L 3.5 0 Z M 2.515625 0 "/> +</symbol> +<symbol overflow="visible" id="glyph1-2"> +<path style="stroke:none;" d="M 3 0.15625 C 2.5625 0.15625 2.1875 0.09375 1.875 -0.03125 C 1.570312 -0.164062 1.320312 -0.347656 1.125 -0.578125 C 0.9375 -0.804688 0.796875 -1.085938 0.703125 -1.421875 C 0.617188 -1.765625 0.578125 -2.144531 0.578125 -2.5625 L 0.578125 -7.421875 L 1.5625 -7.421875 L 1.5625 -2.65625 C 1.5625 -2.28125 1.59375 -1.96875 1.65625 -1.71875 C 1.726562 -1.46875 1.828125 -1.265625 1.953125 -1.109375 C 2.078125 -0.960938 2.222656 -0.859375 2.390625 -0.796875 C 2.566406 -0.734375 2.769531 -0.703125 3 -0.703125 C 3.226562 -0.703125 3.425781 -0.734375 3.59375 -0.796875 C 3.769531 -0.859375 3.921875 -0.960938 4.046875 -1.109375 C 4.171875 -1.265625 4.265625 -1.46875 4.328125 -1.71875 C 4.398438 -1.96875 4.4375 -2.28125 4.4375 -2.65625 L 4.4375 -7.421875 L 5.421875 -7.421875 L 5.421875 -2.5625 C 5.421875 -2.144531 5.375 -1.765625 5.28125 -1.421875 C 5.195312 -1.085938 5.054688 -0.804688 4.859375 -0.578125 C 4.671875 -0.347656 4.421875 -0.164062 4.109375 -0.03125 C 3.804688 0.09375 3.4375 0.15625 3 0.15625 Z M 3 0.15625 "/> +</symbol> +<symbol overflow="visible" id="glyph1-3"> +<path style="stroke:none;" d="M 1.21875 -7.421875 C 1.320312 -6.921875 1.445312 -6.375 1.59375 -5.78125 C 1.738281 -5.1875 1.890625 -4.585938 2.046875 -3.984375 C 2.210938 -3.390625 2.378906 -2.820312 2.546875 -2.28125 C 2.722656 -1.738281 2.882812 -1.265625 3.03125 -0.859375 C 3.15625 -1.265625 3.300781 -1.742188 3.46875 -2.296875 C 3.644531 -2.847656 3.816406 -3.421875 3.984375 -4.015625 C 4.148438 -4.609375 4.304688 -5.203125 4.453125 -5.796875 C 4.609375 -6.390625 4.734375 -6.929688 4.828125 -7.421875 L 5.859375 -7.421875 C 5.796875 -7.109375 5.691406 -6.679688 5.546875 -6.140625 C 5.398438 -5.597656 5.226562 -4.992188 5.03125 -4.328125 C 4.832031 -3.660156 4.609375 -2.953125 4.359375 -2.203125 C 4.117188 -1.453125 3.863281 -0.71875 3.59375 0 L 2.375 0 C 2.125 -0.71875 1.878906 -1.445312 1.640625 -2.1875 C 1.410156 -2.9375 1.195312 -3.644531 1 -4.3125 C 0.800781 -4.976562 0.628906 -5.582031 0.484375 -6.125 C 0.335938 -6.675781 0.226562 -7.109375 0.15625 -7.421875 Z M 1.21875 -7.421875 "/> +</symbol> +<symbol overflow="visible" id="glyph1-4"> +<path style="stroke:none;" d=""/> +</symbol> +<symbol overflow="visible" id="glyph1-5"> +<path style="stroke:none;" d="M 5.515625 -3.71875 C 5.515625 -3.03125 5.425781 -2.445312 5.25 -1.96875 C 5.082031 -1.488281 4.847656 -1.097656 4.546875 -0.796875 C 4.253906 -0.492188 3.898438 -0.273438 3.484375 -0.140625 C 3.078125 -0.00390625 2.628906 0.0625 2.140625 0.0625 C 1.648438 0.0625 1.148438 0 0.640625 -0.125 L 0.640625 -7.3125 C 1.148438 -7.4375 1.648438 -7.5 2.140625 -7.5 C 2.628906 -7.5 3.078125 -7.429688 3.484375 -7.296875 C 3.898438 -7.160156 4.253906 -6.941406 4.546875 -6.640625 C 4.847656 -6.335938 5.082031 -5.941406 5.25 -5.453125 C 5.425781 -4.972656 5.515625 -4.394531 5.515625 -3.71875 Z M 1.625 -0.828125 C 1.832031 -0.804688 2.039062 -0.796875 2.25 -0.796875 C 2.601562 -0.796875 2.921875 -0.847656 3.203125 -0.953125 C 3.484375 -1.054688 3.71875 -1.226562 3.90625 -1.46875 C 4.101562 -1.707031 4.253906 -2.007812 4.359375 -2.375 C 4.460938 -2.75 4.515625 -3.195312 4.515625 -3.71875 C 4.515625 -4.726562 4.316406 -5.46875 3.921875 -5.9375 C 3.535156 -6.40625 2.960938 -6.640625 2.203125 -6.640625 C 2.097656 -6.640625 1.992188 -6.640625 1.890625 -6.640625 C 1.796875 -6.640625 1.707031 -6.628906 1.625 -6.609375 Z M 1.625 -0.828125 "/> +</symbol> +<symbol overflow="visible" id="glyph1-6"> +<path style="stroke:none;" d="M 5.515625 -2.78125 C 5.515625 -2.34375 5.453125 -1.945312 5.328125 -1.59375 C 5.203125 -1.238281 5.023438 -0.929688 4.796875 -0.671875 C 4.578125 -0.410156 4.3125 -0.210938 4 -0.078125 C 3.695312 0.0546875 3.359375 0.125 2.984375 0.125 C 2.628906 0.125 2.296875 0.0546875 1.984375 -0.078125 C 1.679688 -0.210938 1.414062 -0.410156 1.1875 -0.671875 C 0.96875 -0.929688 0.796875 -1.238281 0.671875 -1.59375 C 0.546875 -1.945312 0.484375 -2.34375 0.484375 -2.78125 C 0.484375 -3.21875 0.546875 -3.617188 0.671875 -3.984375 C 0.796875 -4.347656 0.96875 -4.65625 1.1875 -4.90625 C 1.414062 -5.15625 1.679688 -5.347656 1.984375 -5.484375 C 2.296875 -5.628906 2.628906 -5.703125 2.984375 -5.703125 C 3.359375 -5.703125 3.695312 -5.628906 4 -5.484375 C 4.3125 -5.347656 4.578125 -5.15625 4.796875 -4.90625 C 5.023438 -4.65625 5.203125 -4.347656 5.328125 -3.984375 C 5.453125 -3.617188 5.515625 -3.21875 5.515625 -2.78125 Z M 4.5 -2.78125 C 4.5 -3.414062 4.363281 -3.914062 4.09375 -4.28125 C 3.820312 -4.644531 3.453125 -4.828125 2.984375 -4.828125 C 2.523438 -4.828125 2.160156 -4.644531 1.890625 -4.28125 C 1.628906 -3.914062 1.5 -3.414062 1.5 -2.78125 C 1.5 -2.15625 1.628906 -1.660156 1.890625 -1.296875 C 2.160156 -0.929688 2.523438 -0.75 2.984375 -0.75 C 3.453125 -0.75 3.820312 -0.929688 4.09375 -1.296875 C 4.363281 -1.660156 4.5 -2.15625 4.5 -2.78125 Z M 4.5 -2.78125 "/> +</symbol> +<symbol overflow="visible" id="glyph1-7"> +<path style="stroke:none;" d="M 4.109375 0 C 3.992188 -0.269531 3.890625 -0.515625 3.796875 -0.734375 C 3.710938 -0.960938 3.628906 -1.1875 3.546875 -1.40625 C 3.460938 -1.632812 3.378906 -1.867188 3.296875 -2.109375 C 3.210938 -2.359375 3.113281 -2.640625 3 -2.953125 C 2.882812 -2.640625 2.78125 -2.359375 2.6875 -2.109375 C 2.601562 -1.867188 2.519531 -1.632812 2.4375 -1.40625 C 2.351562 -1.1875 2.265625 -0.960938 2.171875 -0.734375 C 2.085938 -0.515625 1.984375 -0.269531 1.859375 0 L 1.109375 0 C 0.890625 -0.976562 0.707031 -1.953125 0.5625 -2.921875 C 0.414062 -3.890625 0.304688 -4.769531 0.234375 -5.5625 L 1.15625 -5.5625 C 1.1875 -5.25 1.210938 -4.941406 1.234375 -4.640625 C 1.265625 -4.347656 1.300781 -4.035156 1.34375 -3.703125 C 1.382812 -3.378906 1.429688 -3.023438 1.484375 -2.640625 C 1.535156 -2.253906 1.59375 -1.820312 1.65625 -1.34375 C 1.78125 -1.664062 1.882812 -1.945312 1.96875 -2.1875 C 2.0625 -2.425781 2.144531 -2.648438 2.21875 -2.859375 C 2.289062 -3.078125 2.359375 -3.296875 2.421875 -3.515625 C 2.492188 -3.742188 2.570312 -4 2.65625 -4.28125 L 3.390625 -4.28125 C 3.472656 -4 3.546875 -3.742188 3.609375 -3.515625 C 3.671875 -3.296875 3.738281 -3.078125 3.8125 -2.859375 C 3.882812 -2.648438 3.957031 -2.425781 4.03125 -2.1875 C 4.113281 -1.945312 4.21875 -1.671875 4.34375 -1.359375 C 4.414062 -1.796875 4.476562 -2.203125 4.53125 -2.578125 C 4.59375 -2.953125 4.640625 -3.304688 4.671875 -3.640625 C 4.710938 -3.972656 4.75 -4.296875 4.78125 -4.609375 C 4.820312 -4.921875 4.851562 -5.238281 4.875 -5.5625 L 5.765625 -5.5625 C 5.734375 -5.164062 5.6875 -4.738281 5.625 -4.28125 C 5.570312 -3.820312 5.503906 -3.351562 5.421875 -2.875 C 5.335938 -2.394531 5.25 -1.910156 5.15625 -1.421875 C 5.0625 -0.929688 4.960938 -0.457031 4.859375 0 Z M 4.109375 0 "/> +</symbol> +<symbol overflow="visible" id="glyph1-8"> +<path style="stroke:none;" d="M 0.859375 -5.40625 C 1.210938 -5.5 1.566406 -5.566406 1.921875 -5.609375 C 2.273438 -5.660156 2.609375 -5.6875 2.921875 -5.6875 C 3.671875 -5.6875 4.234375 -5.492188 4.609375 -5.109375 C 4.992188 -4.722656 5.1875 -4.109375 5.1875 -3.265625 L 5.1875 0 L 4.203125 0 L 4.203125 -3.078125 C 4.203125 -3.441406 4.171875 -3.734375 4.109375 -3.953125 C 4.046875 -4.179688 3.953125 -4.359375 3.828125 -4.484375 C 3.710938 -4.609375 3.570312 -4.691406 3.40625 -4.734375 C 3.25 -4.785156 3.070312 -4.8125 2.875 -4.8125 C 2.71875 -4.8125 2.546875 -4.800781 2.359375 -4.78125 C 2.179688 -4.757812 2.007812 -4.734375 1.84375 -4.703125 L 1.84375 0 L 0.859375 0 Z M 0.859375 -5.40625 "/> +</symbol> +<symbol overflow="visible" id="glyph1-9"> +<path style="stroke:none;" d="M 4.21875 -1.390625 C 4.21875 -1.585938 4.132812 -1.75 3.96875 -1.875 C 3.800781 -2.007812 3.59375 -2.125 3.34375 -2.21875 C 3.101562 -2.3125 2.835938 -2.40625 2.546875 -2.5 C 2.265625 -2.59375 2 -2.707031 1.75 -2.84375 C 1.507812 -2.976562 1.304688 -3.144531 1.140625 -3.34375 C 0.984375 -3.539062 0.90625 -3.800781 0.90625 -4.125 C 0.90625 -4.570312 1.082031 -4.945312 1.4375 -5.25 C 1.800781 -5.550781 2.375 -5.703125 3.15625 -5.703125 C 3.457031 -5.703125 3.769531 -5.675781 4.09375 -5.625 C 4.414062 -5.582031 4.695312 -5.523438 4.9375 -5.453125 L 4.75 -4.578125 C 4.6875 -4.609375 4.597656 -4.640625 4.484375 -4.671875 C 4.367188 -4.710938 4.238281 -4.742188 4.09375 -4.765625 C 3.957031 -4.796875 3.804688 -4.816406 3.640625 -4.828125 C 3.472656 -4.847656 3.316406 -4.859375 3.171875 -4.859375 C 2.304688 -4.859375 1.875 -4.625 1.875 -4.15625 C 1.875 -3.988281 1.953125 -3.84375 2.109375 -3.71875 C 2.273438 -3.601562 2.484375 -3.5 2.734375 -3.40625 C 2.984375 -3.3125 3.25 -3.210938 3.53125 -3.109375 C 3.820312 -3.015625 4.09375 -2.894531 4.34375 -2.75 C 4.59375 -2.601562 4.796875 -2.425781 4.953125 -2.21875 C 5.117188 -2.019531 5.203125 -1.765625 5.203125 -1.453125 C 5.203125 -0.953125 5.003906 -0.5625 4.609375 -0.28125 C 4.222656 -0.0078125 3.609375 0.125 2.765625 0.125 C 2.378906 0.125 2.023438 0.09375 1.703125 0.03125 C 1.378906 -0.03125 1.078125 -0.125 0.796875 -0.25 L 0.984375 -1.15625 C 1.265625 -1.019531 1.554688 -0.910156 1.859375 -0.828125 C 2.171875 -0.742188 2.503906 -0.703125 2.859375 -0.703125 C 3.765625 -0.703125 4.21875 -0.929688 4.21875 -1.390625 Z M 4.21875 -1.390625 "/> +</symbol> +<symbol overflow="visible" id="glyph1-10"> +<path style="stroke:none;" d="M 0.59375 -2.765625 C 0.59375 -3.273438 0.671875 -3.710938 0.828125 -4.078125 C 0.984375 -4.441406 1.203125 -4.742188 1.484375 -4.984375 C 1.765625 -5.234375 2.09375 -5.414062 2.46875 -5.53125 C 2.84375 -5.644531 3.238281 -5.703125 3.65625 -5.703125 C 3.925781 -5.703125 4.195312 -5.679688 4.46875 -5.640625 C 4.738281 -5.609375 5.023438 -5.546875 5.328125 -5.453125 L 5.09375 -4.59375 C 4.832031 -4.6875 4.59375 -4.75 4.375 -4.78125 C 4.15625 -4.8125 3.929688 -4.828125 3.703125 -4.828125 C 3.421875 -4.828125 3.148438 -4.785156 2.890625 -4.703125 C 2.640625 -4.628906 2.414062 -4.507812 2.21875 -4.34375 C 2.03125 -4.1875 1.878906 -3.976562 1.765625 -3.71875 C 1.660156 -3.457031 1.609375 -3.140625 1.609375 -2.765625 C 1.609375 -2.421875 1.660156 -2.117188 1.765625 -1.859375 C 1.867188 -1.609375 2.015625 -1.398438 2.203125 -1.234375 C 2.390625 -1.078125 2.613281 -0.957031 2.875 -0.875 C 3.144531 -0.789062 3.4375 -0.75 3.75 -0.75 C 4.007812 -0.75 4.253906 -0.765625 4.484375 -0.796875 C 4.722656 -0.828125 4.984375 -0.890625 5.265625 -0.984375 L 5.40625 -0.15625 C 5.125 -0.0507812 4.835938 0.0195312 4.546875 0.0625 C 4.265625 0.101562 3.957031 0.125 3.625 0.125 C 3.175781 0.125 2.765625 0.0664062 2.390625 -0.046875 C 2.023438 -0.171875 1.707031 -0.351562 1.4375 -0.59375 C 1.164062 -0.832031 0.957031 -1.132812 0.8125 -1.5 C 0.664062 -1.863281 0.59375 -2.285156 0.59375 -2.765625 Z M 0.59375 -2.765625 "/> +</symbol> +<symbol overflow="visible" id="glyph1-11"> +<path style="stroke:none;" d="M 3.0625 -0.703125 C 3.3125 -0.703125 3.53125 -0.707031 3.71875 -0.71875 C 3.914062 -0.738281 4.082031 -0.765625 4.21875 -0.796875 L 4.21875 -2.453125 C 4.082031 -2.492188 3.925781 -2.523438 3.75 -2.546875 C 3.570312 -2.566406 3.382812 -2.578125 3.1875 -2.578125 C 3 -2.578125 2.816406 -2.5625 2.640625 -2.53125 C 2.460938 -2.507812 2.304688 -2.460938 2.171875 -2.390625 C 2.035156 -2.316406 1.921875 -2.222656 1.828125 -2.109375 C 1.742188 -1.992188 1.703125 -1.847656 1.703125 -1.671875 C 1.703125 -1.304688 1.820312 -1.050781 2.0625 -0.90625 C 2.3125 -0.769531 2.644531 -0.703125 3.0625 -0.703125 Z M 2.96875 -5.703125 C 3.382812 -5.703125 3.734375 -5.648438 4.015625 -5.546875 C 4.296875 -5.441406 4.523438 -5.296875 4.703125 -5.109375 C 4.878906 -4.929688 5.003906 -4.707031 5.078125 -4.4375 C 5.148438 -4.175781 5.1875 -3.890625 5.1875 -3.578125 L 5.1875 -0.09375 C 4.957031 -0.0507812 4.648438 -0.00390625 4.265625 0.046875 C 3.890625 0.0976562 3.5 0.125 3.09375 0.125 C 2.789062 0.125 2.492188 0.0976562 2.203125 0.046875 C 1.921875 -0.00390625 1.664062 -0.09375 1.4375 -0.21875 C 1.21875 -0.351562 1.039062 -0.535156 0.90625 -0.765625 C 0.769531 -0.992188 0.703125 -1.289062 0.703125 -1.65625 C 0.703125 -1.976562 0.769531 -2.25 0.90625 -2.46875 C 1.039062 -2.6875 1.21875 -2.863281 1.4375 -3 C 1.664062 -3.132812 1.921875 -3.234375 2.203125 -3.296875 C 2.484375 -3.359375 2.769531 -3.390625 3.0625 -3.390625 C 3.445312 -3.390625 3.832031 -3.34375 4.21875 -3.25 L 4.21875 -3.53125 C 4.21875 -3.695312 4.195312 -3.859375 4.15625 -4.015625 C 4.125 -4.171875 4.054688 -4.3125 3.953125 -4.4375 C 3.847656 -4.5625 3.707031 -4.660156 3.53125 -4.734375 C 3.363281 -4.816406 3.144531 -4.859375 2.875 -4.859375 C 2.53125 -4.859375 2.226562 -4.832031 1.96875 -4.78125 C 1.71875 -4.738281 1.523438 -4.691406 1.390625 -4.640625 L 1.265625 -5.453125 C 1.398438 -5.523438 1.625 -5.582031 1.9375 -5.625 C 2.257812 -5.675781 2.601562 -5.703125 2.96875 -5.703125 Z M 2.96875 -5.703125 "/> +</symbol> +<symbol overflow="visible" id="glyph1-12"> +<path style="stroke:none;" d="M 4.0625 0.125 C 3.707031 0.125 3.410156 0.078125 3.171875 -0.015625 C 2.941406 -0.109375 2.757812 -0.25 2.625 -0.4375 C 2.488281 -0.632812 2.390625 -0.875 2.328125 -1.15625 C 2.273438 -1.4375 2.25 -1.765625 2.25 -2.140625 L 2.25 -7.421875 L 0.640625 -7.421875 L 0.640625 -8.25 L 3.234375 -8.25 L 3.234375 -2.140625 C 3.234375 -1.867188 3.25 -1.644531 3.28125 -1.46875 C 3.320312 -1.289062 3.378906 -1.144531 3.453125 -1.03125 C 3.535156 -0.925781 3.628906 -0.851562 3.734375 -0.8125 C 3.847656 -0.769531 3.984375 -0.75 4.140625 -0.75 C 4.367188 -0.75 4.582031 -0.773438 4.78125 -0.828125 C 4.988281 -0.890625 5.144531 -0.953125 5.25 -1.015625 L 5.40625 -0.1875 C 5.351562 -0.15625 5.28125 -0.117188 5.1875 -0.078125 C 5.101562 -0.046875 5 -0.015625 4.875 0.015625 C 4.757812 0.046875 4.628906 0.0703125 4.484375 0.09375 C 4.347656 0.113281 4.207031 0.125 4.0625 0.125 Z M 4.0625 0.125 "/> +</symbol> +<symbol overflow="visible" id="glyph1-13"> +<path style="stroke:none;" d="M 2.515625 -6.4375 C 2.304688 -6.4375 2.125 -6.503906 1.96875 -6.640625 C 1.8125 -6.785156 1.734375 -6.984375 1.734375 -7.234375 C 1.734375 -7.484375 1.8125 -7.679688 1.96875 -7.828125 C 2.125 -7.972656 2.304688 -8.046875 2.515625 -8.046875 C 2.722656 -8.046875 2.898438 -7.972656 3.046875 -7.828125 C 3.203125 -7.679688 3.28125 -7.484375 3.28125 -7.234375 C 3.28125 -6.984375 3.203125 -6.785156 3.046875 -6.640625 C 2.898438 -6.503906 2.722656 -6.4375 2.515625 -6.4375 Z M 2.25 -4.734375 L 0.640625 -4.734375 L 0.640625 -5.5625 L 3.234375 -5.5625 L 3.234375 -2.140625 C 3.234375 -1.585938 3.3125 -1.21875 3.46875 -1.03125 C 3.625 -0.84375 3.851562 -0.75 4.15625 -0.75 C 4.382812 -0.75 4.597656 -0.773438 4.796875 -0.828125 C 4.992188 -0.890625 5.144531 -0.953125 5.25 -1.015625 L 5.40625 -0.1875 C 5.351562 -0.15625 5.28125 -0.117188 5.1875 -0.078125 C 5.101562 -0.046875 5.003906 -0.015625 4.890625 0.015625 C 4.773438 0.046875 4.644531 0.0703125 4.5 0.09375 C 4.363281 0.113281 4.21875 0.125 4.0625 0.125 C 3.71875 0.125 3.425781 0.078125 3.1875 -0.015625 C 2.957031 -0.109375 2.769531 -0.25 2.625 -0.4375 C 2.488281 -0.632812 2.390625 -0.875 2.328125 -1.15625 C 2.273438 -1.4375 2.25 -1.765625 2.25 -2.140625 Z M 2.25 -4.734375 "/> +</symbol> +<symbol overflow="visible" id="glyph1-14"> +<path style="stroke:none;" d="M 4.15625 -0.515625 C 4.039062 -0.453125 3.863281 -0.382812 3.625 -0.3125 C 3.394531 -0.238281 3.128906 -0.203125 2.828125 -0.203125 C 2.503906 -0.203125 2.195312 -0.253906 1.90625 -0.359375 C 1.625 -0.472656 1.378906 -0.640625 1.171875 -0.859375 C 0.960938 -1.078125 0.796875 -1.351562 0.671875 -1.6875 C 0.546875 -2.03125 0.484375 -2.4375 0.484375 -2.90625 C 0.484375 -3.3125 0.539062 -3.679688 0.65625 -4.015625 C 0.769531 -4.359375 0.9375 -4.65625 1.15625 -4.90625 C 1.375 -5.15625 1.644531 -5.347656 1.96875 -5.484375 C 2.289062 -5.628906 2.65625 -5.703125 3.0625 -5.703125 C 3.539062 -5.703125 3.945312 -5.664062 4.28125 -5.59375 C 4.625 -5.53125 4.910156 -5.46875 5.140625 -5.40625 L 5.140625 -0.4375 C 5.140625 0.425781 4.921875 1.050781 4.484375 1.4375 C 4.054688 1.820312 3.398438 2.015625 2.515625 2.015625 C 2.160156 2.015625 1.835938 1.984375 1.546875 1.921875 C 1.253906 1.867188 0.992188 1.804688 0.765625 1.734375 L 0.953125 0.859375 C 1.160156 0.941406 1.394531 1.007812 1.65625 1.0625 C 1.925781 1.125 2.222656 1.15625 2.546875 1.15625 C 3.117188 1.15625 3.53125 1.035156 3.78125 0.796875 C 4.03125 0.566406 4.15625 0.191406 4.15625 -0.328125 Z M 4.15625 -4.6875 C 4.0625 -4.71875 3.925781 -4.75 3.75 -4.78125 C 3.582031 -4.8125 3.359375 -4.828125 3.078125 -4.828125 C 2.554688 -4.828125 2.160156 -4.648438 1.890625 -4.296875 C 1.628906 -3.941406 1.5 -3.472656 1.5 -2.890625 C 1.5 -2.566406 1.535156 -2.289062 1.609375 -2.0625 C 1.691406 -1.84375 1.796875 -1.65625 1.921875 -1.5 C 2.054688 -1.351562 2.207031 -1.242188 2.375 -1.171875 C 2.539062 -1.109375 2.722656 -1.078125 2.921875 -1.078125 C 3.160156 -1.078125 3.390625 -1.113281 3.609375 -1.1875 C 3.835938 -1.257812 4.019531 -1.34375 4.15625 -1.4375 Z M 4.15625 -4.6875 "/> +</symbol> +<symbol overflow="visible" id="glyph1-15"> +<path style="stroke:none;" d="M 2.8125 -0.703125 C 3.3125 -0.703125 3.691406 -0.796875 3.953125 -0.984375 C 4.222656 -1.171875 4.359375 -1.457031 4.359375 -1.84375 C 4.359375 -2.082031 4.304688 -2.28125 4.203125 -2.4375 C 4.109375 -2.601562 3.984375 -2.75 3.828125 -2.875 C 3.671875 -3 3.488281 -3.109375 3.28125 -3.203125 C 3.082031 -3.296875 2.878906 -3.378906 2.671875 -3.453125 C 2.429688 -3.546875 2.203125 -3.648438 1.984375 -3.765625 C 1.765625 -3.890625 1.566406 -4.03125 1.390625 -4.1875 C 1.222656 -4.351562 1.085938 -4.546875 0.984375 -4.765625 C 0.890625 -4.984375 0.84375 -5.238281 0.84375 -5.53125 C 0.84375 -6.175781 1.046875 -6.679688 1.453125 -7.046875 C 1.859375 -7.410156 2.425781 -7.59375 3.15625 -7.59375 C 3.351562 -7.59375 3.550781 -7.578125 3.75 -7.546875 C 3.945312 -7.523438 4.128906 -7.492188 4.296875 -7.453125 C 4.460938 -7.410156 4.609375 -7.359375 4.734375 -7.296875 C 4.867188 -7.242188 4.976562 -7.191406 5.0625 -7.140625 L 4.75 -6.3125 C 4.59375 -6.40625 4.375 -6.5 4.09375 -6.59375 C 3.8125 -6.695312 3.5 -6.75 3.15625 -6.75 C 2.789062 -6.75 2.476562 -6.65625 2.21875 -6.46875 C 1.957031 -6.289062 1.828125 -6.019531 1.828125 -5.65625 C 1.828125 -5.457031 1.863281 -5.285156 1.9375 -5.140625 C 2.007812 -4.992188 2.113281 -4.863281 2.25 -4.75 C 2.382812 -4.644531 2.535156 -4.546875 2.703125 -4.453125 C 2.878906 -4.367188 3.070312 -4.285156 3.28125 -4.203125 C 3.59375 -4.078125 3.875 -3.945312 4.125 -3.8125 C 4.375 -3.6875 4.585938 -3.535156 4.765625 -3.359375 C 4.953125 -3.179688 5.09375 -2.972656 5.1875 -2.734375 C 5.289062 -2.492188 5.34375 -2.203125 5.34375 -1.859375 C 5.34375 -1.210938 5.125 -0.710938 4.6875 -0.359375 C 4.25 -0.015625 3.625 0.15625 2.8125 0.15625 C 2.539062 0.15625 2.289062 0.132812 2.0625 0.09375 C 1.832031 0.0625 1.625 0.0195312 1.4375 -0.03125 C 1.257812 -0.09375 1.101562 -0.148438 0.96875 -0.203125 C 0.832031 -0.253906 0.726562 -0.304688 0.65625 -0.359375 L 0.953125 -1.171875 C 1.117188 -1.085938 1.359375 -0.988281 1.671875 -0.875 C 1.984375 -0.757812 2.363281 -0.703125 2.8125 -0.703125 Z M 2.8125 -0.703125 "/> +</symbol> +<symbol overflow="visible" id="glyph1-16"> +<path style="stroke:none;" d="M 3.109375 -5.703125 C 3.859375 -5.703125 4.4375 -5.46875 4.84375 -5 C 5.25 -4.53125 5.453125 -3.820312 5.453125 -2.875 L 5.453125 -2.515625 L 1.46875 -2.515625 C 1.507812 -1.941406 1.703125 -1.503906 2.046875 -1.203125 C 2.390625 -0.898438 2.867188 -0.75 3.484375 -0.75 C 3.835938 -0.75 4.132812 -0.773438 4.375 -0.828125 C 4.625 -0.890625 4.8125 -0.953125 4.9375 -1.015625 L 5.078125 -0.1875 C 4.953125 -0.113281 4.734375 -0.046875 4.421875 0.015625 C 4.109375 0.0859375 3.757812 0.125 3.375 0.125 C 2.894531 0.125 2.472656 0.0507812 2.109375 -0.09375 C 1.742188 -0.238281 1.441406 -0.4375 1.203125 -0.6875 C 0.960938 -0.945312 0.78125 -1.253906 0.65625 -1.609375 C 0.539062 -1.960938 0.484375 -2.347656 0.484375 -2.765625 C 0.484375 -3.265625 0.554688 -3.695312 0.703125 -4.0625 C 0.859375 -4.4375 1.0625 -4.742188 1.3125 -4.984375 C 1.5625 -5.222656 1.835938 -5.398438 2.140625 -5.515625 C 2.453125 -5.640625 2.773438 -5.703125 3.109375 -5.703125 Z M 4.453125 -3.328125 C 4.453125 -3.796875 4.328125 -4.164062 4.078125 -4.4375 C 3.828125 -4.71875 3.5 -4.859375 3.09375 -4.859375 C 2.863281 -4.859375 2.65625 -4.8125 2.46875 -4.71875 C 2.28125 -4.632812 2.117188 -4.519531 1.984375 -4.375 C 1.847656 -4.226562 1.738281 -4.0625 1.65625 -3.875 C 1.570312 -3.695312 1.519531 -3.515625 1.5 -3.328125 Z M 4.453125 -3.328125 "/> +</symbol> +<symbol overflow="visible" id="glyph1-17"> +<path style="stroke:none;" d="M 4.15625 -4.390625 C 4.039062 -4.492188 3.875 -4.59375 3.65625 -4.6875 C 3.445312 -4.78125 3.222656 -4.828125 2.984375 -4.828125 C 2.722656 -4.828125 2.5 -4.773438 2.3125 -4.671875 C 2.125 -4.566406 1.96875 -4.421875 1.84375 -4.234375 C 1.726562 -4.054688 1.640625 -3.84375 1.578125 -3.59375 C 1.523438 -3.34375 1.5 -3.070312 1.5 -2.78125 C 1.5 -2.132812 1.648438 -1.632812 1.953125 -1.28125 C 2.253906 -0.925781 2.648438 -0.75 3.140625 -0.75 C 3.390625 -0.75 3.597656 -0.757812 3.765625 -0.78125 C 3.941406 -0.8125 4.070312 -0.835938 4.15625 -0.859375 Z M 4.15625 -8.140625 L 5.140625 -8.3125 L 5.140625 -0.15625 C 4.929688 -0.09375 4.65625 -0.03125 4.3125 0.03125 C 3.976562 0.09375 3.585938 0.125 3.140625 0.125 C 2.742188 0.125 2.378906 0.0546875 2.046875 -0.078125 C 1.722656 -0.210938 1.441406 -0.40625 1.203125 -0.65625 C 0.972656 -0.90625 0.796875 -1.207031 0.671875 -1.5625 C 0.546875 -1.925781 0.484375 -2.332031 0.484375 -2.78125 C 0.484375 -3.21875 0.535156 -3.613281 0.640625 -3.96875 C 0.742188 -4.320312 0.898438 -4.625 1.109375 -4.875 C 1.316406 -5.132812 1.566406 -5.335938 1.859375 -5.484375 C 2.148438 -5.628906 2.488281 -5.703125 2.875 -5.703125 C 3.164062 -5.703125 3.421875 -5.664062 3.640625 -5.59375 C 3.867188 -5.519531 4.039062 -5.441406 4.15625 -5.359375 Z M 4.15625 -8.140625 "/> +</symbol> +<symbol overflow="visible" id="glyph1-18"> +<path style="stroke:none;" d="M 1.28125 0 L 1.28125 -5.265625 C 2.101562 -5.546875 2.925781 -5.6875 3.75 -5.6875 C 4.007812 -5.6875 4.253906 -5.675781 4.484375 -5.65625 C 4.722656 -5.632812 4.976562 -5.59375 5.25 -5.53125 L 5.078125 -4.65625 C 4.816406 -4.726562 4.585938 -4.773438 4.390625 -4.796875 C 4.203125 -4.816406 3.988281 -4.828125 3.75 -4.828125 C 3.269531 -4.828125 2.773438 -4.757812 2.265625 -4.625 L 2.265625 0 Z M 1.28125 0 "/> +</symbol> +<symbol overflow="visible" id="glyph1-19"> +<path style="stroke:none;" d="M 0.609375 1.046875 C 0.679688 1.085938 0.78125 1.117188 0.90625 1.140625 C 1.039062 1.160156 1.164062 1.171875 1.28125 1.171875 C 1.675781 1.171875 1.984375 1.082031 2.203125 0.90625 C 2.421875 0.738281 2.625 0.460938 2.8125 0.078125 C 2.363281 -0.773438 1.941406 -1.6875 1.546875 -2.65625 C 1.148438 -3.625 0.828125 -4.59375 0.578125 -5.5625 L 1.65625 -5.5625 C 1.738281 -5.25 1.832031 -4.90625 1.9375 -4.53125 C 2.039062 -4.15625 2.160156 -3.769531 2.296875 -3.375 C 2.441406 -2.988281 2.585938 -2.597656 2.734375 -2.203125 C 2.890625 -1.804688 3.054688 -1.425781 3.234375 -1.0625 C 3.367188 -1.4375 3.488281 -1.800781 3.59375 -2.15625 C 3.707031 -2.519531 3.8125 -2.882812 3.90625 -3.25 C 4.007812 -3.613281 4.109375 -3.984375 4.203125 -4.359375 C 4.296875 -4.742188 4.394531 -5.144531 4.5 -5.5625 L 5.53125 -5.5625 C 5.269531 -4.53125 4.984375 -3.523438 4.671875 -2.546875 C 4.359375 -1.566406 4.019531 -0.660156 3.65625 0.171875 C 3.519531 0.484375 3.375 0.753906 3.21875 0.984375 C 3.0625 1.222656 2.890625 1.414062 2.703125 1.5625 C 2.523438 1.71875 2.316406 1.832031 2.078125 1.90625 C 1.847656 1.976562 1.585938 2.015625 1.296875 2.015625 C 1.140625 2.015625 0.972656 1.992188 0.796875 1.953125 C 0.617188 1.910156 0.5 1.875 0.4375 1.84375 Z M 0.609375 1.046875 "/> +</symbol> +<symbol overflow="visible" id="glyph1-20"> +<path style="stroke:none;" d="M 0.34375 -3.71875 C 0.34375 -4.382812 0.40625 -4.960938 0.53125 -5.453125 C 0.664062 -5.941406 0.847656 -6.34375 1.078125 -6.65625 C 1.304688 -6.96875 1.582031 -7.203125 1.90625 -7.359375 C 2.238281 -7.515625 2.601562 -7.59375 3 -7.59375 C 3.394531 -7.59375 3.753906 -7.515625 4.078125 -7.359375 C 4.410156 -7.203125 4.691406 -6.96875 4.921875 -6.65625 C 5.148438 -6.34375 5.328125 -5.941406 5.453125 -5.453125 C 5.585938 -4.960938 5.65625 -4.382812 5.65625 -3.71875 C 5.65625 -3.050781 5.585938 -2.472656 5.453125 -1.984375 C 5.328125 -1.503906 5.148438 -1.101562 4.921875 -0.78125 C 4.691406 -0.457031 4.410156 -0.21875 4.078125 -0.0625 C 3.753906 0.0820312 3.394531 0.15625 3 0.15625 C 2.601562 0.15625 2.238281 0.0820312 1.90625 -0.0625 C 1.582031 -0.21875 1.304688 -0.457031 1.078125 -0.78125 C 0.847656 -1.101562 0.664062 -1.503906 0.53125 -1.984375 C 0.40625 -2.472656 0.34375 -3.050781 0.34375 -3.71875 Z M 1.359375 -3.71875 C 1.359375 -2.738281 1.488281 -1.988281 1.75 -1.46875 C 2.007812 -0.957031 2.414062 -0.703125 2.96875 -0.703125 C 3.53125 -0.703125 3.953125 -0.957031 4.234375 -1.46875 C 4.515625 -1.988281 4.65625 -2.738281 4.65625 -3.71875 C 4.65625 -4.695312 4.515625 -5.445312 4.234375 -5.96875 C 3.953125 -6.488281 3.53125 -6.75 2.96875 -6.75 C 2.414062 -6.75 2.007812 -6.488281 1.75 -5.96875 C 1.488281 -5.445312 1.359375 -4.695312 1.359375 -3.71875 Z M 1.359375 -3.71875 "/> +</symbol> +<symbol overflow="visible" id="glyph1-21"> +<path style="stroke:none;" d="M 5.140625 -0.15625 C 4.929688 -0.101562 4.644531 -0.046875 4.28125 0.015625 C 3.925781 0.0859375 3.507812 0.125 3.03125 0.125 C 2.613281 0.125 2.265625 0.0625 1.984375 -0.0625 C 1.703125 -0.1875 1.472656 -0.363281 1.296875 -0.59375 C 1.117188 -0.820312 0.992188 -1.09375 0.921875 -1.40625 C 0.847656 -1.71875 0.8125 -2.0625 0.8125 -2.4375 L 0.8125 -5.5625 L 1.796875 -5.5625 L 1.796875 -2.65625 C 1.796875 -1.96875 1.894531 -1.476562 2.09375 -1.1875 C 2.300781 -0.894531 2.644531 -0.75 3.125 -0.75 C 3.226562 -0.75 3.332031 -0.753906 3.4375 -0.765625 C 3.550781 -0.773438 3.65625 -0.785156 3.75 -0.796875 C 3.851562 -0.804688 3.9375 -0.816406 4 -0.828125 C 4.070312 -0.847656 4.125 -0.859375 4.15625 -0.859375 L 4.15625 -5.5625 L 5.140625 -5.5625 Z M 5.140625 -0.15625 "/> +</symbol> +<symbol overflow="visible" id="glyph1-22"> +<path style="stroke:none;" d="M 2.921875 -5.5625 L 5.265625 -5.5625 L 5.265625 -4.734375 L 2.921875 -4.734375 L 2.921875 -2.140625 C 2.921875 -1.867188 2.9375 -1.644531 2.96875 -1.46875 C 3.007812 -1.289062 3.078125 -1.144531 3.171875 -1.03125 C 3.265625 -0.925781 3.382812 -0.851562 3.53125 -0.8125 C 3.675781 -0.769531 3.851562 -0.75 4.0625 -0.75 C 4.34375 -0.75 4.570312 -0.773438 4.75 -0.828125 C 4.925781 -0.878906 5.09375 -0.941406 5.25 -1.015625 L 5.40625 -0.1875 C 5.289062 -0.132812 5.109375 -0.0703125 4.859375 0 C 4.617188 0.0820312 4.316406 0.125 3.953125 0.125 C 3.546875 0.125 3.207031 0.078125 2.9375 -0.015625 C 2.675781 -0.109375 2.46875 -0.25 2.3125 -0.4375 C 2.164062 -0.632812 2.066406 -0.875 2.015625 -1.15625 C 1.960938 -1.4375 1.9375 -1.765625 1.9375 -2.140625 L 1.9375 -4.734375 L 0.75 -4.734375 L 0.75 -5.5625 L 1.9375 -5.5625 L 1.9375 -7.125 L 2.921875 -7.296875 Z M 2.921875 -5.5625 "/> +</symbol> +<symbol overflow="visible" id="glyph1-23"> +<path style="stroke:none;" d="M 4.5 -2.765625 C 4.5 -3.421875 4.347656 -3.925781 4.046875 -4.28125 C 3.742188 -4.632812 3.347656 -4.8125 2.859375 -4.8125 C 2.585938 -4.8125 2.375 -4.796875 2.21875 -4.765625 C 2.0625 -4.742188 1.9375 -4.71875 1.84375 -4.6875 L 1.84375 -1.171875 C 1.957031 -1.066406 2.117188 -0.96875 2.328125 -0.875 C 2.546875 -0.789062 2.773438 -0.75 3.015625 -0.75 C 3.273438 -0.75 3.5 -0.800781 3.6875 -0.90625 C 3.875 -1.007812 4.023438 -1.148438 4.140625 -1.328125 C 4.265625 -1.515625 4.351562 -1.726562 4.40625 -1.96875 C 4.46875 -2.21875 4.5 -2.484375 4.5 -2.765625 Z M 5.515625 -2.765625 C 5.515625 -2.347656 5.460938 -1.957031 5.359375 -1.59375 C 5.253906 -1.238281 5.097656 -0.929688 4.890625 -0.671875 C 4.679688 -0.421875 4.429688 -0.222656 4.140625 -0.078125 C 3.847656 0.0546875 3.507812 0.125 3.125 0.125 C 2.832031 0.125 2.570312 0.0859375 2.34375 0.015625 C 2.125 -0.046875 1.957031 -0.125 1.84375 -0.21875 L 1.84375 1.984375 L 0.859375 1.984375 L 0.859375 -5.40625 C 1.066406 -5.46875 1.34375 -5.53125 1.6875 -5.59375 C 2.03125 -5.65625 2.421875 -5.6875 2.859375 -5.6875 C 3.253906 -5.6875 3.613281 -5.617188 3.9375 -5.484375 C 4.269531 -5.347656 4.550781 -5.15625 4.78125 -4.90625 C 5.019531 -4.65625 5.203125 -4.347656 5.328125 -3.984375 C 5.453125 -3.617188 5.515625 -3.210938 5.515625 -2.765625 Z M 5.515625 -2.765625 "/> +</symbol> +<symbol overflow="visible" id="glyph1-24"> +<path style="stroke:none;" d="M 3.015625 -3.734375 L 4.15625 -7.421875 L 5.09375 -7.421875 C 5.25 -6.253906 5.359375 -5.054688 5.421875 -3.828125 C 5.492188 -2.609375 5.554688 -1.332031 5.609375 0 L 4.65625 0 C 4.644531 -0.425781 4.632812 -0.890625 4.625 -1.390625 C 4.625 -1.898438 4.613281 -2.421875 4.59375 -2.953125 C 4.582031 -3.492188 4.570312 -4.039062 4.5625 -4.59375 C 4.550781 -5.144531 4.539062 -5.679688 4.53125 -6.203125 L 3.4375 -2.8125 L 2.578125 -2.8125 L 1.46875 -6.203125 C 1.46875 -5.679688 1.457031 -5.144531 1.4375 -4.59375 C 1.425781 -4.050781 1.414062 -3.507812 1.40625 -2.96875 C 1.394531 -2.425781 1.382812 -1.898438 1.375 -1.390625 C 1.363281 -0.890625 1.351562 -0.425781 1.34375 0 L 0.390625 0 C 0.410156 -0.601562 0.4375 -1.222656 0.46875 -1.859375 C 0.5 -2.503906 0.535156 -3.144531 0.578125 -3.78125 C 0.617188 -4.414062 0.671875 -5.039062 0.734375 -5.65625 C 0.796875 -6.269531 0.863281 -6.859375 0.9375 -7.421875 L 1.84375 -7.421875 Z M 3.015625 -3.734375 "/> +</symbol> +<symbol overflow="visible" id="glyph2-0"> +<path style="stroke:none;" d="M 0.640625 2.296875 L 0.640625 -9.171875 L 7.140625 -9.171875 L 7.140625 2.296875 Z M 1.375 1.578125 L 6.421875 1.578125 L 6.421875 -8.4375 L 1.375 -8.4375 Z M 1.375 1.578125 "/> +</symbol> +<symbol overflow="visible" id="glyph2-1"> +<path style="stroke:none;" d="M 6.34375 -6.84375 L 6.34375 -5.75 C 6.007812 -5.925781 5.675781 -6.0625 5.34375 -6.15625 C 5.007812 -6.25 4.675781 -6.296875 4.34375 -6.296875 C 3.582031 -6.296875 2.992188 -6.050781 2.578125 -5.5625 C 2.160156 -5.082031 1.953125 -4.410156 1.953125 -3.546875 C 1.953125 -2.679688 2.160156 -2.007812 2.578125 -1.53125 C 2.992188 -1.050781 3.582031 -0.8125 4.34375 -0.8125 C 4.675781 -0.8125 5.007812 -0.851562 5.34375 -0.9375 C 5.675781 -1.03125 6.007812 -1.171875 6.34375 -1.359375 L 6.34375 -0.265625 C 6.019531 -0.117188 5.679688 -0.0078125 5.328125 0.0625 C 4.984375 0.144531 4.613281 0.1875 4.21875 0.1875 C 3.144531 0.1875 2.289062 -0.144531 1.65625 -0.8125 C 1.03125 -1.488281 0.71875 -2.398438 0.71875 -3.546875 C 0.71875 -4.703125 1.035156 -5.613281 1.671875 -6.28125 C 2.304688 -6.945312 3.179688 -7.28125 4.296875 -7.28125 C 4.648438 -7.28125 5 -7.242188 5.34375 -7.171875 C 5.6875 -7.097656 6.019531 -6.988281 6.34375 -6.84375 Z M 6.34375 -6.84375 "/> +</symbol> +<symbol overflow="visible" id="glyph2-2"> +<path style="stroke:none;" d="M 5.34375 -6.015625 C 5.207031 -6.085938 5.0625 -6.140625 4.90625 -6.171875 C 4.757812 -6.210938 4.59375 -6.234375 4.40625 -6.234375 C 3.75 -6.234375 3.242188 -6.019531 2.890625 -5.59375 C 2.535156 -5.164062 2.359375 -4.550781 2.359375 -3.75 L 2.359375 0 L 1.1875 0 L 1.1875 -7.109375 L 2.359375 -7.109375 L 2.359375 -6 C 2.597656 -6.4375 2.914062 -6.757812 3.3125 -6.96875 C 3.707031 -7.175781 4.1875 -7.28125 4.75 -7.28125 C 4.832031 -7.28125 4.921875 -7.273438 5.015625 -7.265625 C 5.109375 -7.253906 5.21875 -7.238281 5.34375 -7.21875 Z M 5.34375 -6.015625 "/> +</symbol> +<symbol overflow="visible" id="glyph2-3"> +<path style="stroke:none;" d="M 3.984375 -6.296875 C 3.359375 -6.296875 2.863281 -6.050781 2.5 -5.5625 C 2.132812 -5.070312 1.953125 -4.398438 1.953125 -3.546875 C 1.953125 -2.691406 2.128906 -2.019531 2.484375 -1.53125 C 2.847656 -1.050781 3.347656 -0.8125 3.984375 -0.8125 C 4.597656 -0.8125 5.085938 -1.054688 5.453125 -1.546875 C 5.816406 -2.035156 6 -2.703125 6 -3.546875 C 6 -4.390625 5.816406 -5.054688 5.453125 -5.546875 C 5.085938 -6.046875 4.597656 -6.296875 3.984375 -6.296875 Z M 3.984375 -7.28125 C 4.992188 -7.28125 5.789062 -6.945312 6.375 -6.28125 C 6.957031 -5.625 7.25 -4.710938 7.25 -3.546875 C 7.25 -2.378906 6.957031 -1.460938 6.375 -0.796875 C 5.789062 -0.140625 4.992188 0.1875 3.984375 0.1875 C 2.960938 0.1875 2.160156 -0.140625 1.578125 -0.796875 C 1.003906 -1.460938 0.71875 -2.378906 0.71875 -3.546875 C 0.71875 -4.710938 1.003906 -5.625 1.578125 -6.28125 C 2.160156 -6.945312 2.960938 -7.28125 3.984375 -7.28125 Z M 3.984375 -7.28125 "/> +</symbol> +<symbol overflow="visible" id="glyph2-4"> +<path style="stroke:none;" d="M 2.359375 -1.0625 L 2.359375 2.703125 L 1.1875 2.703125 L 1.1875 -7.109375 L 2.359375 -7.109375 L 2.359375 -6.03125 C 2.597656 -6.457031 2.90625 -6.769531 3.28125 -6.96875 C 3.65625 -7.175781 4.101562 -7.28125 4.625 -7.28125 C 5.488281 -7.28125 6.191406 -6.9375 6.734375 -6.25 C 7.273438 -5.5625 7.546875 -4.660156 7.546875 -3.546875 C 7.546875 -2.429688 7.273438 -1.53125 6.734375 -0.84375 C 6.191406 -0.15625 5.488281 0.1875 4.625 0.1875 C 4.101562 0.1875 3.65625 0.0820312 3.28125 -0.125 C 2.90625 -0.332031 2.597656 -0.644531 2.359375 -1.0625 Z M 6.328125 -3.546875 C 6.328125 -4.410156 6.148438 -5.082031 5.796875 -5.5625 C 5.441406 -6.050781 4.957031 -6.296875 4.34375 -6.296875 C 3.726562 -6.296875 3.242188 -6.050781 2.890625 -5.5625 C 2.535156 -5.082031 2.359375 -4.410156 2.359375 -3.546875 C 2.359375 -2.691406 2.535156 -2.019531 2.890625 -1.53125 C 3.242188 -1.039062 3.726562 -0.796875 4.34375 -0.796875 C 4.957031 -0.796875 5.441406 -1.039062 5.796875 -1.53125 C 6.148438 -2.019531 6.328125 -2.691406 6.328125 -3.546875 Z M 6.328125 -3.546875 "/> +</symbol> +<symbol overflow="visible" id="glyph2-5"> +<path style="stroke:none;" d="M 5.75 -6.90625 L 5.75 -5.796875 C 5.425781 -5.960938 5.085938 -6.085938 4.734375 -6.171875 C 4.378906 -6.253906 4.007812 -6.296875 3.625 -6.296875 C 3.039062 -6.296875 2.601562 -6.207031 2.3125 -6.03125 C 2.03125 -5.851562 1.890625 -5.585938 1.890625 -5.234375 C 1.890625 -4.960938 1.988281 -4.75 2.1875 -4.59375 C 2.394531 -4.445312 2.816406 -4.300781 3.453125 -4.15625 L 3.84375 -4.0625 C 4.675781 -3.882812 5.265625 -3.632812 5.609375 -3.3125 C 5.960938 -2.988281 6.140625 -2.539062 6.140625 -1.96875 C 6.140625 -1.300781 5.878906 -0.773438 5.359375 -0.390625 C 4.835938 -0.00390625 4.117188 0.1875 3.203125 0.1875 C 2.816406 0.1875 2.414062 0.148438 2 0.078125 C 1.59375 0.00390625 1.160156 -0.109375 0.703125 -0.265625 L 0.703125 -1.46875 C 1.140625 -1.238281 1.566406 -1.066406 1.984375 -0.953125 C 2.398438 -0.847656 2.8125 -0.796875 3.21875 -0.796875 C 3.769531 -0.796875 4.191406 -0.890625 4.484375 -1.078125 C 4.785156 -1.265625 4.9375 -1.53125 4.9375 -1.875 C 4.9375 -2.1875 4.828125 -2.425781 4.609375 -2.59375 C 4.398438 -2.769531 3.9375 -2.9375 3.21875 -3.09375 L 2.8125 -3.1875 C 2.082031 -3.34375 1.554688 -3.578125 1.234375 -3.890625 C 0.910156 -4.203125 0.75 -4.632812 0.75 -5.1875 C 0.75 -5.851562 0.984375 -6.367188 1.453125 -6.734375 C 1.929688 -7.097656 2.609375 -7.28125 3.484375 -7.28125 C 3.910156 -7.28125 4.316406 -7.25 4.703125 -7.1875 C 5.085938 -7.125 5.4375 -7.03125 5.75 -6.90625 Z M 5.75 -6.90625 "/> +</symbol> +<symbol overflow="visible" id="glyph2-6"> +<path style="stroke:none;" d="M 4.453125 -3.578125 C 3.515625 -3.578125 2.863281 -3.46875 2.5 -3.25 C 2.132812 -3.03125 1.953125 -2.660156 1.953125 -2.140625 C 1.953125 -1.734375 2.085938 -1.40625 2.359375 -1.15625 C 2.628906 -0.914062 3 -0.796875 3.46875 -0.796875 C 4.113281 -0.796875 4.632812 -1.023438 5.03125 -1.484375 C 5.425781 -1.941406 5.625 -2.550781 5.625 -3.3125 L 5.625 -3.578125 Z M 6.78125 -4.0625 L 6.78125 0 L 5.625 0 L 5.625 -1.078125 C 5.351562 -0.648438 5.019531 -0.332031 4.625 -0.125 C 4.226562 0.0820312 3.738281 0.1875 3.15625 0.1875 C 2.425781 0.1875 1.847656 -0.015625 1.421875 -0.421875 C 0.992188 -0.835938 0.78125 -1.382812 0.78125 -2.0625 C 0.78125 -2.863281 1.046875 -3.46875 1.578125 -3.875 C 2.117188 -4.28125 2.921875 -4.484375 3.984375 -4.484375 L 5.625 -4.484375 L 5.625 -4.609375 C 5.625 -5.140625 5.445312 -5.550781 5.09375 -5.84375 C 4.738281 -6.144531 4.238281 -6.296875 3.59375 -6.296875 C 3.1875 -6.296875 2.789062 -6.242188 2.40625 -6.140625 C 2.019531 -6.046875 1.648438 -5.898438 1.296875 -5.703125 L 1.296875 -6.78125 C 1.722656 -6.945312 2.140625 -7.070312 2.546875 -7.15625 C 2.953125 -7.238281 3.34375 -7.28125 3.71875 -7.28125 C 4.75 -7.28125 5.515625 -7.015625 6.015625 -6.484375 C 6.523438 -5.953125 6.78125 -5.144531 6.78125 -4.0625 Z M 6.78125 -4.0625 "/> +</symbol> +<symbol overflow="visible" id="glyph2-7"> +<path style="stroke:none;" d="M 1.21875 -9.875 L 2.390625 -9.875 L 2.390625 0 L 1.21875 0 Z M 1.21875 -9.875 "/> +</symbol> +<symbol overflow="visible" id="glyph2-8"> +<path style="stroke:none;" d="M 7.3125 -3.84375 L 7.3125 -3.28125 L 1.9375 -3.28125 C 1.988281 -2.46875 2.226562 -1.851562 2.65625 -1.4375 C 3.09375 -1.019531 3.695312 -0.8125 4.46875 -0.8125 C 4.914062 -0.8125 5.347656 -0.863281 5.765625 -0.96875 C 6.191406 -1.082031 6.613281 -1.25 7.03125 -1.46875 L 7.03125 -0.359375 C 6.613281 -0.179688 6.179688 -0.046875 5.734375 0.046875 C 5.296875 0.140625 4.851562 0.1875 4.40625 0.1875 C 3.269531 0.1875 2.367188 -0.140625 1.703125 -0.796875 C 1.046875 -1.460938 0.71875 -2.359375 0.71875 -3.484375 C 0.71875 -4.648438 1.03125 -5.570312 1.65625 -6.25 C 2.289062 -6.9375 3.140625 -7.28125 4.203125 -7.28125 C 5.160156 -7.28125 5.914062 -6.972656 6.46875 -6.359375 C 7.03125 -5.742188 7.3125 -4.90625 7.3125 -3.84375 Z M 6.140625 -4.1875 C 6.128906 -4.820312 5.945312 -5.332031 5.59375 -5.71875 C 5.25 -6.101562 4.789062 -6.296875 4.21875 -6.296875 C 3.5625 -6.296875 3.035156 -6.109375 2.640625 -5.734375 C 2.253906 -5.367188 2.03125 -4.851562 1.96875 -4.1875 Z M 6.140625 -4.1875 "/> +</symbol> +<symbol overflow="visible" id="glyph2-9"> +<path style="stroke:none;" d="M 6.328125 -3.546875 C 6.328125 -4.410156 6.148438 -5.082031 5.796875 -5.5625 C 5.441406 -6.050781 4.957031 -6.296875 4.34375 -6.296875 C 3.726562 -6.296875 3.242188 -6.050781 2.890625 -5.5625 C 2.535156 -5.082031 2.359375 -4.410156 2.359375 -3.546875 C 2.359375 -2.691406 2.535156 -2.019531 2.890625 -1.53125 C 3.242188 -1.039062 3.726562 -0.796875 4.34375 -0.796875 C 4.957031 -0.796875 5.441406 -1.039062 5.796875 -1.53125 C 6.148438 -2.019531 6.328125 -2.691406 6.328125 -3.546875 Z M 2.359375 -6.03125 C 2.597656 -6.457031 2.90625 -6.769531 3.28125 -6.96875 C 3.65625 -7.175781 4.101562 -7.28125 4.625 -7.28125 C 5.488281 -7.28125 6.191406 -6.9375 6.734375 -6.25 C 7.273438 -5.5625 7.546875 -4.660156 7.546875 -3.546875 C 7.546875 -2.429688 7.273438 -1.53125 6.734375 -0.84375 C 6.191406 -0.15625 5.488281 0.1875 4.625 0.1875 C 4.101562 0.1875 3.65625 0.0820312 3.28125 -0.125 C 2.90625 -0.332031 2.597656 -0.644531 2.359375 -1.0625 L 2.359375 0 L 1.1875 0 L 1.1875 -9.875 L 2.359375 -9.875 Z M 2.359375 -6.03125 "/> +</symbol> +<symbol overflow="visible" id="glyph2-10"> +<path style="stroke:none;" d="M 1.21875 -7.109375 L 2.390625 -7.109375 L 2.390625 0 L 1.21875 0 Z M 1.21875 -9.875 L 2.390625 -9.875 L 2.390625 -8.390625 L 1.21875 -8.390625 Z M 1.21875 -9.875 "/> +</symbol> +<symbol overflow="visible" id="glyph2-11"> +<path style="stroke:none;" d="M 7.140625 -4.296875 L 7.140625 0 L 5.96875 0 L 5.96875 -4.25 C 5.96875 -4.925781 5.835938 -5.429688 5.578125 -5.765625 C 5.316406 -6.097656 4.921875 -6.265625 4.390625 -6.265625 C 3.765625 -6.265625 3.269531 -6.0625 2.90625 -5.65625 C 2.539062 -5.257812 2.359375 -4.710938 2.359375 -4.015625 L 2.359375 0 L 1.1875 0 L 1.1875 -7.109375 L 2.359375 -7.109375 L 2.359375 -6 C 2.640625 -6.425781 2.96875 -6.742188 3.34375 -6.953125 C 3.71875 -7.171875 4.15625 -7.28125 4.65625 -7.28125 C 5.46875 -7.28125 6.082031 -7.023438 6.5 -6.515625 C 6.925781 -6.015625 7.140625 -5.273438 7.140625 -4.296875 Z M 7.140625 -4.296875 "/> +</symbol> +<symbol overflow="visible" id="glyph2-12"> +<path style="stroke:none;" d="M 5.90625 -3.640625 C 5.90625 -4.484375 5.726562 -5.132812 5.375 -5.59375 C 5.03125 -6.0625 4.539062 -6.296875 3.90625 -6.296875 C 3.28125 -6.296875 2.789062 -6.0625 2.4375 -5.59375 C 2.09375 -5.132812 1.921875 -4.484375 1.921875 -3.640625 C 1.921875 -2.796875 2.09375 -2.140625 2.4375 -1.671875 C 2.789062 -1.210938 3.28125 -0.984375 3.90625 -0.984375 C 4.539062 -0.984375 5.03125 -1.210938 5.375 -1.671875 C 5.726562 -2.140625 5.90625 -2.796875 5.90625 -3.640625 Z M 7.078125 -0.875 C 7.078125 0.332031 6.804688 1.226562 6.265625 1.8125 C 5.722656 2.40625 4.898438 2.703125 3.796875 2.703125 C 3.390625 2.703125 3.003906 2.671875 2.640625 2.609375 C 2.273438 2.546875 1.921875 2.453125 1.578125 2.328125 L 1.578125 1.1875 C 1.921875 1.375 2.257812 1.507812 2.59375 1.59375 C 2.925781 1.6875 3.265625 1.734375 3.609375 1.734375 C 4.378906 1.734375 4.953125 1.535156 5.328125 1.140625 C 5.710938 0.742188 5.90625 0.140625 5.90625 -0.671875 L 5.90625 -1.25 C 5.664062 -0.832031 5.351562 -0.519531 4.96875 -0.3125 C 4.59375 -0.101562 4.144531 0 3.625 0 C 2.75 0 2.046875 -0.332031 1.515625 -1 C 0.984375 -1.664062 0.71875 -2.546875 0.71875 -3.640625 C 0.71875 -4.734375 0.984375 -5.613281 1.515625 -6.28125 C 2.046875 -6.945312 2.75 -7.28125 3.625 -7.28125 C 4.144531 -7.28125 4.59375 -7.175781 4.96875 -6.96875 C 5.351562 -6.757812 5.664062 -6.445312 5.90625 -6.03125 L 5.90625 -7.109375 L 7.078125 -7.109375 Z M 7.078125 -0.875 "/> +</symbol> +<symbol overflow="visible" id="glyph2-13"> +<path style="stroke:none;" d="M 5.125 -8.609375 C 4.1875 -8.609375 3.441406 -8.257812 2.890625 -7.5625 C 2.347656 -6.875 2.078125 -5.929688 2.078125 -4.734375 C 2.078125 -3.535156 2.347656 -2.585938 2.890625 -1.890625 C 3.441406 -1.203125 4.1875 -0.859375 5.125 -0.859375 C 6.050781 -0.859375 6.785156 -1.203125 7.328125 -1.890625 C 7.878906 -2.585938 8.15625 -3.535156 8.15625 -4.734375 C 8.15625 -5.929688 7.878906 -6.875 7.328125 -7.5625 C 6.785156 -8.257812 6.050781 -8.609375 5.125 -8.609375 Z M 5.125 -9.65625 C 6.445312 -9.65625 7.503906 -9.207031 8.296875 -8.3125 C 9.097656 -7.414062 9.5 -6.222656 9.5 -4.734375 C 9.5 -3.234375 9.097656 -2.035156 8.296875 -1.140625 C 7.503906 -0.253906 6.445312 0.1875 5.125 0.1875 C 3.789062 0.1875 2.722656 -0.253906 1.921875 -1.140625 C 1.128906 -2.035156 0.734375 -3.234375 0.734375 -4.734375 C 0.734375 -6.222656 1.128906 -7.414062 1.921875 -8.3125 C 2.722656 -9.207031 3.789062 -9.65625 5.125 -9.65625 Z M 5.125 -9.65625 "/> +</symbol> +<symbol overflow="visible" id="glyph2-14"> +<path style="stroke:none;" d="M 1.109375 -2.8125 L 1.109375 -7.109375 L 2.265625 -7.109375 L 2.265625 -2.84375 C 2.265625 -2.175781 2.394531 -1.671875 2.65625 -1.328125 C 2.925781 -0.992188 3.320312 -0.828125 3.84375 -0.828125 C 4.476562 -0.828125 4.976562 -1.023438 5.34375 -1.421875 C 5.707031 -1.828125 5.890625 -2.378906 5.890625 -3.078125 L 5.890625 -7.109375 L 7.0625 -7.109375 L 7.0625 0 L 5.890625 0 L 5.890625 -1.09375 C 5.609375 -0.65625 5.28125 -0.332031 4.90625 -0.125 C 4.53125 0.0820312 4.09375 0.1875 3.59375 0.1875 C 2.78125 0.1875 2.160156 -0.0664062 1.734375 -0.578125 C 1.316406 -1.085938 1.109375 -1.832031 1.109375 -2.8125 Z M 4.046875 -7.28125 Z M 4.046875 -7.28125 "/> +</symbol> +<symbol overflow="visible" id="glyph2-15"> +<path style="stroke:none;" d="M 2.375 -9.125 L 2.375 -7.109375 L 4.78125 -7.109375 L 4.78125 -6.203125 L 2.375 -6.203125 L 2.375 -2.34375 C 2.375 -1.757812 2.453125 -1.382812 2.609375 -1.21875 C 2.773438 -1.0625 3.101562 -0.984375 3.59375 -0.984375 L 4.78125 -0.984375 L 4.78125 0 L 3.59375 0 C 2.6875 0 2.0625 -0.164062 1.71875 -0.5 C 1.375 -0.84375 1.203125 -1.457031 1.203125 -2.34375 L 1.203125 -6.203125 L 0.34375 -6.203125 L 0.34375 -7.109375 L 1.203125 -7.109375 L 1.203125 -9.125 Z M 2.375 -9.125 "/> +</symbol> +<symbol overflow="visible" id="glyph2-16"> +<path style="stroke:none;" d=""/> +</symbol> +<symbol overflow="visible" id="glyph2-17"> +<path style="stroke:none;" d="M 1.28125 -9.484375 L 6.71875 -9.484375 L 6.71875 -8.390625 L 2.5625 -8.390625 L 2.5625 -5.609375 L 6.3125 -5.609375 L 6.3125 -4.53125 L 2.5625 -4.53125 L 2.5625 0 L 1.28125 0 Z M 1.28125 -9.484375 "/> +</symbol> +<symbol overflow="visible" id="glyph2-18"> +<path style="stroke:none;" d="M 6.765625 -5.75 C 7.054688 -6.269531 7.40625 -6.65625 7.8125 -6.90625 C 8.21875 -7.15625 8.695312 -7.28125 9.25 -7.28125 C 9.988281 -7.28125 10.554688 -7.019531 10.953125 -6.5 C 11.359375 -5.976562 11.5625 -5.242188 11.5625 -4.296875 L 11.5625 0 L 10.390625 0 L 10.390625 -4.25 C 10.390625 -4.9375 10.265625 -5.441406 10.015625 -5.765625 C 9.773438 -6.097656 9.410156 -6.265625 8.921875 -6.265625 C 8.316406 -6.265625 7.835938 -6.0625 7.484375 -5.65625 C 7.128906 -5.257812 6.953125 -4.710938 6.953125 -4.015625 L 6.953125 0 L 5.78125 0 L 5.78125 -4.25 C 5.78125 -4.9375 5.660156 -5.441406 5.421875 -5.765625 C 5.179688 -6.097656 4.804688 -6.265625 4.296875 -6.265625 C 3.703125 -6.265625 3.226562 -6.0625 2.875 -5.65625 C 2.53125 -5.25 2.359375 -4.703125 2.359375 -4.015625 L 2.359375 0 L 1.1875 0 L 1.1875 -7.109375 L 2.359375 -7.109375 L 2.359375 -6 C 2.617188 -6.4375 2.9375 -6.757812 3.3125 -6.96875 C 3.6875 -7.175781 4.128906 -7.28125 4.640625 -7.28125 C 5.160156 -7.28125 5.597656 -7.148438 5.953125 -6.890625 C 6.316406 -6.628906 6.585938 -6.25 6.765625 -5.75 Z M 6.765625 -5.75 "/> +</symbol> +<symbol overflow="visible" id="glyph2-19"> +<path style="stroke:none;" d="M 6.953125 -9.171875 L 6.953125 -7.921875 C 6.472656 -8.148438 6.015625 -8.320312 5.578125 -8.4375 C 5.148438 -8.550781 4.734375 -8.609375 4.328125 -8.609375 C 3.628906 -8.609375 3.085938 -8.472656 2.703125 -8.203125 C 2.328125 -7.929688 2.140625 -7.546875 2.140625 -7.046875 C 2.140625 -6.628906 2.265625 -6.3125 2.515625 -6.09375 C 2.773438 -5.882812 3.253906 -5.710938 3.953125 -5.578125 L 4.734375 -5.421875 C 5.679688 -5.234375 6.382812 -4.910156 6.84375 -4.453125 C 7.300781 -3.992188 7.53125 -3.378906 7.53125 -2.609375 C 7.53125 -1.691406 7.222656 -0.992188 6.609375 -0.515625 C 5.992188 -0.046875 5.085938 0.1875 3.890625 0.1875 C 3.441406 0.1875 2.960938 0.132812 2.453125 0.03125 C 1.953125 -0.0703125 1.429688 -0.222656 0.890625 -0.421875 L 0.890625 -1.734375 C 1.410156 -1.441406 1.921875 -1.222656 2.421875 -1.078125 C 2.921875 -0.929688 3.410156 -0.859375 3.890625 -0.859375 C 4.628906 -0.859375 5.195312 -1 5.59375 -1.28125 C 5.988281 -1.570312 6.1875 -1.984375 6.1875 -2.515625 C 6.1875 -2.984375 6.039062 -3.347656 5.75 -3.609375 C 5.46875 -3.867188 5.003906 -4.066406 4.359375 -4.203125 L 3.578125 -4.359375 C 2.617188 -4.546875 1.925781 -4.84375 1.5 -5.25 C 1.070312 -5.65625 0.859375 -6.21875 0.859375 -6.9375 C 0.859375 -7.78125 1.148438 -8.441406 1.734375 -8.921875 C 2.328125 -9.410156 3.144531 -9.65625 4.1875 -9.65625 C 4.625 -9.65625 5.070312 -9.613281 5.53125 -9.53125 C 6 -9.445312 6.472656 -9.328125 6.953125 -9.171875 Z M 6.953125 -9.171875 "/> +</symbol> +<symbol overflow="visible" id="glyph2-20"> +<path style="stroke:none;" d="M 4.1875 0.65625 C 3.851562 1.507812 3.53125 2.0625 3.21875 2.3125 C 2.90625 2.570312 2.488281 2.703125 1.96875 2.703125 L 1.03125 2.703125 L 1.03125 1.734375 L 1.71875 1.734375 C 2.039062 1.734375 2.289062 1.65625 2.46875 1.5 C 2.644531 1.34375 2.835938 0.984375 3.046875 0.421875 L 3.265625 -0.109375 L 0.390625 -7.109375 L 1.625 -7.109375 L 3.84375 -1.546875 L 6.0625 -7.109375 L 7.3125 -7.109375 Z M 4.1875 0.65625 "/> +</symbol> +</g> +</defs> +<g id="surface268880"> +<rect x="0" y="0" width="774" height="152" style="fill:rgb(100%,100%,100%);fill-opacity:1;stroke:none;"/> +<path style="fill-rule:evenodd;fill:rgb(100%,100%,100%);fill-opacity:1;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 21.75297 10.408118 L 26.433829 10.408118 L 26.433829 12.281165 L 21.75297 12.281165 Z M 21.75297 10.408118 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill-rule:evenodd;fill:rgb(100%,100%,100%);fill-opacity:1;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 29.079728 10.51222 L 32.829728 10.51222 L 32.829728 12.149915 L 29.079728 12.149915 Z M 29.079728 10.51222 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<g style="fill:rgb(0%,0%,0%);fill-opacity:1;"> + <use xlink:href="#glyph0-1" x="20.171875" y="57.705621"/> + <use xlink:href="#glyph0-2" x="30.171875" y="57.705621"/> + <use xlink:href="#glyph0-3" x="40.171875" y="57.705621"/> + <use xlink:href="#glyph0-4" x="50.171875" y="57.705621"/> + <use xlink:href="#glyph0-5" x="60.171875" y="57.705621"/> + <use xlink:href="#glyph0-6" x="70.171875" y="57.705621"/> +</g> +<g style="fill:rgb(0%,0%,0%);fill-opacity:1;"> + <use xlink:href="#glyph0-7" x="174.203125" y="60.053277"/> + <use xlink:href="#glyph0-8" x="184.203125" y="60.053277"/> +</g> +<path style="fill-rule:evenodd;fill:rgb(100%,100%,100%);fill-opacity:1;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 40.925236 10.544446 L 44.675236 10.544446 L 44.675236 12.090345 L 40.925236 12.090345 Z M 40.925236 10.544446 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill-rule:evenodd;fill:rgb(100%,100%,100%);fill-opacity:1;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 34.883439 10.536634 L 38.633439 10.536634 L 38.633439 12.120032 L 34.883439 12.120032 Z M 34.883439 10.536634 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill-rule:evenodd;fill:rgb(100%,100%,100%);fill-opacity:1;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 47.084806 10.484876 L 52.045743 10.484876 L 52.045743 12.130774 L 47.084806 12.130774 Z M 47.084806 10.484876 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill-rule:evenodd;fill:rgb(100%,100%,100%);fill-opacity:1;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 53.980118 10.376868 L 59.866642 10.376868 L 59.866642 12.279603 L 53.980118 12.279603 Z M 53.980118 10.376868 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill-rule:evenodd;fill:rgb(100%,100%,100%);fill-opacity:1;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 54.048478 13.825501 L 59.868009 13.825501 L 59.868009 15.490345 L 54.048478 15.490345 Z M 54.048478 13.825501 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill:none;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 26.481876 11.338001 L 28.593009 11.332337 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill-rule:evenodd;fill:rgb(0%,0%,0%);fill-opacity:1;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 28.968009 11.33136 L 28.468595 11.582728 L 28.593009 11.332337 L 28.467228 11.082728 Z M 28.968009 11.33136 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill:none;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 32.876798 11.329798 L 34.396525 11.328626 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill-rule:evenodd;fill:rgb(0%,0%,0%);fill-opacity:1;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 34.771525 11.328431 L 34.27172 11.578821 L 34.396525 11.328626 L 34.271329 11.078821 Z M 34.771525 11.328431 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill:none;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 38.633439 11.328431 L 40.438517 11.319642 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill-rule:evenodd;fill:rgb(0%,0%,0%);fill-opacity:1;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 40.813517 11.317884 L 40.314689 11.570228 L 40.438517 11.319642 L 40.312345 11.070423 Z M 40.813517 11.317884 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill:none;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 44.675236 11.317298 L 46.597892 11.309876 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill-rule:evenodd;fill:rgb(0%,0%,0%);fill-opacity:1;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 46.972892 11.308313 L 46.473868 11.560267 L 46.597892 11.309876 L 46.471915 11.060267 Z M 46.972892 11.308313 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill:none;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 52.045743 11.307923 L 53.4934 11.323157 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill-rule:evenodd;fill:rgb(0%,0%,0%);fill-opacity:1;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 53.8684 11.327063 L 53.365861 11.57179 L 53.4934 11.323157 L 53.370939 11.07179 Z M 53.8684 11.327063 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<g style="fill:rgb(0%,0%,0%);fill-opacity:1;"> + <use xlink:href="#glyph0-9" x="286.757813" y="60.611871"/> + <use xlink:href="#glyph0-10" x="296.757813" y="60.611871"/> + <use xlink:href="#glyph0-1" x="306.757813" y="60.611871"/> +</g> +<g style="fill:rgb(0%,0%,0%);fill-opacity:1;"> + <use xlink:href="#glyph0-11" x="405.660156" y="59.904839"/> + <use xlink:href="#glyph0-10" x="415.660156" y="59.904839"/> + <use xlink:href="#glyph0-12" x="425.660156" y="59.904839"/> +</g> +<g style="fill:rgb(0%,0%,0%);fill-opacity:1;"> + <use xlink:href="#glyph1-1" x="511.308594" y="58.064616"/> + <use xlink:href="#glyph1-2" x="517.308757" y="58.064616"/> + <use xlink:href="#glyph1-3" x="523.308919" y="58.064616"/> + <use xlink:href="#glyph1-4" x="529.309082" y="58.064616"/> + <use xlink:href="#glyph1-5" x="535.309245" y="58.064616"/> + <use xlink:href="#glyph1-6" x="541.309408" y="58.064616"/> + <use xlink:href="#glyph1-7" x="547.30957" y="58.064616"/> + <use xlink:href="#glyph1-8" x="553.309733" y="58.064616"/> + <use xlink:href="#glyph1-9" x="559.309896" y="58.064616"/> + <use xlink:href="#glyph1-10" x="565.310059" y="58.064616"/> + <use xlink:href="#glyph1-11" x="571.310221" y="58.064616"/> + <use xlink:href="#glyph1-12" x="577.310384" y="58.064616"/> + <use xlink:href="#glyph1-13" x="583.310547" y="58.064616"/> + <use xlink:href="#glyph1-8" x="589.31071" y="58.064616"/> + <use xlink:href="#glyph1-14" x="595.310872" y="58.064616"/> +</g> +<path style="fill:none;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 45.671915 11.342298 L 45.655704 11.342298 L 45.655704 14.657923 L 53.561759 14.657923 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<path style="fill-rule:evenodd;fill:rgb(0%,0%,0%);fill-opacity:1;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-miterlimit:10;" d="M 53.936759 14.657923 L 53.436759 14.907923 L 53.561759 14.657923 L 53.436759 14.407923 Z M 53.936759 14.657923 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<g style="fill:rgb(0%,0%,0%);fill-opacity:1;"> + <use xlink:href="#glyph1-15" x="657.078125" y="57.724772"/> + <use xlink:href="#glyph1-16" x="663.078288" y="57.724772"/> + <use xlink:href="#glyph1-10" x="669.078451" y="57.724772"/> + <use xlink:href="#glyph1-6" x="675.078613" y="57.724772"/> + <use xlink:href="#glyph1-8" x="681.078776" y="57.724772"/> + <use xlink:href="#glyph1-17" x="687.078939" y="57.724772"/> + <use xlink:href="#glyph1-11" x="693.079102" y="57.724772"/> + <use xlink:href="#glyph1-18" x="699.079264" y="57.724772"/> + <use xlink:href="#glyph1-19" x="705.079427" y="57.724772"/> + <use xlink:href="#glyph1-4" x="711.07959" y="57.724772"/> + <use xlink:href="#glyph1-20" x="717.079753" y="57.724772"/> + <use xlink:href="#glyph1-21" x="723.079915" y="57.724772"/> + <use xlink:href="#glyph1-22" x="729.080078" y="57.724772"/> + <use xlink:href="#glyph1-23" x="735.080241" y="57.724772"/> + <use xlink:href="#glyph1-21" x="741.080404" y="57.724772"/> + <use xlink:href="#glyph1-22" x="747.080566" y="57.724772"/> +</g> +<g style="fill:rgb(0%,0%,0%);fill-opacity:1;"> + <use xlink:href="#glyph1-24" x="673.335938" y="124.170085"/> + <use xlink:href="#glyph1-11" x="679.3361" y="124.170085"/> + <use xlink:href="#glyph1-13" x="685.336263" y="124.170085"/> + <use xlink:href="#glyph1-8" x="691.336426" y="124.170085"/> + <use xlink:href="#glyph1-4" x="697.336589" y="124.170085"/> + <use xlink:href="#glyph1-20" x="703.336751" y="124.170085"/> + <use xlink:href="#glyph1-21" x="709.336914" y="124.170085"/> + <use xlink:href="#glyph1-22" x="715.337077" y="124.170085"/> + <use xlink:href="#glyph1-23" x="721.33724" y="124.170085"/> + <use xlink:href="#glyph1-21" x="727.337402" y="124.170085"/> + <use xlink:href="#glyph1-22" x="733.337565" y="124.170085"/> +</g> +<g style="fill:rgb(0%,0%,0%);fill-opacity:1;"> + <use xlink:href="#glyph2-1" x="168.71875" y="31.959093"/> + <use xlink:href="#glyph2-2" x="175.866102" y="31.959093"/> + <use xlink:href="#glyph2-3" x="180.92551" y="31.959093"/> + <use xlink:href="#glyph2-4" x="188.879069" y="31.959093"/> +</g> +<g style="fill:rgb(0%,0%,0%);fill-opacity:1;"> + <use xlink:href="#glyph2-5" x="288.109375" y="31.681749"/> + <use xlink:href="#glyph2-1" x="294.882378" y="31.681749"/> + <use xlink:href="#glyph2-6" x="302.029731" y="31.681749"/> + <use xlink:href="#glyph2-7" x="309.996039" y="31.681749"/> + <use xlink:href="#glyph2-8" x="313.607964" y="31.681749"/> +</g> +<g style="fill:rgb(0%,0%,0%);fill-opacity:1;"> + <use xlink:href="#glyph2-5" x="535.988281" y="33.365343"/> + <use xlink:href="#glyph2-1" x="542.761285" y="33.365343"/> + <use xlink:href="#glyph2-6" x="549.908637" y="33.365343"/> + <use xlink:href="#glyph2-7" x="557.874946" y="33.365343"/> + <use xlink:href="#glyph2-8" x="561.486871" y="33.365343"/> +</g> +<g style="fill:rgb(0%,0%,0%);fill-opacity:1;"> + <use xlink:href="#glyph2-9" x="26.695313" y="32.365343"/> + <use xlink:href="#glyph2-10" x="34.947266" y="32.365343"/> + <use xlink:href="#glyph2-11" x="38.559191" y="32.365343"/> + <use xlink:href="#glyph2-11" x="46.798394" y="32.365343"/> + <use xlink:href="#glyph2-10" x="55.037598" y="32.365343"/> + <use xlink:href="#glyph2-11" x="58.649523" y="32.365343"/> + <use xlink:href="#glyph2-12" x="66.888726" y="32.365343"/> +</g> +<path style="fill:none;stroke-width:0.1;stroke-linecap:butt;stroke-linejoin:miter;stroke:rgb(0%,0%,0%);stroke-opacity:1;stroke-dasharray:0.14,0.14;stroke-miterlimit:10;" d="M 45.300431 9.486438 L 60.373478 9.486438 L 60.373478 16.175696 L 45.300431 16.175696 Z M 45.300431 9.486438 " transform="matrix(20,0,0,20,-434.059401,-172.47877)"/> +<g style="fill:rgb(0%,0%,0%);fill-opacity:1;"> + <use xlink:href="#glyph2-13" x="532.003906" y="11.904405"/> + <use xlink:href="#glyph2-14" x="542.236382" y="11.904405"/> + <use xlink:href="#glyph2-15" x="550.475586" y="11.904405"/> + <use xlink:href="#glyph2-4" x="555.5727" y="11.904405"/> + <use xlink:href="#glyph2-14" x="563.824653" y="11.904405"/> + <use xlink:href="#glyph2-15" x="572.063856" y="11.904405"/> + <use xlink:href="#glyph2-16" x="577.16097" y="11.904405"/> + <use xlink:href="#glyph2-17" x="581.293186" y="11.904405"/> + <use xlink:href="#glyph2-3" x="588.307346" y="11.904405"/> + <use xlink:href="#glyph2-2" x="596.260905" y="11.904405"/> + <use xlink:href="#glyph2-18" x="601.377279" y="11.904405"/> + <use xlink:href="#glyph2-6" x="614.040853" y="11.904405"/> + <use xlink:href="#glyph2-15" x="622.007161" y="11.904405"/> + <use xlink:href="#glyph2-15" x="627.104275" y="11.904405"/> + <use xlink:href="#glyph2-8" x="632.201389" y="11.904405"/> + <use xlink:href="#glyph2-2" x="640.199436" y="11.904405"/> + <use xlink:href="#glyph2-16" x="645.544217" y="11.904405"/> + <use xlink:href="#glyph2-19" x="649.676432" y="11.904405"/> + <use xlink:href="#glyph2-20" x="657.928385" y="11.904405"/> + <use xlink:href="#glyph2-5" x="665.621799" y="11.904405"/> + <use xlink:href="#glyph2-15" x="672.394803" y="11.904405"/> + <use xlink:href="#glyph2-8" x="677.491916" y="11.904405"/> + <use xlink:href="#glyph2-18" x="685.489963" y="11.904405"/> +</g> +</g> +</svg> diff --git a/Documentation/media/v4l-drivers/vimc.rst b/Documentation/media/v4l-drivers/vimc.rst index 406417680db5..8f5d7f8d83bb 100644 --- a/Documentation/media/v4l-drivers/vimc.rst +++ b/Documentation/media/v4l-drivers/vimc.rst @@ -76,27 +76,19 @@ vimc-capture: * 1 Pad sink * 1 Pad source -Module options ---------------- -Vimc has a few module parameters to configure the driver. You should pass -those arguments to each subdevice, not to the vimc module. For example:: +Module options +-------------- - vimc_subdevice.param=value +Vimc has a module parameter to configure the driver. -* ``vimc_scaler.sca_mult=<unsigned int>`` +* ``sca_mult=<unsigned int>`` Image size multiplier factor to be used to multiply both width and height, so the image size will be ``sca_mult^2`` bigger than the original one. Currently, only supports scaling up (the default value is 3). -* ``vimc_debayer.deb_mean_win_size=<unsigned int>`` - - Window size to calculate the mean. Note: the window size needs to be an - odd number, as the main pixel stays in the center of the window, - otherwise the next odd number is considered (the default value is 3). - Source code documentation ------------------------- diff --git a/Documentation/media/videodev2.h.rst.exceptions b/Documentation/media/videodev2.h.rst.exceptions index adeb6b7a15cb..cb6ccf91776e 100644 --- a/Documentation/media/videodev2.h.rst.exceptions +++ b/Documentation/media/videodev2.h.rst.exceptions @@ -141,6 +141,10 @@ replace symbol V4L2_CTRL_TYPE_H264_PPS :c:type:`v4l2_ctrl_type` replace symbol V4L2_CTRL_TYPE_H264_SCALING_MATRIX :c:type:`v4l2_ctrl_type` replace symbol V4L2_CTRL_TYPE_H264_SLICE_PARAMS :c:type:`v4l2_ctrl_type` replace symbol V4L2_CTRL_TYPE_H264_DECODE_PARAMS :c:type:`v4l2_ctrl_type` +replace symbol V4L2_CTRL_TYPE_HEVC_SPS :c:type:`v4l2_ctrl_type` +replace symbol V4L2_CTRL_TYPE_HEVC_PPS :c:type:`v4l2_ctrl_type` +replace symbol V4L2_CTRL_TYPE_HEVC_SLICE_PARAMS :c:type:`v4l2_ctrl_type` +replace symbol V4L2_CTRL_TYPE_AREA :c:type:`v4l2_ctrl_type` # V4L2 capability defines replace define V4L2_CAP_VIDEO_CAPTURE device-capabilities @@ -434,6 +438,7 @@ replace define V4L2_DEC_CMD_START decoder-cmds replace define V4L2_DEC_CMD_STOP decoder-cmds replace define V4L2_DEC_CMD_PAUSE decoder-cmds replace define V4L2_DEC_CMD_RESUME decoder-cmds +replace define V4L2_DEC_CMD_FLUSH decoder-cmds replace define V4L2_DEC_CMD_START_MUTE_AUDIO decoder-cmds replace define V4L2_DEC_CMD_PAUSE_TO_BLACK decoder-cmds diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index 83f7ae5fc045..5bc55a4e3bce 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -40,13 +40,13 @@ allocates memory for this UMEM using whatever means it feels is most appropriate (malloc, mmap, huge pages, etc). This memory area is then registered with the kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two rings: the FILL ring and the COMPLETION ring. The -fill ring is used by the application to send down addr for the kernel +FILL ring is used by the application to send down addr for the kernel to fill in with RX packet data. References to these frames will then appear in the RX ring once each packet has been received. The -completion ring, on the other hand, contains frame addr that the +COMPLETION ring, on the other hand, contains frame addr that the kernel has transmitted completely and can now be used again by user space, for either TX or RX. Thus, the frame addrs appearing in the -completion ring are addrs that were previously transmitted using the +COMPLETION ring are addrs that were previously transmitted using the TX ring. In summary, the RX and FILL rings are used for the RX path and the TX and COMPLETION rings are used for the TX path. @@ -91,11 +91,16 @@ Concepts ======== In order to use an AF_XDP socket, a number of associated objects need -to be setup. +to be setup. These objects and their options are explained in the +following sections. -Jonathan Corbet has also written an excellent article on LWN, -"Accelerating networking with AF_XDP". It can be found at -https://lwn.net/Articles/750845/. +For an overview on how AF_XDP works, you can also take a look at the +Linux Plumbers paper from 2018 on the subject: +http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf. Do +NOT consult the paper from 2017 on "AF_PACKET v4", the first attempt +at AF_XDP. Nearly everything changed since then. Jonathan Corbet has +also written an excellent article on LWN, "Accelerating networking +with AF_XDP". It can be found at https://lwn.net/Articles/750845/. UMEM ---- @@ -113,22 +118,22 @@ the next socket B can do this by setting the XDP_SHARED_UMEM flag in struct sockaddr_xdp member sxdp_flags, and passing the file descriptor of A to struct sockaddr_xdp member sxdp_shared_umem_fd. -The UMEM has two single-producer/single-consumer rings, that are used +The UMEM has two single-producer/single-consumer rings that are used to transfer ownership of UMEM frames between the kernel and the user-space application. Rings ----- -There are a four different kind of rings: Fill, Completion, RX and +There are a four different kind of rings: FILL, COMPLETION, RX and TX. All rings are single-producer/single-consumer, so the user-space application need explicit synchronization of multiple processes/threads are reading/writing to them. -The UMEM uses two rings: Fill and Completion. Each socket associated +The UMEM uses two rings: FILL and COMPLETION. Each socket associated with the UMEM must have an RX queue, TX queue or both. Say, that there is a setup with four sockets (all doing TX and RX). Then there will be -one Fill ring, one Completion ring, four TX rings and four RX rings. +one FILL ring, one COMPLETION ring, four TX rings and four RX rings. The rings are head(producer)/tail(consumer) based rings. A producer writes the data ring at the index pointed out by struct xdp_ring @@ -146,7 +151,7 @@ The size of the rings need to be of size power of two. UMEM Fill Ring ~~~~~~~~~~~~~~ -The Fill ring is used to transfer ownership of UMEM frames from +The FILL ring is used to transfer ownership of UMEM frames from user-space to kernel-space. The UMEM addrs are passed in the ring. As an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has 16 chunks and can pass addrs between 0 and 64k. @@ -164,8 +169,8 @@ chunks mode, then the incoming addr will be left untouched. UMEM Completion Ring ~~~~~~~~~~~~~~~~~~~~ -The Completion Ring is used transfer ownership of UMEM frames from -kernel-space to user-space. Just like the Fill ring, UMEM indicies are +The COMPLETION Ring is used transfer ownership of UMEM frames from +kernel-space to user-space. Just like the FILL ring, UMEM indices are used. Frames passed from the kernel to user-space are frames that has been @@ -181,7 +186,7 @@ The RX ring is the receiving side of a socket. Each entry in the ring is a struct xdp_desc descriptor. The descriptor contains UMEM offset (addr) and the length of the data (len). -If no frames have been passed to kernel via the Fill ring, no +If no frames have been passed to kernel via the FILL ring, no descriptors will (or can) appear on the RX ring. The user application consumes struct xdp_desc descriptors from this @@ -199,8 +204,24 @@ be relaxed in the future. The user application produces struct xdp_desc descriptors to this ring. +Libbpf +====== + +Libbpf is a helper library for eBPF and XDP that makes using these +technologies a lot simpler. It also contains specific helper functions +in tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It +contains two types of functions: those that can be used to make the +setup of AF_XDP socket easier and ones that can be used in the data +plane to access the rings safely and quickly. To see an example on how +to use this API, please take a look at the sample application in +samples/bpf/xdpsock_usr.c which uses libbpf for both setup and data +plane operations. + +We recommend that you use this library unless you have become a power +user. It will make your program a lot simpler. + XSKMAP / BPF_MAP_TYPE_XSKMAP ----------------------------- +============================ On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that is used in conjunction with bpf_redirect_map() to pass the ingress @@ -216,21 +237,202 @@ queue 17. Only the XDP program executing for eth0 and queue 17 will successfully pass data to the socket. Please refer to the sample application (samples/bpf/) in for an example. +Configuration Flags and Socket Options +====================================== + +These are the various configuration flags that can be used to control +and monitor the behavior of AF_XDP sockets. + +XDP_COPY and XDP_ZERO_COPY bind flags +------------------------------------- + +When you bind to a socket, the kernel will first try to use zero-copy +copy. If zero-copy is not supported, it will fall back on using copy +mode, i.e. copying all packets out to user space. But if you would +like to force a certain mode, you can use the following flags. If you +pass the XDP_COPY flag to the bind call, the kernel will force the +socket into copy mode. If it cannot use copy mode, the bind call will +fail with an error. Conversely, the XDP_ZERO_COPY flag will force the +socket into zero-copy mode or fail. + +XDP_SHARED_UMEM bind flag +------------------------- + +This flag enables you to bind multiple sockets to the same UMEM, but +only if they share the same queue id. In this mode, each socket has +their own RX and TX rings, but the UMEM (tied to the fist socket +created) only has a single FILL ring and a single COMPLETION +ring. To use this mode, create the first socket and bind it in the normal +way. Create a second socket and create an RX and a TX ring, or at +least one of them, but no FILL or COMPLETION rings as the ones from +the first socket will be used. In the bind call, set he +XDP_SHARED_UMEM option and provide the initial socket's fd in the +sxdp_shared_umem_fd field. You can attach an arbitrary number of extra +sockets this way. + +What socket will then a packet arrive on? This is decided by the XDP +program. Put all the sockets in the XSK_MAP and just indicate which +index in the array you would like to send each packet to. A simple +round-robin example of distributing packets is shown below: + +.. code-block:: c + + #include <linux/bpf.h> + #include "bpf_helpers.h" + + #define MAX_SOCKS 16 + + struct { + __uint(type, BPF_MAP_TYPE_XSKMAP); + __uint(max_entries, MAX_SOCKS); + __uint(key_size, sizeof(int)); + __uint(value_size, sizeof(int)); + } xsks_map SEC(".maps"); + + static unsigned int rr; + + SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) + { + rr = (rr + 1) & (MAX_SOCKS - 1); + + return bpf_redirect_map(&xsks_map, rr, XDP_DROP); + } + +Note, that since there is only a single set of FILL and COMPLETION +rings, and they are single producer, single consumer rings, you need +to make sure that multiple processes or threads do not use these rings +concurrently. There are no synchronization primitives in the +libbpf code that protects multiple users at this point in time. + +Libbpf uses this mode if you create more than one socket tied to the +same umem. However, note that you need to supply the +XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the +xsk_socket__create calls and load your own XDP program as there is no +built in one in libbpf that will route the traffic for you. + +XDP_USE_NEED_WAKEUP bind flag +----------------------------- + +This option adds support for a new flag called need_wakeup that is +present in the FILL ring and the TX ring, the rings for which user +space is a producer. When this option is set in the bind call, the +need_wakeup flag will be set if the kernel needs to be explicitly +woken up by a syscall to continue processing packets. If the flag is +zero, no syscall is needed. + +If the flag is set on the FILL ring, the application needs to call +poll() to be able to continue to receive packets on the RX ring. This +can happen, for example, when the kernel has detected that there are no +more buffers on the FILL ring and no buffers left on the RX HW ring of +the NIC. In this case, interrupts are turned off as the NIC cannot +receive any packets (as there are no buffers to put them in), and the +need_wakeup flag is set so that user space can put buffers on the +FILL ring and then call poll() so that the kernel driver can put these +buffers on the HW ring and start to receive packets. + +If the flag is set for the TX ring, it means that the application +needs to explicitly notify the kernel to send any packets put on the +TX ring. This can be accomplished either by a poll() call, as in the +RX path, or by calling sendto(). + +An example of how to use this flag can be found in +samples/bpf/xdpsock_user.c. An example with the use of libbpf helpers +would look like this for the TX path: + +.. code-block:: c + + if (xsk_ring_prod__needs_wakeup(&my_tx_ring)) + sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0); + +I.e., only use the syscall if the flag is set. + +We recommend that you always enable this mode as it usually leads to +better performance especially if you run the application and the +driver on the same core, but also if you use different cores for the +application and the kernel driver, as it reduces the number of +syscalls needed for the TX path. + +XDP_{RX|TX|UMEM_FILL|UMEM_COMPLETION}_RING setsockopts +------------------------------------------------------ + +These setsockopts sets the number of descriptors that the RX, TX, +FILL, and COMPLETION rings respectively should have. It is mandatory +to set the size of at least one of the RX and TX rings. If you set +both, you will be able to both receive and send traffic from your +application, but if you only want to do one of them, you can save +resources by only setting up one of them. Both the FILL ring and the +COMPLETION ring are mandatory as you need to have a UMEM tied to your +socket. But if the XDP_SHARED_UMEM flag is used, any socket after the +first one does not have a UMEM and should in that case not have any +FILL or COMPLETION rings created as the ones from the shared umem will +be used. Note, that the rings are single-producer single-consumer, so +do not try to access them from multiple processes at the same +time. See the XDP_SHARED_UMEM section. + +In libbpf, you can create Rx-only and Tx-only sockets by supplying +NULL to the rx and tx arguments, respectively, to the +xsk_socket__create function. + +If you create a Tx-only socket, we recommend that you do not put any +packets on the fill ring. If you do this, drivers might think you are +going to receive something when you in fact will not, and this can +negatively impact performance. + +XDP_UMEM_REG setsockopt +----------------------- + +This setsockopt registers a UMEM to a socket. This is the area that +contain all the buffers that packet can recide in. The call takes a +pointer to the beginning of this area and the size of it. Moreover, it +also has parameter called chunk_size that is the size that the UMEM is +divided into. It can only be 2K or 4K at the moment. If you have an +UMEM area that is 128K and a chunk size of 2K, this means that you +will be able to hold a maximum of 128K / 2K = 64 packets in your UMEM +area and that your largest packet size can be 2K. + +There is also an option to set the headroom of each single buffer in +the UMEM. If you set this to N bytes, it means that the packet will +start N bytes into the buffer leaving the first N bytes for the +application to use. The final option is the flags field, but it will +be dealt with in separate sections for each UMEM flag. + +XDP_STATISTICS getsockopt +------------------------- + +Gets drop statistics of a socket that can be useful for debug +purposes. The supported statistics are shown below: + +.. code-block:: c + + struct xdp_statistics { + __u64 rx_dropped; /* Dropped for reasons other than invalid desc */ + __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ + __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ + }; + +XDP_OPTIONS getsockopt +---------------------- + +Gets options from an XDP socket. The only one supported so far is +XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not. + Usage ===== -In order to use AF_XDP sockets there are two parts needed. The +In order to use AF_XDP sockets two parts are needed. The user-space application and the XDP program. For a complete setup and usage example, please refer to the sample application. The user-space side is xdpsock_user.c and the XDP side is part of libbpf. -The XDP code sample included in tools/lib/bpf/xsk.c is the following:: +The XDP code sample included in tools/lib/bpf/xsk.c is the following: + +.. code-block:: c SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) { int index = ctx->rx_queue_index; - // A set entry here means that the correspnding queue_id + // A set entry here means that the corresponding queue_id // has an active AF_XDP socket bound to it. if (bpf_map_lookup_elem(&xsks_map, &index)) return bpf_redirect_map(&xsks_map, index, 0); @@ -238,7 +440,10 @@ The XDP code sample included in tools/lib/bpf/xsk.c is the following:: return XDP_PASS; } -Naive ring dequeue and enqueue could look like this:: +A simple but not so performance ring dequeue and enqueue could look +like this: + +.. code-block:: c // struct xdp_rxtx_ring { // __u32 *producer; @@ -287,17 +492,16 @@ Naive ring dequeue and enqueue could look like this:: return 0; } - -For a more optimized version, please refer to the sample application. +But please use the libbpf functions as they are optimized and ready to +use. Will make your life easier. Sample application ================== There is a xdpsock benchmarking/test application included that -demonstrates how to use AF_XDP sockets with both private and shared -UMEMs. Say that you would like your UDP traffic from port 4242 to end -up in queue 16, that we will enable AF_XDP on. Here, we use ethtool -for this:: +demonstrates how to use AF_XDP sockets with private UMEMs. Say that +you would like your UDP traffic from port 4242 to end up in queue 16, +that we will enable AF_XDP on. Here, we use ethtool for this:: ethtool -N p3p2 rx-flow-hash udp4 fn ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ @@ -311,13 +515,18 @@ using:: For XDP_SKB mode, use the switch "-S" instead of "-N" and all options can be displayed with "-h", as usual. +This sample application uses libbpf to make the setup and usage of +AF_XDP simpler. If you want to know how the raw uapi of AF_XDP is +really used to make something more advanced, take a look at the libbpf +code in tools/lib/bpf/xsk.[ch]. + FAQ ======= Q: I am not seeing any traffic on the socket. What am I doing wrong? A: When a netdev of a physical NIC is initialized, Linux usually - allocates one Rx and Tx queue pair per core. So on a 8 core system, + allocates one RX and TX queue pair per core. So on a 8 core system, queue ids 0 to 7 will be allocated, one per core. In the AF_XDP bind call or the xsk_socket__create libbpf function call, you specify a specific queue id to bind to and it is only the traffic @@ -343,9 +552,21 @@ A: When a netdev of a physical NIC is initialized, Linux usually sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \ 4242 action 2 - A number of other ways are possible all up to the capabilitites of + A number of other ways are possible all up to the capabilities of the NIC you have. +Q: Can I use the XSKMAP to implement a switch betwen different umems + in copy mode? + +A: The short answer is no, that is not supported at the moment. The + XSKMAP can only be used to switch traffic coming in on queue id X + to sockets bound to the same queue id X. The XSKMAP can contain + sockets bound to different queue ids, for example X and Y, but only + traffic goming in from queue id Y can be directed to sockets bound + to the same queue id Y. In zero-copy mode, you should use the + switch, or other distribution mechanism, in your NIC to direct + traffic to the correct queue id and socket. + Credits ======= diff --git a/Documentation/networking/device_drivers/aquantia/atlantic.txt b/Documentation/networking/device_drivers/aquantia/atlantic.txt index d235cbaeccc6..2013fcedc2da 100644 --- a/Documentation/networking/device_drivers/aquantia/atlantic.txt +++ b/Documentation/networking/device_drivers/aquantia/atlantic.txt @@ -1,5 +1,5 @@ -aQuantia AQtion Driver for the aQuantia Multi-Gigabit PCI Express Family of -Ethernet Adapters +Marvell(Aquantia) AQtion Driver for the aQuantia Multi-Gigabit PCI Express +Family of Ethernet Adapters ============================================================================= Contents @@ -325,6 +325,46 @@ Supported ethtool options Example: ethtool -N eth0 flow-type udp4 action 0 loc 32 + UDP GSO hardware offload + --------------------------------- + UDP GSO allows to boost UDP tx rates by offloading UDP headers allocation + into hardware. A special userspace socket option is required for this, + could be validated with /kernel/tools/testing/selftests/net/ + + udpgso_bench_tx -u -4 -D 10.0.1.1 -s 6300 -S 100 + + Will cause sending out of 100 byte sized UDP packets formed from single + 6300 bytes user buffer. + + UDP GSO is configured by: + + ethtool -K eth0 tx-udp-segmentation on + + Private flags (testing) + --------------------------------- + + Atlantic driver supports private flags for hardware custom features: + + $ ethtool --show-priv-flags ethX + + Private flags for ethX: + DMASystemLoopback : off + PKTSystemLoopback : off + DMANetworkLoopback : off + PHYInternalLoopback: off + PHYExternalLoopback: off + + Example: + + $ ethtool --set-priv-flags ethX DMASystemLoopback on + + DMASystemLoopback: DMA Host loopback. + PKTSystemLoopback: Packet buffer host loopback. + DMANetworkLoopback: Network side loopback on DMA block. + PHYInternalLoopback: Internal loopback on Phy. + PHYExternalLoopback: External loopback on Phy (with loopback ethernet cable). + + Command Line Parameters ======================= The following command line parameters are available on atlantic driver: @@ -426,7 +466,7 @@ Support If an issue is identified with the released source code on the supported kernel with a supported adapter, email the specific information related -to the issue to support@aquantia.com +to the issue to aqn_support@marvell.com License ======= diff --git a/Documentation/networking/device_drivers/freescale/dpaa.txt b/Documentation/networking/device_drivers/freescale/dpaa.txt index f88194f71c54..b06601ff9200 100644 --- a/Documentation/networking/device_drivers/freescale/dpaa.txt +++ b/Documentation/networking/device_drivers/freescale/dpaa.txt @@ -129,9 +129,9 @@ CONFIG_AQUANTIA_PHY=y DPAA Ethernet Frame Processing ============================== -On Rx, buffers for the incoming frames are retrieved from one of the three -existing buffers pools. The driver initializes and seeds these, each with -buffers of different sizes: 1KB, 2KB and 4KB. +On Rx, buffers for the incoming frames are retrieved from the buffers found +in the dedicated interface buffer pool. The driver initializes and seeds these +with one page buffers. On Tx, all transmitted frames are returned to the driver through Tx confirmation frame queues. The driver is then responsible for freeing the @@ -254,7 +254,7 @@ The following statistics are exported for each interface through ethtool: The driver also exports the following information in sysfs: - the FQ IDs for each FQ type - /sys/devices/platform/dpaa-ethernet.0/net/<int>/fqids + /sys/devices/platform/soc/<addr>.fman/<addr>.ethernet/dpaa-ethernet.<id>/net/fm<nr>-mac<nr>/fqids - - the IDs of the buffer pools in use - /sys/devices/platform/dpaa-ethernet.0/net/<int>/bpids + - the ID of the buffer pool in use + /sys/devices/platform/soc/<addr>.fman/<addr>.ethernet/dpaa-ethernet.<id>/net/fm<nr>-mac<nr>/bpids diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/index.rst b/Documentation/networking/device_drivers/freescale/dpaa2/index.rst index 67bd87fe6c53..ee40fcc5ddff 100644 --- a/Documentation/networking/device_drivers/freescale/dpaa2/index.rst +++ b/Documentation/networking/device_drivers/freescale/dpaa2/index.rst @@ -8,3 +8,4 @@ DPAA2 Documentation overview dpio-driver ethernet-driver + mac-phy-support diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst b/Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst new file mode 100644 index 000000000000..51e6624fb774 --- /dev/null +++ b/Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst @@ -0,0 +1,191 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +======================= +DPAA2 MAC / PHY support +======================= + +:Copyright: |copy| 2019 NXP + +Overview +-------- + +The DPAA2 MAC / PHY support consists of a set of APIs that help DPAA2 network +drivers (dpaa2-eth, dpaa2-ethsw) interract with the PHY library. + +DPAA2 Software Architecture +--------------------------- + +Among other DPAA2 objects, the fsl-mc bus exports DPNI objects (abstracting a +network interface) and DPMAC objects (abstracting a MAC). The dpaa2-eth driver +probes on the DPNI object and connects to and configures a DPMAC object with +the help of phylink. + +Data connections may be established between a DPNI and a DPMAC, or between two +DPNIs. Depending on the connection type, the netif_carrier_[on/off] is handled +directly by the dpaa2-eth driver or by phylink. + +.. code-block:: none + + Sources of abstracted link state information presented by the MC firmware + + +--------------------------------------+ + +------------+ +---------+ | xgmac_mdio | + | net_device | | phylink |--| +-----+ +-----+ +-----+ +-----+ | + +------------+ +---------+ | | PHY | | PHY | | PHY | | PHY | | + | | | +-----+ +-----+ +-----+ +-----+ | + +------------------------------------+ | External MDIO bus | + | dpaa2-eth | +--------------------------------------+ + +------------------------------------+ + | | Linux + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + | | MC firmware + | /| V + +----------+ / | +----------+ + | | / | | | + | | | | | | + | DPNI |<------| |<------| DPMAC | + | | | | | | + | | \ |<---+ | | + +----------+ \ | | +----------+ + \| | + | + +--------------------------------------+ + | MC firmware polling MAC PCS for link | + | +-----+ +-----+ +-----+ +-----+ | + | | PCS | | PCS | | PCS | | PCS | | + | +-----+ +-----+ +-----+ +-----+ | + | Internal MDIO bus | + +--------------------------------------+ + + +Depending on an MC firmware configuration setting, each MAC may be in one of two modes: + +- DPMAC_LINK_TYPE_FIXED: the link state management is handled exclusively by + the MC firmware by polling the MAC PCS. Without the need to register a + phylink instance, the dpaa2-eth driver will not bind to the connected dpmac + object at all. + +- DPMAC_LINK_TYPE_PHY: The MC firmware is left waiting for link state update + events, but those are in fact passed strictly between the dpaa2-mac (based on + phylink) and its attached net_device driver (dpaa2-eth, dpaa2-ethsw), + effectively bypassing the firmware. + +Implementation +-------------- + +At probe time or when a DPNI's endpoint is dynamically changed, the dpaa2-eth +is responsible to find out if the peer object is a DPMAC and if this is the +case, to integrate it with PHYLINK using the dpaa2_mac_connect() API, which +will do the following: + + - look up the device tree for PHYLINK-compatible of binding (phy-handle) + - will create a PHYLINK instance associated with the received net_device + - connect to the PHY using phylink_of_phy_connect() + +The following phylink_mac_ops callback are implemented: + + - .validate() will populate the supported linkmodes with the MAC capabilities + only when the phy_interface_t is RGMII_* (at the moment, this is the only + link type supported by the driver). + + - .mac_config() will configure the MAC in the new configuration using the + dpmac_set_link_state() MC firmware API. + + - .mac_link_up() / .mac_link_down() will update the MAC link using the same + API described above. + +At driver unbind() or when the DPNI object is disconnected from the DPMAC, the +dpaa2-eth driver calls dpaa2_mac_disconnect() which will, in turn, disconnect +from the PHY and destroy the PHYLINK instance. + +In case of a DPNI-DPMAC connection, an 'ip link set dev eth0 up' would start +the following sequence of operations: + +(1) phylink_start() called from .dev_open(). +(2) The .mac_config() and .mac_link_up() callbacks are called by PHYLINK. +(3) In order to configure the HW MAC, the MC Firmware API + dpmac_set_link_state() is called. +(4) The firmware will eventually setup the HW MAC in the new configuration. +(5) A netif_carrier_on() call is made directly from PHYLINK on the associated + net_device. +(6) The dpaa2-eth driver handles the LINK_STATE_CHANGE irq in order to + enable/disable Rx taildrop based on the pause frame settings. + +.. code-block:: none + + +---------+ +---------+ + | PHYLINK |-------------->| eth0 | + +---------+ (5) +---------+ + (1) ^ | + | | + | v (2) + +-----------------------------------+ + | dpaa2-eth | + +-----------------------------------+ + | ^ (6) + | | + v (3) | + +---------+---------------+---------+ + | DPMAC | | DPNI | + +---------+ +---------+ + | MC Firmware | + +-----------------------------------+ + | + | + v (4) + +-----------------------------------+ + | HW MAC | + +-----------------------------------+ + +In case of a DPNI-DPNI connection, a usual sequence of operations looks like +the following: + +(1) ip link set dev eth0 up +(2) The dpni_enable() MC API called on the associated fsl_mc_device. +(3) ip link set dev eth1 up +(4) The dpni_enable() MC API called on the associated fsl_mc_device. +(5) The LINK_STATE_CHANGED irq is received by both instances of the dpaa2-eth + driver because now the operational link state is up. +(6) The netif_carrier_on() is called on the exported net_device from + link_state_update(). + +.. code-block:: none + + +---------+ +---------+ + | eth0 | | eth1 | + +---------+ +---------+ + | ^ ^ | + | | | | + (1) v | (6) (6) | v (3) + +---------+ +---------+ + |dpaa2-eth| |dpaa2-eth| + +---------+ +---------+ + | ^ ^ | + | | | | + (2) v | (5) (5) | v (4) + +---------+---------------+---------+ + | DPNI | | DPNI | + +---------+ +---------+ + | MC Firmware | + +-----------------------------------+ + + +Exported API +------------ + +Any DPAA2 driver that drivers endpoints of DPMAC objects should service its +_EVENT_ENDPOINT_CHANGED irq and connect/disconnect from the associated DPMAC +when necessary using the below listed API:: + + - int dpaa2_mac_connect(struct dpaa2_mac *mac); + - void dpaa2_mac_disconnect(struct dpaa2_mac *mac); + +A phylink integration is necessary only when the partner DPMAC is not of TYPE_FIXED. +One can check for this condition using the below API:: + + - bool dpaa2_mac_is_type_fixed(struct fsl_mc_device *dpmac_dev,struct fsl_mc_io *mc_io); + +Before connection to a MAC, the caller must allocate and populate the +dpaa2_mac structure with the associated net_device, a pointer to the MC portal +to be used and the actual fsl_mc_device structure of the DPMAC. diff --git a/Documentation/networking/device_drivers/intel/e100.rst b/Documentation/networking/device_drivers/intel/e100.rst index 2b9f4887beda..caf023cc88de 100644 --- a/Documentation/networking/device_drivers/intel/e100.rst +++ b/Documentation/networking/device_drivers/intel/e100.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0+ -============================================================== -Linux* Base Driver for the Intel(R) PRO/100 Family of Adapters -============================================================== +============================================================= +Linux Base Driver for the Intel(R) PRO/100 Family of Adapters +============================================================= June 1, 2018 @@ -21,7 +21,7 @@ Contents In This Release =============== -This file describes the Linux* Base Driver for the Intel(R) PRO/100 Family of +This file describes the Linux Base Driver for the Intel(R) PRO/100 Family of Adapters. This driver includes support for Itanium(R)2-based systems. For questions related to hardware requirements, refer to the documentation @@ -138,9 +138,9 @@ version 1.6 or later is required for this functionality. The latest release of ethtool can be found from https://www.kernel.org/pub/software/network/ethtool/ -Enabling Wake on LAN* (WoL) ---------------------------- -WoL is provided through the ethtool* utility. For instructions on +Enabling Wake on LAN (WoL) +-------------------------- +WoL is provided through the ethtool utility. For instructions on enabling WoL with ethtool, refer to the ethtool man page. WoL will be enabled on the system during the next shut down or reboot. For this driver version, in order to enable WoL, the e100 driver must be loaded diff --git a/Documentation/networking/device_drivers/intel/e1000.rst b/Documentation/networking/device_drivers/intel/e1000.rst index 956560b6e745..4aaae0f7d6ba 100644 --- a/Documentation/networking/device_drivers/intel/e1000.rst +++ b/Documentation/networking/device_drivers/intel/e1000.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0+ -=========================================================== -Linux* Base Driver for Intel(R) Ethernet Network Connection -=========================================================== +========================================================== +Linux Base Driver for Intel(R) Ethernet Network Connection +========================================================== Intel Gigabit Linux driver. Copyright(c) 1999 - 2013 Intel Corporation. @@ -438,10 +438,10 @@ ethtool The latest release of ethtool can be found from https://www.kernel.org/pub/software/network/ethtool/ -Enabling Wake on LAN* (WoL) ---------------------------- +Enabling Wake on LAN (WoL) +-------------------------- - WoL is configured through the ethtool* utility. + WoL is configured through the ethtool utility. WoL will be enabled on the system during the next shut down or reboot. For this driver version, in order to enable WoL, the e1000 driver must be diff --git a/Documentation/networking/device_drivers/intel/e1000e.rst b/Documentation/networking/device_drivers/intel/e1000e.rst index 01999f05509c..f49cd370e7bf 100644 --- a/Documentation/networking/device_drivers/intel/e1000e.rst +++ b/Documentation/networking/device_drivers/intel/e1000e.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0+ -====================================================== -Linux* Driver for Intel(R) Ethernet Network Connection -====================================================== +===================================================== +Linux Driver for Intel(R) Ethernet Network Connection +===================================================== Intel Gigabit Linux driver. Copyright(c) 2008-2018 Intel Corporation. @@ -338,7 +338,7 @@ and higher cannot be forced. Use the autonegotiation advertising setting to manually set devices for 1 Gbps and higher. Speed, duplex, and autonegotiation advertising are configured through the -ethtool* utility. +ethtool utility. Caution: Only experienced network administrators should force speed and duplex or change autonegotiation advertising manually. The settings at the switch must @@ -351,9 +351,9 @@ will not attempt to auto-negotiate with its link partner since those adapters operate only in full duplex and only at their native speed. -Enabling Wake on LAN* (WoL) ---------------------------- -WoL is configured through the ethtool* utility. +Enabling Wake on LAN (WoL) +-------------------------- +WoL is configured through the ethtool utility. WoL will be enabled on the system during the next shut down or reboot. For this driver version, in order to enable WoL, the e1000e driver must be loaded diff --git a/Documentation/networking/device_drivers/intel/fm10k.rst b/Documentation/networking/device_drivers/intel/fm10k.rst index ac3269e34f55..4d279e64e221 100644 --- a/Documentation/networking/device_drivers/intel/fm10k.rst +++ b/Documentation/networking/device_drivers/intel/fm10k.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0+ -============================================================== -Linux* Base Driver for Intel(R) Ethernet Multi-host Controller -============================================================== +============================================================= +Linux Base Driver for Intel(R) Ethernet Multi-host Controller +============================================================= August 20, 2018 Copyright(c) 2015-2018 Intel Corporation. @@ -120,8 +120,8 @@ rx-flow-hash tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6 m|v|t|s|d|f|n|r Known Issues/Troubleshooting ============================ -Enabling SR-IOV in a 64-bit Microsoft* Windows Server* 2012/R2 guest OS under Linux KVM ---------------------------------------------------------------------------------------- +Enabling SR-IOV in a 64-bit Microsoft Windows Server 2012/R2 guest OS under Linux KVM +------------------------------------------------------------------------------------- KVM Hypervisor/VMM supports direct assignment of a PCIe device to a VM. This includes traditional PCIe devices, as well as SR-IOV-capable devices based on the Intel Ethernet Controller XL710. diff --git a/Documentation/networking/device_drivers/intel/i40e.rst b/Documentation/networking/device_drivers/intel/i40e.rst index 848fd388fa6e..8a9b18573688 100644 --- a/Documentation/networking/device_drivers/intel/i40e.rst +++ b/Documentation/networking/device_drivers/intel/i40e.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0+ -================================================================== -Linux* Base Driver for the Intel(R) Ethernet Controller 700 Series -================================================================== +================================================================= +Linux Base Driver for the Intel(R) Ethernet Controller 700 Series +================================================================= Intel 40 Gigabit Linux driver. Copyright(c) 1999-2018 Intel Corporation. @@ -384,7 +384,7 @@ NOTE: You cannot set the speed for devices based on the Intel(R) Ethernet Network Adapter XXV710 based devices. Speed, duplex, and autonegotiation advertising are configured through the -ethtool* utility. +ethtool utility. Caution: Only experienced network administrators should force speed and duplex or change autonegotiation advertising manually. The settings at the switch must diff --git a/Documentation/networking/device_drivers/intel/iavf.rst b/Documentation/networking/device_drivers/intel/iavf.rst index cfc08842e32c..84ac7e75f363 100644 --- a/Documentation/networking/device_drivers/intel/iavf.rst +++ b/Documentation/networking/device_drivers/intel/iavf.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0+ -================================================================== -Linux* Base Driver for Intel(R) Ethernet Adaptive Virtual Function -================================================================== +================================================================= +Linux Base Driver for Intel(R) Ethernet Adaptive Virtual Function +================================================================= Intel Ethernet Adaptive Virtual Function Linux driver. Copyright(c) 2013-2018 Intel Corporation. @@ -19,7 +19,7 @@ Contents Overview ======== -This file describes the iavf Linux* Base Driver. This driver was formerly +This file describes the iavf Linux Base Driver. This driver was formerly called i40evf. The iavf driver supports the below mentioned virtual function devices and diff --git a/Documentation/networking/device_drivers/intel/ice.rst b/Documentation/networking/device_drivers/intel/ice.rst index c220aa2711c6..ee43ea57d443 100644 --- a/Documentation/networking/device_drivers/intel/ice.rst +++ b/Documentation/networking/device_drivers/intel/ice.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0+ -=================================================================== -Linux* Base Driver for the Intel(R) Ethernet Connection E800 Series -=================================================================== +================================================================== +Linux Base Driver for the Intel(R) Ethernet Connection E800 Series +================================================================== Intel ice Linux driver. Copyright(c) 2018 Intel Corporation. diff --git a/Documentation/networking/device_drivers/intel/igb.rst b/Documentation/networking/device_drivers/intel/igb.rst index fc8cfaa5dcfa..87e560fe5eaa 100644 --- a/Documentation/networking/device_drivers/intel/igb.rst +++ b/Documentation/networking/device_drivers/intel/igb.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0+ -=========================================================== -Linux* Base Driver for Intel(R) Ethernet Network Connection -=========================================================== +========================================================== +Linux Base Driver for Intel(R) Ethernet Network Connection +========================================================== Intel Gigabit Linux driver. Copyright(c) 1999-2018 Intel Corporation. @@ -129,9 +129,9 @@ version is required for this functionality. Download it at: https://www.kernel.org/pub/software/network/ethtool/ -Enabling Wake on LAN* (WoL) ---------------------------- -WoL is configured through the ethtool* utility. +Enabling Wake on LAN (WoL) +-------------------------- +WoL is configured through the ethtool utility. WoL will be enabled on the system during the next shut down or reboot. For this driver version, in order to enable WoL, the igb driver must be loaded diff --git a/Documentation/networking/device_drivers/intel/igbvf.rst b/Documentation/networking/device_drivers/intel/igbvf.rst index 9cddabe8108e..557fc020ef31 100644 --- a/Documentation/networking/device_drivers/intel/igbvf.rst +++ b/Documentation/networking/device_drivers/intel/igbvf.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0+ -============================================================ -Linux* Base Virtual Function Driver for Intel(R) 1G Ethernet -============================================================ +=========================================================== +Linux Base Virtual Function Driver for Intel(R) 1G Ethernet +=========================================================== Intel Gigabit Virtual Function Linux driver. Copyright(c) 1999-2018 Intel Corporation. diff --git a/Documentation/networking/device_drivers/intel/ixgbe.rst b/Documentation/networking/device_drivers/intel/ixgbe.rst index c7d25483fedb..f1d5233e5e51 100644 --- a/Documentation/networking/device_drivers/intel/ixgbe.rst +++ b/Documentation/networking/device_drivers/intel/ixgbe.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0+ -============================================================================= -Linux* Base Driver for the Intel(R) Ethernet 10 Gigabit PCI Express Adapters -============================================================================= +=========================================================================== +Linux Base Driver for the Intel(R) Ethernet 10 Gigabit PCI Express Adapters +=========================================================================== Intel 10 Gigabit Linux driver. Copyright(c) 1999-2018 Intel Corporation. @@ -519,8 +519,8 @@ The offload is also supported for ixgbe's VFs, but the VF must be set as Known Issues/Troubleshooting ============================ -Enabling SR-IOV in a 64-bit Microsoft* Windows Server* 2012/R2 guest OS ------------------------------------------------------------------------ +Enabling SR-IOV in a 64-bit Microsoft Windows Server 2012/R2 guest OS +--------------------------------------------------------------------- Linux KVM Hypervisor/VMM supports direct assignment of a PCIe device to a VM. This includes traditional PCIe devices, as well as SR-IOV-capable devices based on the Intel Ethernet Controller XL710. diff --git a/Documentation/networking/device_drivers/intel/ixgbevf.rst b/Documentation/networking/device_drivers/intel/ixgbevf.rst index 5d4977360157..76bbde736f21 100644 --- a/Documentation/networking/device_drivers/intel/ixgbevf.rst +++ b/Documentation/networking/device_drivers/intel/ixgbevf.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0+ -============================================================= -Linux* Base Virtual Function Driver for Intel(R) 10G Ethernet -============================================================= +============================================================ +Linux Base Virtual Function Driver for Intel(R) 10G Ethernet +============================================================ Intel 10 Gigabit Virtual Function Linux driver. Copyright(c) 1999-2018 Intel Corporation. diff --git a/Documentation/networking/device_drivers/mellanox/mlx5.rst b/Documentation/networking/device_drivers/mellanox/mlx5.rst index d071c6b49e1f..7599dceba9f1 100644 --- a/Documentation/networking/device_drivers/mellanox/mlx5.rst +++ b/Documentation/networking/device_drivers/mellanox/mlx5.rst @@ -154,6 +154,27 @@ User command examples: values: cmode runtime value smfs +enable_roce: RoCE enablement state +---------------------------------- +RoCE enablement state controls driver support for RoCE traffic. +When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well known UDP RoCE port is handled as raw ethernet traffic. + +To change RoCE enablement state a user must change the driverinit cmode value and run devlink reload. + +User command examples: + +- Disable RoCE:: + + $ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit + $ devlink dev reload pci/0000:06:00.0 + +- Read RoCE enablement state:: + + $ devlink dev param show pci/0000:06:00.0 name enable_roce + pci/0000:06:00.0: + name enable_roce type generic + values: + cmode driverinit value true Devlink health reporters ======================== diff --git a/Documentation/networking/device_drivers/pensando/ionic.rst b/Documentation/networking/device_drivers/pensando/ionic.rst index 13935896bee6..c17d680cf334 100644 --- a/Documentation/networking/device_drivers/pensando/ionic.rst +++ b/Documentation/networking/device_drivers/pensando/ionic.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0+ -========================================================== -Linux* Driver for the Pensando(R) Ethernet adapter family -========================================================== +======================================================== +Linux Driver for the Pensando(R) Ethernet adapter family +======================================================== Pensando Linux Ethernet driver. Copyright(c) 2019 Pensando Systems, Inc diff --git a/Documentation/networking/device_drivers/ti/cpsw_switchdev.txt b/Documentation/networking/device_drivers/ti/cpsw_switchdev.txt new file mode 100644 index 000000000000..5c8cee17fca9 --- /dev/null +++ b/Documentation/networking/device_drivers/ti/cpsw_switchdev.txt @@ -0,0 +1,209 @@ +* Texas Instruments CPSW switchdev based ethernet driver 2.0 + +- Port renaming +On older udev versions renaming of ethX to swXpY will not be automatically +supported +In order to rename via udev: +ip -d link show dev sw0p1 | grep switchid + +SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}==<switchid>, \ + ATTR{phys_port_name}!="", NAME="sw0$attr{phys_port_name}" + + +==================== +# Dual mac mode +==================== +- The new (cpsw_new.c) driver is operating in dual-emac mode by default, thus +working as 2 individual network interfaces. Main differences from legacy CPSW +driver are: + - optimized promiscuous mode: The P0_UNI_FLOOD (both ports) is enabled in +addition to ALLMULTI (current port) instead of ALE_BYPASS. +So, Ports in promiscuous mode will keep possibility of mcast and vlan filtering, +which is provides significant benefits when ports are joined to the same bridge, +but without enabling "switch" mode, or to different bridges. + - learning disabled on ports as it make not too much sense for + segregated ports - no forwarding in HW. + - enabled basic support for devlink. + + devlink dev show + platform/48484000.switch + + devlink dev param show + platform/48484000.switch: + name switch_mode type driver-specific + values: + cmode runtime value false + name ale_bypass type driver-specific + values: + cmode runtime value false + +Devlink configuration parameters +==================== +See Documentation/networking/devlink-params-ti-cpsw-switch.txt + +==================== +# Bridging in dual mac mode +==================== +The dual_mac mode requires two vids to be reserved for internal purposes, +which, by default, equal CPSW Port numbers. As result, bridge has to be +configured in vlan unaware mode or default_pvid has to be adjusted. + + ip link add name br0 type bridge + ip link set dev br0 type bridge vlan_filtering 0 + echo 0 > /sys/class/net/br0/bridge/default_pvid + ip link set dev sw0p1 master br0 + ip link set dev sw0p2 master br0 + - or - + ip link add name br0 type bridge + ip link set dev br0 type bridge vlan_filtering 0 + echo 100 > /sys/class/net/br0/bridge/default_pvid + ip link set dev br0 type bridge vlan_filtering 1 + ip link set dev sw0p1 master br0 + ip link set dev sw0p2 master br0 + +==================== +# Enabling "switch" +==================== +The Switch mode can be enabled by configuring devlink driver parameter +"switch_mode" to 1/true: + devlink dev param set platform/48484000.switch \ + name switch_mode value 1 cmode runtime + +This can be done regardless of the state of Port's netdev devices - UP/DOWN, but +Port's netdev devices have to be in UP before joining to the bridge to avoid +overwriting of bridge configuration as CPSW switch driver copletly reloads its +configuration when first Port changes its state to UP. + +When the both interfaces joined the bridge - CPSW switch driver will enable +marking packets with offload_fwd_mark flag unless "ale_bypass=0" + +All configuration is implemented via switchdev API. + +==================== +# Bridge setup +==================== + devlink dev param set platform/48484000.switch \ + name switch_mode value 1 cmode runtime + + ip link add name br0 type bridge + ip link set dev br0 type bridge ageing_time 1000 + ip link set dev sw0p1 up + ip link set dev sw0p2 up + ip link set dev sw0p1 master br0 + ip link set dev sw0p2 master br0 + [*] bridge vlan add dev br0 vid 1 pvid untagged self + +[*] if vlan_filtering=1. where default_pvid=1 + +================= +# On/off STP +================= +ip link set dev BRDEV type bridge stp_state 1/0 + +Note. Steps [*] are mandatory. + +==================== +# VLAN configuration +==================== +bridge vlan add dev br0 vid 1 pvid untagged self <---- add cpu port to VLAN 1 + +Note. This step is mandatory for bridge/default_pvid. + +================= +# Add extra VLANs +================= + 1. untagged: + bridge vlan add dev sw0p1 vid 100 pvid untagged master + bridge vlan add dev sw0p2 vid 100 pvid untagged master + bridge vlan add dev br0 vid 100 pvid untagged self <---- Add cpu port to VLAN100 + + 2. tagged: + bridge vlan add dev sw0p1 vid 100 master + bridge vlan add dev sw0p2 vid 100 master + bridge vlan add dev br0 vid 100 pvid tagged self <---- Add cpu port to VLAN100 + +==== +FDBs +==== +FDBs are automatically added on the appropriate switch port upon detection + +Manually adding FDBs: +bridge fdb add aa:bb:cc:dd:ee:ff dev sw0p1 master vlan 100 +bridge fdb add aa:bb:cc:dd:ee:fe dev sw0p2 master <---- Add on all VLANs + +==== +MDBs +==== +MDBs are automatically added on the appropriate switch port upon detection + +Manually adding MDBs: +bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent vid 100 +bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent <---- Add on all VLANs + +================== +Multicast flooding +================== +CPU port mcast_flooding is always on + +Turning flooding on/off on swithch ports: +bridge link set dev sw0p1 mcast_flood on/off + +================== +Access and Trunk port +================== + bridge vlan add dev sw0p1 vid 100 pvid untagged master + bridge vlan add dev sw0p2 vid 100 master + + + bridge vlan add dev br0 vid 100 self + ip link add link br0 name br0.100 type vlan id 100 + + Note. Setting PVID on Bridge device itself working only for + default VLAN (default_pvid). + +===================== + NFS +===================== +The only way for NFS to work is by chrooting to a minimal environment when +switch configuration that will affect connectivity is needed. +Assuming you are booting NFS with eth1 interface(the script is hacky and +it's just there to prove NFS is doable). + +setup.sh: +#!/bin/sh +mkdir proc +mount -t proc none /proc +ifconfig br0 > /dev/null +if [ $? -ne 0 ]; then + echo "Setting up bridge" + ip link add name br0 type bridge + ip link set dev br0 type bridge ageing_time 1000 + ip link set dev br0 type bridge vlan_filtering 1 + + ip link set eth1 down + ip link set eth1 name sw0p1 + ip link set dev sw0p1 up + ip link set dev sw0p2 up + ip link set dev sw0p2 master br0 + ip link set dev sw0p1 master br0 + bridge vlan add dev br0 vid 1 pvid untagged self + ifconfig sw0p1 0.0.0.0 + udhchc -i br0 +fi +umount /proc + +run_nfs.sh: +#!/bin/sh +mkdir /tmp/root/bin -p +mkdir /tmp/root/lib -p + +cp -r /lib/ /tmp/root/ +cp -r /bin/ /tmp/root/ +cp /sbin/ip /tmp/root/bin +cp /sbin/bridge /tmp/root/bin +cp /sbin/ifconfig /tmp/root/bin +cp /sbin/udhcpc /tmp/root/bin +cp /path/to/setup.sh /tmp/root/bin +chroot /tmp/root/ busybox sh /bin/setup.sh + +run ./run_nfs.sh diff --git a/Documentation/networking/devlink-params-mlx5.txt b/Documentation/networking/devlink-params-mlx5.txt new file mode 100644 index 000000000000..5071467118bd --- /dev/null +++ b/Documentation/networking/devlink-params-mlx5.txt @@ -0,0 +1,17 @@ +flow_steering_mode [DEVICE, DRIVER-SPECIFIC] + Controls the flow steering mode of the driver. + Two modes are supported: + 1. 'dmfs' - Device managed flow steering. + 2. 'smfs - Software/Driver managed flow steering. + In DMFS mode, the HW steering entities are created and + managed through the Firmware. + In SMFS mode, the HW steering entities are created and + managed though by the driver directly into Hardware + without firmware intervention. + Type: String + Configuration mode: runtime + +enable_roce [DEVICE, GENERIC] + Enable handling of RoCE traffic in the device. + Defaultly enabled. + Configuration mode: driverinit diff --git a/Documentation/networking/devlink-params-mv88e6xxx.txt b/Documentation/networking/devlink-params-mv88e6xxx.txt new file mode 100644 index 000000000000..21c4b3556ef2 --- /dev/null +++ b/Documentation/networking/devlink-params-mv88e6xxx.txt @@ -0,0 +1,7 @@ +ATU_hash [DEVICE, DRIVER-SPECIFIC] + Select one of four possible hashing algorithms for + MAC addresses in the Address Translation Unit. + A value of 3 seems to work better than the default of + 1 when many MAC addresses have the same OUI. + Configuration mode: runtime + Type: u8. 0-3 valid. diff --git a/Documentation/networking/devlink-params-ti-cpsw-switch.txt b/Documentation/networking/devlink-params-ti-cpsw-switch.txt new file mode 100644 index 000000000000..4037458499f7 --- /dev/null +++ b/Documentation/networking/devlink-params-ti-cpsw-switch.txt @@ -0,0 +1,10 @@ +ale_bypass [DEVICE, DRIVER-SPECIFIC] + Allows to enable ALE_CONTROL(4).BYPASS mode for debug purposes. + All packets will be sent to the Host port only if enabled. + Type: bool + Configuration mode: runtime + +switch_mode [DEVICE, DRIVER-SPECIFIC] + Enable switch mode + Type: bool + Configuration mode: runtime diff --git a/Documentation/networking/devlink-params.txt b/Documentation/networking/devlink-params.txt index ddba3e9b55b1..04e234e9acc9 100644 --- a/Documentation/networking/devlink-params.txt +++ b/Documentation/networking/devlink-params.txt @@ -65,3 +65,7 @@ reset_dev_on_drv_probe [DEVICE, GENERIC] Reset only if device firmware can be found in the filesystem. Type: u8 + +enable_roce [DEVICE, GENERIC] + Enable handling of RoCE traffic in the device. + Type: Boolean diff --git a/Documentation/networking/devlink-trap.rst b/Documentation/networking/devlink-trap.rst index 8e90a85f3bd5..dc9659ca06fa 100644 --- a/Documentation/networking/devlink-trap.rst +++ b/Documentation/networking/devlink-trap.rst @@ -162,6 +162,67 @@ be added to the following table: - ``drop`` - Traps packets that the device decided to drop because they could not be enqueued to a transmission queue which is full + * - ``non_ip`` + - ``drop`` + - Traps packets that the device decided to drop because they need to + undergo a layer 3 lookup, but are not IP or MPLS packets + * - ``uc_dip_over_mc_dmac`` + - ``drop`` + - Traps packets that the device decided to drop because they need to be + routed and they have a unicast destination IP and a multicast destination + MAC + * - ``dip_is_loopback_address`` + - ``drop`` + - Traps packets that the device decided to drop because they need to be + routed and their destination IP is the loopback address (i.e., 127.0.0.0/8 + and ::1/128) + * - ``sip_is_mc`` + - ``drop`` + - Traps packets that the device decided to drop because they need to be + routed and their source IP is multicast (i.e., 224.0.0.0/8 and ff::/8) + * - ``sip_is_loopback_address`` + - ``drop`` + - Traps packets that the device decided to drop because they need to be + routed and their source IP is the loopback address (i.e., 127.0.0.0/8 and ::1/128) + * - ``ip_header_corrupted`` + - ``drop`` + - Traps packets that the device decided to drop because they need to be + routed and their IP header is corrupted: wrong checksum, wrong IP version + or too short Internet Header Length (IHL) + * - ``ipv4_sip_is_limited_bc`` + - ``drop`` + - Traps packets that the device decided to drop because they need to be + routed and their source IP is limited broadcast (i.e., 255.255.255.255/32) + * - ``ipv6_mc_dip_reserved_scope`` + - ``drop`` + - Traps IPv6 packets that the device decided to drop because they need to + be routed and their IPv6 multicast destination IP has a reserved scope + (i.e., ffx0::/16) + * - ``ipv6_mc_dip_interface_local_scope`` + - ``drop`` + - Traps IPv6 packets that the device decided to drop because they need to + be routed and their IPv6 multicast destination IP has an interface-local scope + (i.e., ffx1::/16) + * - ``mtu_value_is_too_small`` + - ``exception`` + - Traps packets that should have been routed by the device, but were bigger + than the MTU of the egress interface + * - ``unresolved_neigh`` + - ``exception`` + - Traps packets that did not have a matching IP neighbour after routing + * - ``mc_reverse_path_forwarding`` + - ``exception`` + - Traps multicast IP packets that failed reverse-path forwarding (RPF) + check during multicast routing + * - ``reject_route`` + - ``exception`` + - Traps packets that hit reject routes (i.e., "unreachable", "prohibit") + * - ``ipv4_lpm_miss`` + - ``exception`` + - Traps unicast IPv4 packets that did not match any route + * - ``ipv6_lpm_miss`` + - ``exception`` + - Traps unicast IPv6 packets that did not match any route Driver-specific Packet Traps ============================ diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index 319e5e041f38..c4a328f2d57a 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -770,10 +770,10 @@ Some core changes of the new internal format: callq foo mov %rax,%r13 mov %rbx,%rdi - mov $0x2,%esi - mov $0x3,%edx - mov $0x4,%ecx - mov $0x5,%r8d + mov $0x6,%esi + mov $0x7,%edx + mov $0x8,%ecx + mov $0x9,%r8d callq bar add %r13,%rax mov -0x228(%rbp),%rbx diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index d4dca42910d0..5acab1290e03 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -33,6 +33,7 @@ Contents: scaling tls tls-offload + nfc .. only:: subproject and html diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 49e95f438ed7..fd26788e8c96 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -207,8 +207,8 @@ TCP variables: somaxconn - INTEGER Limit of socket listen() backlog, known in userspace as SOMAXCONN. - Defaults to 128. See also tcp_max_syn_backlog for additional tuning - for TCP sockets. + Defaults to 4096. (Was 128 before linux-5.4) + See also tcp_max_syn_backlog for additional tuning for TCP sockets. tcp_abort_on_overflow - BOOLEAN If listening service is too slow to accept new connections, @@ -408,11 +408,14 @@ tcp_max_orphans - INTEGER up to ~64K of unswappable memory. tcp_max_syn_backlog - INTEGER - Maximal number of remembered connection requests, which have not - received an acknowledgment from connecting client. + Maximal number of remembered connection requests (SYN_RECV), + which have not received an acknowledgment from connecting client. + This is a per-listener limit. The minimal value is 128 for low memory machines, and it will increase in proportion to the memory of machine. If server suffers from overload, try increasing this number. + Remember to also check /proc/sys/net/core/somaxconn + A SYN_RECV request socket consumes about 304 bytes of memory. tcp_max_tw_buckets - INTEGER Maximal number of timewait sockets held by system simultaneously. @@ -901,8 +904,9 @@ ip_local_port_range - 2 INTEGERS Defines the local port range that is used by TCP and UDP to choose the local port. The first number is the first, the second the last local port number. - If possible, it is better these numbers have different parity. - (one even and one odd values) + If possible, it is better these numbers have different parity + (one even and one odd value). + Must be greater than or equal to ip_unprivileged_port_start. The default values are 32768 and 60999 respectively. ip_local_reserved_ports - list of comma separated ranges @@ -940,8 +944,8 @@ ip_unprivileged_port_start - INTEGER This is a per-namespace sysctl. It defines the first unprivileged port in the network namespace. Privileged ports require root or CAP_NET_BIND_SERVICE in order to bind to them. - To disable all privileged ports, set this to 0. It may not - overlap with the ip_local_reserved_ports range. + To disable all privileged ports, set this to 0. They must not + overlap with the ip_local_port_range. Default: 1024 @@ -2088,6 +2092,28 @@ pf_enable - INTEGER Default: 1 +pf_expose - INTEGER + Unset or enable/disable pf (pf is short for potentially failed) state + exposure. Applications can control the exposure of the PF path state + in the SCTP_PEER_ADDR_CHANGE event and the SCTP_GET_PEER_ADDR_INFO + sockopt. When it's unset, no SCTP_PEER_ADDR_CHANGE event with + SCTP_ADDR_PF state will be sent and a SCTP_PF-state transport info + can be got via SCTP_GET_PEER_ADDR_INFO sockopt; When it's enabled, + a SCTP_PEER_ADDR_CHANGE event will be sent for a transport becoming + SCTP_PF state and a SCTP_PF-state transport info can be got via + SCTP_GET_PEER_ADDR_INFO sockopt; When it's diabled, no + SCTP_PEER_ADDR_CHANGE event will be sent and it returns -EACCES when + trying to get a SCTP_PF-state transport info via SCTP_GET_PEER_ADDR_INFO + sockopt. + + 0: Unset pf state exposure, Compatible with old applications. + + 1: Disable pf state exposure. + + 2: Enable pf state exposure. + + Default: 0 + addip_noauth_enable - BOOLEAN Dynamic Address Reconfiguration (ADD-IP) requires the use of authentication to protect the operations of adding or removing new @@ -2170,6 +2196,18 @@ pf_retrans - INTEGER Default: 0 +ps_retrans - INTEGER + Primary.Switchover.Max.Retrans (PSMR), it's a tunable parameter coming + from section-5 "Primary Path Switchover" in rfc7829. The primary path + will be changed to another active path when the path error counter on + the old primary path exceeds PSMR, so that "the SCTP sender is allowed + to continue data transmission on a new working path even when the old + primary destination address becomes active again". Note this feature + is disabled by initializing 'ps_retrans' per netns as 0xffff by default, + and its value can't be less than 'pf_retrans' when changing by sysctl. + + Default: 0xffff + rto_initial - INTEGER The initial round trip timeout value in milliseconds that will be used in calculating round trip times. This is the initial time interval diff --git a/Documentation/networking/nfc.txt b/Documentation/networking/nfc.rst index b24c29bdae27..9aab3a88c9b2 100644 --- a/Documentation/networking/nfc.txt +++ b/Documentation/networking/nfc.rst @@ -1,3 +1,4 @@ +=================== Linux NFC subsystem =================== @@ -8,7 +9,7 @@ This document covers the architecture overview, the device driver interface description and the userspace interface description. Architecture overview ---------------------- +===================== The NFC subsystem is responsible for: - NFC adapters management; @@ -25,33 +26,34 @@ The control operations are available to userspace via generic netlink. The low-level data exchange interface is provided by the new socket family PF_NFC. The NFC_SOCKPROTO_RAW performs raw communication with NFC targets. - - +--------------------------------------+ - | USER SPACE | - +--------------------------------------+ - ^ ^ - | low-level | control - | data exchange | operations - | | - | v - | +-----------+ - | AF_NFC | netlink | - | socket +-----------+ - | raw ^ - | | - v v - +---------+ +-----------+ - | rawsock | <--------> | core | - +---------+ +-----------+ - ^ - | - v - +-----------+ - | driver | - +-----------+ +.. code-block:: none + + +--------------------------------------+ + | USER SPACE | + +--------------------------------------+ + ^ ^ + | low-level | control + | data exchange | operations + | | + | v + | +-----------+ + | AF_NFC | netlink | + | socket +-----------+ + | raw ^ + | | + v v + +---------+ +-----------+ + | rawsock | <--------> | core | + +---------+ +-----------+ + ^ + | + v + +-----------+ + | driver | + +-----------+ Device Driver Interface ------------------------ +======================= When registering on the NFC subsystem, the device driver must inform the core of the set of supported NFC protocols and the set of ops callbacks. The ops @@ -64,7 +66,7 @@ callbacks that must be implemented are the following: * data_exchange - send data and receive the response (transceive operation) Userspace interface --------------------- +=================== The userspace interface is divided in control operations and low-level data exchange operation. @@ -82,7 +84,7 @@ The operations are composed by commands and events, all listed below: * NFC_EVENT_DEVICE_ADDED - reports an NFC device addition * NFC_EVENT_DEVICE_REMOVED - reports an NFC device removal * NFC_EVENT_TARGETS_FOUND - reports START_POLL results when 1 or more targets -are found + are found The user must call START_POLL to poll for NFC targets, passing the desired NFC protocols through NFC_ATTR_PROTOCOLS attribute. The device remains in polling @@ -101,14 +103,14 @@ it's closed. LOW-LEVEL DATA EXCHANGE: The userspace must use PF_NFC sockets to perform any data communication with -targets. All NFC sockets use AF_NFC: - -struct sockaddr_nfc { - sa_family_t sa_family; - __u32 dev_idx; - __u32 target_idx; - __u32 nfc_protocol; -}; +targets. All NFC sockets use AF_NFC:: + + struct sockaddr_nfc { + sa_family_t sa_family; + __u32 dev_idx; + __u32 target_idx; + __u32 nfc_protocol; + }; To establish a connection with one target, the user must create an NFC_SOCKPROTO_RAW socket and call the 'connect' syscall with the sockaddr_nfc diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index a689966bc4be..cda1c0a0492a 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -352,7 +352,8 @@ Fills the phydev structure with up-to-date information about the current settings in the PHY. :: - int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd); + int phy_ethtool_ksettings_set(struct phy_device *phydev, + const struct ethtool_link_ksettings *cmd); Ethtool convenience functions. :: diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst index 0dd3f748239f..f914e81fd3a6 100644 --- a/Documentation/networking/tls-offload.rst +++ b/Documentation/networking/tls-offload.rst @@ -436,6 +436,10 @@ by the driver: encryption. * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream but did not arrive in the expected order. + * ``tx_tls_skip_no_sync_data`` - number of TX packets which were part of + a TLS stream and arrived out-of-order, but skipped the HW offload routine + and went to the regular transmit flow as they were retransmissions of the + connection handshake. * ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of a TLS stream dropped, because they arrived out of order and associated record could not be found. diff --git a/Documentation/networking/tls.rst b/Documentation/networking/tls.rst index 5bcbf75e2025..8cb2cd4e2a80 100644 --- a/Documentation/networking/tls.rst +++ b/Documentation/networking/tls.rst @@ -213,3 +213,29 @@ A patchset to OpenSSL to use ktls as the record layer is of calling send directly after a handshake using gnutls. Since it doesn't implement a full record layer, control messages are not supported. + +Statistics +========== + +TLS implementation exposes the following per-namespace statistics +(``/proc/net/tls_stat``): + +- ``TlsCurrTxSw``, ``TlsCurrRxSw`` - + number of TX and RX sessions currently installed where host handles + cryptography + +- ``TlsCurrTxDevice``, ``TlsCurrRxDevice`` - + number of TX and RX sessions currently installed where NIC handles + cryptography + +- ``TlsTxSw``, ``TlsRxSw`` - + number of TX and RX sessions opened with host cryptography + +- ``TlsTxDevice``, ``TlsRxDevice`` - + number of TX and RX sessions opened with NIC cryptography + +- ``TlsDecryptError`` - + record decryption failed (e.g. due to incorrect authentication tag) + +- ``TlsDeviceRxResync`` - + number of RX resyncs sent to NICs handling cryptography diff --git a/Documentation/power/drivers-testing.rst b/Documentation/power/drivers-testing.rst index e53f1999fc39..d77d2894f9fe 100644 --- a/Documentation/power/drivers-testing.rst +++ b/Documentation/power/drivers-testing.rst @@ -39,9 +39,10 @@ c) Compile the driver directly into the kernel and try the test modes of d) Attempt to hibernate with the driver compiled directly into the kernel in the "reboot", "shutdown" and "platform" modes. -e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.rst, - 2). [As far as the STR tests are concerned, it should not matter whether or - not the driver is built as a module.] +e) Try the test modes of suspend (see: + Documentation/power/basic-pm-debugging.rst, 2). [As far as the STR tests are + concerned, it should not matter whether or not the driver is built as a + module.] f) Attempt to suspend to RAM using the s2ram tool with the driver loaded (see: Documentation/power/basic-pm-debugging.rst, 2). diff --git a/Documentation/power/freezing-of-tasks.rst b/Documentation/power/freezing-of-tasks.rst index ef110fe55e82..8bd693399834 100644 --- a/Documentation/power/freezing-of-tasks.rst +++ b/Documentation/power/freezing-of-tasks.rst @@ -215,30 +215,31 @@ VI. Are there any precautions to be taken to prevent freezing failures? Yes, there are. -First of all, grabbing the 'system_transition_mutex' lock to mutually exclude a piece of code -from system-wide sleep such as suspend/hibernation is not encouraged. -If possible, that piece of code must instead hook onto the suspend/hibernation -notifiers to achieve mutual exclusion. Look at the CPU-Hotplug code -(kernel/cpu.c) for an example. - -However, if that is not feasible, and grabbing 'system_transition_mutex' is deemed necessary, -it is strongly discouraged to directly call mutex_[un]lock(&system_transition_mutex) since -that could lead to freezing failures, because if the suspend/hibernate code -successfully acquired the 'system_transition_mutex' lock, and hence that other entity failed -to acquire the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE -state. As a consequence, the freezer would not be able to freeze that task, -leading to freezing failure. +First of all, grabbing the 'system_transition_mutex' lock to mutually exclude a +piece of code from system-wide sleep such as suspend/hibernation is not +encouraged. If possible, that piece of code must instead hook onto the +suspend/hibernation notifiers to achieve mutual exclusion. Look at the +CPU-Hotplug code (kernel/cpu.c) for an example. + +However, if that is not feasible, and grabbing 'system_transition_mutex' is +deemed necessary, it is strongly discouraged to directly call +mutex_[un]lock(&system_transition_mutex) since that could lead to freezing +failures, because if the suspend/hibernate code successfully acquired the +'system_transition_mutex' lock, and hence that other entity failed to acquire +the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE state. As a +consequence, the freezer would not be able to freeze that task, leading to +freezing failure. However, the [un]lock_system_sleep() APIs are safe to use in this scenario, since they ask the freezer to skip freezing this task, since it is anyway -"frozen enough" as it is blocked on 'system_transition_mutex', which will be released -only after the entire suspend/hibernation sequence is complete. -So, to summarize, use [un]lock_system_sleep() instead of directly using +"frozen enough" as it is blocked on 'system_transition_mutex', which will be +released only after the entire suspend/hibernation sequence is complete. So, to +summarize, use [un]lock_system_sleep() instead of directly using mutex_[un]lock(&system_transition_mutex). That would prevent freezing failures. V. Miscellaneous ================ /sys/power/pm_freeze_timeout controls how long it will cost at most to freeze -all user space processes or all freezable kernel threads, in unit of millisecond. -The default value is 20000, with range of unsigned integer. +all user space processes or all freezable kernel threads, in unit of +millisecond. The default value is 20000, with range of unsigned integer. diff --git a/Documentation/power/opp.rst b/Documentation/power/opp.rst index 209c7613f5a4..e3cc4f349ea8 100644 --- a/Documentation/power/opp.rst +++ b/Documentation/power/opp.rst @@ -73,19 +73,21 @@ factors. Example usage: Thermal management or other exceptional situations where SoC framework might choose to disable a higher frequency OPP to safely continue operations until that OPP could be re-enabled if possible. -OPP library facilitates this concept in it's implementation. The following +OPP library facilitates this concept in its implementation. The following operational functions operate only on available opps: -opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage, dev_pm_opp_get_freq, dev_pm_opp_get_opp_count +opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage, dev_pm_opp_get_freq, +dev_pm_opp_get_opp_count -dev_pm_opp_find_freq_exact is meant to be used to find the opp pointer which can then -be used for dev_pm_opp_enable/disable functions to make an opp available as required. +dev_pm_opp_find_freq_exact is meant to be used to find the opp pointer +which can then be used for dev_pm_opp_enable/disable functions to make an +opp available as required. WARNING: Users of OPP library should refresh their availability count using -get_opp_count if dev_pm_opp_enable/disable functions are invoked for a device, the -exact mechanism to trigger these or the notification mechanism to other -dependent subsystems such as cpufreq are left to the discretion of the SoC -specific framework which uses the OPP library. Similar care needs to be taken -care to refresh the cpufreq table in cases of these operations. +get_opp_count if dev_pm_opp_enable/disable functions are invoked for a +device, the exact mechanism to trigger these or the notification mechanism +to other dependent subsystems such as cpufreq are left to the discretion of +the SoC specific framework which uses the OPP library. Similar care needs +to be taken care to refresh the cpufreq table in cases of these operations. 2. Initial OPP List Registration ================================ @@ -99,11 +101,11 @@ OPPs dynamically using the dev_pm_opp_enable / disable functions. dev_pm_opp_add Add a new OPP for a specific domain represented by the device pointer. The OPP is defined using the frequency and voltage. Once added, the OPP - is assumed to be available and control of it's availability can be done - with the dev_pm_opp_enable/disable functions. OPP library internally stores - and manages this information in the opp struct. This function may be - used by SoC framework to define a optimal list as per the demands of - SoC usage environment. + is assumed to be available and control of its availability can be done + with the dev_pm_opp_enable/disable functions. OPP library + internally stores and manages this information in the opp struct. + This function may be used by SoC framework to define a optimal list + as per the demands of SoC usage environment. WARNING: Do not use this function in interrupt context. @@ -354,7 +356,7 @@ struct dev_pm_opp struct device This is used to identify a domain to the OPP layer. The - nature of the device and it's implementation is left to the user of + nature of the device and its implementation is left to the user of OPP library such as the SoC framework. Overall, in a simplistic view, the data structure operations is represented as diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst index 0e2ef7429304..51e0a493d284 100644 --- a/Documentation/power/pci.rst +++ b/Documentation/power/pci.rst @@ -426,12 +426,12 @@ pm->runtime_idle() callback. 2.4. System-Wide Power Transitions ---------------------------------- There are a few different types of system-wide power transitions, described in -Documentation/driver-api/pm/devices.rst. Each of them requires devices to be handled -in a specific way and the PM core executes subsystem-level power management -callbacks for this purpose. They are executed in phases such that each phase -involves executing the same subsystem-level callback for every device belonging -to the given subsystem before the next phase begins. These phases always run -after tasks have been frozen. +Documentation/driver-api/pm/devices.rst. Each of them requires devices to be +handled in a specific way and the PM core executes subsystem-level power +management callbacks for this purpose. They are executed in phases such that +each phase involves executing the same subsystem-level callback for every device +belonging to the given subsystem before the next phase begins. These phases +always run after tasks have been frozen. 2.4.1. System Suspend ^^^^^^^^^^^^^^^^^^^^^ @@ -636,12 +636,12 @@ System restore requires a hibernation image to be loaded into memory and the pre-hibernation memory contents to be restored before the pre-hibernation system activity can be resumed. -As described in Documentation/driver-api/pm/devices.rst, the hibernation image is loaded -into memory by a fresh instance of the kernel, called the boot kernel, which in -turn is loaded and run by a boot loader in the usual way. After the boot kernel -has loaded the image, it needs to replace its own code and data with the code -and data of the "hibernated" kernel stored within the image, called the image -kernel. For this purpose all devices are frozen just like before creating +As described in Documentation/driver-api/pm/devices.rst, the hibernation image +is loaded into memory by a fresh instance of the kernel, called the boot kernel, +which in turn is loaded and run by a boot loader in the usual way. After the +boot kernel has loaded the image, it needs to replace its own code and data with +the code and data of the "hibernated" kernel stored within the image, called the +image kernel. For this purpose all devices are frozen just like before creating the image during hibernation, in the prepare, freeze, freeze_noirq @@ -691,8 +691,8 @@ controlling the runtime power management of their devices. At the time of this writing there are two ways to define power management callbacks for a PCI device driver, the recommended one, based on using a -dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and the -"legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and +dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and +the "legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and .resume() callbacks from struct pci_driver are used. The legacy approach, however, doesn't allow one to define runtime power management callbacks and is not really suitable for any new drivers. Therefore it is not covered by this diff --git a/Documentation/power/pm_qos_interface.rst b/Documentation/power/pm_qos_interface.rst index 3097694fba69..0d62d506caf0 100644 --- a/Documentation/power/pm_qos_interface.rst +++ b/Documentation/power/pm_qos_interface.rst @@ -8,8 +8,8 @@ one of the parameters. Two different PM QoS frameworks are available: 1. PM QoS classes for cpu_dma_latency -2. the per-device PM QoS framework provides the API to manage the per-device latency -constraints and PM QoS flags. +2. The per-device PM QoS framework provides the API to manage the + per-device latency constraints and PM QoS flags. Each parameters have defined units: @@ -47,14 +47,14 @@ void pm_qos_add_request(handle, param_class, target_value): pm_qos API functions. void pm_qos_update_request(handle, new_target_value): - Will update the list element pointed to by the handle with the new target value - and recompute the new aggregated target, calling the notification tree if the - target is changed. + Will update the list element pointed to by the handle with the new target + value and recompute the new aggregated target, calling the notification tree + if the target is changed. void pm_qos_remove_request(handle): - Will remove the element. After removal it will update the aggregate target and - call the notification tree if the target was changed as a result of removing - the request. + Will remove the element. After removal it will update the aggregate target + and call the notification tree if the target was changed as a result of + removing the request. int pm_qos_request(param_class): Returns the aggregated value for a given PM QoS class. @@ -167,9 +167,9 @@ int dev_pm_qos_expose_flags(device, value) change the value of the PM_QOS_FLAG_NO_POWER_OFF flag. void dev_pm_qos_hide_flags(device) - Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list - of flags and remove sysfs attribute pm_qos_no_power_off from the device's power - directory. + Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS + list of flags and remove sysfs attribute pm_qos_no_power_off from the device's + power directory. Notification mechanisms: @@ -179,8 +179,8 @@ int dev_pm_qos_add_notifier(device, notifier, type): Adds a notification callback function for the device for a particular request type. - The callback is called when the aggregated value of the device constraints list - is changed. + The callback is called when the aggregated value of the device constraints + list is changed. int dev_pm_qos_remove_notifier(device, notifier, type): Removes the notification callback function for the device. diff --git a/Documentation/power/runtime_pm.rst b/Documentation/power/runtime_pm.rst index 2c2ec99b5088..ab8406c84254 100644 --- a/Documentation/power/runtime_pm.rst +++ b/Documentation/power/runtime_pm.rst @@ -268,8 +268,8 @@ defined in include/linux/pm.h: `unsigned int runtime_auto;` - if set, indicates that the user space has allowed the device driver to power manage the device at run time via the /sys/devices/.../power/control - `interface;` it may only be modified with the help of the pm_runtime_allow() - and pm_runtime_forbid() helper functions + `interface;` it may only be modified with the help of the + pm_runtime_allow() and pm_runtime_forbid() helper functions `unsigned int no_callbacks;` - indicates that the device does not use the runtime PM callbacks (see diff --git a/Documentation/power/suspend-and-cpuhotplug.rst b/Documentation/power/suspend-and-cpuhotplug.rst index 7ac8e1f549f4..572d968c5375 100644 --- a/Documentation/power/suspend-and-cpuhotplug.rst +++ b/Documentation/power/suspend-and-cpuhotplug.rst @@ -106,8 +106,8 @@ execution during resume): * Release system_transition_mutex lock. -It is to be noted here that the system_transition_mutex lock is acquired at the very -beginning, when we are just starting out to suspend, and then released only +It is to be noted here that the system_transition_mutex lock is acquired at the +very beginning, when we are just starting out to suspend, and then released only after the entire cycle is complete (i.e., suspend + resume). :: @@ -165,7 +165,8 @@ Important files and functions/entry points: - kernel/power/process.c : freeze_processes(), thaw_processes() - kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() -- kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() +- kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), + [disable|enable]_nonboot_cpus() diff --git a/Documentation/power/swsusp.rst b/Documentation/power/swsusp.rst index d000312f6965..8524f079e05c 100644 --- a/Documentation/power/swsusp.rst +++ b/Documentation/power/swsusp.rst @@ -118,7 +118,8 @@ In a really perfect world:: echo 1 > /proc/acpi/sleep # for standby echo 2 > /proc/acpi/sleep # for suspend to ram - echo 3 > /proc/acpi/sleep # for suspend to ram, but with more power conservative + echo 3 > /proc/acpi/sleep # for suspend to ram, but with more power + # conservative echo 4 > /proc/acpi/sleep # for suspend to disk echo 5 > /proc/acpi/sleep # for shutdown unfriendly the system @@ -192,8 +193,8 @@ Q: A: The freezing of tasks is a mechanism by which user space processes and some - kernel threads are controlled during hibernation or system-wide suspend (on some - architectures). See freezing-of-tasks.txt for details. + kernel threads are controlled during hibernation or system-wide suspend (on + some architectures). See freezing-of-tasks.txt for details. Q: What is the difference between "platform" and "shutdown"? @@ -282,7 +283,8 @@ A: suspend(PMSG_FREEZE): devices are frozen so that they don't interfere with state snapshot - state snapshot: copy of whole used memory is taken with interrupts disabled + state snapshot: copy of whole used memory is taken with interrupts + disabled resume(): devices are woken up so that we can write image to swap @@ -353,8 +355,8 @@ Q: A: Generally, yes, you can. However, it requires you to use the "resume=" and - "resume_offset=" kernel command line parameters, so the resume from a swap file - cannot be initiated from an initrd or initramfs image. See + "resume_offset=" kernel command line parameters, so the resume from a swap + file cannot be initiated from an initrd or initramfs image. See swsusp-and-swap-files.txt for details. Q: diff --git a/Documentation/sound/kernel-api/writing-an-alsa-driver.rst b/Documentation/sound/kernel-api/writing-an-alsa-driver.rst index 132f5eb9b530..f169d58ca019 100644 --- a/Documentation/sound/kernel-api/writing-an-alsa-driver.rst +++ b/Documentation/sound/kernel-api/writing-an-alsa-driver.rst @@ -805,6 +805,7 @@ destructor and PCI entries. Example code is shown first, below. return -EBUSY; } chip->irq = pci->irq; + card->sync_irq = chip->irq; /* (2) initialization of the chip hardware */ .... /* (not implemented in this document) */ @@ -965,6 +966,15 @@ usually like the following: return IRQ_HANDLED; } +After requesting the IRQ, you can passed it to ``card->sync_irq`` +field: +:: + + card->irq = chip->irq; + +This allows PCM core automatically performing +:c:func:`synchronize_irq()` at the necessary timing like ``hw_free``. +See the later section `sync_stop callback`_ for details. Now let's write the corresponding destructor for the resources above. The role of destructor is simple: disable the hardware (if already @@ -1270,21 +1280,23 @@ shows only the skeleton, how to build up the PCM interfaces. /* the hardware-specific codes will be here */ .... return 0; - } /* hw_params callback */ static int snd_mychip_pcm_hw_params(struct snd_pcm_substream *substream, struct snd_pcm_hw_params *hw_params) { - return snd_pcm_lib_malloc_pages(substream, - params_buffer_bytes(hw_params)); + /* the hardware-specific codes will be here */ + .... + return 0; } /* hw_free callback */ static int snd_mychip_pcm_hw_free(struct snd_pcm_substream *substream) { - return snd_pcm_lib_free_pages(substream); + /* the hardware-specific codes will be here */ + .... + return 0; } /* prepare callback */ @@ -1339,7 +1351,6 @@ shows only the skeleton, how to build up the PCM interfaces. static struct snd_pcm_ops snd_mychip_playback_ops = { .open = snd_mychip_playback_open, .close = snd_mychip_playback_close, - .ioctl = snd_pcm_lib_ioctl, .hw_params = snd_mychip_pcm_hw_params, .hw_free = snd_mychip_pcm_hw_free, .prepare = snd_mychip_pcm_prepare, @@ -1351,7 +1362,6 @@ shows only the skeleton, how to build up the PCM interfaces. static struct snd_pcm_ops snd_mychip_capture_ops = { .open = snd_mychip_capture_open, .close = snd_mychip_capture_close, - .ioctl = snd_pcm_lib_ioctl, .hw_params = snd_mychip_pcm_hw_params, .hw_free = snd_mychip_pcm_hw_free, .prepare = snd_mychip_pcm_prepare, @@ -1382,9 +1392,9 @@ shows only the skeleton, how to build up the PCM interfaces. &snd_mychip_capture_ops); /* pre-allocation of buffers */ /* NOTE: this may fail */ - snd_pcm_lib_preallocate_pages_for_all(pcm, SNDRV_DMA_TYPE_DEV, - snd_dma_pci_data(chip->pci), - 64*1024, 64*1024); + snd_pcm_set_managed_buffer_all(pcm, SNDRV_DMA_TYPE_DEV, + &chip->pci->dev, + 64*1024, 64*1024); return 0; } @@ -1454,7 +1464,6 @@ The operators are defined typically like this: static struct snd_pcm_ops snd_mychip_playback_ops = { .open = snd_mychip_pcm_open, .close = snd_mychip_pcm_close, - .ioctl = snd_pcm_lib_ioctl, .hw_params = snd_mychip_pcm_hw_params, .hw_free = snd_mychip_pcm_hw_free, .prepare = snd_mychip_pcm_prepare, @@ -1465,13 +1474,14 @@ The operators are defined typically like this: All the callbacks are described in the Operators_ subsection. After setting the operators, you probably will want to pre-allocate the -buffer. For the pre-allocation, simply call the following: +buffer and set up the managed allocation mode. +For that, simply call the following: :: - snd_pcm_lib_preallocate_pages_for_all(pcm, SNDRV_DMA_TYPE_DEV, - snd_dma_pci_data(chip->pci), - 64*1024, 64*1024); + snd_pcm_set_managed_buffer_all(pcm, SNDRV_DMA_TYPE_DEV, + &chip->pci->dev, + 64*1024, 64*1024); It will allocate a buffer up to 64kB as default. Buffer management details will be described in the later section `Buffer and Memory @@ -1621,8 +1631,7 @@ For the operators (callbacks) of each sound driver, most of these records are supposed to be read-only. Only the PCM middle-layer changes / updates them. The exceptions are the hardware description (hw) DMA buffer information and the private data. Besides, if you use the -standard buffer allocation method via -:c:func:`snd_pcm_lib_malloc_pages()`, you don't need to set the +standard managed buffer allocation mode, you don't need to set the DMA buffer information by yourself. In the sections below, important records are explained. @@ -1776,8 +1785,8 @@ the physical address of the buffer. This field is specified only when the buffer is a linear buffer. ``dma_bytes`` holds the size of buffer in bytes. ``dma_private`` is used for the ALSA DMA allocator. -If you use a standard ALSA function, -:c:func:`snd_pcm_lib_malloc_pages()`, for allocating the buffer, +If you use either the managed buffer allocation mode or the standard +API function :c:func:`snd_pcm_lib_malloc_pages()` for allocating the buffer, these fields are set by the ALSA middle layer, and you should *not* change them by yourself. You can read them but not write them. On the other hand, if you want to allocate the buffer by yourself, you'll @@ -1911,7 +1920,10 @@ ioctl callback ~~~~~~~~~~~~~~ This is used for any special call to pcm ioctls. But usually you can -pass a generic ioctl callback, :c:func:`snd_pcm_lib_ioctl()`. +leave it as NULL, then PCM core calls the generic ioctl callback +function :c:func:`snd_pcm_lib_ioctl()`. If you need to deal with the +unique setup of channel info or reset procedure, you can pass your own +callback function here. hw_params callback ~~~~~~~~~~~~~~~~~~~ @@ -1929,8 +1941,12 @@ Many hardware setups should be done in this callback, including the allocation of buffers. Parameters to be initialized are retrieved by -:c:func:`params_xxx()` macros. To allocate buffer, you can call a -helper function, +:c:func:`params_xxx()` macros. + +When you set up the managed buffer allocation mode for the substream, +a buffer is already allocated before this callback gets +called. Alternatively, you can call a helper function below for +allocating the buffer, too. :: @@ -1964,18 +1980,23 @@ hw_free callback static int snd_xxx_hw_free(struct snd_pcm_substream *substream); This is called to release the resources allocated via -``hw_params``. For example, releasing the buffer via -:c:func:`snd_pcm_lib_malloc_pages()` is done by calling the -following: - -:: - - snd_pcm_lib_free_pages(substream); +``hw_params``. This function is always called before the close callback is called. Also, the callback may be called multiple times, too. Keep track whether the resource was already released. +When you have set up the managed buffer allocation mode for the PCM +substream, the allocated PCM buffer will be automatically released +after this callback gets called. Otherwise you'll have to release the +buffer manually. Typically, when the buffer was allocated from the +pre-allocated pool, you can use the standard API function +:c:func:`snd_pcm_lib_malloc_pages()` like: + +:: + + snd_pcm_lib_free_pages(substream); + prepare callback ~~~~~~~~~~~~~~~~ @@ -2048,6 +2069,37 @@ flag set, and you cannot call functions which may sleep. The triggering the DMA. The other stuff should be initialized ``hw_params`` and ``prepare`` callbacks properly beforehand. +sync_stop callback +~~~~~~~~~~~~~~~~~~ + +:: + + static int snd_xxx_sync_stop(struct snd_pcm_substream *substream); + +This callback is optional, and NULL can be passed. It's called after +the PCM core stops the stream and changes the stream state +``prepare``, ``hw_params`` or ``hw_free``. +Since the IRQ handler might be still pending, we need to wait until +the pending task finishes before moving to the next step; otherwise it +might lead to a crash due to resource conflicts or access to the freed +resources. A typical behavior is to call a synchronization function +like :c:func:`synchronize_irq()` here. + +For majority of drivers that need only a call of +:c:func:`synchronize_irq()`, there is a simpler setup, too. +While keeping NULL to ``sync_stop`` PCM callback, the driver can set +``card->sync_irq`` field to store the valid interrupt number after +requesting an IRQ, instead. Then PCM core will look call +:c:func:`synchronize_irq()` with the given IRQ appropriately. + +If the IRQ handler is released at the card destructor, you don't need +to clear ``card->sync_irq``, as the card itself is being released. +So, usually you'll need to add just a single line for assigning +``card->sync_irq`` in the driver code unless the driver re-acquires +the IRQ. When the driver frees and re-acquires the IRQ dynamically +(e.g. for suspend/resume), it needs to clear and re-set +``card->sync_irq`` again appropriately. + pointer callback ~~~~~~~~~~~~~~~~ @@ -2095,10 +2147,12 @@ This callback is atomic as default. page callback ~~~~~~~~~~~~~ -This callback is optional too. This callback is used mainly for -non-contiguous buffers. The mmap calls this callback to get the page -address. Some examples will be explained in the later section `Buffer -and Memory Management`_, too. +This callback is optional too. The mmap calls this callback to get the +page fault address. + +Since the recent changes, you need no special callback any longer for +the standard SG-buffer or vmalloc-buffer. Hence this callback should +be rarely used. mmap calllback ~~~~~~~~~~~~~~ @@ -3512,7 +3566,7 @@ bus). :: snd_pcm_lib_preallocate_pages_for_all(pcm, SNDRV_DMA_TYPE_DEV, - snd_dma_pci_data(pci), size, max); + &pci->dev, size, max); where ``size`` is the byte size to be pre-allocated and the ``max`` is the maximum size to be changed via the ``prealloc`` proc file. The @@ -3523,12 +3577,14 @@ The second argument (type) and the third argument (device pointer) are dependent on the bus. For normal devices, pass the device pointer (typically identical as ``card->dev``) to the third argument with ``SNDRV_DMA_TYPE_DEV`` type. For the continuous buffer unrelated to the -bus can be pre-allocated with ``SNDRV_DMA_TYPE_CONTINUOUS`` type and the -``snd_dma_continuous_data(GFP_KERNEL)`` device pointer, where -``GFP_KERNEL`` is the kernel allocation flag to use. For the -scatter-gather buffers, use ``SNDRV_DMA_TYPE_DEV_SG`` with the device -pointer (see the `Non-Contiguous Buffers`_ -section). +bus can be pre-allocated with ``SNDRV_DMA_TYPE_CONTINUOUS`` type. +You can pass NULL to the device pointer in that case, which is the +default mode implying to allocate with ``GFP_KRENEL`` flag. +If you need a different GFP flag, you can pass it by encoding the flag +into the device pointer via a special macro +:c:func:`snd_dma_continuous_data()`. +For the scatter-gather buffers, use ``SNDRV_DMA_TYPE_DEV_SG`` with the +device pointer (see the `Non-Contiguous Buffers`_ section). Once the buffer is pre-allocated, you can use the allocator in the ``hw_params`` callback: @@ -3539,6 +3595,25 @@ Once the buffer is pre-allocated, you can use the allocator in the Note that you have to pre-allocate to use this function. +Most of drivers use, though, rather the newly introduced "managed +buffer allocation mode" instead of the manual allocation or release. +This is done by calling :c:func:`snd_pcm_set_managed_buffer_all()` +instead of :c:func:`snd_pcm_lib_preallocate_pages_for_all()`. + +:: + + snd_pcm_set_managed_buffer_all(pcm, SNDRV_DMA_TYPE_DEV, + &pci->dev, size, max); + +where passed arguments are identical in both functions. +The difference in the managed mode is that PCM core will call +:c:func:`snd_pcm_lib_malloc_pages()` internally already before calling +the PCM ``hw_params`` callback, and call :c:func:`snd_pcm_lib_free_pages()` +after the PCM ``hw_free`` callback automatically. So the driver +doesn't have to call these functions explicitly in its callback any +longer. This made many driver code having NULL ``hw_params`` and +``hw_free`` entries. + External Hardware Buffers ------------------------- @@ -3693,20 +3768,26 @@ provides an interface for handling SG-buffers. The API is provided in ``<sound/pcm.h>``. For creating the SG-buffer handler, call -:c:func:`snd_pcm_lib_preallocate_pages()` or -:c:func:`snd_pcm_lib_preallocate_pages_for_all()` with +:c:func:`snd_pcm_set_managed_buffer()` or +:c:func:`snd_pcm_set_managed_buffer_all()` with ``SNDRV_DMA_TYPE_DEV_SG`` in the PCM constructor like other PCI -pre-allocator. You need to pass ``snd_dma_pci_data(pci)``, where pci is +pre-allocator. You need to pass ``&pci->dev``, where pci is the :c:type:`struct pci_dev <pci_dev>` pointer of the chip as -well. The ``struct snd_sg_buf`` instance is created as -``substream->dma_private``. You can cast the pointer like: +well. + +:: + + snd_pcm_set_managed_buffer_all(pcm, SNDRV_DMA_TYPE_DEV_SG, + &pci->dev, size, max); + +The ``struct snd_sg_buf`` instance is created as +``substream->dma_private`` in turn. You can cast the pointer like: :: struct snd_sg_buf *sgbuf = (struct snd_sg_buf *)substream->dma_private; -Then call :c:func:`snd_pcm_lib_malloc_pages()` in the ``hw_params`` -callback as well as in the case of normal PCI buffer. The SG-buffer +Then in :c:func:`snd_pcm_lib_malloc_pages()` call, the common SG-buffer handler will allocate the non-contiguous kernel pages of the given size and map them onto the virtually contiguous memory. The virtual pointer is addressed in runtime->dma_area. The physical address @@ -3715,41 +3796,40 @@ physically non-contiguous. The physical address table is set up in ``sgbuf->table``. You can get the physical address at a certain offset via :c:func:`snd_pcm_sgbuf_get_addr()`. -When a SG-handler is used, you need to set -:c:func:`snd_pcm_sgbuf_ops_page()` as the ``page`` callback. (See -`page callback`_ section.) - -To release the data, call :c:func:`snd_pcm_lib_free_pages()` in -the ``hw_free`` callback as usual. +If you need to release the SG-buffer data explicitly, call the +standard API function :c:func:`snd_pcm_lib_free_pages()` as usual. Vmalloc'ed Buffers ------------------ It's possible to use a buffer allocated via :c:func:`vmalloc()`, for -example, for an intermediate buffer. Since the allocated pages are not -contiguous, you need to set the ``page`` callback to obtain the physical -address at every offset. +example, for an intermediate buffer. In the recent version of kernel, +you can simply allocate it via standard +:c:func:`snd_pcm_lib_malloc_pages()` and co after setting up the +buffer preallocation with ``SNDRV_DMA_TYPE_VMALLOC`` type. -The easiest way to achieve it would be to use -:c:func:`snd_pcm_lib_alloc_vmalloc_buffer()` for allocating the buffer -via :c:func:`vmalloc()`, and set :c:func:`snd_pcm_sgbuf_ops_page()` to -the ``page`` callback. At release, you need to call -:c:func:`snd_pcm_lib_free_vmalloc_buffer()`. +:: -If you want to implementation the ``page`` manually, it would be like -this: + snd_pcm_set_managed_buffer_all(pcm, SNDRV_DMA_TYPE_VMALLOC, + NULL, 0, 0); -:: +The NULL is passed to the device pointer argument, which indicates +that the default pages (GFP_KERNEL and GFP_HIGHMEM) will be +allocated. - #include <linux/vmalloc.h> +Also, note that zero is passed to both the size and the max size +arguments here. Since each vmalloc call should succeed at any time, +we don't need to pre-allocate the buffers like other continuous +pages. - /* get the physical page pointer on the given offset */ - static struct page *mychip_page(struct snd_pcm_substream *substream, - unsigned long offset) - { - void *pageptr = substream->runtime->dma_area + offset; - return vmalloc_to_page(pageptr); - } +If you need the 32bit DMA allocation, pass the device pointer encoded +by :c:func:`snd_dma_continuous_data()` with ``GFP_KERNEL|__GFP_DMA32`` +argument. + +:: + + snd_pcm_set_managed_buffer_all(pcm, SNDRV_DMA_TYPE_VMALLOC, + snd_dma_continuous_data(GFP_KERNEL | __GFP_DMA32), 0, 0); Proc Interface ============== diff --git a/Documentation/trace/ftrace-uses.rst b/Documentation/trace/ftrace-uses.rst index 1fbc69894eed..2a05e770618a 100644 --- a/Documentation/trace/ftrace-uses.rst +++ b/Documentation/trace/ftrace-uses.rst @@ -146,7 +146,7 @@ FTRACE_OPS_FL_RECURSION_SAFE itself or any nested functions that those functions call. If this flag is set, it is possible that the callback will also - be called with preemption enabled (when CONFIG_PREEMPT is set), + be called with preemption enabled (when CONFIG_PREEMPTION is set), but this is not guaranteed. FTRACE_OPS_FL_IPMODIFY @@ -170,6 +170,14 @@ FTRACE_OPS_FL_RCU a callback may be executed and RCU synchronization will not protect it. +FTRACE_OPS_FL_PERMANENT + If this is set on any ftrace ops, then the tracing cannot disabled by + writing 0 to the proc sysctl ftrace_enabled. Equally, a callback with + the flag set cannot be registered if ftrace_enabled is 0. + + Livepatch uses it not to lose the function redirection, so the system + stays protected. + Filtering which functions to trace ================================== diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst index e3060eedb22d..d2b5657ed33e 100644 --- a/Documentation/trace/ftrace.rst +++ b/Documentation/trace/ftrace.rst @@ -2976,7 +2976,9 @@ Note, the proc sysctl ftrace_enable is a big on/off switch for the function tracer. By default it is enabled (when function tracing is enabled in the kernel). If it is disabled, all function tracing is disabled. This includes not only the function tracers for ftrace, but -also for any other uses (perf, kprobes, stack tracing, profiling, etc). +also for any other uses (perf, kprobes, stack tracing, profiling, etc). It +cannot be disabled if there is a callback with FTRACE_OPS_FL_PERMANENT set +registered. Please disable this with care. diff --git a/Documentation/trace/intel_th.rst b/Documentation/trace/intel_th.rst index baa12eb09ef4..70b7126eaeeb 100644 --- a/Documentation/trace/intel_th.rst +++ b/Documentation/trace/intel_th.rst @@ -44,7 +44,8 @@ Documentation/trace/stm.rst for more information on that. MSU can be configured to collect trace data into a system memory buffer, which can later on be read from its device nodes via read() or -mmap() interface. +mmap() interface and directed to a "software sink" driver that will +consume the data and/or relay it further. On the whole, Intel(R) Trace Hub does not require any special userspace software to function; everything can be configured, started @@ -122,3 +123,28 @@ In order to enable the host mode, set the 'host_mode' parameter of the will show up on the intel_th bus. Also, trace configuration and capture controlling attribute groups of the 'gth' device will not be exposed. The 'sth' device will operate as usual. + +Software Sinks +-------------- + +The Memory Storage Unit (MSU) driver provides an in-kernel API for +drivers to register themselves as software sinks for the trace data. +Such drivers can further export the data via other devices, such as +USB device controllers or network cards. + +The API has two main parts:: + - notifying the software sink that a particular window is full, and + "locking" that window, that is, making it unavailable for the trace + collection; when this happens, the MSU driver will automatically + switch to the next window in the buffer if it is unlocked, or stop + the trace capture if it's not; + - tracking the "locked" state of windows and providing a way for the + software sink driver to notify the MSU driver when a window is + unlocked and can be used again to collect trace data. + +An example sink driver, msu-sink illustrates the implementation of a +software sink. Functionally, it simply unlocks windows as soon as they +are full, keeping the MSU running in a circular buffer mode. Unlike the +"multi" mode, it will fill out all the windows in the buffer as opposed +to just the first one. It can be enabled by writing "sink" to the "mode" +file (assuming msu-sink.ko is loaded). diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt index 4833904d32a5..49183add44e7 100644 --- a/Documentation/virt/kvm/api.txt +++ b/Documentation/virt/kvm/api.txt @@ -1002,12 +1002,18 @@ Specifying exception.has_esr on a system that does not support it will return -EINVAL. Setting anything other than the lower 24bits of exception.serror_esr will return -EINVAL. +It is not possible to read back a pending external abort (injected via +KVM_SET_VCPU_EVENTS or otherwise) because such an exception is always delivered +directly to the virtual CPU). + + struct kvm_vcpu_events { struct { __u8 serror_pending; __u8 serror_has_esr; + __u8 ext_dabt_pending; /* Align it to 8 bytes */ - __u8 pad[6]; + __u8 pad[5]; __u64 serror_esr; } exception; __u32 reserved[12]; @@ -1051,9 +1057,23 @@ contain a valid state and shall be written into the VCPU. ARM/ARM64: +User space may need to inject several types of events to the guest. + Set the pending SError exception state for this VCPU. It is not possible to 'cancel' an Serror that has been made pending. +If the guest performed an access to I/O memory which could not be handled by +userspace, for example because of missing instruction syndrome decode +information or because there is no device mapped at the accessed IPA, then +userspace can ask the kernel to inject an external abort using the address +from the exiting fault on the VCPU. It is a programming error to set +ext_dabt_pending after an exit which was not either KVM_EXIT_MMIO or +KVM_EXIT_ARM_NISV. This feature is only available if the system supports +KVM_CAP_ARM_INJECT_EXT_DABT. This is a helper which provides commonality in +how userspace reports accesses for the above cases to guests, across different +userspace implementations. Nevertheless, userspace can still emulate all Arm +exceptions by manipulating individual registers using the KVM_SET_ONE_REG API. + See KVM_GET_VCPU_EVENTS for the data structure. @@ -2982,6 +3002,9 @@ can be determined by querying the KVM_CAP_GUEST_DEBUG_HW_BPS and KVM_CAP_GUEST_DEBUG_HW_WPS capabilities which return a positive number indicating the number of supported registers. +For ppc, the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability indicates whether +the single-step debug event (KVM_GUESTDBG_SINGLESTEP) is supported. + When debug events exit the main run loop with the reason KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run structure containing architecture specific debug information. @@ -4468,6 +4491,39 @@ Hyper-V SynIC state change. Notification is used to remap SynIC event/message pages and to enable/disable SynIC messages/events processing in userspace. + /* KVM_EXIT_ARM_NISV */ + struct { + __u64 esr_iss; + __u64 fault_ipa; + } arm_nisv; + +Used on arm and arm64 systems. If a guest accesses memory not in a memslot, +KVM will typically return to userspace and ask it to do MMIO emulation on its +behalf. However, for certain classes of instructions, no instruction decode +(direction, length of memory access) is provided, and fetching and decoding +the instruction from the VM is overly complicated to live in the kernel. + +Historically, when this situation occurred, KVM would print a warning and kill +the VM. KVM assumed that if the guest accessed non-memslot memory, it was +trying to do I/O, which just couldn't be emulated, and the warning message was +phrased accordingly. However, what happened more often was that a guest bug +caused access outside the guest memory areas which should lead to a more +meaningful warning message and an external abort in the guest, if the access +did not fall within an I/O window. + +Userspace implementations can query for KVM_CAP_ARM_NISV_TO_USER, and enable +this capability at VM creation. Once this is done, these types of errors will +instead return to userspace with KVM_EXIT_ARM_NISV, with the valid bits from +the HSR (arm) and ESR_EL2 (arm64) in the esr_iss field, and the faulting IPA +in the fault_ipa field. Userspace can either fix up the access if it's +actually an I/O access by decoding the instruction from guest memory (if it's +very brave) and continue executing the guest, or it can decide to suspend, +dump, or restart the guest. + +Note that KVM does not skip the faulting instruction as it does for +KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state +if it decides to decode and emulate the instruction. + /* Fix the size of the union. */ char padding[256]; }; diff --git a/Documentation/virt/kvm/arm/pvtime.rst b/Documentation/virt/kvm/arm/pvtime.rst new file mode 100644 index 000000000000..2357dd2d8655 --- /dev/null +++ b/Documentation/virt/kvm/arm/pvtime.rst @@ -0,0 +1,80 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Paravirtualized time support for arm64 +====================================== + +Arm specification DEN0057/A defines a standard for paravirtualised time +support for AArch64 guests: + +https://developer.arm.com/docs/den0057/a + +KVM/arm64 implements the stolen time part of this specification by providing +some hypervisor service calls to support a paravirtualized guest obtaining a +view of the amount of time stolen from its execution. + +Two new SMCCC compatible hypercalls are defined: + +* PV_TIME_FEATURES: 0xC5000020 +* PV_TIME_ST: 0xC5000021 + +These are only available in the SMC64/HVC64 calling convention as +paravirtualized time is not available to 32 bit Arm guests. The existence of +the PV_FEATURES hypercall should be probed using the SMCCC 1.1 ARCH_FEATURES +mechanism before calling it. + +PV_TIME_FEATURES + ============= ======== ========== + Function ID: (uint32) 0xC5000020 + PV_call_id: (uint32) The function to query for support. + Currently only PV_TIME_ST is supported. + Return value: (int64) NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant + PV-time feature is supported by the hypervisor. + ============= ======== ========== + +PV_TIME_ST + ============= ======== ========== + Function ID: (uint32) 0xC5000021 + Return value: (int64) IPA of the stolen time data structure for this + VCPU. On failure: + NOT_SUPPORTED (-1) + ============= ======== ========== + +The IPA returned by PV_TIME_ST should be mapped by the guest as normal memory +with inner and outer write back caching attributes, in the inner shareable +domain. A total of 16 bytes from the IPA returned are guaranteed to be +meaningfully filled by the hypervisor (see structure below). + +PV_TIME_ST returns the structure for the calling VCPU. + +Stolen Time +----------- + +The structure pointed to by the PV_TIME_ST hypercall is as follows: + ++-------------+-------------+-------------+----------------------------+ +| Field | Byte Length | Byte Offset | Description | ++=============+=============+=============+============================+ +| Revision | 4 | 0 | Must be 0 for version 1.0 | ++-------------+-------------+-------------+----------------------------+ +| Attributes | 4 | 4 | Must be 0 | ++-------------+-------------+-------------+----------------------------+ +| Stolen time | 8 | 8 | Stolen time in unsigned | +| | | | nanoseconds indicating how | +| | | | much time this VCPU thread | +| | | | was involuntarily not | +| | | | running on a physical CPU. | ++-------------+-------------+-------------+----------------------------+ + +All values in the structure are stored little-endian. + +The structure will be updated by the hypervisor prior to scheduling a VCPU. It +will be present within a reserved region of the normal memory given to the +guest. The guest should not attempt to write into this memory. There is a +structure per VCPU of the guest. + +It is advisable that one or more 64k pages are set aside for the purpose of +these structures and not used for other purposes, this enables the guest to map +the region using 64k pages and avoids conflicting attributes with other memory. + +For the user space interface see Documentation/virt/kvm/devices/vcpu.txt +section "3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL". diff --git a/Documentation/virt/kvm/devices/vcpu.txt b/Documentation/virt/kvm/devices/vcpu.txt index 2b5dab16c4f2..6f3bd64a05b0 100644 --- a/Documentation/virt/kvm/devices/vcpu.txt +++ b/Documentation/virt/kvm/devices/vcpu.txt @@ -60,3 +60,17 @@ time to use the number provided for a given timer, overwriting any previously configured values on other VCPUs. Userspace should configure the interrupt numbers on at least one VCPU after creating all VCPUs and before running any VCPUs. + +3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL +Architectures: ARM64 + +3.1 ATTRIBUTE: KVM_ARM_VCPU_PVTIME_IPA +Parameters: 64-bit base address +Returns: -ENXIO: Stolen time not implemented + -EEXIST: Base address already set for this VCPU + -EINVAL: Base address not 64 byte aligned + +Specifies the base address of the stolen time structure for this VCPU. The +base address must be 64 byte aligned and exist within a valid guest memory +region. See Documentation/virt/kvm/arm/pvtime.txt for more information +including the layout of the stolen time structure. diff --git a/Documentation/virt/kvm/devices/xics.txt b/Documentation/virt/kvm/devices/xics.txt index 42864935ac5d..423332dda7bc 100644 --- a/Documentation/virt/kvm/devices/xics.txt +++ b/Documentation/virt/kvm/devices/xics.txt @@ -3,9 +3,19 @@ XICS interrupt controller Device type supported: KVM_DEV_TYPE_XICS Groups: - KVM_DEV_XICS_SOURCES + 1. KVM_DEV_XICS_GRP_SOURCES Attributes: One per interrupt source, indexed by the source number. + 2. KVM_DEV_XICS_GRP_CTRL + Attributes: + 2.1 KVM_DEV_XICS_NR_SERVERS (write only) + The kvm_device_attr.addr points to a __u32 value which is the number of + interrupt server numbers (ie, highest possible vcpu id plus one). + Errors: + -EINVAL: Value greater than KVM_MAX_VCPU_ID. + -EFAULT: Invalid user pointer for attr->addr. + -EBUSY: A vcpu is already connected to the device. + This device emulates the XICS (eXternal Interrupt Controller Specification) defined in PAPR. The XICS has a set of interrupt sources, each identified by a 20-bit source number, and a set of @@ -38,7 +48,7 @@ least-significant end of the word: Each source has 64 bits of state that can be read and written using the KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls, specifying the -KVM_DEV_XICS_SOURCES attribute group, with the attribute number being +KVM_DEV_XICS_GRP_SOURCES attribute group, with the attribute number being the interrupt source number. The 64 bit state word has the following bitfields, starting from the least-significant end of the word: diff --git a/Documentation/virt/kvm/devices/xive.txt b/Documentation/virt/kvm/devices/xive.txt index 9a24a4525253..f5d1d6b5af61 100644 --- a/Documentation/virt/kvm/devices/xive.txt +++ b/Documentation/virt/kvm/devices/xive.txt @@ -78,6 +78,14 @@ the legacy interrupt mode, referred as XICS (POWER7/8). migrating the VM. Errors: none + 1.3 KVM_DEV_XIVE_NR_SERVERS (write only) + The kvm_device_attr.addr points to a __u32 value which is the number of + interrupt server numbers (ie, highest possible vcpu id plus one). + Errors: + -EINVAL: Value greater than KVM_MAX_VCPU_ID. + -EFAULT: Invalid user pointer for attr->addr. + -EBUSY: A vCPU is already connected to the device. + 2. KVM_DEV_XIVE_GRP_SOURCE (write only) Initializes a new source in the XIVE device and mask it. Attributes: diff --git a/Documentation/x86/boot.rst b/Documentation/x86/boot.rst index 08a2f100c0e6..90bb8f5ab384 100644 --- a/Documentation/x86/boot.rst +++ b/Documentation/x86/boot.rst @@ -68,8 +68,25 @@ Protocol 2.12 (Kernel 3.8) Added the xloadflags field and extension fields Protocol 2.13 (Kernel 3.14) Support 32- and 64-bit flags being set in xloadflags to support booting a 64-bit kernel from 32-bit EFI + +Protocol 2.14: BURNT BY INCORRECT COMMIT ae7e1238e68f2a472a125673ab506d49158c1889 + (x86/boot: Add ACPI RSDP address to setup_header) + DO NOT USE!!! ASSUME SAME AS 2.13. + +Protocol 2.15: (Kernel 5.5) Added the kernel_info and kernel_info.setup_type_max. ============= ============================================================ +.. note:: + The protocol version number should be changed only if the setup header + is changed. There is no need to update the version number if boot_params + or kernel_info are changed. Additionally, it is recommended to use + xloadflags (in this case the protocol version number should not be + updated either) or kernel_info to communicate supported Linux kernel + features to the boot loader. Due to very limited space available in + the original setup header every update to it should be considered + with great care. Starting from the protocol 2.15 the primary way to + communicate things to the boot loader is the kernel_info. + Memory Layout ============= @@ -207,6 +224,7 @@ Offset/Size Proto Name Meaning 0258/8 2.10+ pref_address Preferred loading address 0260/4 2.10+ init_size Linear memory required during initialization 0264/4 2.11+ handover_offset Offset of handover entry point +0268/4 2.15+ kernel_info_offset Offset of the kernel_info =========== ======== ===================== ============================================ .. note:: @@ -809,6 +827,47 @@ Protocol: 2.09+ sure to consider the case where the linked list already contains entries. + The setup_data is a bit awkward to use for extremely large data objects, + both because the setup_data header has to be adjacent to the data object + and because it has a 32-bit length field. However, it is important that + intermediate stages of the boot process have a way to identify which + chunks of memory are occupied by kernel data. + + Thus setup_indirect struct and SETUP_INDIRECT type were introduced in + protocol 2.15. + + struct setup_indirect { + __u32 type; + __u32 reserved; /* Reserved, must be set to zero. */ + __u64 len; + __u64 addr; + }; + + The type member is a SETUP_INDIRECT | SETUP_* type. However, it cannot be + SETUP_INDIRECT itself since making the setup_indirect a tree structure + could require a lot of stack space in something that needs to parse it + and stack space can be limited in boot contexts. + + Let's give an example how to point to SETUP_E820_EXT data using setup_indirect. + In this case setup_data and setup_indirect will look like this: + + struct setup_data { + __u64 next = 0 or <addr_of_next_setup_data_struct>; + __u32 type = SETUP_INDIRECT; + __u32 len = sizeof(setup_data); + __u8 data[sizeof(setup_indirect)] = struct setup_indirect { + __u32 type = SETUP_INDIRECT | SETUP_E820_EXT; + __u32 reserved = 0; + __u64 len = <len_of_SETUP_E820_EXT_data>; + __u64 addr = <addr_of_SETUP_E820_EXT_data>; + } + } + +.. note:: + SETUP_INDIRECT | SETUP_NONE objects cannot be properly distinguished + from SETUP_INDIRECT itself. So, this kind of objects cannot be provided + by the bootloaders. + ============ ============ Field name: pref_address Type: read (reloc) @@ -855,6 +914,121 @@ Offset/size: 0x264/4 See EFI HANDOVER PROTOCOL below for more details. +============ ================== +Field name: kernel_info_offset +Type: read +Offset/size: 0x268/4 +Protocol: 2.15+ +============ ================== + + This field is the offset from the beginning of the kernel image to the + kernel_info. The kernel_info structure is embedded in the Linux image + in the uncompressed protected mode region. + + +The kernel_info +=============== + +The relationships between the headers are analogous to the various data +sections: + + setup_header = .data + boot_params/setup_data = .bss + +What is missing from the above list? That's right: + + kernel_info = .rodata + +We have been (ab)using .data for things that could go into .rodata or .bss for +a long time, for lack of alternatives and -- especially early on -- inertia. +Also, the BIOS stub is responsible for creating boot_params, so it isn't +available to a BIOS-based loader (setup_data is, though). + +setup_header is permanently limited to 144 bytes due to the reach of the +2-byte jump field, which doubles as a length field for the structure, combined +with the size of the "hole" in struct boot_params that a protected-mode loader +or the BIOS stub has to copy it into. It is currently 119 bytes long, which +leaves us with 25 very precious bytes. This isn't something that can be fixed +without revising the boot protocol entirely, breaking backwards compatibility. + +boot_params proper is limited to 4096 bytes, but can be arbitrarily extended +by adding setup_data entries. It cannot be used to communicate properties of +the kernel image, because it is .bss and has no image-provided content. + +kernel_info solves this by providing an extensible place for information about +the kernel image. It is readonly, because the kernel cannot rely on a +bootloader copying its contents anywhere, but that is OK; if it becomes +necessary it can still contain data items that an enabled bootloader would be +expected to copy into a setup_data chunk. + +All kernel_info data should be part of this structure. Fixed size data have to +be put before kernel_info_var_len_data label. Variable size data have to be put +after kernel_info_var_len_data label. Each chunk of variable size data has to +be prefixed with header/magic and its size, e.g.: + + kernel_info: + .ascii "LToP" /* Header, Linux top (structure). */ + .long kernel_info_var_len_data - kernel_info + .long kernel_info_end - kernel_info + .long 0x01234567 /* Some fixed size data for the bootloaders. */ + kernel_info_var_len_data: + example_struct: /* Some variable size data for the bootloaders. */ + .ascii "0123" /* Header/Magic. */ + .long example_struct_end - example_struct + .ascii "Struct" + .long 0x89012345 + example_struct_end: + example_strings: /* Some variable size data for the bootloaders. */ + .ascii "ABCD" /* Header/Magic. */ + .long example_strings_end - example_strings + .asciz "String_0" + .asciz "String_1" + example_strings_end: + kernel_info_end: + +This way the kernel_info is self-contained blob. + +.. note:: + Each variable size data header/magic can be any 4-character string, + without \0 at the end of the string, which does not collide with + existing variable length data headers/magics. + + +Details of the kernel_info Fields +================================= + +============ ======== +Field name: header +Offset/size: 0x0000/4 +============ ======== + + Contains the magic number "LToP" (0x506f544c). + +============ ======== +Field name: size +Offset/size: 0x0004/4 +============ ======== + + This field contains the size of the kernel_info including kernel_info.header. + It does not count kernel_info.kernel_info_var_len_data size. This field should be + used by the bootloaders to detect supported fixed size fields in the kernel_info + and beginning of kernel_info.kernel_info_var_len_data. + +============ ======== +Field name: size_total +Offset/size: 0x0008/4 +============ ======== + + This field contains the size of the kernel_info including kernel_info.header + and kernel_info.kernel_info_var_len_data. + +============ ============== +Field name: setup_type_max +Offset/size: 0x000c/4 +============ ============== + + This field contains maximal allowed type for setup_data and setup_indirect structs. + The Image Checksum ================== diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index af64c4bb4447..a8de2fbc1caa 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -27,6 +27,7 @@ x86-specific Documentation mds microcode resctrl_ui + tsx_async_abort usb-legacy-support i386/index x86_64/index diff --git a/Documentation/x86/tsx_async_abort.rst b/Documentation/x86/tsx_async_abort.rst new file mode 100644 index 000000000000..583ddc185ba2 --- /dev/null +++ b/Documentation/x86/tsx_async_abort.rst @@ -0,0 +1,117 @@ +.. SPDX-License-Identifier: GPL-2.0 + +TSX Async Abort (TAA) mitigation +================================ + +.. _tsx_async_abort: + +Overview +-------- + +TSX Async Abort (TAA) is a side channel attack on internal buffers in some +Intel processors similar to Microachitectural Data Sampling (MDS). In this +case certain loads may speculatively pass invalid data to dependent operations +when an asynchronous abort condition is pending in a Transactional +Synchronization Extensions (TSX) transaction. This includes loads with no +fault or assist condition. Such loads may speculatively expose stale data from +the same uarch data structures as in MDS, with same scope of exposure i.e. +same-thread and cross-thread. This issue affects all current processors that +support TSX. + +Mitigation strategy +------------------- + +a) TSX disable - one of the mitigations is to disable TSX. A new MSR +IA32_TSX_CTRL will be available in future and current processors after +microcode update which can be used to disable TSX. In addition, it +controls the enumeration of the TSX feature bits (RTM and HLE) in CPUID. + +b) Clear CPU buffers - similar to MDS, clearing the CPU buffers mitigates this +vulnerability. More details on this approach can be found in +:ref:`Documentation/admin-guide/hw-vuln/mds.rst <mds>`. + +Kernel internal mitigation modes +-------------------------------- + + ============= ============================================================ + off Mitigation is disabled. Either the CPU is not affected or + tsx_async_abort=off is supplied on the kernel command line. + + tsx disabled Mitigation is enabled. TSX feature is disabled by default at + bootup on processors that support TSX control. + + verw Mitigation is enabled. CPU is affected and MD_CLEAR is + advertised in CPUID. + + ucode needed Mitigation is enabled. CPU is affected and MD_CLEAR is not + advertised in CPUID. That is mainly for virtualization + scenarios where the host has the updated microcode but the + hypervisor does not expose MD_CLEAR in CPUID. It's a best + effort approach without guarantee. + ============= ============================================================ + +If the CPU is affected and the "tsx_async_abort" kernel command line parameter is +not provided then the kernel selects an appropriate mitigation depending on the +status of RTM and MD_CLEAR CPUID bits. + +Below tables indicate the impact of tsx=on|off|auto cmdline options on state of +TAA mitigation, VERW behavior and TSX feature for various combinations of +MSR_IA32_ARCH_CAPABILITIES bits. + +1. "tsx=off" + +========= ========= ============ ============ ============== =================== ====================== +MSR_IA32_ARCH_CAPABILITIES bits Result with cmdline tsx=off +---------------------------------- ------------------------------------------------------------------------- +TAA_NO MDS_NO TSX_CTRL_MSR TSX state VERW can clear TAA mitigation TAA mitigation + after bootup CPU buffers tsx_async_abort=off tsx_async_abort=full +========= ========= ============ ============ ============== =================== ====================== + 0 0 0 HW default Yes Same as MDS Same as MDS + 0 0 1 Invalid case Invalid case Invalid case Invalid case + 0 1 0 HW default No Need ucode update Need ucode update + 0 1 1 Disabled Yes TSX disabled TSX disabled + 1 X 1 Disabled X None needed None needed +========= ========= ============ ============ ============== =================== ====================== + +2. "tsx=on" + +========= ========= ============ ============ ============== =================== ====================== +MSR_IA32_ARCH_CAPABILITIES bits Result with cmdline tsx=on +---------------------------------- ------------------------------------------------------------------------- +TAA_NO MDS_NO TSX_CTRL_MSR TSX state VERW can clear TAA mitigation TAA mitigation + after bootup CPU buffers tsx_async_abort=off tsx_async_abort=full +========= ========= ============ ============ ============== =================== ====================== + 0 0 0 HW default Yes Same as MDS Same as MDS + 0 0 1 Invalid case Invalid case Invalid case Invalid case + 0 1 0 HW default No Need ucode update Need ucode update + 0 1 1 Enabled Yes None Same as MDS + 1 X 1 Enabled X None needed None needed +========= ========= ============ ============ ============== =================== ====================== + +3. "tsx=auto" + +========= ========= ============ ============ ============== =================== ====================== +MSR_IA32_ARCH_CAPABILITIES bits Result with cmdline tsx=auto +---------------------------------- ------------------------------------------------------------------------- +TAA_NO MDS_NO TSX_CTRL_MSR TSX state VERW can clear TAA mitigation TAA mitigation + after bootup CPU buffers tsx_async_abort=off tsx_async_abort=full +========= ========= ============ ============ ============== =================== ====================== + 0 0 0 HW default Yes Same as MDS Same as MDS + 0 0 1 Invalid case Invalid case Invalid case Invalid case + 0 1 0 HW default No Need ucode update Need ucode update + 0 1 1 Disabled Yes TSX disabled TSX disabled + 1 X 1 Enabled X None needed None needed +========= ========= ============ ============ ============== =================== ====================== + +In the tables, TSX_CTRL_MSR is a new bit in MSR_IA32_ARCH_CAPABILITIES that +indicates whether MSR_IA32_TSX_CTRL is supported. + +There are two control bits in IA32_TSX_CTRL MSR: + + Bit 0: When set it disables the Restricted Transactional Memory (RTM) + sub-feature of TSX (will force all transactions to abort on the + XBEGIN instruction). + + Bit 1: When set it disables the enumeration of the RTM and HLE feature + (i.e. it will make CPUID(EAX=7).EBX{bit4} and + CPUID(EAX=7).EBX{bit11} read as 0). |