diff options
Diffstat (limited to 'Documentation')
496 files changed, 13991 insertions, 5465 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 708dc4c166e4..2754fe83f0d4 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -64,8 +64,6 @@ auxdisplay/ - misc. LCD driver documentation (cfag12864b, ks0108). backlight/ - directory with info on controlling backlights in flat panel displays -bcache.txt - - Block-layer cache on fast SSDs to improve slow (raid) I/O performance. block/ - info on the Block I/O (BIO) layer. blockdev/ @@ -78,18 +76,10 @@ bus-devices/ - directory with info on TI GPMC (General Purpose Memory Controller) bus-virt-phys-mapping.txt - how to access I/O mapped memory from within device drivers. -cachetlb.txt - - describes the cache/TLB flushing interfaces Linux uses. cdrom/ - directory with information on the CD-ROM drivers that Linux has. cgroup-v1/ - cgroups v1 features, including cpusets and memory controller. -cgroup-v2.txt - - cgroups v2 features, including cpusets and memory controller. -circular-buffers.txt - - how to make use of the existing circular buffer infrastructure -clk.txt - - info on the common clock framework cma/ - Continuous Memory Area (CMA) debugfs interface. conf.py diff --git a/Documentation/ABI/removed/sysfs-bus-nfit b/Documentation/ABI/removed/sysfs-bus-nfit new file mode 100644 index 000000000000..ae8c1ca53828 --- /dev/null +++ b/Documentation/ABI/removed/sysfs-bus-nfit @@ -0,0 +1,17 @@ +What: /sys/bus/nd/devices/regionX/nfit/ecc_unit_size +Date: Aug, 2017 +KernelVersion: v4.14 (Removed v4.18) +Contact: linux-nvdimm@lists.01.org +Description: + (RO) Size of a write request to a DIMM that will not incur a + read-modify-write cycle at the memory controller. + + When the nfit driver initializes it runs an ARS (Address Range + Scrub) operation across every pmem range. Part of that process + involves determining the ARS capabilities of a given address + range. One of the capabilities that is reported is the 'Clear + Uncorrectable Error Range Length Unit Size' (see: ACPI 6.2 + section 9.20.7.4 Function Index 1 - Query ARS Capabilities). + This property indicates the boundary at which the NVDIMM may + need to perform read-modify-write cycles to maintain ECC (Error + Correcting Code) blocks. diff --git a/Documentation/ABI/stable/sysfs-bus-vmbus b/Documentation/ABI/stable/sysfs-bus-vmbus index 0c9d9dcd2151..3eaffbb2d468 100644 --- a/Documentation/ABI/stable/sysfs-bus-vmbus +++ b/Documentation/ABI/stable/sysfs-bus-vmbus @@ -1,25 +1,25 @@ -What: /sys/bus/vmbus/devices/vmbus_*/id +What: /sys/bus/vmbus/devices/<UUID>/id Date: Jul 2009 KernelVersion: 2.6.31 Contact: K. Y. Srinivasan <kys@microsoft.com> Description: The VMBus child_relid of the device's primary channel Users: tools/hv/lsvmbus -What: /sys/bus/vmbus/devices/vmbus_*/class_id +What: /sys/bus/vmbus/devices/<UUID>/class_id Date: Jul 2009 KernelVersion: 2.6.31 Contact: K. Y. Srinivasan <kys@microsoft.com> Description: The VMBus interface type GUID of the device Users: tools/hv/lsvmbus -What: /sys/bus/vmbus/devices/vmbus_*/device_id +What: /sys/bus/vmbus/devices/<UUID>/device_id Date: Jul 2009 KernelVersion: 2.6.31 Contact: K. Y. Srinivasan <kys@microsoft.com> Description: The VMBus interface instance GUID of the device Users: tools/hv/lsvmbus -What: /sys/bus/vmbus/devices/vmbus_*/channel_vp_mapping +What: /sys/bus/vmbus/devices/<UUID>/channel_vp_mapping Date: Jul 2015 KernelVersion: 4.2.0 Contact: K. Y. Srinivasan <kys@microsoft.com> @@ -28,112 +28,112 @@ Description: The mapping of which primary/sub channels are bound to which Format: <channel's child_relid:the bound cpu's number> Users: tools/hv/lsvmbus -What: /sys/bus/vmbus/devices/vmbus_*/device +What: /sys/bus/vmbus/devices/<UUID>/device Date: Dec. 2015 KernelVersion: 4.5 Contact: K. Y. Srinivasan <kys@microsoft.com> Description: The 16 bit device ID of the device Users: tools/hv/lsvmbus and user level RDMA libraries -What: /sys/bus/vmbus/devices/vmbus_*/vendor +What: /sys/bus/vmbus/devices/<UUID>/vendor Date: Dec. 2015 KernelVersion: 4.5 Contact: K. Y. Srinivasan <kys@microsoft.com> Description: The 16 bit vendor ID of the device Users: tools/hv/lsvmbus and user level RDMA libraries -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN +What: /sys/bus/vmbus/devices/<UUID>/channels/<N> Date: September. 2017 KernelVersion: 4.14 Contact: Stephen Hemminger <sthemmin@microsoft.com> Description: Directory for per-channel information NN is the VMBUS relid associtated with the channel. -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/cpu +What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/cpu Date: September. 2017 KernelVersion: 4.14 Contact: Stephen Hemminger <sthemmin@microsoft.com> Description: VCPU (sub)channel is affinitized to Users: tools/hv/lsvmbus and other debugging tools -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/cpu +What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/cpu Date: September. 2017 KernelVersion: 4.14 Contact: Stephen Hemminger <sthemmin@microsoft.com> Description: VCPU (sub)channel is affinitized to Users: tools/hv/lsvmbus and other debugging tools -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/in_mask +What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/in_mask Date: September. 2017 KernelVersion: 4.14 Contact: Stephen Hemminger <sthemmin@microsoft.com> Description: Host to guest channel interrupt mask Users: Debugging tools -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/latency +What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/latency Date: September. 2017 KernelVersion: 4.14 Contact: Stephen Hemminger <sthemmin@microsoft.com> Description: Channel signaling latency Users: Debugging tools -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/out_mask +What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/out_mask Date: September. 2017 KernelVersion: 4.14 Contact: Stephen Hemminger <sthemmin@microsoft.com> Description: Guest to host channel interrupt mask Users: Debugging tools -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/pending +What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/pending Date: September. 2017 KernelVersion: 4.14 Contact: Stephen Hemminger <sthemmin@microsoft.com> Description: Channel interrupt pending state Users: Debugging tools -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/read_avail +What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/read_avail Date: September. 2017 KernelVersion: 4.14 Contact: Stephen Hemminger <sthemmin@microsoft.com> Description: Bytes available to read Users: Debugging tools -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/write_avail +What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/write_avail Date: September. 2017 KernelVersion: 4.14 Contact: Stephen Hemminger <sthemmin@microsoft.com> Description: Bytes available to write Users: Debugging tools -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/events +What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/events Date: September. 2017 KernelVersion: 4.14 Contact: Stephen Hemminger <sthemmin@microsoft.com> Description: Number of times we have signaled the host Users: Debugging tools -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/interrupts +What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/interrupts Date: September. 2017 KernelVersion: 4.14 Contact: Stephen Hemminger <sthemmin@microsoft.com> Description: Number of times we have taken an interrupt (incoming) Users: Debugging tools -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/subchannel_id +What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/subchannel_id Date: January. 2018 KernelVersion: 4.16 Contact: Stephen Hemminger <sthemmin@microsoft.com> Description: Subchannel ID associated with VMBUS channel Users: Debugging tools and userspace drivers -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/monitor_id +What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/monitor_id Date: January. 2018 KernelVersion: 4.16 Contact: Stephen Hemminger <sthemmin@microsoft.com> Description: Monitor bit associated with channel Users: Debugging tools and userspace drivers -What: /sys/bus/vmbus/devices/vmbus_*/channels/NN/ring +What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/ring Date: January. 2018 KernelVersion: 4.16 Contact: Stephen Hemminger <sthemmin@microsoft.com> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index 5b2d0f08867c..3e90e1f3bf0a 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -90,4 +90,4 @@ Date: December 2009 Contact: Lee Schermerhorn <lee.schermerhorn@hp.com> Description: The node's huge page size control/query attributes. - See Documentation/vm/hugetlbpage.txt
\ No newline at end of file + See Documentation/admin-guide/mm/hugetlbpage.rst
\ No newline at end of file diff --git a/Documentation/ABI/testing/evm b/Documentation/ABI/testing/evm index d12cb2eae9ee..201d10319fa1 100644 --- a/Documentation/ABI/testing/evm +++ b/Documentation/ABI/testing/evm @@ -57,3 +57,16 @@ Description: dracut (via 97masterkey and 98integrity) and systemd (via core/ima-setup) have support for loading keys at boot time. + +What: security/integrity/evm/evm_xattrs +Date: April 2018 +Contact: Matthew Garrett <mjg59@google.com> +Description: + Shows the set of extended attributes used to calculate or + validate the EVM signature, and allows additional attributes + to be added at runtime. Any signatures generated after + additional attributes are added (and on files posessing those + additional attributes) will only be valid if the same + additional attributes are configured on system boot. Writing + a single period (.) will lock the xattr list from any further + modification. diff --git a/Documentation/ABI/testing/ima_policy b/Documentation/ABI/testing/ima_policy index b8465e00ba5f..74c6702de74e 100644 --- a/Documentation/ABI/testing/ima_policy +++ b/Documentation/ABI/testing/ima_policy @@ -21,7 +21,7 @@ Description: audit | hash | dont_hash condition:= base | lsm [option] base: [[func=] [mask=] [fsmagic=] [fsuuid=] [uid=] - [euid=] [fowner=]] + [euid=] [fowner=] [fsname=]] lsm: [[subj_user=] [subj_role=] [subj_type=] [obj_user=] [obj_role=] [obj_type=]] option: [[appraise_type=]] [permit_directio] diff --git a/Documentation/ABI/testing/sysfs-bus-iio b/Documentation/ABI/testing/sysfs-bus-iio index 6a5f34b4d5b9..731146c3b138 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio +++ b/Documentation/ABI/testing/sysfs-bus-iio @@ -190,6 +190,13 @@ Description: but should match other such assignments on device). Units after application of scale and offset are m/s^2. +What: /sys/bus/iio/devices/iio:deviceX/in_angl_raw +KernelVersion: 4.17 +Contact: linux-iio@vger.kernel.org +Description: + Angle of rotation. Units after application of scale and offset + are radians. + What: /sys/bus/iio/devices/iio:deviceX/in_anglvel_x_raw What: /sys/bus/iio/devices/iio:deviceX/in_anglvel_y_raw What: /sys/bus/iio/devices/iio:deviceX/in_anglvel_z_raw @@ -297,6 +304,7 @@ What: /sys/bus/iio/devices/iio:deviceX/in_pressure_offset What: /sys/bus/iio/devices/iio:deviceX/in_humidityrelative_offset What: /sys/bus/iio/devices/iio:deviceX/in_magn_offset What: /sys/bus/iio/devices/iio:deviceX/in_rot_offset +What: /sys/bus/iio/devices/iio:deviceX/in_angl_offset KernelVersion: 2.6.35 Contact: linux-iio@vger.kernel.org Description: @@ -350,6 +358,7 @@ What: /sys/bus/iio/devices/iio:deviceX/in_humidityrelative_scale What: /sys/bus/iio/devices/iio:deviceX/in_velocity_sqrt(x^2+y^2+z^2)_scale What: /sys/bus/iio/devices/iio:deviceX/in_illuminance_scale What: /sys/bus/iio/devices/iio:deviceX/in_countY_scale +What: /sys/bus/iio/devices/iio:deviceX/in_angl_scale KernelVersion: 2.6.35 Contact: linux-iio@vger.kernel.org Description: diff --git a/Documentation/ABI/testing/sysfs-bus-nfit b/Documentation/ABI/testing/sysfs-bus-nfit index 619eb8ca0f99..a1cb44dcb908 100644 --- a/Documentation/ABI/testing/sysfs-bus-nfit +++ b/Documentation/ABI/testing/sysfs-bus-nfit @@ -212,22 +212,3 @@ Description: range. Used by NVDIMM Region Mapping Structure to uniquely refer to this structure. Value of 0 is reserved and not used as an index. - - -What: /sys/bus/nd/devices/regionX/nfit/ecc_unit_size -Date: Aug, 2017 -KernelVersion: v4.14 -Contact: linux-nvdimm@lists.01.org -Description: - (RO) Size of a write request to a DIMM that will not incur a - read-modify-write cycle at the memory controller. - - When the nfit driver initializes it runs an ARS (Address Range - Scrub) operation across every pmem range. Part of that process - involves determining the ARS capabilities of a given address - range. One of the capabilities that is reported is the 'Clear - Uncorrectable Error Range Length Unit Size' (see: ACPI 6.2 - section 9.20.7.4 Function Index 1 - Query ARS Capabilities). - This property indicates the boundary at which the NVDIMM may - need to perform read-modify-write cycles to maintain ECC (Error - Correcting Code) blocks. diff --git a/Documentation/ABI/testing/sysfs-bus-usb b/Documentation/ABI/testing/sysfs-bus-usb index c702c78f24d8..08d456e07b53 100644 --- a/Documentation/ABI/testing/sysfs-bus-usb +++ b/Documentation/ABI/testing/sysfs-bus-usb @@ -189,6 +189,28 @@ Description: The file will read "hotplug", "wired" and "not used" if the information is available, and "unknown" otherwise. +What: /sys/bus/usb/devices/.../(hub interface)/portX/quirks +Date: May 2018 +Contact: Nicolas Boichat <drinkcat@chromium.org> +Description: + In some cases, we care about time-to-active for devices + connected on a specific port (e.g. non-standard USB port like + pogo pins), where the device to be connected is known in + advance, and behaves well according to the specification. + This attribute is a bit-field that controls the behavior of + a specific port: + - Bit 0 of this field selects the "old" enumeration scheme, + as it is considerably faster (it only causes one USB reset + instead of 2). + The old enumeration scheme can also be selected globally + using /sys/module/usbcore/parameters/old_scheme_first, but + it is often not desirable as the new scheme was introduced to + increase compatibility with more devices. + - Bit 1 reduces TRSTRCY to the 10 ms that are required by the + USB 2.0 specification, instead of the 50 ms that are normally + used to help make enumeration work better on some high speed + devices. + What: /sys/bus/usb/devices/.../(hub interface)/portX/over_current_count Date: February 2018 Contact: Richard Leitner <richard.leitner@skidata.com> @@ -236,3 +258,21 @@ Description: Supported values are 0 - 15. More information on how besl values map to microseconds can be found in USB 2.0 ECN Errata for Link Power Management, section 4.10) + +What: /sys/bus/usb/devices/.../rx_lanes +Date: March 2018 +Contact: Mathias Nyman <mathias.nyman@linux.intel.com> +Description: + Number of rx lanes the device is using. + USB 3.2 adds Dual-lane support, 2 rx and 2 tx lanes over Type-C. + Inter-Chip SSIC devices support asymmetric lanes up to 4 lanes per + direction. Devices before USB 3.2 are single lane (rx_lanes = 1) + +What: /sys/bus/usb/devices/.../tx_lanes +Date: March 2018 +Contact: Mathias Nyman <mathias.nyman@linux.intel.com> +Description: + Number of tx lanes the device is using. + USB 3.2 adds Dual-lane support, 2 rx and 2 tx -lanes over Type-C. + Inter-Chip SSIC devices support asymmetric lanes up to 4 lanes per + direction. Devices before USB 3.2 are single lane (tx_lanes = 1) diff --git a/Documentation/ABI/testing/sysfs-class-cxl b/Documentation/ABI/testing/sysfs-class-cxl index 640f65e79ef1..bbbabffc682a 100644 --- a/Documentation/ABI/testing/sysfs-class-cxl +++ b/Documentation/ABI/testing/sysfs-class-cxl @@ -69,7 +69,9 @@ Date: September 2014 Contact: linuxppc-dev@lists.ozlabs.org Description: read/write Set the mode for prefaulting in segments into the segment table - when performing the START_WORK ioctl. Possible values: + when performing the START_WORK ioctl. Only applicable when + running under hashed page table mmu. + Possible values: none: No prefaulting (default) work_element_descriptor: Treat the work element descriptor as an effective address and @@ -244,3 +246,11 @@ Description: read only Returns 1 if the psl timebase register is synchronized with the core timebase register, 0 otherwise. Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl/<card>/tunneled_ops_supported +Date: May 2018 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Returns 1 if tunneled operations are supported in capi mode, + 0 otherwise. +Users: https://github.com/ibm-capi/libcxl diff --git a/Documentation/ABI/testing/sysfs-class-mtd b/Documentation/ABI/testing/sysfs-class-mtd index f34e592301d1..3bc7c0a95c92 100644 --- a/Documentation/ABI/testing/sysfs-class-mtd +++ b/Documentation/ABI/testing/sysfs-class-mtd @@ -232,3 +232,11 @@ Description: of the parent (another partition or a flash device) in bytes. This attribute is absent on flash devices, so it can be used to distinguish them from partitions. + +What: /sys/class/mtd/mtdX/oobavail +Date: April 2018 +KernelVersion: 4.16 +Contact: linux-mtd@lists.infradead.org +Description: + Number of bytes available for a client to place data into + the out of band area. diff --git a/Documentation/ABI/testing/sysfs-class-power b/Documentation/ABI/testing/sysfs-class-power index f85ce9e327b9..5e23e22dce1b 100644 --- a/Documentation/ABI/testing/sysfs-class-power +++ b/Documentation/ABI/testing/sysfs-class-power @@ -1,3 +1,458 @@ +===== General Properties ===== + +What: /sys/class/power_supply/<supply_name>/manufacturer +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Reports the name of the device manufacturer. + + Access: Read + Valid values: Represented as string + +What: /sys/class/power_supply/<supply_name>/model_name +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Reports the name of the device model. + + Access: Read + Valid values: Represented as string + +What: /sys/class/power_supply/<supply_name>/serial_number +Date: January 2008 +Contact: linux-pm@vger.kernel.org +Description: + Reports the serial number of the device. + + Access: Read + Valid values: Represented as string + +What: /sys/class/power_supply/<supply_name>/type +Date: May 2010 +Contact: linux-pm@vger.kernel.org +Description: + Describes the main type of the supply. + + Access: Read + Valid values: "Battery", "UPS", "Mains", "USB" + +===== Battery Properties ===== + +What: /sys/class/power_supply/<supply_name>/capacity +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Fine grain representation of battery capacity. + Access: Read + Valid values: 0 - 100 (percent) + +What: /sys/class/power_supply/<supply_name>/capacity_alert_max +Date: July 2012 +Contact: linux-pm@vger.kernel.org +Description: + Maximum battery capacity trip-wire value where the supply will + notify user-space of the event. This is normally used for the + battery discharging scenario where user-space needs to know the + battery has dropped to an upper level so it can take + appropriate action (e.g. warning user that battery level is + low). + + Access: Read, Write + Valid values: 0 - 100 (percent) + +What: /sys/class/power_supply/<supply_name>/capacity_alert_min +Date: July 2012 +Contact: linux-pm@vger.kernel.org +Description: + Minimum battery capacity trip-wire value where the supply will + notify user-space of the event. This is normally used for the + battery discharging scenario where user-space needs to know the + battery has dropped to a lower level so it can take + appropriate action (e.g. warning user that battery level is + critically low). + + Access: Read, Write + Valid values: 0 - 100 (percent) + +What: /sys/class/power_supply/<supply_name>/capacity_level +Date: June 2009 +Contact: linux-pm@vger.kernel.org +Description: + Coarse representation of battery capacity. + + Access: Read + Valid values: "Unknown", "Critical", "Low", "Normal", "High", + "Full" + +What: /sys/class/power_supply/<supply_name>/current_avg +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Reports an average IBAT current reading for the battery, over a + fixed period. Normally devices will provide a fixed interval in + which they average readings to smooth out the reported value. + + Access: Read + Valid values: Represented in microamps + +What: /sys/class/power_supply/<supply_name>/current_max +Date: October 2010 +Contact: linux-pm@vger.kernel.org +Description: + Reports the maximum IBAT current allowed into the battery. + + Access: Read + Valid values: Represented in microamps + +What: /sys/class/power_supply/<supply_name>/current_now +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Reports an instant, single IBAT current reading for the battery. + This value is not averaged/smoothed. + + Access: Read + Valid values: Represented in microamps + +What: /sys/class/power_supply/<supply_name>/charge_type +Date: July 2009 +Contact: linux-pm@vger.kernel.org +Description: + Represents the type of charging currently being applied to the + battery. + + Access: Read + Valid values: "Unknown", "N/A", "Trickle", "Fast" + +What: /sys/class/power_supply/<supply_name>/charge_term_current +Date: July 2014 +Contact: linux-pm@vger.kernel.org +Description: + Reports the charging current value which is used to determine + when the battery is considered full and charging should end. + + Access: Read + Valid values: Represented in microamps + +What: /sys/class/power_supply/<supply_name>/health +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Reports the health of the battery or battery side of charger + functionality. + + Access: Read + Valid values: "Unknown", "Good", "Overheat", "Dead", + "Over voltage", "Unspecified failure", "Cold", + "Watchdog timer expire", "Safety timer expire" + +What: /sys/class/power_supply/<supply_name>/precharge_current +Date: June 2017 +Contact: linux-pm@vger.kernel.org +Description: + Reports the charging current applied during pre-charging phase + for a battery charge cycle. + + Access: Read + Valid values: Represented in microamps + +What: /sys/class/power_supply/<supply_name>/present +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Reports whether a battery is present or not in the system. + + Access: Read + Valid values: + 0: Absent + 1: Present + +What: /sys/class/power_supply/<supply_name>/status +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Represents the charging status of the battery. Normally this + is read-only reporting although for some supplies this can be + used to enable/disable charging to the battery. + + Access: Read, Write + Valid values: "Unknown", "Charging", "Discharging", + "Not charging", "Full" + +What: /sys/class/power_supply/<supply_name>/technology +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Describes the battery technology supported by the supply. + + Access: Read + Valid values: "Unknown", "NiMH", "Li-ion", "Li-poly", "LiFe", + "NiCd", "LiMn" + +What: /sys/class/power_supply/<supply_name>/temp +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Reports the current TBAT battery temperature reading. + + Access: Read + Valid values: Represented in 1/10 Degrees Celsius + +What: /sys/class/power_supply/<supply_name>/temp_alert_max +Date: July 2012 +Contact: linux-pm@vger.kernel.org +Description: + Maximum TBAT temperature trip-wire value where the supply will + notify user-space of the event. This is normally used for the + battery charging scenario where user-space needs to know the + battery temperature has crossed an upper threshold so it can + take appropriate action (e.g. warning user that battery level is + critically high, and charging has stopped). + + Access: Read + Valid values: Represented in 1/10 Degrees Celsius + +What: /sys/class/power_supply/<supply_name>/temp_alert_min +Date: July 2012 +Contact: linux-pm@vger.kernel.org +Description: + Minimum TBAT temperature trip-wire value where the supply will + notify user-space of the event. This is normally used for the + battery charging scenario where user-space needs to know the + battery temperature has crossed a lower threshold so it can take + appropriate action (e.g. warning user that battery level is + high, and charging current has been reduced accordingly to + remedy the situation). + + Access: Read + Valid values: Represented in 1/10 Degrees Celsius + +What: /sys/class/power_supply/<supply_name>/temp_max +Date: July 2014 +Contact: linux-pm@vger.kernel.org +Description: + Reports the maximum allowed TBAT battery temperature for + charging. + + Access: Read + Valid values: Represented in 1/10 Degrees Celsius + +What: /sys/class/power_supply/<supply_name>/temp_min +Date: July 2014 +Contact: linux-pm@vger.kernel.org +Description: + Reports the minimum allowed TBAT battery temperature for + charging. + + Access: Read + Valid values: Represented in 1/10 Degrees Celsius + +What: /sys/class/power_supply/<supply_name>/voltage_avg, +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Reports an average VBAT voltage reading for the battery, over a + fixed period. Normally devices will provide a fixed interval in + which they average readings to smooth out the reported value. + + Access: Read + Valid values: Represented in microvolts + +What: /sys/class/power_supply/<supply_name>/voltage_max, +Date: January 2008 +Contact: linux-pm@vger.kernel.org +Description: + Reports the maximum safe VBAT voltage permitted for the battery, + during charging. + + Access: Read + Valid values: Represented in microvolts + +What: /sys/class/power_supply/<supply_name>/voltage_min, +Date: January 2008 +Contact: linux-pm@vger.kernel.org +Description: + Reports the minimum safe VBAT voltage permitted for the battery, + during discharging. + + Access: Read + Valid values: Represented in microvolts + +What: /sys/class/power_supply/<supply_name>/voltage_now, +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Reports an instant, single VBAT voltage reading for the battery. + This value is not averaged/smoothed. + + Access: Read + Valid values: Represented in microvolts + +===== USB Properties ===== + +What: /sys/class/power_supply/<supply_name>/current_avg +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Reports an average IBUS current reading over a fixed period. + Normally devices will provide a fixed interval in which they + average readings to smooth out the reported value. + + Access: Read + Valid values: Represented in microamps + + +What: /sys/class/power_supply/<supply_name>/current_max +Date: October 2010 +Contact: linux-pm@vger.kernel.org +Description: + Reports the maximum IBUS current the supply can support. + + Access: Read + Valid values: Represented in microamps + +What: /sys/class/power_supply/<supply_name>/current_now +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Reports the IBUS current supplied now. This value is generally + read-only reporting, unless the 'online' state of the supply + is set to be programmable, in which case this value can be set + within the reported min/max range. + + Access: Read, Write + Valid values: Represented in microamps + +What: /sys/class/power_supply/<supply_name>/input_current_limit +Date: July 2014 +Contact: linux-pm@vger.kernel.org +Description: + Details the incoming IBUS current limit currently set in the + supply. Normally this is configured based on the type of + connection made (e.g. A configured SDP should output a maximum + of 500mA so the input current limit is set to the same value). + + Access: Read, Write + Valid values: Represented in microamps + +What: /sys/class/power_supply/<supply_name>/online, +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Indicates if VBUS is present for the supply. When the supply is + online, and the supply allows it, then it's possible to switch + between online states (e.g. Fixed -> Programmable for a PD_PPS + USB supply so voltage and current can be controlled). + + Access: Read, Write + Valid values: + 0: Offline + 1: Online Fixed - Fixed Voltage Supply + 2: Online Programmable - Programmable Voltage Supply + +What: /sys/class/power_supply/<supply_name>/temp +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Reports the current supply temperature reading. This would + normally be the internal temperature of the device itself (e.g + TJUNC temperature of an IC) + + Access: Read + Valid values: Represented in 1/10 Degrees Celsius + +What: /sys/class/power_supply/<supply_name>/temp_alert_max +Date: July 2012 +Contact: linux-pm@vger.kernel.org +Description: + Maximum supply temperature trip-wire value where the supply will + notify user-space of the event. This is normally used for the + charging scenario where user-space needs to know the supply + temperature has crossed an upper threshold so it can take + appropriate action (e.g. warning user that the supply + temperature is critically high, and charging has stopped to + remedy the situation). + + Access: Read + Valid values: Represented in 1/10 Degrees Celsius + +What: /sys/class/power_supply/<supply_name>/temp_alert_min +Date: July 2012 +Contact: linux-pm@vger.kernel.org +Description: + Minimum supply temperature trip-wire value where the supply will + notify user-space of the event. This is normally used for the + charging scenario where user-space needs to know the supply + temperature has crossed a lower threshold so it can take + appropriate action (e.g. warning user that the supply + temperature is high, and charging current has been reduced + accordingly to remedy the situation). + + Access: Read + Valid values: Represented in 1/10 Degrees Celsius + +What: /sys/class/power_supply/<supply_name>/temp_max +Date: July 2014 +Contact: linux-pm@vger.kernel.org +Description: + Reports the maximum allowed supply temperature for operation. + + Access: Read + Valid values: Represented in 1/10 Degrees Celsius + +What: /sys/class/power_supply/<supply_name>/temp_min +Date: July 2014 +Contact: linux-pm@vger.kernel.org +Description: + Reports the mainimum allowed supply temperature for operation. + + Access: Read + Valid values: Represented in 1/10 Degrees Celsius + +What: /sys/class/power_supply/<supply_name>/usb_type +Date: March 2018 +Contact: linux-pm@vger.kernel.org +Description: + Reports what type of USB connection is currently active for + the supply, for example it can show if USB-PD capable source + is attached. + + Access: Read-Only + Valid values: "Unknown", "SDP", "DCP", "CDP", "ACA", "C", "PD", + "PD_DRP", "PD_PPS", "BrickID" + +What: /sys/class/power_supply/<supply_name>/voltage_max +Date: January 2008 +Contact: linux-pm@vger.kernel.org +Description: + Reports the maximum VBUS voltage the supply can support. + + Access: Read + Valid values: Represented in microvolts + +What: /sys/class/power_supply/<supply_name>/voltage_min +Date: January 2008 +Contact: linux-pm@vger.kernel.org +Description: + Reports the minimum VBUS voltage the supply can support. + + Access: Read + Valid values: Represented in microvolts + +What: /sys/class/power_supply/<supply_name>/voltage_now +Date: May 2007 +Contact: linux-pm@vger.kernel.org +Description: + Reports the VBUS voltage supplied now. This value is generally + read-only reporting, unless the 'online' state of the supply + is set to be programmable, in which case this value can be set + within the reported min/max range. + + Access: Read, Write + Valid values: Represented in microvolts + +===== Device Specific Properties ===== + What: /sys/class/power/ds2760-battery.*/charge_now Date: May 2010 KernelVersion: 2.6.35 diff --git a/Documentation/ABI/testing/sysfs-class-rc b/Documentation/ABI/testing/sysfs-class-rc index 8be1fd3760e0..6c0d6c8cb911 100644 --- a/Documentation/ABI/testing/sysfs-class-rc +++ b/Documentation/ABI/testing/sysfs-class-rc @@ -1,7 +1,7 @@ What: /sys/class/rc/ Date: Apr 2010 KernelVersion: 2.6.35 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Description: The rc/ class sub-directory belongs to the Remote Controller core and provides a sysfs interface for configuring infrared @@ -10,7 +10,7 @@ Description: What: /sys/class/rc/rcN/ Date: Apr 2010 KernelVersion: 2.6.35 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Description: A /sys/class/rc/rcN directory is created for each remote control receiver device where N is the number of the receiver. @@ -18,7 +18,7 @@ Description: What: /sys/class/rc/rcN/protocols Date: Jun 2010 KernelVersion: 2.6.36 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Description: Reading this file returns a list of available protocols, something like: @@ -36,7 +36,7 @@ Description: What: /sys/class/rc/rcN/filter Date: Jan 2014 KernelVersion: 3.15 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Description: Sets the scancode filter expected value. Use in combination with /sys/class/rc/rcN/filter_mask to set the @@ -49,7 +49,7 @@ Description: What: /sys/class/rc/rcN/filter_mask Date: Jan 2014 KernelVersion: 3.15 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Description: Sets the scancode filter mask of bits to compare. Use in combination with /sys/class/rc/rcN/filter to set the bits @@ -64,7 +64,7 @@ Description: What: /sys/class/rc/rcN/wakeup_protocols Date: Feb 2017 KernelVersion: 4.11 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Description: Reading this file returns a list of available protocols to use for the wakeup filter, something like: @@ -83,7 +83,7 @@ Description: What: /sys/class/rc/rcN/wakeup_filter Date: Jan 2014 KernelVersion: 3.15 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Description: Sets the scancode wakeup filter expected value. Use in combination with /sys/class/rc/rcN/wakeup_filter_mask to @@ -98,7 +98,7 @@ Description: What: /sys/class/rc/rcN/wakeup_filter_mask Date: Jan 2014 KernelVersion: 3.15 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Description: Sets the scancode wakeup filter mask of bits to compare. Use in combination with /sys/class/rc/rcN/wakeup_filter to set diff --git a/Documentation/ABI/testing/sysfs-class-rc-nuvoton b/Documentation/ABI/testing/sysfs-class-rc-nuvoton index 905bcdeedef2..d3abe45f8690 100644 --- a/Documentation/ABI/testing/sysfs-class-rc-nuvoton +++ b/Documentation/ABI/testing/sysfs-class-rc-nuvoton @@ -1,7 +1,7 @@ What: /sys/class/rc/rcN/wakeup_data Date: Mar 2016 KernelVersion: 4.6 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Description: Reading this file returns the stored CIR wakeup sequence. It starts with a pulse, followed by a space, pulse etc. diff --git a/Documentation/ABI/testing/sysfs-devices-edac b/Documentation/ABI/testing/sysfs-devices-edac index 46ff929fd52a..256a9e990c0b 100644 --- a/Documentation/ABI/testing/sysfs-devices-edac +++ b/Documentation/ABI/testing/sysfs-devices-edac @@ -77,7 +77,7 @@ Description: Read/Write attribute file that controls memory scrubbing. What: /sys/devices/system/edac/mc/mc*/max_location Date: April 2012 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> linux-edac@vger.kernel.org Description: This attribute file displays the information about the last available memory slot in this memory controller. It is used by @@ -85,7 +85,7 @@ Description: This attribute file displays the information about the last What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/size Date: April 2012 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> linux-edac@vger.kernel.org Description: This attribute file will display the size of dimm or rank. For dimm*/size, this is the size, in MB of the DIMM memory @@ -96,14 +96,14 @@ Description: This attribute file will display the size of dimm or rank. What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/dimm_dev_type Date: April 2012 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> linux-edac@vger.kernel.org Description: This attribute file will display what type of DRAM device is being utilized on this DIMM (x1, x2, x4, x8, ...). What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/dimm_edac_mode Date: April 2012 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> linux-edac@vger.kernel.org Description: This attribute file will display what type of Error detection and correction is being utilized. For example: S4ECD4ED would @@ -111,7 +111,7 @@ Description: This attribute file will display what type of Error detection What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/dimm_label Date: April 2012 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> linux-edac@vger.kernel.org Description: This control file allows this DIMM to have a label assigned to it. With this label in the module, when errors occur @@ -126,14 +126,14 @@ Description: This control file allows this DIMM to have a label assigned What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/dimm_location Date: April 2012 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> linux-edac@vger.kernel.org Description: This attribute file will display the location (csrow/channel, branch/channel/slot or channel/slot) of the dimm or rank. What: /sys/devices/system/edac/mc/mc*/(dimm|rank)*/dimm_mem_type Date: April 2012 -Contact: Mauro Carvalho Chehab <m.chehab@samsung.com> +Contact: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> linux-edac@vger.kernel.org Description: This attribute file will display what type of memory is currently on this csrow. Normally, either buffered or diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 025b7cf3768d..bd4975e132d3 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -478,6 +478,7 @@ What: /sys/devices/system/cpu/vulnerabilities /sys/devices/system/cpu/vulnerabilities/meltdown /sys/devices/system/cpu/vulnerabilities/spectre_v1 /sys/devices/system/cpu/vulnerabilities/spectre_v2 + /sys/devices/system/cpu/vulnerabilities/spec_store_bypass Date: January 2018 Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> Description: Information about CPU vulnerabilities diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-hugepages b/Documentation/ABI/testing/sysfs-kernel-mm-hugepages index e21c00571cf4..fdaa2162fae1 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-hugepages +++ b/Documentation/ABI/testing/sysfs-kernel-mm-hugepages @@ -12,4 +12,4 @@ Description: free_hugepages surplus_hugepages resv_hugepages - See Documentation/vm/hugetlbpage.txt for details. + See Documentation/admin-guide/mm/hugetlbpage.rst for details. diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-ksm b/Documentation/ABI/testing/sysfs-kernel-mm-ksm index 73e653ee2481..dfc13244cda3 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-ksm +++ b/Documentation/ABI/testing/sysfs-kernel-mm-ksm @@ -40,7 +40,7 @@ Description: Kernel Samepage Merging daemon sysfs interface sleep_millisecs: how many milliseconds ksm should sleep between scans. - See Documentation/vm/ksm.txt for more information. + See Documentation/vm/ksm.rst for more information. What: /sys/kernel/mm/ksm/merge_across_nodes Date: January 2013 diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab index 2cc0a72b64be..29601d93a1c2 100644 --- a/Documentation/ABI/testing/sysfs-kernel-slab +++ b/Documentation/ABI/testing/sysfs-kernel-slab @@ -37,7 +37,7 @@ Description: The alloc_calls file is read-only and lists the kernel code locations from which allocations for this cache were performed. The alloc_calls file only contains information if debugging is - enabled for that cache (see Documentation/vm/slub.txt). + enabled for that cache (see Documentation/vm/slub.rst). What: /sys/kernel/slab/cache/alloc_fastpath Date: February 2008 @@ -219,7 +219,7 @@ Contact: Pekka Enberg <penberg@cs.helsinki.fi>, Description: The free_calls file is read-only and lists the locations of object frees if slab debugging is enabled (see - Documentation/vm/slub.txt). + Documentation/vm/slub.rst). What: /sys/kernel/slab/cache/free_fastpath Date: February 2008 diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt index 0b6bb3ef449e..688b69121e82 100644 --- a/Documentation/PCI/pci-error-recovery.txt +++ b/Documentation/PCI/pci-error-recovery.txt @@ -110,7 +110,7 @@ The actual steps taken by a platform to recover from a PCI error event will be platform-dependent, but will follow the general sequence described below. -STEP 0: Error Event +STEP 0: Error Event: ERR_NONFATAL ------------------- A PCI bus error is detected by the PCI hardware. On powerpc, the slot is isolated, in that all I/O is blocked: all reads return 0xffffffff, @@ -228,13 +228,7 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform proceeds to STEP 4 (Slot Reset) -STEP 3: Link Reset ------------------- -The platform resets the link. This is a PCI-Express specific step -and is done whenever a fatal error has been detected that can be -"solved" by resetting the link. - -STEP 4: Slot Reset +STEP 3: Slot Reset ------------------ In response to a return value of PCI_ERS_RESULT_NEED_RESET, the @@ -320,7 +314,7 @@ Failure). >>> However, it probably should. -STEP 5: Resume Operations +STEP 4: Resume Operations ------------------------- The platform will call the resume() callback on all affected device drivers if all drivers on the segment have returned @@ -332,7 +326,7 @@ a result code. At this point, if a new error happens, the platform will restart a new error recovery sequence. -STEP 6: Permanent Failure +STEP 5: Permanent Failure ------------------------- A "permanent failure" has occurred, and the platform cannot recover the device. The platform will call error_detected() with a @@ -355,6 +349,27 @@ errors. See the discussion in powerpc/eeh-pci-error-recovery.txt for additional detail on real-life experience of the causes of software errors. +STEP 0: Error Event: ERR_FATAL +------------------- +PCI bus error is detected by the PCI hardware. On powerpc, the slot is +isolated, in that all I/O is blocked: all reads return 0xffffffff, all +writes are ignored. + +STEP 1: Remove devices +-------------------- +Platform removes the devices depending on the error agent, it could be +this port for all subordinates or upstream component (likely downstream +port) + +STEP 2: Reset link +-------------------- +The platform resets the link. This is a PCI-Express specific step and is +done whenever a fatal error has been detected that can be "solved" by +resetting the link. + +STEP 3: Re-enumerate the devices +-------------------- +Initiates the re-enumeration. Conclusion; General Remarks --------------------------- diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index a27fbfb0efb8..65eb856526b7 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -1,3 +1,5 @@ +What is RCU? -- "Read, Copy, Update" + Please note that the "What is RCU?" LWN series is an excellent place to start learning about RCU: diff --git a/Documentation/accelerators/ocxl.rst b/Documentation/accelerators/ocxl.rst index ddcc58d01cfb..14cefc020e2d 100644 --- a/Documentation/accelerators/ocxl.rst +++ b/Documentation/accelerators/ocxl.rst @@ -157,6 +157,17 @@ OCXL_IOCTL_GET_METADATA: Obtains configuration information from the card, such at the size of MMIO areas, the AFU version, and the PASID for the current context. +OCXL_IOCTL_ENABLE_P9_WAIT: + + Allows the AFU to wake a userspace thread executing 'wait'. Returns + information to userspace to allow it to configure the AFU. Note that + this is only available on POWER9. + +OCXL_IOCTL_GET_FEATURES: + + Reports on which CPU features that affect OpenCAPI are usable from + userspace. + mmap ---- diff --git a/Documentation/acpi/cppc_sysfs.txt b/Documentation/acpi/cppc_sysfs.txt new file mode 100644 index 000000000000..f20fb445135d --- /dev/null +++ b/Documentation/acpi/cppc_sysfs.txt @@ -0,0 +1,69 @@ + + Collaborative Processor Performance Control (CPPC) + +CPPC defined in the ACPI spec describes a mechanism for the OS to manage the +performance of a logical processor on a contigious and abstract performance +scale. CPPC exposes a set of registers to describe abstract performance scale, +to request performance levels and to measure per-cpu delivered performance. + +For more details on CPPC please refer to the ACPI specification at: + +http://uefi.org/specifications + +Some of the CPPC registers are exposed via sysfs under: + +/sys/devices/system/cpu/cpuX/acpi_cppc/ + +for each cpu X + +-------------------------------------------------------------------------------- + +$ ls -lR /sys/devices/system/cpu/cpu0/acpi_cppc/ +/sys/devices/system/cpu/cpu0/acpi_cppc/: +total 0 +-r--r--r-- 1 root root 65536 Mar 5 19:38 feedback_ctrs +-r--r--r-- 1 root root 65536 Mar 5 19:38 highest_perf +-r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_freq +-r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_nonlinear_perf +-r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_perf +-r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_freq +-r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_perf +-r--r--r-- 1 root root 65536 Mar 5 19:38 reference_perf +-r--r--r-- 1 root root 65536 Mar 5 19:38 wraparound_time + +-------------------------------------------------------------------------------- + +* highest_perf : Highest performance of this processor (abstract scale). +* nominal_perf : Highest sustained performance of this processor (abstract scale). +* lowest_nonlinear_perf : Lowest performance of this processor with nonlinear + power savings (abstract scale). +* lowest_perf : Lowest performance of this processor (abstract scale). + +* lowest_freq : CPU frequency corresponding to lowest_perf (in MHz). +* nominal_freq : CPU frequency corresponding to nominal_perf (in MHz). + The above frequencies should only be used to report processor performance in + freqency instead of abstract scale. These values should not be used for any + functional decisions. + +* feedback_ctrs : Includes both Reference and delivered performance counter. + Reference counter ticks up proportional to processor's reference performance. + Delivered counter ticks up proportional to processor's delivered performance. +* wraparound_time: Minimum time for the feedback counters to wraparound (seconds). +* reference_perf : Performance level at which reference performance counter + accumulates (abstract scale). + +-------------------------------------------------------------------------------- + + Computing Average Delivered Performance + +Below describes the steps to compute the average performance delivered by taking +two different snapshots of feedback counters at time T1 and T2. + +T1: Read feedback_ctrs as fbc_t1 + Wait or run some workload +T2: Read feedback_ctrs as fbc_t2 + +delivered_counter_delta = fbc_t2[del] - fbc_t1[del] +reference_counter_delta = fbc_t2[ref] - fbc_t1[ref] + +delivered_perf = (refernce_perf x delivered_counter_delta) / reference_counter_delta diff --git a/Documentation/bcache.txt b/Documentation/admin-guide/bcache.rst index c0ce64d75bbf..c0ce64d75bbf 100644 --- a/Documentation/bcache.txt +++ b/Documentation/admin-guide/bcache.rst diff --git a/Documentation/cgroup-v2.txt b/Documentation/admin-guide/cgroup-v2.rst index 74cdeaed9f7a..8a2c52d5c53b 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1001,14 +1001,44 @@ PAGE_SIZE multiple when read back. The total amount of memory currently being used by the cgroup and its descendants. + memory.min + A read-write single value file which exists on non-root + cgroups. The default is "0". + + Hard memory protection. If the memory usage of a cgroup + is within its effective min boundary, the cgroup's memory + won't be reclaimed under any conditions. If there is no + unprotected reclaimable memory available, OOM killer + is invoked. + + Effective min boundary is limited by memory.min values of + all ancestor cgroups. If there is memory.min overcommitment + (child cgroup or cgroups are requiring more protected memory + than parent will allow), then each child cgroup will get + the part of parent's protection proportional to its + actual memory usage below memory.min. + + Putting more memory than generally available under this + protection is discouraged and may lead to constant OOMs. + + If a memory cgroup is not populated with processes, + its memory.min is ignored. + memory.low A read-write single value file which exists on non-root cgroups. The default is "0". - Best-effort memory protection. If the memory usages of a - cgroup and all its ancestors are below their low boundaries, - the cgroup's memory won't be reclaimed unless memory can be - reclaimed from unprotected cgroups. + Best-effort memory protection. If the memory usage of a + cgroup is within its effective low boundary, the cgroup's + memory won't be reclaimed unless memory can be reclaimed + from unprotected cgroups. + + Effective low boundary is limited by memory.low values of + all ancestor cgroups. If there is memory.low overcommitment + (child cgroup or cgroups are requiring more protected memory + than parent will allow), then each child cgroup will get + the part of parent's protection proportional to its + actual memory usage below memory.low. Putting more memory than generally available under this protection is discouraged. @@ -1199,6 +1229,27 @@ PAGE_SIZE multiple when read back. Swap usage hard limit. If a cgroup's swap usage reaches this limit, anonymous memory of the cgroup will not be swapped out. + memory.swap.events + A read-only flat-keyed file which exists on non-root cgroups. + The following entries are defined. Unless specified + otherwise, a value change in this file generates a file + modified event. + + max + The number of times the cgroup's swap usage was about + to go over the max boundary and swap allocation + failed. + + fail + The number of times swap allocation failed either + because of running out of swap system-wide or max + limit. + + When reduced under the current usage, the existing swap + entries are reclaimed gradually and the swap usage may stay + higher than the limit for an extended period of time. This + reduces the impact on the workload and memory management. + Usage Guidelines ~~~~~~~~~~~~~~~~ @@ -1934,17 +1985,8 @@ system performance due to overreclaim, to the point where the feature becomes self-defeating. The memory.low boundary on the other hand is a top-down allocated -reserve. A cgroup enjoys reclaim protection when it and all its -ancestors are below their low boundaries, which makes delegation of -subtrees possible. Secondly, new cgroups have no reserve per default -and in the common case most cgroups are eligible for the preferred -reclaim pass. This allows the new low boundary to be efficiently -implemented with just a minor addition to the generic reclaim code, -without the need for out-of-band data structures and reclaim passes. -Because the generic reclaim code considers all cgroups except for the -ones running low in the preferred first reclaim pass, overreclaim of -individual groups is eliminated as well, resulting in much better -overall workload performance. +reserve. A cgroup enjoys reclaim protection when it's within its low, +which makes delegation of subtrees possible. The original high boundary, the hard limit, is defined as a strict limit that can not budge, even if the OOM killer has to be called. diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 5bb9161dbe6a..48d70af11652 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -48,6 +48,7 @@ configure specific aspects of kernel behavior to your liking. :maxdepth: 1 initrd + cgroup-v2 serial-console braille-console parport @@ -60,9 +61,11 @@ configure specific aspects of kernel behavior to your liking. mono java ras + bcache pm/index thunderbolt LSM/index + mm/index .. only:: subproject and html diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 11fc28ecdb6d..638342d0a095 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -106,11 +106,11 @@ use by PCI Format: <irq>,<irq>... - acpi_mask_gpe= [HW,ACPI] + acpi_mask_gpe= [HW,ACPI] Due to the existence of _Lxx/_Exx, some GPEs triggered by unsupported hardware/firmware features can result in - GPE floodings that cannot be automatically disabled by - the GPE dispatcher. + GPE floodings that cannot be automatically disabled by + the GPE dispatcher. This facility can be used to prevent such uncontrolled GPE floodings. Format: <int> @@ -472,10 +472,10 @@ for platform specific values (SB1, Loongson3 and others). - ccw_timeout_log [S390] + ccw_timeout_log [S390] See Documentation/s390/CommonIO for details. - cgroup_disable= [KNL] Disable a particular controller + cgroup_disable= [KNL] Disable a particular controller Format: {name of the controller(s) to disable} The effects of cgroup_disable=foo are: - foo isn't auto-mounted if you mount all cgroups in @@ -518,7 +518,7 @@ those clocks in any way. This parameter is useful for debug and development, but should not be needed on a platform with proper driver support. For more - information, see Documentation/clk.txt. + information, see Documentation/driver-api/clk.rst. clock= [BUGS=X86-32, HW] gettimeofday clocksource override. [Deprecated] @@ -587,11 +587,6 @@ Sets the size of memory pool for coherent, atomic dma allocations, by default set to 256K. - code_bytes [X86] How many bytes of object code to print - in an oops report. - Range: 0 - 8192 - Default: 64 - com20020= [HW,NET] ARCnet - COM20020 chipset Format: <io>[,<irq>[,<nodeID>[,<backplane>[,<ckp>[,<timeout>]]]]] @@ -641,8 +636,8 @@ hvc<n> Use the hypervisor console device <n>. This is for both Xen and PowerPC hypervisors. - If the device connected to the port is not a TTY but a braille - device, prepend "brl," before the device type, for instance + If the device connected to the port is not a TTY but a braille + device, prepend "brl," before the device type, for instance console=brl,ttyS0 For now, only VisioBraille is supported. @@ -662,7 +657,7 @@ consoleblank= [KNL] The console blank (screen saver) timeout in seconds. A value of 0 disables the blank timer. - Defaults to 0. + Defaults to 0. coredump_filter= [KNL] Change the default value for @@ -730,7 +725,7 @@ or memory reserved is below 4G. cryptomgr.notests - [KNL] Disable crypto self-tests + [KNL] Disable crypto self-tests cs89x0_dma= [HW,NET] Format: <dma> @@ -746,7 +741,7 @@ Format: <port#>,<type> See also Documentation/input/devices/joystick-parport.rst - ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot + ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot time. See Documentation/admin-guide/dynamic-debug-howto.rst for details. Deprecated, see dyndbg. @@ -833,7 +828,7 @@ causing system reset or hang due to sending INIT from AP to BSP. - disable_ddw [PPC/PSERIES] + disable_ddw [PPC/PSERIES] Disable Dynamic DMA Window support. Use this if to workaround buggy firmware. @@ -1025,6 +1020,12 @@ address. The serial port must already be setup and configured. Options are not yet supported. + qcom_geni,<addr> + Start an early, polled-mode console on a Qualcomm + Generic Interface (GENI) based serial port at the + specified address. The serial port must already be + setup and configured. Options are not yet supported. + earlyprintk= [X86,SH,ARM,M68k,S390] earlyprintk=vga earlyprintk=efi @@ -1188,7 +1189,7 @@ parameter will force ia64_sal_cache_flush to call ia64_pal_cache_flush instead of SAL_CACHE_FLUSH. - forcepae [X86-32] + forcepae [X86-32] Forcefully enable Physical Address Extension (PAE). Many Pentium M systems disable PAE but may have a functionally usable PAE implementation. @@ -1247,7 +1248,7 @@ gamma= [HW,DRM] - gart_fix_e820= [X86_64] disable the fix e820 for K8 GART + gart_fix_e820= [X86_64] disable the fix e820 for K8 GART Format: off | on default: on @@ -1341,23 +1342,32 @@ x86-64 are 2M (when the CPU supports "pse") and 1G (when the CPU supports the "pdpe1gb" cpuinfo flag). - hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) - terminal devices. Valid values: 0..8 - hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs. - If specified, z/VM IUCV HVC accepts connections - from listed z/VM user IDs only. + hung_task_panic= + [KNL] Should the hung task detector generate panics. + Format: <integer> + A nonzero value instructs the kernel to panic when a + hung task is detected. The default value is controlled + by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time + option. The value selected by this boot parameter can + be changed later by the kernel.hung_task_panic sysctl. + + hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) + terminal devices. Valid values: 0..8 + hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs. + If specified, z/VM IUCV HVC accepts connections + from listed z/VM user IDs only. keep_bootcon [KNL] Do not unregister boot console at start. This is only useful for debugging when something happens in the window between unregistering the boot console and initializing the real console. - i2c_bus= [HW] Override the default board specific I2C bus speed - or register an additional I2C bus that is not - registered from board initialization code. - Format: - <bus_id>,<clkrate> + i2c_bus= [HW] Override the default board specific I2C bus speed + or register an additional I2C bus that is not + registered from board initialization code. + Format: + <bus_id>,<clkrate> i8042.debug [HW] Toggle i8042 debug mode i8042.unmask_kbd_data @@ -1386,7 +1396,7 @@ Default: only on s2r transitions on x86; most other architectures force reset to be always executed i8042.unlock [HW] Unlock (ignore) the keylock - i8042.kbdreset [HW] Reset device connected to KBD port + i8042.kbdreset [HW] Reset device connected to KBD port i810= [HW,DRM] @@ -1548,13 +1558,13 @@ programs exec'd, files mmap'd for exec, and all files opened for read by uid=0. - ima_template= [IMA] + ima_template= [IMA] Select one of defined IMA measurements template formats. Formats: { "ima" | "ima-ng" | "ima-sig" } Default: "ima-ng" ima_template_fmt= - [IMA] Define a custom template format. + [IMA] Define a custom template format. Format: { "field1|...|fieldN" } ima.ahash_minsize= [IMA] Minimum file size for asynchronous hash usage @@ -1597,7 +1607,7 @@ inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver Format: <irq> - int_pln_enable [x86] Enable power limit notification interrupt + int_pln_enable [x86] Enable power limit notification interrupt integrity_audit=[IMA] Format: { "0" | "1" } @@ -1650,39 +1660,39 @@ 0 disables intel_idle and fall back on acpi_idle. 1 to 9 specify maximum depth of C-state. - intel_pstate= [X86] - disable - Do not enable intel_pstate as the default - scaling driver for the supported processors - passive - Use intel_pstate as a scaling driver, but configure it - to work with generic cpufreq governors (instead of - enabling its internal governor). This mode cannot be - used along with the hardware-managed P-states (HWP) - feature. - force - Enable intel_pstate on systems that prohibit it by default - in favor of acpi-cpufreq. Forcing the intel_pstate driver - instead of acpi-cpufreq may disable platform features, such - as thermal controls and power capping, that rely on ACPI - P-States information being indicated to OSPM and therefore - should be used with caution. This option does not work with - processors that aren't supported by the intel_pstate driver - or on platforms that use pcc-cpufreq instead of acpi-cpufreq. - no_hwp - Do not enable hardware P state control (HWP) - if available. - hwp_only - Only load intel_pstate on systems which support - hardware P state control (HWP) if available. - support_acpi_ppc - Enforce ACPI _PPC performance limits. If the Fixed ACPI - Description Table, specifies preferred power management - profile as "Enterprise Server" or "Performance Server", - then this feature is turned on by default. - per_cpu_perf_limits - Allow per-logical-CPU P-State performance control limits using - cpufreq sysfs interface + intel_pstate= [X86] + disable + Do not enable intel_pstate as the default + scaling driver for the supported processors + passive + Use intel_pstate as a scaling driver, but configure it + to work with generic cpufreq governors (instead of + enabling its internal governor). This mode cannot be + used along with the hardware-managed P-states (HWP) + feature. + force + Enable intel_pstate on systems that prohibit it by default + in favor of acpi-cpufreq. Forcing the intel_pstate driver + instead of acpi-cpufreq may disable platform features, such + as thermal controls and power capping, that rely on ACPI + P-States information being indicated to OSPM and therefore + should be used with caution. This option does not work with + processors that aren't supported by the intel_pstate driver + or on platforms that use pcc-cpufreq instead of acpi-cpufreq. + no_hwp + Do not enable hardware P state control (HWP) + if available. + hwp_only + Only load intel_pstate on systems which support + hardware P state control (HWP) if available. + support_acpi_ppc + Enforce ACPI _PPC performance limits. If the Fixed ACPI + Description Table, specifies preferred power management + profile as "Enterprise Server" or "Performance Server", + then this feature is turned on by default. + per_cpu_perf_limits + Allow per-logical-CPU P-State performance control limits using + cpufreq sysfs interface intremap= [X86-64, Intel-IOMMU] on enable Interrupt Remapping (default) @@ -1705,7 +1715,6 @@ nopanic merge nomerge - forcesac soft pt [x86, IA-64] nobypass [PPC/POWERNV] @@ -2027,7 +2036,7 @@ * [no]ncqtrim: Turn off queued DSM TRIM. * nohrst, nosrst, norst: suppress hard, soft - and both resets. + and both resets. * rstonce: only attempt one reset during hot-unplug link recovery @@ -2215,7 +2224,7 @@ [KNL,SH] Allow user to override the default size for per-device physically contiguous DMA buffers. - memhp_default_state=online/offline + memhp_default_state=online/offline [KNL] Set the initial state for the memory hotplug onlining policy. If not specified, the default value is set according to the @@ -2600,6 +2609,9 @@ emulation library even if a 387 maths coprocessor is present. + no5lvl [X86-64] Disable 5-level paging mode. Forces + kernel to use 4-level paging instead. + no_console_suspend [HW] Never suspend the console Disable suspending of consoles during suspend and @@ -2680,6 +2692,9 @@ allow data leaks with this option, which is equivalent to spectre_v2=off. + nospec_store_bypass_disable + [HW] Disable all mitigations for the Speculative Store Bypass vulnerability + noxsave [BUGS=X86] Disables x86 extended register state save and restore using xsave. The kernel will fallback to enabling legacy floating-point and sse state. @@ -2762,7 +2777,7 @@ [X86,PV_OPS] Disable paravirtualized VMware scheduler clock and use the default one. - no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting. + no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting. steal time is computed, but won't influence scheduler behaviour @@ -2823,7 +2838,7 @@ notsc [BUGS=X86-32] Disable Time Stamp Counter nowatchdog [KNL] Disable both lockup detectors, i.e. - soft-lockup and NMI watchdog (hard-lockup). + soft-lockup and NMI watchdog (hard-lockup). nowb [ARM] @@ -2843,7 +2858,7 @@ If the dependencies are under your control, you can turn on cpu0_hotplug. - nps_mtm_hs_ctr= [KNL,ARC] + nps_mtm_hs_ctr= [KNL,ARC] This parameter sets the maximum duration, in cycles, each HW thread of the CTOP can run without interruptions, before HW switches it. @@ -2984,7 +2999,7 @@ pci=option[,option...] [PCI] various PCI subsystem options: earlydump [X86] dump PCI config space before the kernel - changes anything + changes anything off [X86] don't probe for the PCI bus bios [X86-32] force use of PCI BIOS, don't access the hardware directly. Use this if your machine @@ -3072,7 +3087,7 @@ is enabled by default. If you need to use this, please report a bug. nocrs [X86] Ignore PCI host bridge windows from ACPI. - If you need to use this, please report a bug. + If you need to use this, please report a bug. routeirq Do IRQ routing for all PCI devices. This is normally done in pci_enable_device(), so this option is a temporary workaround @@ -3147,6 +3162,8 @@ on: Turn realloc on realloc same as realloc=on noari do not use PCIe ARI. + noats [PCIE, Intel-IOMMU, AMD-IOMMU] + do not use PCIe ATS (and IOMMU device IOTLB). pcie_scan_all Scan all possible PCIe devices. Otherwise we only look for one device below a PCIe downstream port. @@ -3915,7 +3932,7 @@ cache (risks via metadata attacks are mostly unchanged). Debug options disable merging on their own. - For more information see Documentation/vm/slub.txt. + For more information see Documentation/vm/slub.rst. slab_max_order= [MM, SLAB] Determines the maximum allowed order for slabs. @@ -3929,7 +3946,7 @@ slub_debug can create guard zones around objects and may poison objects when not in use. Also tracks the last alloc / free. For more information see - Documentation/vm/slub.txt. + Documentation/vm/slub.rst. slub_memcg_sysfs= [MM, SLUB] Determines whether to enable sysfs directories for @@ -3943,7 +3960,7 @@ Determines the maximum allowed order for slabs. A high setting may cause OOMs due to memory fragmentation. For more information see - Documentation/vm/slub.txt. + Documentation/vm/slub.rst. slub_min_objects= [MM, SLUB] The minimum number of objects per slab. SLUB will @@ -3952,12 +3969,12 @@ the number of objects indicated. The higher the number of objects the smaller the overhead of tracking slabs and the less frequently locks need to be acquired. - For more information see Documentation/vm/slub.txt. + For more information see Documentation/vm/slub.rst. slub_min_order= [MM, SLUB] Determines the minimum page order for slabs. Must be lower than slub_max_order. - For more information see Documentation/vm/slub.txt. + For more information see Documentation/vm/slub.rst. slub_nomerge [MM, SLUB] Same with slab_nomerge. This is supported for legacy. @@ -4025,6 +4042,48 @@ Not specifying this option is equivalent to spectre_v2=auto. + spec_store_bypass_disable= + [HW] Control Speculative Store Bypass (SSB) Disable mitigation + (Speculative Store Bypass vulnerability) + + Certain CPUs are vulnerable to an exploit against a + a common industry wide performance optimization known + as "Speculative Store Bypass" in which recent stores + to the same memory location may not be observed by + later loads during speculative execution. The idea + is that such stores are unlikely and that they can + be detected prior to instruction retirement at the + end of a particular speculation execution window. + + In vulnerable processors, the speculatively forwarded + store can be used in a cache side channel attack, for + example to read memory to which the attacker does not + directly have access (e.g. inside sandboxed code). + + This parameter controls whether the Speculative Store + Bypass optimization is used. + + on - Unconditionally disable Speculative Store Bypass + off - Unconditionally enable Speculative Store Bypass + auto - Kernel detects whether the CPU model contains an + implementation of Speculative Store Bypass and + picks the most appropriate mitigation. If the + CPU is not vulnerable, "off" is selected. If the + CPU is vulnerable the default mitigation is + architecture and Kconfig dependent. See below. + prctl - Control Speculative Store Bypass per thread + via prctl. Speculative Store Bypass is enabled + for a process by default. The state of the control + is inherited on fork. + seccomp - Same as "prctl" above, but all seccomp threads + will disable SSB unless they explicitly opt out. + + Not specifying this option is equivalent to + spec_store_bypass_disable=auto. + + Default mitigations: + X86: If CONFIG_SECCOMP=y "seccomp", otherwise "prctl" + spia_io_base= [HW,MTD] spia_fio_base= spia_pedr= @@ -4047,6 +4106,23 @@ expediting. Set to zero to disable automatic expediting. + ssbd= [ARM64,HW] + Speculative Store Bypass Disable control + + On CPUs that are vulnerable to the Speculative + Store Bypass vulnerability and offer a + firmware based mitigation, this parameter + indicates how the mitigation should be used: + + force-on: Unconditionally enable mitigation for + for both kernel and userspace + force-off: Unconditionally disable mitigation for + for both kernel and userspace + kernel: Always enable mitigation in the + kernel, and offer a prctl interface + to allow userspace to register its + interest in being mitigated too. + stack_guard_gap= [MM] override the default stack gap protection. The value is in page units and it defines how many pages prior @@ -4313,7 +4389,8 @@ Format: [always|madvise|never] Can be used to control the default behavior of the system with respect to transparent hugepages. - See Documentation/vm/transhuge.txt for more details. + See Documentation/admin-guide/mm/transhuge.rst + for more details. tsc= Disable clocksource stability checks for TSC. Format: <string> @@ -4391,7 +4468,7 @@ usbcore.initial_descriptor_timeout= [USB] Specifies timeout for the initial 64-byte - USB_REQ_GET_DESCRIPTOR request in milliseconds + USB_REQ_GET_DESCRIPTOR request in milliseconds (default 5000 = 5.0 seconds). usbcore.nousb [USB] Disable the USB subsystem diff --git a/Documentation/admin-guide/mm/concepts.rst b/Documentation/admin-guide/mm/concepts.rst new file mode 100644 index 000000000000..291699c810d4 --- /dev/null +++ b/Documentation/admin-guide/mm/concepts.rst @@ -0,0 +1,222 @@ +.. _mm_concepts: + +================= +Concepts overview +================= + +The memory management in Linux is complex system that evolved over the +years and included more and more functionality to support variety of +systems from MMU-less microcontrollers to supercomputers. The memory +management for systems without MMU is called ``nommu`` and it +definitely deserves a dedicated document, which hopefully will be +eventually written. Yet, although some of the concepts are the same, +here we assume that MMU is available and CPU can translate a virtual +address to a physical address. + +.. contents:: :local: + +Virtual Memory Primer +===================== + +The physical memory in a computer system is a limited resource and +even for systems that support memory hotplug there is a hard limit on +the amount of memory that can be installed. The physical memory is not +necessary contiguous, it might be accessible as a set of distinct +address ranges. Besides, different CPU architectures, and even +different implementations of the same architecture have different view +how these address ranges defined. + +All this makes dealing directly with physical memory quite complex and +to avoid this complexity a concept of virtual memory was developed. + +The virtual memory abstracts the details of physical memory from the +application software, allows to keep only needed information in the +physical memory (demand paging) and provides a mechanism for the +protection and controlled sharing of data between processes. + +With virtual memory, each and every memory access uses a virtual +address. When the CPU decodes the an instruction that reads (or +writes) from (or to) the system memory, it translates the `virtual` +address encoded in that instruction to a `physical` address that the +memory controller can understand. + +The physical system memory is divided into page frames, or pages. The +size of each page is architecture specific. Some architectures allow +selection of the page size from several supported values; this +selection is performed at the kernel build time by setting an +appropriate kernel configuration option. + +Each physical memory page can be mapped as one or more virtual +pages. These mappings are described by page tables that allow +translation from virtual address used by programs to real address in +the physical memory. The page tables organized hierarchically. + +The tables at the lowest level of the hierarchy contain physical +addresses of actual pages used by the software. The tables at higher +levels contain physical addresses of the pages belonging to the lower +levels. The pointer to the top level page table resides in a +register. When the CPU performs the address translation, it uses this +register to access the top level page table. The high bits of the +virtual address are used to index an entry in the top level page +table. That entry is then used to access the next level in the +hierarchy with the next bits of the virtual address as the index to +that level page table. The lowest bits in the virtual address define +the offset inside the actual page. + +Huge Pages +========== + +The address translation requires several memory accesses and memory +accesses are slow relatively to CPU speed. To avoid spending precious +processor cycles on the address translation, CPUs maintain a cache of +such translations called Translation Lookaside Buffer (or +TLB). Usually TLB is pretty scarce resource and applications with +large memory working set will experience performance hit because of +TLB misses. + +Many modern CPU architectures allow mapping of the memory pages +directly by the higher levels in the page table. For instance, on x86, +it is possible to map 2M and even 1G pages using entries in the second +and the third level page tables. In Linux such pages are called +`huge`. Usage of huge pages significantly reduces pressure on TLB, +improves TLB hit-rate and thus improves overall system performance. + +There are two mechanisms in Linux that enable mapping of the physical +memory with the huge pages. The first one is `HugeTLB filesystem`, or +hugetlbfs. It is a pseudo filesystem that uses RAM as its backing +store. For the files created in this filesystem the data resides in +the memory and mapped using huge pages. The hugetlbfs is described at +:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`. + +Another, more recent, mechanism that enables use of the huge pages is +called `Transparent HugePages`, or THP. Unlike the hugetlbfs that +requires users and/or system administrators to configure what parts of +the system memory should and can be mapped by the huge pages, THP +manages such mappings transparently to the user and hence the +name. See +:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>` +for more details about THP. + +Zones +===== + +Often hardware poses restrictions on how different physical memory +ranges can be accessed. In some cases, devices cannot perform DMA to +all the addressable memory. In other cases, the size of the physical +memory exceeds the maximal addressable size of virtual memory and +special actions are required to access portions of the memory. Linux +groups memory pages into `zones` according to their possible +usage. For example, ZONE_DMA will contain memory that can be used by +devices for DMA, ZONE_HIGHMEM will contain memory that is not +permanently mapped into kernel's address space and ZONE_NORMAL will +contain normally addressed pages. + +The actual layout of the memory zones is hardware dependent as not all +architectures define all zones, and requirements for DMA are different +for different platforms. + +Nodes +===== + +Many multi-processor machines are NUMA - Non-Uniform Memory Access - +systems. In such systems the memory is arranged into banks that have +different access latency depending on the "distance" from the +processor. Each bank is referred as `node` and for each node Linux +constructs an independent memory management subsystem. A node has it's +own set of zones, lists of free and used pages and various statistics +counters. You can find more details about NUMA in +:ref:`Documentation/vm/numa.rst <numa>` and in +:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`. + +Page cache +========== + +The physical memory is volatile and the common case for getting data +into the memory is to read it from files. Whenever a file is read, the +data is put into the `page cache` to avoid expensive disk access on +the subsequent reads. Similarly, when one writes to a file, the data +is placed in the page cache and eventually gets into the backing +storage device. The written pages are marked as `dirty` and when Linux +decides to reuse them for other purposes, it makes sure to synchronize +the file contents on the device with the updated data. + +Anonymous Memory +================ + +The `anonymous memory` or `anonymous mappings` represent memory that +is not backed by a filesystem. Such mappings are implicitly created +for program's stack and heap or by explicit calls to mmap(2) system +call. Usually, the anonymous mappings only define virtual memory areas +that the program is allowed to access. The read accesses will result +in creation of a page table entry that references a special physical +page filled with zeroes. When the program performs a write, regular +physical page will be allocated to hold the written data. The page +will be marked dirty and if the kernel will decide to repurpose it, +the dirty page will be swapped out. + +Reclaim +======= + +Throughout the system lifetime, a physical page can be used for storing +different types of data. It can be kernel internal data structures, +DMA'able buffers for device drivers use, data read from a filesystem, +memory allocated by user space processes etc. + +Depending on the page usage it is treated differently by the Linux +memory management. The pages that can be freed at any time, either +because they cache the data available elsewhere, for instance, on a +hard disk, or because they can be swapped out, again, to the hard +disk, are called `reclaimable`. The most notable categories of the +reclaimable pages are page cache and anonymous memory. + +In most cases, the pages holding internal kernel data and used as DMA +buffers cannot be repurposed, and they remain pinned until freed by +their user. Such pages are called `unreclaimable`. However, in certain +circumstances, even pages occupied with kernel data structures can be +reclaimed. For instance, in-memory caches of filesystem metadata can +be re-read from the storage device and therefore it is possible to +discard them from the main memory when system is under memory +pressure. + +The process of freeing the reclaimable physical memory pages and +repurposing them is called (surprise!) `reclaim`. Linux can reclaim +pages either asynchronously or synchronously, depending on the state +of the system. When system is not loaded, most of the memory is free +and allocation request will be satisfied immediately from the free +pages supply. As the load increases, the amount of the free pages goes +down and when it reaches a certain threshold (high watermark), an +allocation request will awaken the ``kswapd`` daemon. It will +asynchronously scan memory pages and either just free them if the data +they contain is available elsewhere, or evict to the backing storage +device (remember those dirty pages?). As memory usage increases even +more and reaches another threshold - min watermark - an allocation +will trigger the `direct reclaim`. In this case allocation is stalled +until enough memory pages are reclaimed to satisfy the request. + +Compaction +========== + +As the system runs, tasks allocate and free the memory and it becomes +fragmented. Although with virtual memory it is possible to present +scattered physical pages as virtually contiguous range, sometimes it is +necessary to allocate large physically contiguous memory areas. Such +need may arise, for instance, when a device driver requires large +buffer for DMA, or when THP allocates a huge page. Memory `compaction` +addresses the fragmentation issue. This mechanism moves occupied pages +from the lower part of a memory zone to free pages in the upper part +of the zone. When a compaction scan is finished free pages are grouped +together at the beginning of the zone and allocations of large +physically contiguous areas become possible. + +Like reclaim, the compaction may happen asynchronously in ``kcompactd`` +daemon or synchronously as a result of memory allocation request. + +OOM killer +========== + +It may happen, that on a loaded machine memory will be exhausted. When +the kernel detects that the system runs out of memory (OOM) it invokes +`OOM killer`. Its mission is simple: all it has to do is to select a +task to sacrifice for the sake of the overall system health. The +selected task is killed in a hope that after it exits enough memory +will be freed to continue normal operation. diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/admin-guide/mm/hugetlbpage.rst index faf077d50d42..1cc0bc78d10e 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -1,3 +1,11 @@ +.. _hugetlbpage: + +============= +HugeTLB Pages +============= + +Overview +======== The intent of this file is to give a brief summary of hugetlbpage support in the Linux kernel. This support is built on top of multiple page size support @@ -18,53 +26,59 @@ First the Linux kernel needs to be built with the CONFIG_HUGETLBFS automatically when CONFIG_HUGETLBFS is selected) configuration options. -The /proc/meminfo file provides information about the total number of +The ``/proc/meminfo`` file provides information about the total number of persistent hugetlb pages in the kernel's huge page pool. It also displays default huge page size and information about the number of free, reserved and surplus huge pages in the pool of huge pages of default size. The huge page size is needed for generating the proper alignment and size of the arguments to system calls that map huge page regions. -The output of "cat /proc/meminfo" will include lines like: +The output of ``cat /proc/meminfo`` will include lines like:: -..... -HugePages_Total: uuu -HugePages_Free: vvv -HugePages_Rsvd: www -HugePages_Surp: xxx -Hugepagesize: yyy kB -Hugetlb: zzz kB + HugePages_Total: uuu + HugePages_Free: vvv + HugePages_Rsvd: www + HugePages_Surp: xxx + Hugepagesize: yyy kB + Hugetlb: zzz kB where: -HugePages_Total is the size of the pool of huge pages. -HugePages_Free is the number of huge pages in the pool that are not yet - allocated. -HugePages_Rsvd is short for "reserved," and is the number of huge pages for - which a commitment to allocate from the pool has been made, - but no allocation has yet been made. Reserved huge pages - guarantee that an application will be able to allocate a - huge page from the pool of huge pages at fault time. -HugePages_Surp is short for "surplus," and is the number of huge pages in - the pool above the value in /proc/sys/vm/nr_hugepages. The - maximum number of surplus huge pages is controlled by - /proc/sys/vm/nr_overcommit_hugepages. -Hugepagesize is the default hugepage size (in Kb). -Hugetlb is the total amount of memory (in kB), consumed by huge - pages of all sizes. - If huge pages of different sizes are in use, this number - will exceed HugePages_Total * Hugepagesize. To get more - detailed information, please, refer to - /sys/kernel/mm/hugepages (described below). - - -/proc/filesystems should also show a filesystem of type "hugetlbfs" configured -in the kernel. - -/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge + +HugePages_Total + is the size of the pool of huge pages. +HugePages_Free + is the number of huge pages in the pool that are not yet + allocated. +HugePages_Rsvd + is short for "reserved," and is the number of huge pages for + which a commitment to allocate from the pool has been made, + but no allocation has yet been made. Reserved huge pages + guarantee that an application will be able to allocate a + huge page from the pool of huge pages at fault time. +HugePages_Surp + is short for "surplus," and is the number of huge pages in + the pool above the value in ``/proc/sys/vm/nr_hugepages``. The + maximum number of surplus huge pages is controlled by + ``/proc/sys/vm/nr_overcommit_hugepages``. +Hugepagesize + is the default hugepage size (in Kb). +Hugetlb + is the total amount of memory (in kB), consumed by huge + pages of all sizes. + If huge pages of different sizes are in use, this number + will exceed HugePages_Total \* Hugepagesize. To get more + detailed information, please, refer to + ``/sys/kernel/mm/hugepages`` (described below). + + +``/proc/filesystems`` should also show a filesystem of type "hugetlbfs" +configured in the kernel. + +``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge pages in the kernel's huge page pool. "Persistent" huge pages will be returned to the huge page pool when freed by a task. A user with root privileges can dynamically allocate more or free some persistent huge pages -by increasing or decreasing the value of 'nr_hugepages'. +by increasing or decreasing the value of ``nr_hugepages``. Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under @@ -73,7 +87,7 @@ memory pressure. Once a number of huge pages have been pre-allocated to the kernel huge page pool, a user with appropriate privilege can use either the mmap system call or shared memory system calls to use the huge pages. See the discussion of -Using Huge Pages, below. +:ref:`Using Huge Pages <using_huge_pages>`, below. The administrator can allocate persistent huge pages on the kernel boot command line by specifying the "hugepages=N" parameter, where 'N' = the @@ -86,10 +100,10 @@ with a huge page size selection parameter "hugepagesz=<size>". <size> must be specified in bytes with optional scale suffix [kKmMgG]. The default huge page size may be selected with the "default_hugepagesz=<size>" boot parameter. -When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages +When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` indicates the current number of pre-allocated huge pages of the default size. Thus, one can use the following command to dynamically allocate/deallocate -default sized persistent huge pages: +default sized persistent huge pages:: echo 20 > /proc/sys/vm/nr_hugepages @@ -98,11 +112,12 @@ huge page pool to 20, allocating or freeing huge pages, as required. On a NUMA platform, the kernel will attempt to distribute the huge page pool over all the set of allowed nodes specified by the NUMA memory policy of the -task that modifies nr_hugepages. The default for the allowed nodes--when the +task that modifies ``nr_hugepages``. The default for the allowed nodes--when the task has default memory policy--is all on-line nodes with memory. Allowed nodes with insufficient available, contiguous memory for a huge page will be -silently skipped when allocating persistent huge pages. See the discussion -below of the interaction of task memory policy, cpusets and per node attributes +silently skipped when allocating persistent huge pages. See the +:ref:`discussion below <mem_policy_and_hp_alloc>` +of the interaction of task memory policy, cpusets and per node attributes with the allocation and freeing of persistent huge pages. The success or failure of huge page allocation depends on the amount of @@ -117,51 +132,52 @@ init files. This will enable the kernel to allocate huge pages early in the boot process when the possibility of getting physical contiguous pages is still very high. Administrators can verify the number of huge pages actually allocated by checking the sysctl or meminfo. To check the per node -distribution of huge pages in a NUMA system, use: +distribution of huge pages in a NUMA system, use:: cat /sys/devices/system/node/node*/meminfo | fgrep Huge -/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of -huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are +``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of +huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are requested by applications. Writing any non-zero value into this file indicates that the hugetlb subsystem is allowed to try to obtain that number of "surplus" huge pages from the kernel's normal page pool, when the persistent huge page pool is exhausted. As these surplus huge pages become unused, they are freed back to the kernel's normal page pool. -When increasing the huge page pool size via nr_hugepages, any existing surplus -pages will first be promoted to persistent huge pages. Then, additional +When increasing the huge page pool size via ``nr_hugepages``, any existing +surplus pages will first be promoted to persistent huge pages. Then, additional huge pages will be allocated, if necessary and if possible, to fulfill the new persistent huge page pool size. The administrator may shrink the pool of persistent huge pages for -the default huge page size by setting the nr_hugepages sysctl to a +the default huge page size by setting the ``nr_hugepages`` sysctl to a smaller value. The kernel will attempt to balance the freeing of huge pages -across all nodes in the memory policy of the task modifying nr_hugepages. +across all nodes in the memory policy of the task modifying ``nr_hugepages``. Any free huge pages on the selected nodes will be freed back to the kernel's normal page pool. -Caveat: Shrinking the persistent huge page pool via nr_hugepages such that +Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that it becomes less than the number of huge pages in use will convert the balance of the in-use huge pages to surplus huge pages. This will occur even if -the number of surplus pages it would exceed the overcommit value. As long as -this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is +the number of surplus pages would exceed the overcommit value. As long as +this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is increased sufficiently, or the surplus huge pages go out of use and are freed-- no more surplus huge pages will be allowed to be allocated. With support for multiple huge page pools at run-time available, much of -the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs. -The /proc interfaces discussed above have been retained for backwards -compatibility. The root huge page control directory in sysfs is: +the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in +sysfs. +The ``/proc`` interfaces discussed above have been retained for backwards +compatibility. The root huge page control directory in sysfs is:: /sys/kernel/mm/hugepages For each huge page size supported by the running kernel, a subdirectory -will exist, of the form: +will exist, of the form:: hugepages-${size}kB -Inside each of these directories, the same set of files will exist: +Inside each of these directories, the same set of files will exist:: nr_hugepages nr_hugepages_mempolicy @@ -172,37 +188,39 @@ Inside each of these directories, the same set of files will exist: which function as described above for the default huge page-sized case. +.. _mem_policy_and_hp_alloc: Interaction of Task Memory Policy with Huge Page Allocation/Freeing =================================================================== -Whether huge pages are allocated and freed via the /proc interface or -the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA -nodes from which huge pages are allocated or freed are controlled by the -NUMA memory policy of the task that modifies the nr_hugepages_mempolicy -sysctl or attribute. When the nr_hugepages attribute is used, mempolicy +Whether huge pages are allocated and freed via the ``/proc`` interface or +the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the +NUMA nodes from which huge pages are allocated or freed are controlled by the +NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy`` +sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy is ignored. The recommended method to allocate or free huge pages to/from the kernel -huge page pool, using the nr_hugepages example above, is: +huge page pool, using the ``nr_hugepages`` example above, is:: numactl --interleave <node-list> echo 20 \ >/proc/sys/vm/nr_hugepages_mempolicy -or, more succinctly: +or, more succinctly:: numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy -This will allocate or free abs(20 - nr_hugepages) to or from the nodes +This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes specified in <node-list>, depending on whether number of persistent huge pages is initially less than or greater than 20, respectively. No huge pages will be allocated nor freed on any node not included in the specified <node-list>. -When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any +When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any memory policy mode--bind, preferred, local or interleave--may be used. The resulting effect on persistent huge page allocation is as follows: -1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt], +#. Regardless of mempolicy mode [see + :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`], persistent huge pages will be distributed across the node or nodes specified in the mempolicy as if "interleave" had been specified. However, if a node in the policy does not contain sufficient contiguous @@ -212,7 +230,7 @@ resulting effect on persistent huge page allocation is as follows: possibly, allocation of persistent huge pages on nodes not allowed by the task's memory policy. -2) One or more nodes may be specified with the bind or interleave policy. +#. One or more nodes may be specified with the bind or interleave policy. If more than one node is specified with the preferred policy, only the lowest numeric id will be used. Local policy will select the node where the task is running at the time the nodes_allowed mask is constructed. @@ -222,20 +240,20 @@ resulting effect on persistent huge page allocation is as follows: indeterminate. Thus, local policy is not very useful for this purpose. Any of the other mempolicy modes may be used to specify a single node. -3) The nodes allowed mask will be derived from any non-default task mempolicy, +#. The nodes allowed mask will be derived from any non-default task mempolicy, whether this policy was set explicitly by the task itself or one of its ancestors, such as numactl. This means that if the task is invoked from a shell with non-default policy, that policy will be used. One can specify a node list of "all" with numactl --interleave or --membind [-m] to achieve interleaving over all nodes in the system or cpuset. -4) Any task mempolicy specified--e.g., using numactl--will be constrained by +#. Any task mempolicy specified--e.g., using numactl--will be constrained by the resource limits of any cpuset in which the task runs. Thus, there will be no way for a task with non-default policy running in a cpuset with a subset of the system nodes to allocate huge pages outside the cpuset without first moving to a cpuset that contains all of the desired nodes. -5) Boot-time huge page allocation attempts to distribute the requested number +#. Boot-time huge page allocation attempts to distribute the requested number of huge pages over all on-lines nodes with memory. Per Node Hugepages Attributes @@ -243,22 +261,22 @@ Per Node Hugepages Attributes A subset of the contents of the root huge page control directory in sysfs, described above, will be replicated under each the system device of each -NUMA node with memory in: +NUMA node with memory in:: /sys/devices/system/node/node[0-9]*/hugepages/ Under this directory, the subdirectory for each supported huge page size -contains the following attribute files: +contains the following attribute files:: nr_hugepages free_hugepages surplus_hugepages -The free_' and surplus_' attribute files are read-only. They return the number +The free\_' and surplus\_' attribute files are read-only. They return the number of free and surplus [overcommitted] huge pages, respectively, on the parent node. -The nr_hugepages attribute returns the total number of huge pages on the +The ``nr_hugepages`` attribute returns the total number of huge pages on the specified node. When this attribute is written, the number of persistent huge pages on the parent node will be adjusted to the specified value, if sufficient resources exist, regardless of the task's mempolicy or cpuset constraints. @@ -267,43 +285,58 @@ Note that the number of overcommit and reserve pages remain global quantities, as we don't know until fault time, when the faulting task's mempolicy is applied, from which node the huge page allocation will be attempted. +.. _using_huge_pages: Using Huge Pages ================ If the user applications are going to request huge pages using mmap system call, then it is required that system administrator mount a file system of -type hugetlbfs: +type hugetlbfs:: mount -t hugetlbfs \ -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\ min_size=<value>,nr_inodes=<value> none /mnt/huge This command mounts a (pseudo) filesystem of type hugetlbfs on the directory -/mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid -options sets the owner and group of the root of the file system. By default -the uid and gid of the current process are taken. The mode option sets the -mode of root of file system to value & 01777. This value is given in octal. -By default the value 0755 is picked. If the platform supports multiple huge -page sizes, the pagesize option can be used to specify the huge page size and -associated pool. pagesize is specified in bytes. If pagesize is not specified -the platform's default huge page size and associated pool will be used. The -size option sets the maximum value of memory (huge pages) allowed for that -filesystem (/mnt/huge). The size option can be specified in bytes, or as a -percentage of the specified huge page pool (nr_hugepages). The size is -rounded down to HPAGE_SIZE boundary. The min_size option sets the minimum -value of memory (huge pages) allowed for the filesystem. min_size can be -specified in the same way as size, either bytes or a percentage of the -huge page pool. At mount time, the number of huge pages specified by -min_size are reserved for use by the filesystem. If there are not enough -free huge pages available, the mount will fail. As huge pages are allocated -to the filesystem and freed, the reserve count is adjusted so that the sum -of allocated and reserved huge pages is always at least min_size. The option -nr_inodes sets the maximum number of inodes that /mnt/huge can use. If the -size, min_size or nr_inodes option is not provided on command line then -no limits are set. For pagesize, size, min_size and nr_inodes options, you -can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For example, size=2K -has the same meaning as size=2048. +``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages. + +The ``uid`` and ``gid`` options sets the owner and group of the root of the +file system. By default the ``uid`` and ``gid`` of the current process +are taken. + +The ``mode`` option sets the mode of root of file system to value & 01777. +This value is given in octal. By default the value 0755 is picked. + +If the platform supports multiple huge page sizes, the ``pagesize`` option can +be used to specify the huge page size and associated pool. ``pagesize`` +is specified in bytes. If ``pagesize`` is not specified the platform's +default huge page size and associated pool will be used. + +The ``size`` option sets the maximum value of memory (huge pages) allowed +for that filesystem (``/mnt/huge``). The ``size`` option can be specified +in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``). +The size is rounded down to HPAGE_SIZE boundary. + +The ``min_size`` option sets the minimum value of memory (huge pages) allowed +for the filesystem. ``min_size`` can be specified in the same way as ``size``, +either bytes or a percentage of the huge page pool. +At mount time, the number of huge pages specified by ``min_size`` are reserved +for use by the filesystem. +If there are not enough free huge pages available, the mount will fail. +As huge pages are allocated to the filesystem and freed, the reserve count +is adjusted so that the sum of allocated and reserved huge pages is always +at least ``min_size``. + +The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge`` +can use. + +If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on +command line then no limits are set. + +For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can +use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. +For example, size=2K has the same meaning as size=2048. While read system calls are supported on files that reside on hugetlb file systems, write system calls are not. @@ -313,12 +346,12 @@ used to change the file attributes on hugetlbfs. Also, it is important to note that no such mount command is required if applications are going to use only shmat/shmget system calls or mmap with -MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see map_hugetlb -below. +MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see +:ref:`map_hugetlb <map_hugetlb>` below. -Users who wish to use hugetlb memory via shared memory segment should be a -member of a supplementary group and system admin needs to configure that gid -into /proc/sys/vm/hugetlb_shm_group. It is possible for same or different +Users who wish to use hugetlb memory via shared memory segment should be +members of a supplementary group and system admin needs to configure that gid +into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different applications to use any combination of mmaps and shm* calls, though the mount of filesystem will be required for using mmap calls without MAP_HUGETLB. @@ -332,20 +365,18 @@ a hugetlb page and the length is smaller than the hugepage size. Examples ======== -1) map_hugetlb: see tools/testing/selftests/vm/map_hugetlb.c +.. _map_hugetlb: -2) hugepage-shm: see tools/testing/selftests/vm/hugepage-shm.c +``map_hugetlb`` + see tools/testing/selftests/vm/map_hugetlb.c -3) hugepage-mmap: see tools/testing/selftests/vm/hugepage-mmap.c +``hugepage-shm`` + see tools/testing/selftests/vm/hugepage-shm.c -4) The libhugetlbfs (https://github.com/libhugetlbfs/libhugetlbfs) library - provides a wide range of userspace tools to help with huge page usability, - environment setup, and control. +``hugepage-mmap`` + see tools/testing/selftests/vm/hugepage-mmap.c -Kernel development regression testing -===================================== +The `libhugetlbfs`_ library provides a wide range of userspace tools +to help with huge page usability, environment setup, and control. -The most complete set of hugetlb tests are in the libhugetlbfs repository. -If you modify any hugetlb related code, use the libhugetlbfs test suite -to check for regressions. In addition, if you add any new hugetlb -functionality, please add appropriate tests to libhugetlbfs. +.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs diff --git a/Documentation/vm/idle_page_tracking.txt b/Documentation/admin-guide/mm/idle_page_tracking.rst index 85dcc3bb85dc..6f7b7ca1add3 100644 --- a/Documentation/vm/idle_page_tracking.txt +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst @@ -1,4 +1,11 @@ -MOTIVATION +.. _idle_page_tracking: + +================== +Idle Page Tracking +================== + +Motivation +========== The idle page tracking feature allows to track which memory pages are being accessed by a workload and which are idle. This information can be useful for @@ -8,10 +15,14 @@ or deciding where to place the workload within a compute cluster. It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. -USER API +.. _user_api: -The idle page tracking API is located at /sys/kernel/mm/page_idle. Currently, -it consists of the only read-write file, /sys/kernel/mm/page_idle/bitmap. +User API +======== + +The idle page tracking API is located at ``/sys/kernel/mm/page_idle``. +Currently, it consists of the only read-write file, +``/sys/kernel/mm/page_idle/bitmap``. The file implements a bitmap where each bit corresponds to a memory page. The bitmap is represented by an array of 8-byte integers, and the page at PFN #i is @@ -19,8 +30,9 @@ mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is set, the corresponding page is idle. A page is considered idle if it has not been accessed since it was marked idle -(for more details on what "accessed" actually means see the IMPLEMENTATION -DETAILS section). To mark a page idle one has to set the bit corresponding to +(for more details on what "accessed" actually means see the :ref:`Implementation +Details <impl_details>` section). +To mark a page idle one has to set the bit corresponding to the page by writing to the file. A value written to the file is OR-ed with the current bitmap value. @@ -30,9 +42,9 @@ page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, and hence such pages are never reported idle. For huge pages the idle flag is set only on the head page, so one has to read -/proc/kpageflags in order to correctly count idle huge pages. +``/proc/kpageflags`` in order to correctly count idle huge pages. -Reading from or writing to /sys/kernel/mm/page_idle/bitmap will return +Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return -EINVAL if you are not starting the read/write on an 8-byte boundary, or if the size of the read/write is not a multiple of 8 bytes. Writing to this file beyond max PFN will return -ENXIO. @@ -41,21 +53,26 @@ That said, in order to estimate the amount of pages that are not used by a workload one should: 1. Mark all the workload's pages as idle by setting corresponding bits in - /sys/kernel/mm/page_idle/bitmap. The pages can be found by reading - /proc/pid/pagemap if the workload is represented by a process, or by - filtering out alien pages using /proc/kpagecgroup in case the workload is - placed in a memory cgroup. + ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading + ``/proc/pid/pagemap`` if the workload is represented by a process, or by + filtering out alien pages using ``/proc/kpagecgroup`` in case the workload + is placed in a memory cgroup. 2. Wait until the workload accesses its working set. - 3. Read /sys/kernel/mm/page_idle/bitmap and count the number of bits set. If - one wants to ignore certain types of pages, e.g. mlocked pages since they - are not reclaimable, he or she can filter them out using /proc/kpageflags. + 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set. + If one wants to ignore certain types of pages, e.g. mlocked pages since they + are not reclaimable, he or she can filter them out using + ``/proc/kpageflags``. + +See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more +information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and +``/proc/kpagecgroup``. -See Documentation/vm/pagemap.txt for more information about /proc/pid/pagemap, -/proc/kpageflags, and /proc/kpagecgroup. +.. _impl_details: -IMPLEMENTATION DETAILS +Implementation Details +====================== The kernel internally keeps track of accesses to user memory pages in order to reclaim unreferenced pages first on memory shortage conditions. A page is @@ -77,7 +94,8 @@ When a dirty page is written to swap or disk as a result of memory reclaim or exceeding the dirty memory limit, it is not marked referenced. The idle memory tracking feature adds a new page flag, the Idle flag. This flag -is set manually, by writing to /sys/kernel/mm/page_idle/bitmap (see the USER API +is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the +:ref:`User API <user_api>` section), and cleared automatically whenever a page is referenced as defined above. diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst new file mode 100644 index 000000000000..ceead68c2df7 --- /dev/null +++ b/Documentation/admin-guide/mm/index.rst @@ -0,0 +1,36 @@ +================= +Memory Management +================= + +Linux memory management subsystem is responsible, as the name implies, +for managing the memory in the system. This includes implemnetation of +virtual memory and demand paging, memory allocation both for kernel +internal structures and user space programms, mapping of files into +processes address space and many other cool things. + +Linux memory management is a complex system with many configurable +settings. Most of these settings are available via ``/proc`` +filesystem and can be quired and adjusted using ``sysctl``. These APIs +are described in Documentation/sysctl/vm.txt and in `man 5 proc`_. + +.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html + +Linux memory management has its own jargon and if you are not yet +familiar with it, consider reading +:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`. + +Here we document in detail how to interact with various mechanisms in +the Linux memory management. + +.. toctree:: + :maxdepth: 1 + + concepts + hugetlbpage + idle_page_tracking + ksm + numa_memory_policy + pagemap + soft-dirty + transhuge + userfaultfd diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst new file mode 100644 index 000000000000..9303786632d1 --- /dev/null +++ b/Documentation/admin-guide/mm/ksm.rst @@ -0,0 +1,189 @@ +.. _admin_guide_ksm: + +======================= +Kernel Samepage Merging +======================= + +Overview +======== + +KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, +added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation, +and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/ + +KSM was originally developed for use with KVM (where it was known as +Kernel Shared Memory), to fit more virtual machines into physical memory, +by sharing the data common between them. But it can be useful to any +application which generates many instances of the same data. + +The KSM daemon ksmd periodically scans those areas of user memory +which have been registered with it, looking for pages of identical +content which can be replaced by a single write-protected page (which +is automatically copied if a process later wants to update its +content). The amount of pages that KSM daemon scans in a single pass +and the time between the passes are configured using :ref:`sysfs +intraface <ksm_sysfs>` + +KSM only merges anonymous (private) pages, never pagecache (file) pages. +KSM's merged pages were originally locked into kernel memory, but can now +be swapped out just like other user pages (but sharing is broken when they +are swapped back in: ksmd must rediscover their identity and merge again). + +Controlling KSM with madvise +============================ + +KSM only operates on those areas of address space which an application +has advised to be likely candidates for merging, by using the madvise(2) +system call:: + + int madvise(addr, length, MADV_MERGEABLE) + +The app may call + +:: + + int madvise(addr, length, MADV_UNMERGEABLE) + +to cancel that advice and restore unshared pages: whereupon KSM +unmerges whatever it merged in that range. Note: this unmerging call +may suddenly require more memory than is available - possibly failing +with EAGAIN, but more probably arousing the Out-Of-Memory killer. + +If KSM is not configured into the running kernel, madvise MADV_MERGEABLE +and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was +built with CONFIG_KSM=y, those calls will normally succeed: even if the +the KSM daemon is not currently running, MADV_MERGEABLE still registers +the range for whenever the KSM daemon is started; even if the range +cannot contain any pages which KSM could actually merge; even if +MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. + +If a region of memory must be split into at least one new MADV_MERGEABLE +or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process +will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt). + +Like other madvise calls, they are intended for use on mapped areas of +the user address space: they will report ENOMEM if the specified range +includes unmapped gaps (though working on the intervening mapped areas), +and might fail with EAGAIN if not enough memory for internal structures. + +Applications should be considerate in their use of MADV_MERGEABLE, +restricting its use to areas likely to benefit. KSM's scans may use a lot +of processing power: some installations will disable KSM for that reason. + +.. _ksm_sysfs: + +KSM daemon sysfs interface +========================== + +The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``, +readable by all but writable only by root: + +pages_to_scan + how many pages to scan before ksmd goes to sleep + e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``. + + Default: 100 (chosen for demonstration purposes) + +sleep_millisecs + how many milliseconds ksmd should sleep before next scan + e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs`` + + Default: 20 (chosen for demonstration purposes) + +merge_across_nodes + specifies if pages from different NUMA nodes can be merged. + When set to 0, ksm merges only pages which physically reside + in the memory area of same NUMA node. That brings lower + latency to access of shared pages. Systems with more nodes, at + significant NUMA distances, are likely to benefit from the + lower latency of setting 0. Smaller systems, which need to + minimize memory usage, are likely to benefit from the greater + sharing of setting 1 (default). You may wish to compare how + your system performs under each setting, before deciding on + which to use. ``merge_across_nodes`` setting can be changed only + when there are no ksm shared pages in the system: set run 2 to + unmerge pages first, then to 1 after changing + ``merge_across_nodes``, to remerge according to the new setting. + + Default: 1 (merging across nodes as in earlier releases) + +run + * set to 0 to stop ksmd from running but keep merged pages, + * set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``, + * set to 2 to stop ksmd and unmerge all pages currently merged, but + leave mergeable areas registered for next run. + + Default: 0 (must be changed to 1 to activate KSM, except if + CONFIG_SYSFS is disabled) + +use_zero_pages + specifies whether empty pages (i.e. allocated pages that only + contain zeroes) should be treated specially. When set to 1, + empty pages are merged with the kernel zero page(s) instead of + with each other as it would happen normally. This can improve + the performance on architectures with coloured zero pages, + depending on the workload. Care should be taken when enabling + this setting, as it can potentially degrade the performance of + KSM for some workloads, for example if the checksums of pages + candidate for merging match the checksum of an empty + page. This setting can be changed at any time, it is only + effective for pages merged after the change. + + Default: 0 (normal KSM behaviour as in earlier releases) + +max_page_sharing + Maximum sharing allowed for each KSM page. This enforces a + deduplication limit to avoid high latency for virtual memory + operations that involve traversal of the virtual mappings that + share the KSM page. The minimum value is 2 as a newly created + KSM page will have at least two sharers. The higher this value + the faster KSM will merge the memory and the higher the + deduplication factor will be, but the slower the worst case + virtual mappings traversal could be for any given KSM + page. Slowing down this traversal means there will be higher + latency for certain virtual memory operations happening during + swapping, compaction, NUMA balancing and page migration, in + turn decreasing responsiveness for the caller of those virtual + memory operations. The scheduler latency of other tasks not + involved with the VM operations doing the virtual mappings + traversal is not affected by this parameter as these + traversals are always schedule friendly themselves. + +stable_node_chains_prune_millisecs + specifies how frequently KSM checks the metadata of the pages + that hit the deduplication limit for stale information. + Smaller milllisecs values will free up the KSM metadata with + lower latency, but they will make ksmd use more CPU during the + scan. It's a noop if not a single KSM page hit the + ``max_page_sharing`` yet. + +The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``: + +pages_shared + how many shared pages are being used +pages_sharing + how many more sites are sharing them i.e. how much saved +pages_unshared + how many pages unique but repeatedly checked for merging +pages_volatile + how many pages changing too fast to be placed in a tree +full_scans + how many times all mergeable areas have been scanned +stable_node_chains + the number of KSM pages that hit the ``max_page_sharing`` limit +stable_node_dups + number of duplicated KSM pages + +A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good +sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing`` +indicates wasted effort. ``pages_volatile`` embraces several +different kinds of activity, but a high proportion there would also +indicate poor use of madvise MADV_MERGEABLE. + +The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the +``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must +be increased accordingly. + +-- +Izik Eidus, +Hugh Dickins, 17 Nov 2009 diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst new file mode 100644 index 000000000000..d78c5b315f72 --- /dev/null +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -0,0 +1,495 @@ +.. _numa_memory_policy: + +================== +NUMA Memory Policy +================== + +What is NUMA Memory Policy? +============================ + +In the Linux kernel, "memory policy" determines from which node the kernel will +allocate memory in a NUMA system or in an emulated NUMA system. Linux has +supported platforms with Non-Uniform Memory Access architectures since 2.4.?. +The current memory policy support was added to Linux 2.6 around May 2004. This +document attempts to describe the concepts and APIs of the 2.6 memory policy +support. + +Memory policies should not be confused with cpusets +(``Documentation/cgroup-v1/cpusets.txt``) +which is an administrative mechanism for restricting the nodes from which +memory may be allocated by a set of processes. Memory policies are a +programming interface that a NUMA-aware application can take advantage of. When +both cpusets and policies are applied to a task, the restrictions of the cpuset +takes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>` +below for more details. + +Memory Policy Concepts +====================== + +Scope of Memory Policies +------------------------ + +The Linux kernel supports _scopes_ of memory policy, described here from +most general to most specific: + +System Default Policy + this policy is "hard coded" into the kernel. It is the policy + that governs all page allocations that aren't controlled by + one of the more specific policy scopes discussed below. When + the system is "up and running", the system default policy will + use "local allocation" described below. However, during boot + up, the system default policy will be set to interleave + allocations across all nodes with "sufficient" memory, so as + not to overload the initial boot node with boot-time + allocations. + +Task/Process Policy + this is an optional, per-task policy. When defined for a + specific task, this policy controls all page allocations made + by or on behalf of the task that aren't controlled by a more + specific scope. If a task does not define a task policy, then + all page allocations that would have been controlled by the + task policy "fall back" to the System Default Policy. + + The task policy applies to the entire address space of a task. Thus, + it is inheritable, and indeed is inherited, across both fork() + [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task + to establish the task policy for a child task exec()'d from an + executable image that has no awareness of memory policy. See the + :ref:`Memory Policy APIs <memory_policy_apis>` section, + below, for an overview of the system call + that a task may use to set/change its task/process policy. + + In a multi-threaded task, task policies apply only to the thread + [Linux kernel task] that installs the policy and any threads + subsequently created by that thread. Any sibling threads existing + at the time a new task policy is installed retain their current + policy. + + A task policy applies only to pages allocated after the policy is + installed. Any pages already faulted in by the task when the task + changes its task policy remain where they were allocated based on + the policy at the time they were allocated. + +.. _vma_policy: + +VMA Policy + A "VMA" or "Virtual Memory Area" refers to a range of a task's + virtual address space. A task may define a specific policy for a range + of its virtual address space. See the + :ref:`Memory Policy APIs <memory_policy_apis>` section, + below, for an overview of the mbind() system call used to set a VMA + policy. + + A VMA policy will govern the allocation of pages that back + this region of the address space. Any regions of the task's + address space that don't have an explicit VMA policy will fall + back to the task policy, which may itself fall back to the + System Default Policy. + + VMA policies have a few complicating details: + + * VMA policy applies ONLY to anonymous pages. These include + pages allocated for anonymous segments, such as the task + stack and heap, and any regions of the address space + mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is + applied to a file mapping, it will be ignored if the mapping + used the MAP_SHARED flag. If the file mapping used the + MAP_PRIVATE flag, the VMA policy will only be applied when + an anonymous page is allocated on an attempt to write to the + mapping-- i.e., at Copy-On-Write. + + * VMA policies are shared between all tasks that share a + virtual address space--a.k.a. threads--independent of when + the policy is installed; and they are inherited across + fork(). However, because VMA policies refer to a specific + region of a task's address space, and because the address + space is discarded and recreated on exec*(), VMA policies + are NOT inheritable across exec(). Thus, only NUMA-aware + applications may use VMA policies. + + * A task may install a new VMA policy on a sub-range of a + previously mmap()ed region. When this happens, Linux splits + the existing virtual memory area into 2 or 3 VMAs, each with + it's own policy. + + * By default, VMA policy applies only to pages allocated after + the policy is installed. Any pages already faulted into the + VMA range remain where they were allocated based on the + policy at the time they were allocated. However, since + 2.6.16, Linux supports page migration via the mbind() system + call, so that page contents can be moved to match a newly + installed policy. + +Shared Policy + Conceptually, shared policies apply to "memory objects" mapped + shared into one or more tasks' distinct address spaces. An + application installs shared policies the same way as VMA + policies--using the mbind() system call specifying a range of + virtual addresses that map the shared object. However, unlike + VMA policies, which can be considered to be an attribute of a + range of a task's address space, shared policies apply + directly to the shared object. Thus, all tasks that attach to + the object share the policy, and all pages allocated for the + shared object, by any task, will obey the shared policy. + + As of 2.6.22, only shared memory segments, created by shmget() or + mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared + policy support was added to Linux, the associated data structures were + added to hugetlbfs shmem segments. At the time, hugetlbfs did not + support allocation at fault time--a.k.a lazy allocation--so hugetlbfs + shmem segments were never "hooked up" to the shared policy support. + Although hugetlbfs segments now support lazy allocation, their support + for shared policy has not been completed. + + As mentioned above in :ref:`VMA policies <vma_policy>` section, + allocations of page cache pages for regular files mmap()ed + with MAP_SHARED ignore any VMA policy installed on the virtual + address range backed by the shared file mapping. Rather, + shared page cache pages, including pages backing private + mappings that have not yet been written by the task, follow + task policy, if any, else System Default Policy. + + The shared policy infrastructure supports different policies on subset + ranges of the shared object. However, Linux still splits the VMA of + the task that installs the policy for each range of distinct policy. + Thus, different tasks that attach to a shared memory segment can have + different VMA configurations mapping that one shared object. This + can be seen by examining the /proc/<pid>/numa_maps of tasks sharing + a shared memory region, when one task has installed shared policy on + one or more ranges of the region. + +Components of Memory Policies +----------------------------- + +A NUMA memory policy consists of a "mode", optional mode flags, and +an optional set of nodes. The mode determines the behavior of the +policy, the optional mode flags determine the behavior of the mode, +and the optional set of nodes can be viewed as the arguments to the +policy behavior. + +Internally, memory policies are implemented by a reference counted +structure, struct mempolicy. Details of this structure will be +discussed in context, below, as required to explain the behavior. + +NUMA memory policy supports the following 4 behavioral modes: + +Default Mode--MPOL_DEFAULT + This mode is only used in the memory policy APIs. Internally, + MPOL_DEFAULT is converted to the NULL memory policy in all + policy scopes. Any existing non-default policy will simply be + removed when MPOL_DEFAULT is specified. As a result, + MPOL_DEFAULT means "fall back to the next most specific policy + scope." + + For example, a NULL or default task policy will fall back to the + system default policy. A NULL or default vma policy will fall + back to the task policy. + + When specified in one of the memory policy APIs, the Default mode + does not use the optional set of nodes. + + It is an error for the set of nodes specified for this policy to + be non-empty. + +MPOL_BIND + This mode specifies that memory must come from the set of + nodes specified by the policy. Memory will be allocated from + the node in the set with sufficient free memory that is + closest to the node where the allocation takes place. + +MPOL_PREFERRED + This mode specifies that the allocation should be attempted + from the single node specified in the policy. If that + allocation fails, the kernel will search other nodes, in order + of increasing distance from the preferred node based on + information provided by the platform firmware. + + Internally, the Preferred policy uses a single node--the + preferred_node member of struct mempolicy. When the internal + mode flag MPOL_F_LOCAL is set, the preferred_node is ignored + and the policy is interpreted as local allocation. "Local" + allocation policy can be viewed as a Preferred policy that + starts at the node containing the cpu where the allocation + takes place. + + It is possible for the user to specify that local allocation + is always preferred by passing an empty nodemask with this + mode. If an empty nodemask is passed, the policy cannot use + the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags + described below. + +MPOL_INTERLEAVED + This mode specifies that page allocations be interleaved, on a + page granularity, across the nodes specified in the policy. + This mode also behaves slightly differently, based on the + context where it is used: + + For allocation of anonymous pages and shared memory pages, + Interleave mode indexes the set of nodes specified by the + policy using the page offset of the faulting address into the + segment [VMA] containing the address modulo the number of + nodes specified by the policy. It then attempts to allocate a + page, starting at the selected node, as if the node had been + specified by a Preferred policy or had been selected by a + local allocation. That is, allocation will follow the per + node zonelist. + + For allocation of page cache pages, Interleave mode indexes + the set of nodes specified by the policy using a node counter + maintained per task. This counter wraps around to the lowest + specified node after it reaches the highest specified node. + This will tend to spread the pages out over the nodes + specified by the policy based on the order in which they are + allocated, rather than based on any page offset into an + address range or file. During system boot up, the temporary + interleaved system default policy works in this mode. + +NUMA memory policy supports the following optional mode flags: + +MPOL_F_STATIC_NODES + This flag specifies that the nodemask passed by + the user should not be remapped if the task or VMA's set of allowed + nodes changes after the memory policy has been defined. + + Without this flag, any time a mempolicy is rebound because of a + change in the set of allowed nodes, the node (Preferred) or + nodemask (Bind, Interleave) is remapped to the new set of + allowed nodes. This may result in nodes being used that were + previously undesired. + + With this flag, if the user-specified nodes overlap with the + nodes allowed by the task's cpuset, then the memory policy is + applied to their intersection. If the two sets of nodes do not + overlap, the Default policy is used. + + For example, consider a task that is attached to a cpuset with + mems 1-3 that sets an Interleave policy over the same set. If + the cpuset's mems change to 3-5, the Interleave will now occur + over nodes 3, 4, and 5. With this flag, however, since only node + 3 is allowed from the user's nodemask, the "interleave" only + occurs over that node. If no nodes from the user's nodemask are + now allowed, the Default behavior is used. + + MPOL_F_STATIC_NODES cannot be combined with the + MPOL_F_RELATIVE_NODES flag. It also cannot be used for + MPOL_PREFERRED policies that were created with an empty nodemask + (local allocation). + +MPOL_F_RELATIVE_NODES + This flag specifies that the nodemask passed + by the user will be mapped relative to the set of the task or VMA's + set of allowed nodes. The kernel stores the user-passed nodemask, + and if the allowed nodes changes, then that original nodemask will + be remapped relative to the new set of allowed nodes. + + Without this flag (and without MPOL_F_STATIC_NODES), anytime a + mempolicy is rebound because of a change in the set of allowed + nodes, the node (Preferred) or nodemask (Bind, Interleave) is + remapped to the new set of allowed nodes. That remap may not + preserve the relative nature of the user's passed nodemask to its + set of allowed nodes upon successive rebinds: a nodemask of + 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of + allowed nodes is restored to its original state. + + With this flag, the remap is done so that the node numbers from + the user's passed nodemask are relative to the set of allowed + nodes. In other words, if nodes 0, 2, and 4 are set in the user's + nodemask, the policy will be effected over the first (and in the + Bind or Interleave case, the third and fifth) nodes in the set of + allowed nodes. The nodemask passed by the user represents nodes + relative to task or VMA's set of allowed nodes. + + If the user's nodemask includes nodes that are outside the range + of the new set of allowed nodes (for example, node 5 is set in + the user's nodemask when the set of allowed nodes is only 0-3), + then the remap wraps around to the beginning of the nodemask and, + if not already set, sets the node in the mempolicy nodemask. + + For example, consider a task that is attached to a cpuset with + mems 2-5 that sets an Interleave policy over the same set with + MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the + interleave now occurs over nodes 3,5-7. If the cpuset's mems + then change to 0,2-3,5, then the interleave occurs over nodes + 0,2-3,5. + + Thanks to the consistent remapping, applications preparing + nodemasks to specify memory policies using this flag should + disregard their current, actual cpuset imposed memory placement + and prepare the nodemask as if they were always located on + memory nodes 0 to N-1, where N is the number of memory nodes the + policy is intended to manage. Let the kernel then remap to the + set of memory nodes allowed by the task's cpuset, as that may + change over time. + + MPOL_F_RELATIVE_NODES cannot be combined with the + MPOL_F_STATIC_NODES flag. It also cannot be used for + MPOL_PREFERRED policies that were created with an empty nodemask + (local allocation). + +Memory Policy Reference Counting +================================ + +To resolve use/free races, struct mempolicy contains an atomic reference +count field. Internal interfaces, mpol_get()/mpol_put() increment and +decrement this reference count, respectively. mpol_put() will only free +the structure back to the mempolicy kmem cache when the reference count +goes to zero. + +When a new memory policy is allocated, its reference count is initialized +to '1', representing the reference held by the task that is installing the +new policy. When a pointer to a memory policy structure is stored in another +structure, another reference is added, as the task's reference will be dropped +on completion of the policy installation. + +During run-time "usage" of the policy, we attempt to minimize atomic operations +on the reference count, as this can lead to cache lines bouncing between cpus +and NUMA nodes. "Usage" here means one of the following: + +1) querying of the policy, either by the task itself [using the get_mempolicy() + API discussed below] or by another task using the /proc/<pid>/numa_maps + interface. + +2) examination of the policy to determine the policy mode and associated node + or node lists, if any, for page allocation. This is considered a "hot + path". Note that for MPOL_BIND, the "usage" extends across the entire + allocation process, which may sleep during page reclaimation, because the + BIND policy nodemask is used, by reference, to filter ineligible nodes. + +We can avoid taking an extra reference during the usages listed above as +follows: + +1) we never need to get/free the system default policy as this is never + changed nor freed, once the system is up and running. + +2) for querying the policy, we do not need to take an extra reference on the + target task's task policy nor vma policies because we always acquire the + task's mm's mmap_sem for read during the query. The set_mempolicy() and + mbind() APIs [see below] always acquire the mmap_sem for write when + installing or replacing task or vma policies. Thus, there is no possibility + of a task or thread freeing a policy while another task or thread is + querying it. + +3) Page allocation usage of task or vma policy occurs in the fault path where + we hold them mmap_sem for read. Again, because replacing the task or vma + policy requires that the mmap_sem be held for write, the policy can't be + freed out from under us while we're using it for page allocation. + +4) Shared policies require special consideration. One task can replace a + shared memory policy while another task, with a distinct mmap_sem, is + querying or allocating a page based on the policy. To resolve this + potential race, the shared policy infrastructure adds an extra reference + to the shared policy during lookup while holding a spin lock on the shared + policy management structure. This requires that we drop this extra + reference when we're finished "using" the policy. We must drop the + extra reference on shared policies in the same query/allocation paths + used for non-shared policies. For this reason, shared policies are marked + as such, and the extra reference is dropped "conditionally"--i.e., only + for shared policies. + + Because of this extra reference counting, and because we must lookup + shared policies in a tree structure under spinlock, shared policies are + more expensive to use in the page allocation path. This is especially + true for shared policies on shared memory regions shared by tasks running + on different NUMA nodes. This extra overhead can be avoided by always + falling back to task or system default policy for shared memory regions, + or by prefaulting the entire shared memory region into memory and locking + it down. However, this might not be appropriate for all applications. + +.. _memory_policy_apis: + +Memory Policy APIs +================== + +Linux supports 3 system calls for controlling memory policy. These APIS +always affect only the calling task, the calling task's address space, or +some shared object mapped into the calling task's address space. + +.. note:: + the headers that define these APIs and the parameter data types for + user space applications reside in a package that is not part of the + Linux kernel. The kernel system call interfaces, with the 'sys\_' + prefix, are defined in <linux/syscalls.h>; the mode and flag + definitions are defined in <linux/mempolicy.h>. + +Set [Task] Memory Policy:: + + long set_mempolicy(int mode, const unsigned long *nmask, + unsigned long maxnode); + +Set's the calling task's "task/process memory policy" to mode +specified by the 'mode' argument and the set of nodes defined by +'nmask'. 'nmask' points to a bit mask of node ids containing at least +'maxnode' ids. Optional mode flags may be passed by combining the +'mode' argument with the flag (for example: MPOL_INTERLEAVE | +MPOL_F_STATIC_NODES). + +See the set_mempolicy(2) man page for more details + + +Get [Task] Memory Policy or Related Information:: + + long get_mempolicy(int *mode, + const unsigned long *nmask, unsigned long maxnode, + void *addr, int flags); + +Queries the "task/process memory policy" of the calling task, or the +policy or location of a specified virtual address, depending on the +'flags' argument. + +See the get_mempolicy(2) man page for more details + + +Install VMA/Shared Policy for a Range of Task's Address Space:: + + long mbind(void *start, unsigned long len, int mode, + const unsigned long *nmask, unsigned long maxnode, + unsigned flags); + +mbind() installs the policy specified by (mode, nmask, maxnodes) as a +VMA policy for the range of the calling task's address space specified +by the 'start' and 'len' arguments. Additional actions may be +requested via the 'flags' argument. + +See the mbind(2) man page for more details. + +Memory Policy Command Line Interface +==================================== + +Although not strictly part of the Linux implementation of memory policy, +a command line tool, numactl(8), exists that allows one to: + ++ set the task policy for a specified program via set_mempolicy(2), fork(2) and + exec(2) + ++ set the shared policy for a shared memory segment via mbind(2) + +The numactl(8) tool is packaged with the run-time version of the library +containing the memory policy system call wrappers. Some distributions +package the headers and compile-time libraries in a separate development +package. + +.. _mem_pol_and_cpusets: + +Memory Policies and cpusets +=========================== + +Memory policies work within cpusets as described above. For memory policies +that require a node or set of nodes, the nodes are restricted to the set of +nodes whose memories are allowed by the cpuset constraints. If the nodemask +specified for the policy contains nodes that are not allowed by the cpuset and +MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes +specified for the policy and the set of nodes with memory is used. If the +result is the empty set, the policy is considered invalid and cannot be +installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped +onto and folded into the task's set of allowed nodes as previously described. + +The interaction of memory policies and cpusets can be problematic when tasks +in two cpusets share access to a memory region, such as shared memory segments +created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and +any of the tasks install shared policy on the region, only nodes whose +memories are allowed in both cpusets may be used in the policies. Obtaining +this information requires "stepping outside" the memory policy APIs to use the +cpuset information and requires that one know in what cpusets other task might +be attaching to the shared region. Furthermore, if the cpusets' allowed +memory sets are disjoint, "local" allocation is the only valid policy. diff --git a/Documentation/vm/pagemap.txt b/Documentation/admin-guide/mm/pagemap.rst index eafcefa15261..577af85beb41 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -1,21 +1,25 @@ -pagemap, from the userspace perspective ---------------------------------------- +.. _pagemap: + +============================= +Examining Process Page Tables +============================= pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow userspace programs to examine the page tables and related information by -reading files in /proc. +reading files in ``/proc``. There are four components to pagemap: - * /proc/pid/pagemap. This file lets a userspace process find out which + * ``/proc/pid/pagemap``. This file lets a userspace process find out which physical frame each virtual page is mapped to. It contains one 64-bit value for each virtual page, containing the following data (from - fs/proc/task_mmu.c, above pagemap_read): + ``fs/proc/task_mmu.c``, above pagemap_read): * Bits 0-54 page frame number (PFN) if present * Bits 0-4 swap type if swapped * Bits 5-54 swap offset if swapped - * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt) + * Bit 55 pte is soft-dirty (see + :ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`) * Bit 56 page exclusively mapped (since 4.2) * Bits 57-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) @@ -33,28 +37,28 @@ There are four components to pagemap: precisely which pages are mapped (or in swap) and comparing mapped pages between processes. - Efficient users of this interface will use /proc/pid/maps to + Efficient users of this interface will use ``/proc/pid/maps`` to determine which areas of memory are actually mapped and llseek to skip over unmapped regions. - * /proc/kpagecount. This file contains a 64-bit count of the number of + * ``/proc/kpagecount``. This file contains a 64-bit count of the number of times each page is mapped, indexed by PFN. - * /proc/kpageflags. This file contains a 64-bit set of flags for each + * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each page, indexed by PFN. - The flags are (from fs/proc/page.c, above kpageflags_read): - - 0. LOCKED - 1. ERROR - 2. REFERENCED - 3. UPTODATE - 4. DIRTY - 5. LRU - 6. ACTIVE - 7. SLAB - 8. WRITEBACK - 9. RECLAIM + The flags are (from ``fs/proc/page.c``, above kpageflags_read): + + 0. LOCKED + 1. ERROR + 2. REFERENCED + 3. UPTODATE + 4. DIRTY + 5. LRU + 6. ACTIVE + 7. SLAB + 8. WRITEBACK + 9. RECLAIM 10. BUDDY 11. MMAP 12. ANON @@ -72,98 +76,111 @@ There are four components to pagemap: 24. ZERO_PAGE 25. IDLE - * /proc/kpagecgroup. This file contains a 64-bit inode number of the + * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Only available when CONFIG_MEMCG is set. -Short descriptions to the page flags: - - 0. LOCKED - page is being locked for exclusive access, eg. by undergoing read/write IO - - 7. SLAB - page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator - When compound page is used, SLUB/SLQB will only set this flag on the head - page; SLOB will not flag it at all. +Short descriptions to the page flags +==================================== -10. BUDDY +0 - LOCKED + page is being locked for exclusive access, e.g. by undergoing read/write IO +7 - SLAB + page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator + When compound page is used, SLUB/SLQB will only set this flag on the head + page; SLOB will not flag it at all. +10 - BUDDY a free memory block managed by the buddy system allocator The buddy system organizes free memory in blocks of various orders. An order N block has 2^N physically contiguous pages, with the BUDDY flag set for and _only_ for the first page. - -15. COMPOUND_HEAD -16. COMPOUND_TAIL +15 - COMPOUND_HEAD A compound page with order N consists of 2^N physically contiguous pages. A compound page with order 2 takes the form of "HTTT", where H donates its head page and T donates its tail page(s). The major consumers of compound - pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc. - memory allocators and various device drivers. However in this interface, - only huge/giga pages are made visible to end users. -17. HUGE + pages are hugeTLB pages + (:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`), + the SLUB etc. memory allocators and various device drivers. + However in this interface, only huge/giga pages are made visible + to end users. +16 - COMPOUND_TAIL + A compound page tail (see description above). +17 - HUGE this is an integral part of a HugeTLB page - -19. HWPOISON +19 - HWPOISON hardware detected memory corruption on this page: don't touch the data! - -20. NOPAGE +20 - NOPAGE no page frame exists at the requested address - -21. KSM +21 - KSM identical memory pages dynamically shared between one or more processes - -22. THP +22 - THP contiguous pages which construct transparent hugepages - -23. BALLOON +23 - BALLOON balloon compaction page - -24. ZERO_PAGE +24 - ZERO_PAGE zero page for pfn_zero or huge_zero page - -25. IDLE +25 - IDLE page has not been accessed since it was marked idle (see - Documentation/vm/idle_page_tracking.txt). Note that this flag may be - stale in case the page was accessed via a PTE. To make sure the flag - is up-to-date one has to read /sys/kernel/mm/page_idle/bitmap first. - - [IO related page flags] - 1. ERROR IO error occurred - 3. UPTODATE page has up-to-date data - ie. for file backed page: (in-memory data revision >= on-disk one) - 4. DIRTY page has been written to, hence contains new data - ie. for file backed page: (in-memory data revision > on-disk one) - 8. WRITEBACK page is being synced to disk - - [LRU related page flags] - 5. LRU page is in one of the LRU lists - 6. ACTIVE page is in the active LRU list -18. UNEVICTABLE page is in the unevictable (non-)LRU list - It is somehow pinned and not a candidate for LRU page reclaims, - eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments - 2. REFERENCED page has been referenced since last LRU list enqueue/requeue - 9. RECLAIM page will be reclaimed soon after its pageout IO completed -11. MMAP a memory mapped page -12. ANON a memory mapped page that is not part of a file -13. SWAPCACHE page is mapped to swap space, ie. has an associated swap entry -14. SWAPBACKED page is backed by swap/RAM + :ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`). + Note that this flag may be stale in case the page was accessed via + a PTE. To make sure the flag is up-to-date one has to read + ``/sys/kernel/mm/page_idle/bitmap`` first. + +IO related page flags +--------------------- + +1 - ERROR + IO error occurred +3 - UPTODATE + page has up-to-date data + ie. for file backed page: (in-memory data revision >= on-disk one) +4 - DIRTY + page has been written to, hence contains new data + i.e. for file backed page: (in-memory data revision > on-disk one) +8 - WRITEBACK + page is being synced to disk + +LRU related page flags +---------------------- + +5 - LRU + page is in one of the LRU lists +6 - ACTIVE + page is in the active LRU list +18 - UNEVICTABLE + page is in the unevictable (non-)LRU list It is somehow pinned and + not a candidate for LRU page reclaims, e.g. ramfs pages, + shmctl(SHM_LOCK) and mlock() memory segments +2 - REFERENCED + page has been referenced since last LRU list enqueue/requeue +9 - RECLAIM + page will be reclaimed soon after its pageout IO completed +11 - MMAP + a memory mapped page +12 - ANON + a memory mapped page that is not part of a file +13 - SWAPCACHE + page is mapped to swap space, i.e. has an associated swap entry +14 - SWAPBACKED + page is backed by swap/RAM The page-types tool in the tools/vm directory can be used to query the above flags. -Using pagemap to do something useful: +Using pagemap to do something useful +==================================== The general procedure for using pagemap to find out about a process' memory usage goes like this: - 1. Read /proc/pid/maps to determine which parts of the memory space are + 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are mapped to what. 2. Select the maps you are interested in -- all of them, or a particular library, or the stack or the heap, etc. - 3. Open /proc/pid/pagemap and seek to the pages you would like to examine. + 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine. 4. Read a u64 for each page from pagemap. - 5. Open /proc/kpagecount and/or /proc/kpageflags. For each PFN you just - read, seek to that entry in the file, and read the data you want. + 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you + just read, seek to that entry in the file, and read the data you want. For example, to find the "unique set size" (USS), which is the amount of memory that a process is using that is not shared with any other process, @@ -171,7 +188,8 @@ you can go through every map in the process, find the PFNs, look those up in kpagecount, and tally up the number of pages that are only referenced once. -Other notes: +Other notes +=========== Reading from any of the files will return -EINVAL if you are not starting the read on an 8-byte boundary (e.g., if you sought an odd number of bytes diff --git a/Documentation/vm/soft-dirty.txt b/Documentation/admin-guide/mm/soft-dirty.rst index 55684d11a1e8..cb0cfd6672fa 100644 --- a/Documentation/vm/soft-dirty.txt +++ b/Documentation/admin-guide/mm/soft-dirty.rst @@ -1,34 +1,38 @@ - SOFT-DIRTY PTEs +.. _soft_dirty: - The soft-dirty is a bit on a PTE which helps to track which pages a task +=============== +Soft-Dirty PTEs +=============== + +The soft-dirty is a bit on a PTE which helps to track which pages a task writes to. In order to do this tracking one should 1. Clear soft-dirty bits from the task's PTEs. - This is done by writing "4" into the /proc/PID/clear_refs file of the + This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the task in question. 2. Wait some time. 3. Read soft-dirty bits from the PTEs. - This is done by reading from the /proc/PID/pagemap. The bit 55 of the + This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the 64-bit qword is the soft-dirty one. If set, the respective PTE was written to since step 1. - Internally, to do this tracking, the writable bit is cleared from PTEs +Internally, to do this tracking, the writable bit is cleared from PTEs when the soft-dirty bit is cleared. So, after this, when the task tries to modify a page at some virtual address the #PF occurs and the kernel sets the soft-dirty bit on the respective PTE. - Note, that although all the task's address space is marked as r/o after the +Note, that although all the task's address space is marked as r/o after the soft-dirty bits clear, the #PF-s that occur after that are processed fast. This is so, since the pages are still mapped to physical memory, and thus all the kernel does is finds this fact out and puts both writable and soft-dirty bits on the PTE. - While in most cases tracking memory changes by #PF-s is more than enough +While in most cases tracking memory changes by #PF-s is more than enough there is still a scenario when we can lose soft dirty bits -- a task unmaps a previously mapped memory region and then maps a new one at exactly the same place. When unmap is called, the kernel internally clears PTE values @@ -36,7 +40,7 @@ including soft dirty bits. To notify user space application about such memory region renewal the kernel always marks new memory regions (and expanded regions) as soft dirty. - This feature is actively used by the checkpoint-restore project. You +This feature is actively used by the checkpoint-restore project. You can find more details about it on http://criu.org diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst new file mode 100644 index 000000000000..7ab93a8404b9 --- /dev/null +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -0,0 +1,418 @@ +.. _admin_guide_transhuge: + +============================ +Transparent Hugepage Support +============================ + +Objective +========= + +Performance critical computing applications dealing with large memory +working sets are already running on top of libhugetlbfs and in turn +hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of +using huge pages for the backing of virtual memory with huge pages +that supports the automatic promotion and demotion of page sizes and +without the shortcomings of hugetlbfs. + +Currently THP only works for anonymous memory mappings and tmpfs/shmem. +But in the future it can expand to other filesystems. + +.. note:: + in the examples below we presume that the basic page size is 4K and + the huge page size is 2M, although the actual numbers may vary + depending on the CPU architecture. + +The reason applications are running faster is because of two +factors. The first factor is almost completely irrelevant and it's not +of significant interest because it'll also have the downside of +requiring larger clear-page copy-page in page faults which is a +potentially negative effect. The first factor consists in taking a +single page fault for each 2M virtual region touched by userland (so +reducing the enter/exit kernel frequency by a 512 times factor). This +only matters the first time the memory is accessed for the lifetime of +a memory mapping. The second long lasting and much more important +factor will affect all subsequent accesses to the memory for the whole +runtime of the application. The second factor consist of two +components: + +1) the TLB miss will run faster (especially with virtualization using + nested pagetables but almost always also on bare metal without + virtualization) + +2) a single TLB entry will be mapping a much larger amount of virtual + memory in turn reducing the number of TLB misses. With + virtualization and nested pagetables the TLB can be mapped of + larger size only if both KVM and the Linux guest are using + hugepages but a significant speedup already happens if only one of + the two is using hugepages just because of the fact the TLB miss is + going to run faster. + +THP can be enabled system wide or restricted to certain tasks or even +memory ranges inside task's address space. Unless THP is completely +disabled, there is ``khugepaged`` daemon that scans memory and +collapses sequences of basic pages into huge pages. + +The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>` +interface and using madivse(2) and prctl(2) system calls. + +Transparent Hugepage Support maximizes the usefulness of free memory +if compared to the reservation approach of hugetlbfs by allowing all +unused memory to be used as cache or other movable (or even unmovable +entities). It doesn't require reservation to prevent hugepage +allocation failures to be noticeable from userland. It allows paging +and all other advanced VM features to be available on the +hugepages. It requires no modifications for applications to take +advantage of it. + +Applications however can be further optimized to take advantage of +this feature, like for example they've been optimized before to avoid +a flood of mmap system calls for every malloc(4k). Optimizing userland +is by far not mandatory and khugepaged already can take care of long +lived page allocations even for hugepage unaware applications that +deals with large amounts of memory. + +In certain cases when hugepages are enabled system wide, application +may end up allocating more memory resources. An application may mmap a +large region but only touch 1 byte of it, in that case a 2M page might +be allocated instead of a 4k page for no good. This is why it's +possible to disable hugepages system-wide and to only have them inside +MADV_HUGEPAGE madvise regions. + +Embedded systems should enable hugepages only inside madvise regions +to eliminate any risk of wasting any precious byte of memory and to +only run faster. + +Applications that gets a lot of benefit from hugepages and that don't +risk to lose memory by using hugepages, should use +madvise(MADV_HUGEPAGE) on their critical mmapped regions. + +.. _thp_sysfs: + +sysfs +===== + +Global THP controls +------------------- + +Transparent Hugepage Support for anonymous memory can be entirely disabled +(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE +regions (to avoid the risk of consuming more memory resources) or enabled +system wide. This can be achieved with one of:: + + echo always >/sys/kernel/mm/transparent_hugepage/enabled + echo madvise >/sys/kernel/mm/transparent_hugepage/enabled + echo never >/sys/kernel/mm/transparent_hugepage/enabled + +It's also possible to limit defrag efforts in the VM to generate +anonymous hugepages in case they're not immediately free to madvise +regions or to never try to defrag memory and simply fallback to regular +pages unless hugepages are immediately available. Clearly if we spend CPU +time to defrag memory, we would expect to gain even more by the fact we +use hugepages later instead of regular pages. This isn't always +guaranteed, but it may be more likely in case the allocation is for a +MADV_HUGEPAGE region. + +:: + + echo always >/sys/kernel/mm/transparent_hugepage/defrag + echo defer >/sys/kernel/mm/transparent_hugepage/defrag + echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag + echo madvise >/sys/kernel/mm/transparent_hugepage/defrag + echo never >/sys/kernel/mm/transparent_hugepage/defrag + +always + means that an application requesting THP will stall on + allocation failure and directly reclaim pages and compact + memory in an effort to allocate a THP immediately. This may be + desirable for virtual machines that benefit heavily from THP + use and are willing to delay the VM start to utilise them. + +defer + means that an application will wake kswapd in the background + to reclaim pages and wake kcompactd to compact memory so that + THP is available in the near future. It's the responsibility + of khugepaged to then install the THP pages later. + +defer+madvise + will enter direct reclaim and compaction like ``always``, but + only for regions that have used madvise(MADV_HUGEPAGE); all + other regions will wake kswapd in the background to reclaim + pages and wake kcompactd to compact memory so that THP is + available in the near future. + +madvise + will enter direct reclaim like ``always`` but only for regions + that are have used madvise(MADV_HUGEPAGE). This is the default + behaviour. + +never + should be self-explanatory. + +By default kernel tries to use huge zero page on read page fault to +anonymous mapping. It's possible to disable huge zero page by writing 0 +or enable it back by writing 1:: + + echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page + echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page + +Some userspace (such as a test program, or an optimized memory allocation +library) may want to know the size (in bytes) of a transparent hugepage:: + + cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size + +khugepaged will be automatically started when +transparent_hugepage/enabled is set to "always" or "madvise, and it'll +be automatically shutdown if it's set to "never". + +Khugepaged controls +------------------- + +khugepaged runs usually at low frequency so while one may not want to +invoke defrag algorithms synchronously during the page faults, it +should be worth invoking defrag at least in khugepaged. However it's +also possible to disable defrag in khugepaged by writing 0 or enable +defrag in khugepaged by writing 1:: + + echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag + echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag + +You can also control how many pages khugepaged should scan at each +pass:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan + +and how many milliseconds to wait in khugepaged between each pass (you +can set this to 0 to run khugepaged at 100% utilization of one core):: + + /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs + +and how many milliseconds to wait in khugepaged if there's an hugepage +allocation failure to throttle the next allocation attempt:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs + +The khugepaged progress can be seen in the number of pages collapsed:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed + +for each pass:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans + +``max_ptes_none`` specifies how many extra small pages (that are +not already mapped) can be allocated when collapsing a group +of small pages into one large page:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none + +A higher value leads to use additional memory for programs. +A lower value leads to gain less thp performance. Value of +max_ptes_none can waste cpu time very little, you can +ignore it. + +``max_ptes_swap`` specifies how many pages can be brought in from +swap when collapsing a group of pages into a transparent huge page:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap + +A higher value can cause excessive swap IO and waste +memory. A lower value can prevent THPs from being +collapsed, resulting fewer pages being collapsed into +THPs, and lower memory access performance. + +Boot parameter +============== + +You can change the sysfs boot time defaults of Transparent Hugepage +Support by passing the parameter ``transparent_hugepage=always`` or +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` +to the kernel command line. + +Hugepages in tmpfs/shmem +======================== + +You can control hugepage allocation policy in tmpfs with mount option +``huge=``. It can have following values: + +always + Attempt to allocate huge pages every time we need a new page; + +never + Do not allocate huge pages; + +within_size + Only allocate huge page if it will be fully within i_size. + Also respect fadvise()/madvise() hints; + +advise + Only allocate huge pages if requested with fadvise()/madvise(); + +The default policy is ``never``. + +``mount -o remount,huge= /mountpoint`` works fine after mount: remounting +``huge=never`` will not attempt to break up huge pages at all, just stop more +from being allocated. + +There's also sysfs knob to control hugepage allocation policy for internal +shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount +is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or +MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. + +In addition to policies listed above, shmem_enabled allows two further +values: + +deny + For use in emergencies, to force the huge option off from + all mounts; +force + Force the huge option on for all - very useful for testing; + +Need of application restart +=========================== + +The transparent_hugepage/enabled values and tmpfs mount option only affect +future behavior. So to make them effective you need to restart any +application that could have been using hugepages. This also applies to the +regions registered in khugepaged. + +Monitoring usage +================ + +The number of anonymous transparent huge pages currently used by the +system is available by reading the AnonHugePages field in ``/proc/meminfo``. +To identify what applications are using anonymous transparent huge pages, +it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields +for each mapping. + +The number of file transparent huge pages mapped to userspace is available +by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. +To identify what applications are mapping file transparent huge pages, it +is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields +for each mapping. + +Note that reading the smaps file is expensive and reading it +frequently will incur overhead. + +There are a number of counters in ``/proc/vmstat`` that may be used to +monitor how successfully the system is providing huge pages for use. + +thp_fault_alloc + is incremented every time a huge page is successfully + allocated to handle a page fault. This applies to both the + first time a page is faulted and for COW faults. + +thp_collapse_alloc + is incremented by khugepaged when it has found + a range of pages to collapse into one huge page and has + successfully allocated a new huge page to store the data. + +thp_fault_fallback + is incremented if a page fault fails to allocate + a huge page and instead falls back to using small pages. + +thp_collapse_alloc_failed + is incremented if khugepaged found a range + of pages that should be collapsed into one huge page but failed + the allocation. + +thp_file_alloc + is incremented every time a file huge page is successfully + allocated. + +thp_file_mapped + is incremented every time a file huge page is mapped into + user address space. + +thp_split_page + is incremented every time a huge page is split into base + pages. This can happen for a variety of reasons but a common + reason is that a huge page is old and is being reclaimed. + This action implies splitting all PMD the page mapped with. + +thp_split_page_failed + is incremented if kernel fails to split huge + page. This can happen if the page was pinned by somebody. + +thp_deferred_split_page + is incremented when a huge page is put onto split + queue. This happens when a huge page is partially unmapped and + splitting it would free up some memory. Pages on split queue are + going to be split under memory pressure. + +thp_split_pmd + is incremented every time a PMD split into table of PTEs. + This can happen, for instance, when application calls mprotect() or + munmap() on part of huge page. It doesn't split huge page, only + page table entry. + +thp_zero_page_alloc + is incremented every time a huge zero page is + successfully allocated. It includes allocations which where + dropped due race with other allocation. Note, it doesn't count + every map of the huge zero page, only its allocation. + +thp_zero_page_alloc_failed + is incremented if kernel fails to allocate + huge zero page and falls back to using small pages. + +thp_swpout + is incremented every time a huge page is swapout in one + piece without splitting. + +thp_swpout_fallback + is incremented if a huge page has to be split before swapout. + Usually because failed to allocate some continuous swap space + for the huge page. + +As the system ages, allocating huge pages may be expensive as the +system uses memory compaction to copy data around memory to free a +huge page for use. There are some counters in ``/proc/vmstat`` to help +monitor this overhead. + +compact_stall + is incremented every time a process stalls to run + memory compaction so that a huge page is free for use. + +compact_success + is incremented if the system compacted memory and + freed a huge page for use. + +compact_fail + is incremented if the system tries to compact memory + but failed. + +compact_pages_moved + is incremented each time a page is moved. If + this value is increasing rapidly, it implies that the system + is copying a lot of data to satisfy the huge page allocation. + It is possible that the cost of copying exceeds any savings + from reduced TLB misses. + +compact_pagemigrate_failed + is incremented when the underlying mechanism + for moving a page failed. + +compact_blocks_moved + is incremented each time memory compaction examines + a huge page aligned range of pages. + +It is possible to establish how long the stalls were using the function +tracer to record how long was spent in __alloc_pages_nodemask and +using the mm_page_alloc tracepoint to identify which allocations were +for huge pages. + +Optimizing the applications +=========================== + +To be guaranteed that the kernel will map a 2M page immediately in any +memory region, the mmap region has to be hugepage naturally +aligned. posix_memalign() can provide that guarantee. + +Hugetlbfs +========= + +You can use hugetlbfs on a kernel that has transparent hugepage +support enabled just fine as always. No difference can be noted in +hugetlbfs other than there will be less overall fragmentation. All +usual features belonging to hugetlbfs are preserved and +unaffected. libhugetlbfs will also work fine as usual. diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/admin-guide/mm/userfaultfd.rst index bb2f945f87ab..5048cf661a8a 100644 --- a/Documentation/vm/userfaultfd.txt +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -1,6 +1,11 @@ -= Userfaultfd = +.. _userfaultfd: -== Objective == +=========== +Userfaultfd +=========== + +Objective +========= Userfaults allow the implementation of on-demand paging from userland and more generally they allow userland to take control of various @@ -9,7 +14,8 @@ memory page faults, something otherwise only the kernel code could do. For example userfaults allows a proper and more optimal implementation of the PROT_NONE+SIGSEGV trick. -== Design == +Design +====== Userfaults are delivered and resolved through the userfaultfd syscall. @@ -41,7 +47,8 @@ different processes without them being aware about what is going on themselves on the same region the manager is already tracking, which is a corner case that would currently return -EBUSY). -== API == +API +=== When first opened the userfaultfd must be enabled invoking the UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or @@ -101,7 +108,8 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an half copied page since it'll keep userfaulting until the copy has finished. -== QEMU/KVM == +QEMU/KVM +======== QEMU/KVM is using the userfaultfd syscall to implement postcopy live migration. Postcopy live migration is one form of memory @@ -163,7 +171,8 @@ sending the same page twice (in case the userfault is read by the postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration thread). -== Non-cooperative userfaultfd == +Non-cooperative userfaultfd +=========================== When the userfaultfd is monitored by an external manager, the manager must be able to track changes in the process virtual memory @@ -172,27 +181,30 @@ the same read(2) protocol as for the page fault notifications. The manager has to explicitly enable these events by setting appropriate bits in uffdio_api.features passed to UFFDIO_API ioctl: -UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When -this feature is enabled, the userfaultfd context of the parent process -is duplicated into the newly created process. The manager receives -UFFD_EVENT_FORK with file descriptor of the new userfaultfd context in -the uffd_msg.fork. - -UFFD_FEATURE_EVENT_REMAP - enable notifications about mremap() -calls. When the non-cooperative process moves a virtual memory area to -a different location, the manager will receive UFFD_EVENT_REMAP. The -uffd_msg.remap will contain the old and new addresses of the area and -its original length. - -UFFD_FEATURE_EVENT_REMOVE - enable notifications about -madvise(MADV_REMOVE) and madvise(MADV_DONTNEED) calls. The event -UFFD_EVENT_REMOVE will be generated upon these calls to madvise. The -uffd_msg.remove will contain start and end addresses of the removed -area. - -UFFD_FEATURE_EVENT_UNMAP - enable notifications about memory -unmapping. The manager will get UFFD_EVENT_UNMAP with uffd_msg.remove -containing start and end addresses of the unmapped area. +UFFD_FEATURE_EVENT_FORK + enable userfaultfd hooks for fork(). When this feature is + enabled, the userfaultfd context of the parent process is + duplicated into the newly created process. The manager + receives UFFD_EVENT_FORK with file descriptor of the new + userfaultfd context in the uffd_msg.fork. + +UFFD_FEATURE_EVENT_REMAP + enable notifications about mremap() calls. When the + non-cooperative process moves a virtual memory area to a + different location, the manager will receive + UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and + new addresses of the area and its original length. + +UFFD_FEATURE_EVENT_REMOVE + enable notifications about madvise(MADV_REMOVE) and + madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will + be generated upon these calls to madvise. The uffd_msg.remove + will contain start and end addresses of the removed area. + +UFFD_FEATURE_EVENT_UNMAP + enable notifications about memory unmapping. The manager will + get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and + end addresses of the unmapped area. Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP are pretty similar, they quite differ in the action expected from the diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index d2b6fda3d67b..ab2fe0eda1d7 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -145,7 +145,7 @@ feature enabled.] In this mode ``intel_pstate`` registers utilization update callbacks with the CPU scheduler in order to run a P-state selection algorithm, either -``powersave`` or ``performance``, depending on the ``scaling_cur_freq`` policy +``powersave`` or ``performance``, depending on the ``scaling_governor`` policy setting in ``sysfs``. The current CPU frequency information to be made available from the ``scaling_cur_freq`` policy attribute in ``sysfs`` is periodically updated by those utilization update callbacks too. diff --git a/Documentation/admin-guide/pm/sleep-states.rst b/Documentation/admin-guide/pm/sleep-states.rst index 1e5c0f00cb2f..dbf5acd49f35 100644 --- a/Documentation/admin-guide/pm/sleep-states.rst +++ b/Documentation/admin-guide/pm/sleep-states.rst @@ -15,7 +15,7 @@ Sleep States That Can Be Supported ================================== Depending on its configuration and the capabilities of the platform it runs on, -the Linux kernel can support up to four system sleep states, includig +the Linux kernel can support up to four system sleep states, including hibernation and up to three variants of system suspend. The sleep states that can be supported by the kernel are listed below. diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst index 4efd7ce77565..6dbcc5481000 100644 --- a/Documentation/admin-guide/ramoops.rst +++ b/Documentation/admin-guide/ramoops.rst @@ -61,7 +61,7 @@ Setting the ramoops parameters can be done in several different manners: mem=128M ramoops.mem_address=0x8000000 ramoops.ecc=1 B. Use Device Tree bindings, as described in - ``Documentation/device-tree/bindings/reserved-memory/admin-guide/ramoops.rst``. + ``Documentation/devicetree/bindings/reserved-memory/ramoops.txt``. For example:: reserved-memory { diff --git a/Documentation/arm/Marvell/README b/Documentation/arm/Marvell/README index b5bb7f518840..56ada27c53be 100644 --- a/Documentation/arm/Marvell/README +++ b/Documentation/arm/Marvell/README @@ -302,19 +302,15 @@ Berlin family (Multimedia Solutions) 88DE3010, Armada 1000 (no Linux support) Core: Marvell PJ1 (ARMv5TE), Dual-core Product Brief: http://www.marvell.com.cn/digital-entertainment/assets/armada_1000_pb.pdf - 88DE3005, Armada 1500-mini 88DE3005, Armada 1500 Mini Design name: BG2CD Core: ARM Cortex-A9, PL310 L2CC - Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini/ - 88DE3006, Armada 1500 Mini Plus - Design name: BG2CDP - Core: Dual Core ARM Cortex-A7 - Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini-plus/ + 88DE3006, Armada 1500 Mini Plus + Design name: BG2CDP + Core: Dual Core ARM Cortex-A7 88DE3100, Armada 1500 Design name: BG2 Core: Marvell PJ4B-MP (ARMv7), Tauros3 L2CC - Product Brief: http://www.marvell.com/digital-entertainment/armada-1500/assets/Marvell-ARMADA-1500-Product-Brief.pdf 88DE3114, Armada 1500 Pro Design name: BG2Q Core: Quad Core ARM Cortex-A9, PL310 L2CC @@ -324,13 +320,16 @@ Berlin family (Multimedia Solutions) 88DE3218, ARMADA 1500 Ultra Core: ARM Cortex-A53 - Homepage: http://www.marvell.com/multimedia-solutions/ + Homepage: https://www.synaptics.com/products/multimedia-solutions Directory: arch/arm/mach-berlin Comments: + * This line of SoCs is based on Marvell Sheeva or ARM Cortex CPUs with Synopsys DesignWare (IRQ, GPIO, Timers, ...) and PXA IP (SDHCI, USB, ETH, ...). + * The Berlin family was acquired by Synaptics from Marvell in 2017. + CPU Cores --------- diff --git a/Documentation/misc-devices/lcd-panel-cgram.txt b/Documentation/auxdisplay/lcd-panel-cgram.txt index 7f82c905763d..7f82c905763d 100644 --- a/Documentation/misc-devices/lcd-panel-cgram.txt +++ b/Documentation/auxdisplay/lcd-panel-cgram.txt diff --git a/Documentation/block/cmdline-partition.txt b/Documentation/block/cmdline-partition.txt index 525b9f6d7fb4..760a3f7c3ed4 100644 --- a/Documentation/block/cmdline-partition.txt +++ b/Documentation/block/cmdline-partition.txt @@ -1,7 +1,9 @@ Embedded device command line partition parsing ===================================================================== -Support for reading the block device partition table from the command line. +The "blkdevparts" command line option adds support for reading the +block device partition table from the kernel command line. + It is typically used for fixed block (eMMC) embedded devices. It has no MBR, so saves storage space. Bootloader can be easily accessed by absolute address of data on the block device. @@ -14,22 +16,27 @@ blkdevparts=<blkdev-def>[;<blkdev-def>] <partdef> := <size>[@<offset>](part-name) <blkdev-id> - block device disk name, embedded device used fixed block device, - it's disk name also fixed. such as: mmcblk0, mmcblk1, mmcblk0boot0. + block device disk name. Embedded device uses fixed block device. + Its disk name is also fixed, such as: mmcblk0, mmcblk1, mmcblk0boot0. <size> partition size, in bytes, such as: 512, 1m, 1G. + size may contain an optional suffix of (upper or lower case): + K, M, G, T, P, E. + "-" is used to denote all remaining space. <offset> partition start address, in bytes. + offset may contain an optional suffix of (upper or lower case): + K, M, G, T, P, E. (part-name) - partition name, kernel send uevent with "PARTNAME". application can create - a link to block device partition with the name "PARTNAME". - user space application can access partition by partition name. + partition name. Kernel sends uevent with "PARTNAME". Application can + create a link to block device partition with the name "PARTNAME". + User space application can access partition by partition name. Example: - eMMC disk name is "mmcblk0" and "mmcblk0boot0" + eMMC disk names are "mmcblk0" and "mmcblk0boot0". bootargs: 'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)' diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt index 733927a7b501..07f147381f32 100644 --- a/Documentation/block/null_blk.txt +++ b/Documentation/block/null_blk.txt @@ -71,13 +71,16 @@ use_per_node_hctx=[0/1]: Default: 0 1: The multi-queue block layer is instantiated with a hardware dispatch queue for each CPU node in the system. -use_lightnvm=[0/1]: Default: 0 - Register device with LightNVM. Requires blk-mq and CONFIG_NVM to be enabled. - no_sched=[0/1]: Default: 0 0: nullb* use default blk-mq io scheduler. 1: nullb* doesn't use io scheduler. +blocking=[0/1]: Default: 0 + 0: Register as a non-blocking blk-mq driver device. + 1: Register as a blocking blk-mq driver device, null_blk will set + the BLK_MQ_F_BLOCKING flag, indicating that it sometimes/always + needs to block in its ->queue_rq() function. + shared_tags=[0/1]: Default: 0 0: Tag set is not shared. 1: Tag set shared between devices for blk-mq. Only makes sense with diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 257e65714c6a..875b2b56b87f 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -218,6 +218,7 @@ line of text and contains the following stats separated by whitespace: same_pages the number of same element filled pages written to this disk. No memory is allocated for such pages. pages_compacted the number of pages freed during compaction + huge_pages the number of incompressible pages 9) Deactivate: swapoff /dev/zram0 @@ -242,5 +243,29 @@ to backing storage rather than keeping it in memory. User should set up backing device via /sys/block/zramX/backing_dev before disksize setting. += memory tracking + +With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the +zram block. It could be useful to catch cold or incompressible +pages of the process with*pagemap. +If you enable the feature, you could see block state via +/sys/kernel/debug/zram/zram0/block_state". The output is as follows, + + 300 75.033841 .wh + 301 63.806904 s.. + 302 63.806919 ..h + +First column is zram's block index. +Second column is access time since the system was booted +Third column is state of the block. +(s: same page +w: written page to backing store +h: huge page) + +First line of above example says 300th block is accessed at 75.033841sec +and the block's state is huge so it is written back to the backing +storage. It's a debugging feature so anyone shouldn't rely on it to work +properly. + Nitin Gupta ngupta@vflare.org diff --git a/Documentation/bpf/README.rst b/Documentation/bpf/README.rst new file mode 100644 index 000000000000..b9a80c9e9392 --- /dev/null +++ b/Documentation/bpf/README.rst @@ -0,0 +1,36 @@ +================= +BPF documentation +================= + +This directory contains documentation for the BPF (Berkeley Packet +Filter) facility, with a focus on the extended BPF version (eBPF). + +This kernel side documentation is still work in progress. The main +textual documentation is (for historical reasons) described in +`Documentation/networking/filter.txt`_, which describe both classical +and extended BPF instruction-set. +The Cilium project also maintains a `BPF and XDP Reference Guide`_ +that goes into great technical depth about the BPF Architecture. + +The primary info for the bpf syscall is available in the `man-pages`_ +for `bpf(2)`_. + + + +Frequently asked questions (FAQ) +================================ + +Two sets of Questions and Answers (Q&A) are maintained. + +* QA for common questions about BPF see: bpf_design_QA_ + +* QA for developers interacting with BPF subsystem: bpf_devel_QA_ + + +.. Links: +.. _bpf_design_QA: bpf_design_QA.rst +.. _bpf_devel_QA: bpf_devel_QA.rst +.. _Documentation/networking/filter.txt: ../networking/filter.txt +.. _man-pages: https://www.kernel.org/doc/man-pages/ +.. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html +.. _BPF and XDP Reference Guide: http://cilium.readthedocs.io/en/latest/bpf/ diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst new file mode 100644 index 000000000000..6780a6d81745 --- /dev/null +++ b/Documentation/bpf/bpf_design_QA.rst @@ -0,0 +1,221 @@ +============== +BPF Design Q&A +============== + +BPF extensibility and applicability to networking, tracing, security +in the linux kernel and several user space implementations of BPF +virtual machine led to a number of misunderstanding on what BPF actually is. +This short QA is an attempt to address that and outline a direction +of where BPF is heading long term. + +.. contents:: + :local: + :depth: 3 + +Questions and Answers +===================== + +Q: Is BPF a generic instruction set similar to x64 and arm64? +------------------------------------------------------------- +A: NO. + +Q: Is BPF a generic virtual machine ? +------------------------------------- +A: NO. + +BPF is generic instruction set *with* C calling convention. +----------------------------------------------------------- + +Q: Why C calling convention was chosen? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A: Because BPF programs are designed to run in the linux kernel +which is written in C, hence BPF defines instruction set compatible +with two most used architectures x64 and arm64 (and takes into +consideration important quirks of other architectures) and +defines calling convention that is compatible with C calling +convention of the linux kernel on those architectures. + +Q: can multiple return values be supported in the future? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +A: NO. BPF allows only register R0 to be used as return value. + +Q: can more than 5 function arguments be supported in the future? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +A: NO. BPF calling convention only allows registers R1-R5 to be used +as arguments. BPF is not a standalone instruction set. +(unlike x64 ISA that allows msft, cdecl and other conventions) + +Q: can BPF programs access instruction pointer or return address? +----------------------------------------------------------------- +A: NO. + +Q: can BPF programs access stack pointer ? +------------------------------------------ +A: NO. + +Only frame pointer (register R10) is accessible. +From compiler point of view it's necessary to have stack pointer. +For example LLVM defines register R11 as stack pointer in its +BPF backend, but it makes sure that generated code never uses it. + +Q: Does C-calling convention diminishes possible use cases? +----------------------------------------------------------- +A: YES. + +BPF design forces addition of major functionality in the form +of kernel helper functions and kernel objects like BPF maps with +seamless interoperability between them. It lets kernel call into +BPF programs and programs call kernel helpers with zero overhead. +As all of them were native C code. That is particularly the case +for JITed BPF programs that are indistinguishable from +native kernel C code. + +Q: Does it mean that 'innovative' extensions to BPF code are disallowed? +------------------------------------------------------------------------ +A: Soft yes. + +At least for now until BPF core has support for +bpf-to-bpf calls, indirect calls, loops, global variables, +jump tables, read only sections and all other normal constructs +that C code can produce. + +Q: Can loops be supported in a safe way? +---------------------------------------- +A: It's not clear yet. + +BPF developers are trying to find a way to +support bounded loops where the verifier can guarantee that +the program terminates in less than 4096 instructions. + +Instruction level questions +--------------------------- + +Q: LD_ABS and LD_IND instructions vs C code +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Q: How come LD_ABS and LD_IND instruction are present in BPF whereas +C code cannot express them and has to use builtin intrinsics? + +A: This is artifact of compatibility with classic BPF. Modern +networking code in BPF performs better without them. +See 'direct packet access'. + +Q: BPF instructions mapping not one-to-one to native CPU +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Q: It seems not all BPF instructions are one-to-one to native CPU. +For example why BPF_JNE and other compare and jumps are not cpu-like? + +A: This was necessary to avoid introducing flags into ISA which are +impossible to make generic and efficient across CPU architectures. + +Q: why BPF_DIV instruction doesn't map to x64 div? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +A: Because if we picked one-to-one relationship to x64 it would have made +it more complicated to support on arm64 and other archs. Also it +needs div-by-zero runtime check. + +Q: why there is no BPF_SDIV for signed divide operation? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +A: Because it would be rarely used. llvm errors in such case and +prints a suggestion to use unsigned divide instead + +Q: Why BPF has implicit prologue and epilogue? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +A: Because architectures like sparc have register windows and in general +there are enough subtle differences between architectures, so naive +store return address into stack won't work. Another reason is BPF has +to be safe from division by zero (and legacy exception path +of LD_ABS insn). Those instructions need to invoke epilogue and +return implicitly. + +Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +A: Because classic BPF didn't have them and BPF authors felt that compiler +workaround would be acceptable. Turned out that programs lose performance +due to lack of these compare instructions and they were added. +These two instructions is a perfect example what kind of new BPF +instructions are acceptable and can be added in the future. +These two already had equivalent instructions in native CPUs. +New instructions that don't have one-to-one mapping to HW instructions +will not be accepted. + +Q: BPF 32-bit subregister requirements +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF +registers which makes BPF inefficient virtual machine for 32-bit +CPU architectures and 32-bit HW accelerators. Can true 32-bit registers +be added to BPF in the future? + +A: NO. The first thing to improve performance on 32-bit archs is to teach +LLVM to generate code that uses 32-bit subregisters. Then second step +is to teach verifier to mark operations where zero-ing upper bits +is unnecessary. Then JITs can take advantage of those markings and +drastically reduce size of generated code and improve performance. + +Q: Does BPF have a stable ABI? +------------------------------ +A: YES. BPF instructions, arguments to BPF programs, set of helper +functions and their arguments, recognized return codes are all part +of ABI. However when tracing programs are using bpf_probe_read() helper +to walk kernel internal datastructures and compile with kernel +internal headers these accesses can and will break with newer +kernels. The union bpf_attr -> kern_version is checked at load time +to prevent accidentally loading kprobe-based bpf programs written +for a different kernel. Networking programs don't do kern_version check. + +Q: How much stack space a BPF program uses? +------------------------------------------- +A: Currently all program types are limited to 512 bytes of stack +space, but the verifier computes the actual amount of stack used +and both interpreter and most JITed code consume necessary amount. + +Q: Can BPF be offloaded to HW? +------------------------------ +A: YES. BPF HW offload is supported by NFP driver. + +Q: Does classic BPF interpreter still exist? +-------------------------------------------- +A: NO. Classic BPF programs are converted into extend BPF instructions. + +Q: Can BPF call arbitrary kernel functions? +------------------------------------------- +A: NO. BPF programs can only call a set of helper functions which +is defined for every program type. + +Q: Can BPF overwrite arbitrary kernel memory? +--------------------------------------------- +A: NO. + +Tracing bpf programs can *read* arbitrary memory with bpf_probe_read() +and bpf_probe_read_str() helpers. Networking programs cannot read +arbitrary memory, since they don't have access to these helpers. +Programs can never read or write arbitrary memory directly. + +Q: Can BPF overwrite arbitrary user memory? +------------------------------------------- +A: Sort-of. + +Tracing BPF programs can overwrite the user memory +of the current task with bpf_probe_write_user(). Every time such +program is loaded the kernel will print warning message, so +this helper is only useful for experiments and prototypes. +Tracing BPF programs are root only. + +Q: bpf_trace_printk() helper warning +------------------------------------ +Q: When bpf_trace_printk() helper is used the kernel prints nasty +warning message. Why is that? + +A: This is done to nudge program authors into better interfaces when +programs need to pass data to user space. Like bpf_perf_event_output() +can be used to efficiently stream data via perf ring buffer. +BPF maps can be used for asynchronous data sharing between kernel +and user space. bpf_trace_printk() should only be used for debugging. + +Q: New functionality via kernel modules? +---------------------------------------- +Q: Can BPF functionality such as new program or map types, new +helpers, etc be added out of kernel module code? + +A: NO. diff --git a/Documentation/bpf/bpf_design_QA.txt b/Documentation/bpf/bpf_design_QA.txt deleted file mode 100644 index f3e458a0bb2f..000000000000 --- a/Documentation/bpf/bpf_design_QA.txt +++ /dev/null @@ -1,156 +0,0 @@ -BPF extensibility and applicability to networking, tracing, security -in the linux kernel and several user space implementations of BPF -virtual machine led to a number of misunderstanding on what BPF actually is. -This short QA is an attempt to address that and outline a direction -of where BPF is heading long term. - -Q: Is BPF a generic instruction set similar to x64 and arm64? -A: NO. - -Q: Is BPF a generic virtual machine ? -A: NO. - -BPF is generic instruction set _with_ C calling convention. - -Q: Why C calling convention was chosen? -A: Because BPF programs are designed to run in the linux kernel - which is written in C, hence BPF defines instruction set compatible - with two most used architectures x64 and arm64 (and takes into - consideration important quirks of other architectures) and - defines calling convention that is compatible with C calling - convention of the linux kernel on those architectures. - -Q: can multiple return values be supported in the future? -A: NO. BPF allows only register R0 to be used as return value. - -Q: can more than 5 function arguments be supported in the future? -A: NO. BPF calling convention only allows registers R1-R5 to be used - as arguments. BPF is not a standalone instruction set. - (unlike x64 ISA that allows msft, cdecl and other conventions) - -Q: can BPF programs access instruction pointer or return address? -A: NO. - -Q: can BPF programs access stack pointer ? -A: NO. Only frame pointer (register R10) is accessible. - From compiler point of view it's necessary to have stack pointer. - For example LLVM defines register R11 as stack pointer in its - BPF backend, but it makes sure that generated code never uses it. - -Q: Does C-calling convention diminishes possible use cases? -A: YES. BPF design forces addition of major functionality in the form - of kernel helper functions and kernel objects like BPF maps with - seamless interoperability between them. It lets kernel call into - BPF programs and programs call kernel helpers with zero overhead. - As all of them were native C code. That is particularly the case - for JITed BPF programs that are indistinguishable from - native kernel C code. - -Q: Does it mean that 'innovative' extensions to BPF code are disallowed? -A: Soft yes. At least for now until BPF core has support for - bpf-to-bpf calls, indirect calls, loops, global variables, - jump tables, read only sections and all other normal constructs - that C code can produce. - -Q: Can loops be supported in a safe way? -A: It's not clear yet. BPF developers are trying to find a way to - support bounded loops where the verifier can guarantee that - the program terminates in less than 4096 instructions. - -Q: How come LD_ABS and LD_IND instruction are present in BPF whereas - C code cannot express them and has to use builtin intrinsics? -A: This is artifact of compatibility with classic BPF. Modern - networking code in BPF performs better without them. - See 'direct packet access'. - -Q: It seems not all BPF instructions are one-to-one to native CPU. - For example why BPF_JNE and other compare and jumps are not cpu-like? -A: This was necessary to avoid introducing flags into ISA which are - impossible to make generic and efficient across CPU architectures. - -Q: why BPF_DIV instruction doesn't map to x64 div? -A: Because if we picked one-to-one relationship to x64 it would have made - it more complicated to support on arm64 and other archs. Also it - needs div-by-zero runtime check. - -Q: why there is no BPF_SDIV for signed divide operation? -A: Because it would be rarely used. llvm errors in such case and - prints a suggestion to use unsigned divide instead - -Q: Why BPF has implicit prologue and epilogue? -A: Because architectures like sparc have register windows and in general - there are enough subtle differences between architectures, so naive - store return address into stack won't work. Another reason is BPF has - to be safe from division by zero (and legacy exception path - of LD_ABS insn). Those instructions need to invoke epilogue and - return implicitly. - -Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning? -A: Because classic BPF didn't have them and BPF authors felt that compiler - workaround would be acceptable. Turned out that programs lose performance - due to lack of these compare instructions and they were added. - These two instructions is a perfect example what kind of new BPF - instructions are acceptable and can be added in the future. - These two already had equivalent instructions in native CPUs. - New instructions that don't have one-to-one mapping to HW instructions - will not be accepted. - -Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF - registers which makes BPF inefficient virtual machine for 32-bit - CPU architectures and 32-bit HW accelerators. Can true 32-bit registers - be added to BPF in the future? -A: NO. The first thing to improve performance on 32-bit archs is to teach - LLVM to generate code that uses 32-bit subregisters. Then second step - is to teach verifier to mark operations where zero-ing upper bits - is unnecessary. Then JITs can take advantage of those markings and - drastically reduce size of generated code and improve performance. - -Q: Does BPF have a stable ABI? -A: YES. BPF instructions, arguments to BPF programs, set of helper - functions and their arguments, recognized return codes are all part - of ABI. However when tracing programs are using bpf_probe_read() helper - to walk kernel internal datastructures and compile with kernel - internal headers these accesses can and will break with newer - kernels. The union bpf_attr -> kern_version is checked at load time - to prevent accidentally loading kprobe-based bpf programs written - for a different kernel. Networking programs don't do kern_version check. - -Q: How much stack space a BPF program uses? -A: Currently all program types are limited to 512 bytes of stack - space, but the verifier computes the actual amount of stack used - and both interpreter and most JITed code consume necessary amount. - -Q: Can BPF be offloaded to HW? -A: YES. BPF HW offload is supported by NFP driver. - -Q: Does classic BPF interpreter still exist? -A: NO. Classic BPF programs are converted into extend BPF instructions. - -Q: Can BPF call arbitrary kernel functions? -A: NO. BPF programs can only call a set of helper functions which - is defined for every program type. - -Q: Can BPF overwrite arbitrary kernel memory? -A: NO. Tracing bpf programs can _read_ arbitrary memory with bpf_probe_read() - and bpf_probe_read_str() helpers. Networking programs cannot read - arbitrary memory, since they don't have access to these helpers. - Programs can never read or write arbitrary memory directly. - -Q: Can BPF overwrite arbitrary user memory? -A: Sort-of. Tracing BPF programs can overwrite the user memory - of the current task with bpf_probe_write_user(). Every time such - program is loaded the kernel will print warning message, so - this helper is only useful for experiments and prototypes. - Tracing BPF programs are root only. - -Q: When bpf_trace_printk() helper is used the kernel prints nasty - warning message. Why is that? -A: This is done to nudge program authors into better interfaces when - programs need to pass data to user space. Like bpf_perf_event_output() - can be used to efficiently stream data via perf ring buffer. - BPF maps can be used for asynchronous data sharing between kernel - and user space. bpf_trace_printk() should only be used for debugging. - -Q: Can BPF functionality such as new program or map types, new - helpers, etc be added out of kernel module code? -A: NO. diff --git a/Documentation/bpf/bpf_devel_QA.rst b/Documentation/bpf/bpf_devel_QA.rst new file mode 100644 index 000000000000..0e7c1d946e83 --- /dev/null +++ b/Documentation/bpf/bpf_devel_QA.rst @@ -0,0 +1,640 @@ +================================= +HOWTO interact with BPF subsystem +================================= + +This document provides information for the BPF subsystem about various +workflows related to reporting bugs, submitting patches, and queueing +patches for stable kernels. + +For general information about submitting patches, please refer to +`Documentation/process/`_. This document only describes additional specifics +related to BPF. + +.. contents:: + :local: + :depth: 2 + +Reporting bugs +============== + +Q: How do I report bugs for BPF kernel code? +-------------------------------------------- +A: Since all BPF kernel development as well as bpftool and iproute2 BPF +loader development happens through the netdev kernel mailing list, +please report any found issues around BPF to the following mailing +list: + + netdev@vger.kernel.org + +This may also include issues related to XDP, BPF tracing, etc. + +Given netdev has a high volume of traffic, please also add the BPF +maintainers to Cc (from kernel MAINTAINERS_ file): + +* Alexei Starovoitov <ast@kernel.org> +* Daniel Borkmann <daniel@iogearbox.net> + +In case a buggy commit has already been identified, make sure to keep +the actual commit authors in Cc as well for the report. They can +typically be identified through the kernel's git tree. + +**Please do NOT report BPF issues to bugzilla.kernel.org since it +is a guarantee that the reported issue will be overlooked.** + +Submitting patches +================== + +Q: To which mailing list do I need to submit my BPF patches? +------------------------------------------------------------ +A: Please submit your BPF patches to the netdev kernel mailing list: + + netdev@vger.kernel.org + +Historically, BPF came out of networking and has always been maintained +by the kernel networking community. Although these days BPF touches +many other subsystems as well, the patches are still routed mainly +through the networking community. + +In case your patch has changes in various different subsystems (e.g. +tracing, security, etc), make sure to Cc the related kernel mailing +lists and maintainers from there as well, so they are able to review +the changes and provide their Acked-by's to the patches. + +Q: Where can I find patches currently under discussion for BPF subsystem? +------------------------------------------------------------------------- +A: All patches that are Cc'ed to netdev are queued for review under netdev +patchwork project: + + http://patchwork.ozlabs.org/project/netdev/list/ + +Those patches which target BPF, are assigned to a 'bpf' delegate for +further processing from BPF maintainers. The current queue with +patches under review can be found at: + + https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147 + +Once the patches have been reviewed by the BPF community as a whole +and approved by the BPF maintainers, their status in patchwork will be +changed to 'Accepted' and the submitter will be notified by mail. This +means that the patches look good from a BPF perspective and have been +applied to one of the two BPF kernel trees. + +In case feedback from the community requires a respin of the patches, +their status in patchwork will be set to 'Changes Requested', and purged +from the current review queue. Likewise for cases where patches would +get rejected or are not applicable to the BPF trees (but assigned to +the 'bpf' delegate). + +Q: How do the changes make their way into Linux? +------------------------------------------------ +A: There are two BPF kernel trees (git repositories). Once patches have +been accepted by the BPF maintainers, they will be applied to one +of the two BPF trees: + + * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/ + * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/ + +The bpf tree itself is for fixes only, whereas bpf-next for features, +cleanups or other kind of improvements ("next-like" content). This is +analogous to net and net-next trees for networking. Both bpf and +bpf-next will only have a master branch in order to simplify against +which branch patches should get rebased to. + +Accumulated BPF patches in the bpf tree will regularly get pulled +into the net kernel tree. Likewise, accumulated BPF patches accepted +into the bpf-next tree will make their way into net-next tree. net and +net-next are both run by David S. Miller. From there, they will go +into the kernel mainline tree run by Linus Torvalds. To read up on the +process of net and net-next being merged into the mainline tree, see +the `netdev FAQ`_ under: + + `Documentation/networking/netdev-FAQ.txt`_ + +Occasionally, to prevent merge conflicts, we might send pull requests +to other trees (e.g. tracing) with a small subset of the patches, but +net and net-next are always the main trees targeted for integration. + +The pull requests will contain a high-level summary of the accumulated +patches and can be searched on netdev kernel mailing list through the +following subject lines (``yyyy-mm-dd`` is the date of the pull +request):: + + pull-request: bpf yyyy-mm-dd + pull-request: bpf-next yyyy-mm-dd + +Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be applied to? +--------------------------------------------------------------------------------- + +A: The process is the very same as described in the `netdev FAQ`_, so +please read up on it. The subject line must indicate whether the +patch is a fix or rather "next-like" content in order to let the +maintainers know whether it is targeted at bpf or bpf-next. + +For fixes eventually landing in bpf -> net tree, the subject must +look like:: + + git format-patch --subject-prefix='PATCH bpf' start..finish + +For features/improvements/etc that should eventually land in +bpf-next -> net-next, the subject must look like:: + + git format-patch --subject-prefix='PATCH bpf-next' start..finish + +If unsure whether the patch or patch series should go into bpf +or net directly, or bpf-next or net-next directly, it is not a +problem either if the subject line says net or net-next as target. +It is eventually up to the maintainers to do the delegation of +the patches. + +If it is clear that patches should go into bpf or bpf-next tree, +please make sure to rebase the patches against those trees in +order to reduce potential conflicts. + +In case the patch or patch series has to be reworked and sent out +again in a second or later revision, it is also required to add a +version number (``v2``, ``v3``, ...) into the subject prefix:: + + git format-patch --subject-prefix='PATCH net-next v2' start..finish + +When changes have been requested to the patch series, always send the +whole patch series again with the feedback incorporated (never send +individual diffs on top of the old series). + +Q: What does it mean when a patch gets applied to bpf or bpf-next tree? +----------------------------------------------------------------------- +A: It means that the patch looks good for mainline inclusion from +a BPF point of view. + +Be aware that this is not a final verdict that the patch will +automatically get accepted into net or net-next trees eventually: + +On the netdev kernel mailing list reviews can come in at any point +in time. If discussions around a patch conclude that they cannot +get included as-is, we will either apply a follow-up fix or drop +them from the trees entirely. Therefore, we also reserve to rebase +the trees when deemed necessary. After all, the purpose of the tree +is to: + +i) accumulate and stage BPF patches for integration into trees + like net and net-next, and + +ii) run extensive BPF test suite and + workloads on the patches before they make their way any further. + +Once the BPF pull request was accepted by David S. Miller, then +the patches end up in net or net-next tree, respectively, and +make their way from there further into mainline. Again, see the +`netdev FAQ`_ for additional information e.g. on how often they are +merged to mainline. + +Q: How long do I need to wait for feedback on my BPF patches? +------------------------------------------------------------- +A: We try to keep the latency low. The usual time to feedback will +be around 2 or 3 business days. It may vary depending on the +complexity of changes and current patch load. + +Q: How often do you send pull requests to major kernel trees like net or net-next? +---------------------------------------------------------------------------------- + +A: Pull requests will be sent out rather often in order to not +accumulate too many patches in bpf or bpf-next. + +As a rule of thumb, expect pull requests for each tree regularly +at the end of the week. In some cases pull requests could additionally +come also in the middle of the week depending on the current patch +load or urgency. + +Q: Are patches applied to bpf-next when the merge window is open? +----------------------------------------------------------------- +A: For the time when the merge window is open, bpf-next will not be +processed. This is roughly analogous to net-next patch processing, +so feel free to read up on the `netdev FAQ`_ about further details. + +During those two weeks of merge window, we might ask you to resend +your patch series once bpf-next is open again. Once Linus released +a ``v*-rc1`` after the merge window, we continue processing of bpf-next. + +For non-subscribers to kernel mailing lists, there is also a status +page run by David S. Miller on net-next that provides guidance: + + http://vger.kernel.org/~davem/net-next.html + +Q: Verifier changes and test cases +---------------------------------- +Q: I made a BPF verifier change, do I need to add test cases for +BPF kernel selftests_? + +A: If the patch has changes to the behavior of the verifier, then yes, +it is absolutely necessary to add test cases to the BPF kernel +selftests_ suite. If they are not present and we think they are +needed, then we might ask for them before accepting any changes. + +In particular, test_verifier.c is tracking a high number of BPF test +cases, including a lot of corner cases that LLVM BPF back end may +generate out of the restricted C code. Thus, adding test cases is +absolutely crucial to make sure future changes do not accidentally +affect prior use-cases. Thus, treat those test cases as: verifier +behavior that is not tracked in test_verifier.c could potentially +be subject to change. + +Q: samples/bpf preference vs selftests? +--------------------------------------- +Q: When should I add code to `samples/bpf/`_ and when to BPF kernel +selftests_ ? + +A: In general, we prefer additions to BPF kernel selftests_ rather than +`samples/bpf/`_. The rationale is very simple: kernel selftests are +regularly run by various bots to test for kernel regressions. + +The more test cases we add to BPF selftests, the better the coverage +and the less likely it is that those could accidentally break. It is +not that BPF kernel selftests cannot demo how a specific feature can +be used. + +That said, `samples/bpf/`_ may be a good place for people to get started, +so it might be advisable that simple demos of features could go into +`samples/bpf/`_, but advanced functional and corner-case testing rather +into kernel selftests. + +If your sample looks like a test case, then go for BPF kernel selftests +instead! + +Q: When should I add code to the bpftool? +----------------------------------------- +A: The main purpose of bpftool (under tools/bpf/bpftool/) is to provide +a central user space tool for debugging and introspection of BPF programs +and maps that are active in the kernel. If UAPI changes related to BPF +enable for dumping additional information of programs or maps, then +bpftool should be extended as well to support dumping them. + +Q: When should I add code to iproute2's BPF loader? +--------------------------------------------------- +A: For UAPI changes related to the XDP or tc layer (e.g. ``cls_bpf``), +the convention is that those control-path related changes are added to +iproute2's BPF loader as well from user space side. This is not only +useful to have UAPI changes properly designed to be usable, but also +to make those changes available to a wider user base of major +downstream distributions. + +Q: Do you accept patches as well for iproute2's BPF loader? +----------------------------------------------------------- +A: Patches for the iproute2's BPF loader have to be sent to: + + netdev@vger.kernel.org + +While those patches are not processed by the BPF kernel maintainers, +please keep them in Cc as well, so they can be reviewed. + +The official git repository for iproute2 is run by Stephen Hemminger +and can be found at: + + https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/ + +The patches need to have a subject prefix of '``[PATCH iproute2 +master]``' or '``[PATCH iproute2 net-next]``'. '``master``' or +'``net-next``' describes the target branch where the patch should be +applied to. Meaning, if kernel changes went into the net-next kernel +tree, then the related iproute2 changes need to go into the iproute2 +net-next branch, otherwise they can be targeted at master branch. The +iproute2 net-next branch will get merged into the master branch after +the current iproute2 version from master has been released. + +Like BPF, the patches end up in patchwork under the netdev project and +are delegated to 'shemminger' for further processing: + + http://patchwork.ozlabs.org/project/netdev/list/?delegate=389 + +Q: What is the minimum requirement before I submit my BPF patches? +------------------------------------------------------------------ +A: When submitting patches, always take the time and properly test your +patches *prior* to submission. Never rush them! If maintainers find +that your patches have not been properly tested, it is a good way to +get them grumpy. Testing patch submissions is a hard requirement! + +Note, fixes that go to bpf tree *must* have a ``Fixes:`` tag included. +The same applies to fixes that target bpf-next, where the affected +commit is in net-next (or in some cases bpf-next). The ``Fixes:`` tag is +crucial in order to identify follow-up commits and tremendously helps +for people having to do backporting, so it is a must have! + +We also don't accept patches with an empty commit message. Take your +time and properly write up a high quality commit message, it is +essential! + +Think about it this way: other developers looking at your code a month +from now need to understand *why* a certain change has been done that +way, and whether there have been flaws in the analysis or assumptions +that the original author did. Thus providing a proper rationale and +describing the use-case for the changes is a must. + +Patch submissions with >1 patch must have a cover letter which includes +a high level description of the series. This high level summary will +then be placed into the merge commit by the BPF maintainers such that +it is also accessible from the git log for future reference. + +Q: Features changing BPF JIT and/or LLVM +---------------------------------------- +Q: What do I need to consider when adding a new instruction or feature +that would require BPF JIT and/or LLVM integration as well? + +A: We try hard to keep all BPF JITs up to date such that the same user +experience can be guaranteed when running BPF programs on different +architectures without having the program punt to the less efficient +interpreter in case the in-kernel BPF JIT is enabled. + +If you are unable to implement or test the required JIT changes for +certain architectures, please work together with the related BPF JIT +developers in order to get the feature implemented in a timely manner. +Please refer to the git log (``arch/*/net/``) to locate the necessary +people for helping out. + +Also always make sure to add BPF test cases (e.g. test_bpf.c and +test_verifier.c) for new instructions, so that they can receive +broad test coverage and help run-time testing the various BPF JITs. + +In case of new BPF instructions, once the changes have been accepted +into the Linux kernel, please implement support into LLVM's BPF back +end. See LLVM_ section below for further information. + +Stable submission +================= + +Q: I need a specific BPF commit in stable kernels. What should I do? +-------------------------------------------------------------------- +A: In case you need a specific fix in stable kernels, first check whether +the commit has already been applied in the related ``linux-*.y`` branches: + + https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/ + +If not the case, then drop an email to the BPF maintainers with the +netdev kernel mailing list in Cc and ask for the fix to be queued up: + + netdev@vger.kernel.org + +The process in general is the same as on netdev itself, see also the +`netdev FAQ`_ document. + +Q: Do you also backport to kernels not currently maintained as stable? +---------------------------------------------------------------------- +A: No. If you need a specific BPF commit in kernels that are currently not +maintained by the stable maintainers, then you are on your own. + +The current stable and longterm stable kernels are all listed here: + + https://www.kernel.org/ + +Q: The BPF patch I am about to submit needs to go to stable as well +------------------------------------------------------------------- +What should I do? + +A: The same rules apply as with netdev patch submissions in general, see +`netdev FAQ`_ under: + + `Documentation/networking/netdev-FAQ.txt`_ + +Never add "``Cc: stable@vger.kernel.org``" to the patch description, but +ask the BPF maintainers to queue the patches instead. This can be done +with a note, for example, under the ``---`` part of the patch which does +not go into the git log. Alternatively, this can be done as a simple +request by mail instead. + +Q: Queue stable patches +----------------------- +Q: Where do I find currently queued BPF patches that will be submitted +to stable? + +A: Once patches that fix critical bugs got applied into the bpf tree, they +are queued up for stable submission under: + + http://patchwork.ozlabs.org/bundle/bpf/stable/?state=* + +They will be on hold there at minimum until the related commit made its +way into the mainline kernel tree. + +After having been under broader exposure, the queued patches will be +submitted by the BPF maintainers to the stable maintainers. + +Testing patches +=============== + +Q: How to run BPF selftests +--------------------------- +A: After you have booted into the newly compiled kernel, navigate to +the BPF selftests_ suite in order to test BPF functionality (current +working directory points to the root of the cloned git tree):: + + $ cd tools/testing/selftests/bpf/ + $ make + +To run the verifier tests:: + + $ sudo ./test_verifier + +The verifier tests print out all the current checks being +performed. The summary at the end of running all tests will dump +information of test successes and failures:: + + Summary: 418 PASSED, 0 FAILED + +In order to run through all BPF selftests, the following command is +needed:: + + $ sudo make run_tests + +See the kernels selftest `Documentation/dev-tools/kselftest.rst`_ +document for further documentation. + +Q: Which BPF kernel selftests version should I run my kernel against? +--------------------------------------------------------------------- +A: If you run a kernel ``xyz``, then always run the BPF kernel selftests +from that kernel ``xyz`` as well. Do not expect that the BPF selftest +from the latest mainline tree will pass all the time. + +In particular, test_bpf.c and test_verifier.c have a large number of +test cases and are constantly updated with new BPF test sequences, or +existing ones are adapted to verifier changes e.g. due to verifier +becoming smarter and being able to better track certain things. + +LLVM +==== + +Q: Where do I find LLVM with BPF support? +----------------------------------------- +A: The BPF back end for LLVM is upstream in LLVM since version 3.7.1. + +All major distributions these days ship LLVM with BPF back end enabled, +so for the majority of use-cases it is not required to compile LLVM by +hand anymore, just install the distribution provided package. + +LLVM's static compiler lists the supported targets through +``llc --version``, make sure BPF targets are listed. Example:: + + $ llc --version + LLVM (http://llvm.org/): + LLVM version 6.0.0svn + Optimized build. + Default target: x86_64-unknown-linux-gnu + Host CPU: skylake + + Registered Targets: + bpf - BPF (host endian) + bpfeb - BPF (big endian) + bpfel - BPF (little endian) + x86 - 32-bit X86: Pentium-Pro and above + x86-64 - 64-bit X86: EM64T and AMD64 + +For developers in order to utilize the latest features added to LLVM's +BPF back end, it is advisable to run the latest LLVM releases. Support +for new BPF kernel features such as additions to the BPF instruction +set are often developed together. + +All LLVM releases can be found at: http://releases.llvm.org/ + +Q: Got it, so how do I build LLVM manually anyway? +-------------------------------------------------- +A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have +that set up, proceed with building the latest LLVM and clang version +from the git repositories:: + + $ git clone http://llvm.org/git/llvm.git + $ cd llvm/tools + $ git clone --depth 1 http://llvm.org/git/clang.git + $ cd ..; mkdir build; cd build + $ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" \ + -DBUILD_SHARED_LIBS=OFF \ + -DCMAKE_BUILD_TYPE=Release \ + -DLLVM_BUILD_RUNTIME=OFF + $ make -j $(getconf _NPROCESSORS_ONLN) + +The built binaries can then be found in the build/bin/ directory, where +you can point the PATH variable to. + +Q: Reporting LLVM BPF issues +---------------------------- +Q: Should I notify BPF kernel maintainers about issues in LLVM's BPF code +generation back end or about LLVM generated code that the verifier +refuses to accept? + +A: Yes, please do! + +LLVM's BPF back end is a key piece of the whole BPF +infrastructure and it ties deeply into verification of programs from the +kernel side. Therefore, any issues on either side need to be investigated +and fixed whenever necessary. + +Therefore, please make sure to bring them up at netdev kernel mailing +list and Cc BPF maintainers for LLVM and kernel bits: + +* Yonghong Song <yhs@fb.com> +* Alexei Starovoitov <ast@kernel.org> +* Daniel Borkmann <daniel@iogearbox.net> + +LLVM also has an issue tracker where BPF related bugs can be found: + + https://bugs.llvm.org/buglist.cgi?quicksearch=bpf + +However, it is better to reach out through mailing lists with having +maintainers in Cc. + +Q: New BPF instruction for kernel and LLVM +------------------------------------------ +Q: I have added a new BPF instruction to the kernel, how can I integrate +it into LLVM? + +A: LLVM has a ``-mcpu`` selector for the BPF back end in order to allow +the selection of BPF instruction set extensions. By default the +``generic`` processor target is used, which is the base instruction set +(v1) of BPF. + +LLVM has an option to select ``-mcpu=probe`` where it will probe the host +kernel for supported BPF instruction set extensions and selects the +optimal set automatically. + +For cross-compilation, a specific version can be select manually as well :: + + $ llc -march bpf -mcpu=help + Available CPUs for this target: + + generic - Select the generic processor. + probe - Select the probe processor. + v1 - Select the v1 processor. + v2 - Select the v2 processor. + [...] + +Newly added BPF instructions to the Linux kernel need to follow the same +scheme, bump the instruction set version and implement probing for the +extensions such that ``-mcpu=probe`` users can benefit from the +optimization transparently when upgrading their kernels. + +If you are unable to implement support for the newly added BPF instruction +please reach out to BPF developers for help. + +By the way, the BPF kernel selftests run with ``-mcpu=probe`` for better +test coverage. + +Q: clang flag for target bpf? +----------------------------- +Q: In some cases clang flag ``-target bpf`` is used but in other cases the +default clang target, which matches the underlying architecture, is used. +What is the difference and when I should use which? + +A: Although LLVM IR generation and optimization try to stay architecture +independent, ``-target <arch>`` still has some impact on generated code: + +- BPF program may recursively include header file(s) with file scope + inline assembly codes. The default target can handle this well, + while ``bpf`` target may fail if bpf backend assembler does not + understand these assembly codes, which is true in most cases. + +- When compiled without ``-g``, additional elf sections, e.g., + .eh_frame and .rela.eh_frame, may be present in the object file + with default target, but not with ``bpf`` target. + +- The default target may turn a C switch statement into a switch table + lookup and jump operation. Since the switch table is placed + in the global readonly section, the bpf program will fail to load. + The bpf target does not support switch table optimization. + The clang option ``-fno-jump-tables`` can be used to disable + switch table generation. + +- For clang ``-target bpf``, it is guaranteed that pointer or long / + unsigned long types will always have a width of 64 bit, no matter + whether underlying clang binary or default target (or kernel) is + 32 bit. However, when native clang target is used, then it will + compile these types based on the underlying architecture's conventions, + meaning in case of 32 bit architecture, pointer or long / unsigned + long types e.g. in BPF context structure will have width of 32 bit + while the BPF LLVM back end still operates in 64 bit. The native + target is mostly needed in tracing for the case of walking ``pt_regs`` + or other kernel structures where CPU's register width matters. + Otherwise, ``clang -target bpf`` is generally recommended. + +You should use default target when: + +- Your program includes a header file, e.g., ptrace.h, which eventually + pulls in some header files containing file scope host assembly codes. + +- You can add ``-fno-jump-tables`` to work around the switch table issue. + +Otherwise, you can use ``bpf`` target. Additionally, you *must* use bpf target +when: + +- Your program uses data structures with pointer or long / unsigned long + types that interface with BPF helpers or context data structures. Access + into these structures is verified by the BPF verifier and may result + in verification failures if the native architecture is not aligned with + the BPF architecture, e.g. 64-bit. An example of this is + BPF_PROG_TYPE_SK_MSG require ``-target bpf`` + + +.. Links +.. _Documentation/process/: https://www.kernel.org/doc/html/latest/process/ +.. _MAINTAINERS: ../../MAINTAINERS +.. _Documentation/networking/netdev-FAQ.txt: ../networking/netdev-FAQ.txt +.. _netdev FAQ: ../networking/netdev-FAQ.txt +.. _samples/bpf/: ../../samples/bpf/ +.. _selftests: ../../tools/testing/selftests/bpf/ +.. _Documentation/dev-tools/kselftest.rst: + https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html + +Happy BPF hacking! diff --git a/Documentation/bpf/bpf_devel_QA.txt b/Documentation/bpf/bpf_devel_QA.txt deleted file mode 100644 index 1a0b704e1a38..000000000000 --- a/Documentation/bpf/bpf_devel_QA.txt +++ /dev/null @@ -1,562 +0,0 @@ -This document provides information for the BPF subsystem about various -workflows related to reporting bugs, submitting patches, and queueing -patches for stable kernels. - -For general information about submitting patches, please refer to -Documentation/process/. This document only describes additional specifics -related to BPF. - -Reporting bugs: ---------------- - -Q: How do I report bugs for BPF kernel code? - -A: Since all BPF kernel development as well as bpftool and iproute2 BPF - loader development happens through the netdev kernel mailing list, - please report any found issues around BPF to the following mailing - list: - - netdev@vger.kernel.org - - This may also include issues related to XDP, BPF tracing, etc. - - Given netdev has a high volume of traffic, please also add the BPF - maintainers to Cc (from kernel MAINTAINERS file): - - Alexei Starovoitov <ast@kernel.org> - Daniel Borkmann <daniel@iogearbox.net> - - In case a buggy commit has already been identified, make sure to keep - the actual commit authors in Cc as well for the report. They can - typically be identified through the kernel's git tree. - - Please do *not* report BPF issues to bugzilla.kernel.org since it - is a guarantee that the reported issue will be overlooked. - -Submitting patches: -------------------- - -Q: To which mailing list do I need to submit my BPF patches? - -A: Please submit your BPF patches to the netdev kernel mailing list: - - netdev@vger.kernel.org - - Historically, BPF came out of networking and has always been maintained - by the kernel networking community. Although these days BPF touches - many other subsystems as well, the patches are still routed mainly - through the networking community. - - In case your patch has changes in various different subsystems (e.g. - tracing, security, etc), make sure to Cc the related kernel mailing - lists and maintainers from there as well, so they are able to review - the changes and provide their Acked-by's to the patches. - -Q: Where can I find patches currently under discussion for BPF subsystem? - -A: All patches that are Cc'ed to netdev are queued for review under netdev - patchwork project: - - http://patchwork.ozlabs.org/project/netdev/list/ - - Those patches which target BPF, are assigned to a 'bpf' delegate for - further processing from BPF maintainers. The current queue with - patches under review can be found at: - - https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147 - - Once the patches have been reviewed by the BPF community as a whole - and approved by the BPF maintainers, their status in patchwork will be - changed to 'Accepted' and the submitter will be notified by mail. This - means that the patches look good from a BPF perspective and have been - applied to one of the two BPF kernel trees. - - In case feedback from the community requires a respin of the patches, - their status in patchwork will be set to 'Changes Requested', and purged - from the current review queue. Likewise for cases where patches would - get rejected or are not applicable to the BPF trees (but assigned to - the 'bpf' delegate). - -Q: How do the changes make their way into Linux? - -A: There are two BPF kernel trees (git repositories). Once patches have - been accepted by the BPF maintainers, they will be applied to one - of the two BPF trees: - - https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/ - https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/ - - The bpf tree itself is for fixes only, whereas bpf-next for features, - cleanups or other kind of improvements ("next-like" content). This is - analogous to net and net-next trees for networking. Both bpf and - bpf-next will only have a master branch in order to simplify against - which branch patches should get rebased to. - - Accumulated BPF patches in the bpf tree will regularly get pulled - into the net kernel tree. Likewise, accumulated BPF patches accepted - into the bpf-next tree will make their way into net-next tree. net and - net-next are both run by David S. Miller. From there, they will go - into the kernel mainline tree run by Linus Torvalds. To read up on the - process of net and net-next being merged into the mainline tree, see - the netdev FAQ under: - - Documentation/networking/netdev-FAQ.txt - - Occasionally, to prevent merge conflicts, we might send pull requests - to other trees (e.g. tracing) with a small subset of the patches, but - net and net-next are always the main trees targeted for integration. - - The pull requests will contain a high-level summary of the accumulated - patches and can be searched on netdev kernel mailing list through the - following subject lines (yyyy-mm-dd is the date of the pull request): - - pull-request: bpf yyyy-mm-dd - pull-request: bpf-next yyyy-mm-dd - -Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be - applied to? - -A: The process is the very same as described in the netdev FAQ, so - please read up on it. The subject line must indicate whether the - patch is a fix or rather "next-like" content in order to let the - maintainers know whether it is targeted at bpf or bpf-next. - - For fixes eventually landing in bpf -> net tree, the subject must - look like: - - git format-patch --subject-prefix='PATCH bpf' start..finish - - For features/improvements/etc that should eventually land in - bpf-next -> net-next, the subject must look like: - - git format-patch --subject-prefix='PATCH bpf-next' start..finish - - If unsure whether the patch or patch series should go into bpf - or net directly, or bpf-next or net-next directly, it is not a - problem either if the subject line says net or net-next as target. - It is eventually up to the maintainers to do the delegation of - the patches. - - If it is clear that patches should go into bpf or bpf-next tree, - please make sure to rebase the patches against those trees in - order to reduce potential conflicts. - - In case the patch or patch series has to be reworked and sent out - again in a second or later revision, it is also required to add a - version number (v2, v3, ...) into the subject prefix: - - git format-patch --subject-prefix='PATCH net-next v2' start..finish - - When changes have been requested to the patch series, always send the - whole patch series again with the feedback incorporated (never send - individual diffs on top of the old series). - -Q: What does it mean when a patch gets applied to bpf or bpf-next tree? - -A: It means that the patch looks good for mainline inclusion from - a BPF point of view. - - Be aware that this is not a final verdict that the patch will - automatically get accepted into net or net-next trees eventually: - - On the netdev kernel mailing list reviews can come in at any point - in time. If discussions around a patch conclude that they cannot - get included as-is, we will either apply a follow-up fix or drop - them from the trees entirely. Therefore, we also reserve to rebase - the trees when deemed necessary. After all, the purpose of the tree - is to i) accumulate and stage BPF patches for integration into trees - like net and net-next, and ii) run extensive BPF test suite and - workloads on the patches before they make their way any further. - - Once the BPF pull request was accepted by David S. Miller, then - the patches end up in net or net-next tree, respectively, and - make their way from there further into mainline. Again, see the - netdev FAQ for additional information e.g. on how often they are - merged to mainline. - -Q: How long do I need to wait for feedback on my BPF patches? - -A: We try to keep the latency low. The usual time to feedback will - be around 2 or 3 business days. It may vary depending on the - complexity of changes and current patch load. - -Q: How often do you send pull requests to major kernel trees like - net or net-next? - -A: Pull requests will be sent out rather often in order to not - accumulate too many patches in bpf or bpf-next. - - As a rule of thumb, expect pull requests for each tree regularly - at the end of the week. In some cases pull requests could additionally - come also in the middle of the week depending on the current patch - load or urgency. - -Q: Are patches applied to bpf-next when the merge window is open? - -A: For the time when the merge window is open, bpf-next will not be - processed. This is roughly analogous to net-next patch processing, - so feel free to read up on the netdev FAQ about further details. - - During those two weeks of merge window, we might ask you to resend - your patch series once bpf-next is open again. Once Linus released - a v*-rc1 after the merge window, we continue processing of bpf-next. - - For non-subscribers to kernel mailing lists, there is also a status - page run by David S. Miller on net-next that provides guidance: - - http://vger.kernel.org/~davem/net-next.html - -Q: I made a BPF verifier change, do I need to add test cases for - BPF kernel selftests? - -A: If the patch has changes to the behavior of the verifier, then yes, - it is absolutely necessary to add test cases to the BPF kernel - selftests suite. If they are not present and we think they are - needed, then we might ask for them before accepting any changes. - - In particular, test_verifier.c is tracking a high number of BPF test - cases, including a lot of corner cases that LLVM BPF back end may - generate out of the restricted C code. Thus, adding test cases is - absolutely crucial to make sure future changes do not accidentally - affect prior use-cases. Thus, treat those test cases as: verifier - behavior that is not tracked in test_verifier.c could potentially - be subject to change. - -Q: When should I add code to samples/bpf/ and when to BPF kernel - selftests? - -A: In general, we prefer additions to BPF kernel selftests rather than - samples/bpf/. The rationale is very simple: kernel selftests are - regularly run by various bots to test for kernel regressions. - - The more test cases we add to BPF selftests, the better the coverage - and the less likely it is that those could accidentally break. It is - not that BPF kernel selftests cannot demo how a specific feature can - be used. - - That said, samples/bpf/ may be a good place for people to get started, - so it might be advisable that simple demos of features could go into - samples/bpf/, but advanced functional and corner-case testing rather - into kernel selftests. - - If your sample looks like a test case, then go for BPF kernel selftests - instead! - -Q: When should I add code to the bpftool? - -A: The main purpose of bpftool (under tools/bpf/bpftool/) is to provide - a central user space tool for debugging and introspection of BPF programs - and maps that are active in the kernel. If UAPI changes related to BPF - enable for dumping additional information of programs or maps, then - bpftool should be extended as well to support dumping them. - -Q: When should I add code to iproute2's BPF loader? - -A: For UAPI changes related to the XDP or tc layer (e.g. cls_bpf), the - convention is that those control-path related changes are added to - iproute2's BPF loader as well from user space side. This is not only - useful to have UAPI changes properly designed to be usable, but also - to make those changes available to a wider user base of major - downstream distributions. - -Q: Do you accept patches as well for iproute2's BPF loader? - -A: Patches for the iproute2's BPF loader have to be sent to: - - netdev@vger.kernel.org - - While those patches are not processed by the BPF kernel maintainers, - please keep them in Cc as well, so they can be reviewed. - - The official git repository for iproute2 is run by Stephen Hemminger - and can be found at: - - https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/ - - The patches need to have a subject prefix of '[PATCH iproute2 master]' - or '[PATCH iproute2 net-next]'. 'master' or 'net-next' describes the - target branch where the patch should be applied to. Meaning, if kernel - changes went into the net-next kernel tree, then the related iproute2 - changes need to go into the iproute2 net-next branch, otherwise they - can be targeted at master branch. The iproute2 net-next branch will get - merged into the master branch after the current iproute2 version from - master has been released. - - Like BPF, the patches end up in patchwork under the netdev project and - are delegated to 'shemminger' for further processing: - - http://patchwork.ozlabs.org/project/netdev/list/?delegate=389 - -Q: What is the minimum requirement before I submit my BPF patches? - -A: When submitting patches, always take the time and properly test your - patches *prior* to submission. Never rush them! If maintainers find - that your patches have not been properly tested, it is a good way to - get them grumpy. Testing patch submissions is a hard requirement! - - Note, fixes that go to bpf tree *must* have a Fixes: tag included. The - same applies to fixes that target bpf-next, where the affected commit - is in net-next (or in some cases bpf-next). The Fixes: tag is crucial - in order to identify follow-up commits and tremendously helps for people - having to do backporting, so it is a must have! - - We also don't accept patches with an empty commit message. Take your - time and properly write up a high quality commit message, it is - essential! - - Think about it this way: other developers looking at your code a month - from now need to understand *why* a certain change has been done that - way, and whether there have been flaws in the analysis or assumptions - that the original author did. Thus providing a proper rationale and - describing the use-case for the changes is a must. - - Patch submissions with >1 patch must have a cover letter which includes - a high level description of the series. This high level summary will - then be placed into the merge commit by the BPF maintainers such that - it is also accessible from the git log for future reference. - -Q: What do I need to consider when adding a new instruction or feature - that would require BPF JIT and/or LLVM integration as well? - -A: We try hard to keep all BPF JITs up to date such that the same user - experience can be guaranteed when running BPF programs on different - architectures without having the program punt to the less efficient - interpreter in case the in-kernel BPF JIT is enabled. - - If you are unable to implement or test the required JIT changes for - certain architectures, please work together with the related BPF JIT - developers in order to get the feature implemented in a timely manner. - Please refer to the git log (arch/*/net/) to locate the necessary - people for helping out. - - Also always make sure to add BPF test cases (e.g. test_bpf.c and - test_verifier.c) for new instructions, so that they can receive - broad test coverage and help run-time testing the various BPF JITs. - - In case of new BPF instructions, once the changes have been accepted - into the Linux kernel, please implement support into LLVM's BPF back - end. See LLVM section below for further information. - -Stable submission: ------------------- - -Q: I need a specific BPF commit in stable kernels. What should I do? - -A: In case you need a specific fix in stable kernels, first check whether - the commit has already been applied in the related linux-*.y branches: - - https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/ - - If not the case, then drop an email to the BPF maintainers with the - netdev kernel mailing list in Cc and ask for the fix to be queued up: - - netdev@vger.kernel.org - - The process in general is the same as on netdev itself, see also the - netdev FAQ document. - -Q: Do you also backport to kernels not currently maintained as stable? - -A: No. If you need a specific BPF commit in kernels that are currently not - maintained by the stable maintainers, then you are on your own. - - The current stable and longterm stable kernels are all listed here: - - https://www.kernel.org/ - -Q: The BPF patch I am about to submit needs to go to stable as well. What - should I do? - -A: The same rules apply as with netdev patch submissions in general, see - netdev FAQ under: - - Documentation/networking/netdev-FAQ.txt - - Never add "Cc: stable@vger.kernel.org" to the patch description, but - ask the BPF maintainers to queue the patches instead. This can be done - with a note, for example, under the "---" part of the patch which does - not go into the git log. Alternatively, this can be done as a simple - request by mail instead. - -Q: Where do I find currently queued BPF patches that will be submitted - to stable? - -A: Once patches that fix critical bugs got applied into the bpf tree, they - are queued up for stable submission under: - - http://patchwork.ozlabs.org/bundle/bpf/stable/?state=* - - They will be on hold there at minimum until the related commit made its - way into the mainline kernel tree. - - After having been under broader exposure, the queued patches will be - submitted by the BPF maintainers to the stable maintainers. - -Testing patches: ----------------- - -Q: Which BPF kernel selftests version should I run my kernel against? - -A: If you run a kernel xyz, then always run the BPF kernel selftests from - that kernel xyz as well. Do not expect that the BPF selftest from the - latest mainline tree will pass all the time. - - In particular, test_bpf.c and test_verifier.c have a large number of - test cases and are constantly updated with new BPF test sequences, or - existing ones are adapted to verifier changes e.g. due to verifier - becoming smarter and being able to better track certain things. - -LLVM: ------ - -Q: Where do I find LLVM with BPF support? - -A: The BPF back end for LLVM is upstream in LLVM since version 3.7.1. - - All major distributions these days ship LLVM with BPF back end enabled, - so for the majority of use-cases it is not required to compile LLVM by - hand anymore, just install the distribution provided package. - - LLVM's static compiler lists the supported targets through 'llc --version', - make sure BPF targets are listed. Example: - - $ llc --version - LLVM (http://llvm.org/): - LLVM version 6.0.0svn - Optimized build. - Default target: x86_64-unknown-linux-gnu - Host CPU: skylake - - Registered Targets: - bpf - BPF (host endian) - bpfeb - BPF (big endian) - bpfel - BPF (little endian) - x86 - 32-bit X86: Pentium-Pro and above - x86-64 - 64-bit X86: EM64T and AMD64 - - For developers in order to utilize the latest features added to LLVM's - BPF back end, it is advisable to run the latest LLVM releases. Support - for new BPF kernel features such as additions to the BPF instruction - set are often developed together. - - All LLVM releases can be found at: http://releases.llvm.org/ - -Q: Got it, so how do I build LLVM manually anyway? - -A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have - that set up, proceed with building the latest LLVM and clang version - from the git repositories: - - $ git clone http://llvm.org/git/llvm.git - $ cd llvm/tools - $ git clone --depth 1 http://llvm.org/git/clang.git - $ cd ..; mkdir build; cd build - $ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" \ - -DBUILD_SHARED_LIBS=OFF \ - -DCMAKE_BUILD_TYPE=Release \ - -DLLVM_BUILD_RUNTIME=OFF - $ make -j $(getconf _NPROCESSORS_ONLN) - - The built binaries can then be found in the build/bin/ directory, where - you can point the PATH variable to. - -Q: Should I notify BPF kernel maintainers about issues in LLVM's BPF code - generation back end or about LLVM generated code that the verifier - refuses to accept? - -A: Yes, please do! LLVM's BPF back end is a key piece of the whole BPF - infrastructure and it ties deeply into verification of programs from the - kernel side. Therefore, any issues on either side need to be investigated - and fixed whenever necessary. - - Therefore, please make sure to bring them up at netdev kernel mailing - list and Cc BPF maintainers for LLVM and kernel bits: - - Yonghong Song <yhs@fb.com> - Alexei Starovoitov <ast@kernel.org> - Daniel Borkmann <daniel@iogearbox.net> - - LLVM also has an issue tracker where BPF related bugs can be found: - - https://bugs.llvm.org/buglist.cgi?quicksearch=bpf - - However, it is better to reach out through mailing lists with having - maintainers in Cc. - -Q: I have added a new BPF instruction to the kernel, how can I integrate - it into LLVM? - -A: LLVM has a -mcpu selector for the BPF back end in order to allow the - selection of BPF instruction set extensions. By default the 'generic' - processor target is used, which is the base instruction set (v1) of BPF. - - LLVM has an option to select -mcpu=probe where it will probe the host - kernel for supported BPF instruction set extensions and selects the - optimal set automatically. - - For cross-compilation, a specific version can be select manually as well. - - $ llc -march bpf -mcpu=help - Available CPUs for this target: - - generic - Select the generic processor. - probe - Select the probe processor. - v1 - Select the v1 processor. - v2 - Select the v2 processor. - [...] - - Newly added BPF instructions to the Linux kernel need to follow the same - scheme, bump the instruction set version and implement probing for the - extensions such that -mcpu=probe users can benefit from the optimization - transparently when upgrading their kernels. - - If you are unable to implement support for the newly added BPF instruction - please reach out to BPF developers for help. - - By the way, the BPF kernel selftests run with -mcpu=probe for better - test coverage. - -Q: In some cases clang flag "-target bpf" is used but in other cases the - default clang target, which matches the underlying architecture, is used. - What is the difference and when I should use which? - -A: Although LLVM IR generation and optimization try to stay architecture - independent, "-target <arch>" still has some impact on generated code: - - - BPF program may recursively include header file(s) with file scope - inline assembly codes. The default target can handle this well, - while bpf target may fail if bpf backend assembler does not - understand these assembly codes, which is true in most cases. - - - When compiled without -g, additional elf sections, e.g., - .eh_frame and .rela.eh_frame, may be present in the object file - with default target, but not with bpf target. - - - The default target may turn a C switch statement into a switch table - lookup and jump operation. Since the switch table is placed - in the global readonly section, the bpf program will fail to load. - The bpf target does not support switch table optimization. - The clang option "-fno-jump-tables" can be used to disable - switch table generation. - - - For clang -target bpf, it is guaranteed that pointer or long / - unsigned long types will always have a width of 64 bit, no matter - whether underlying clang binary or default target (or kernel) is - 32 bit. However, when native clang target is used, then it will - compile these types based on the underlying architecture's conventions, - meaning in case of 32 bit architecture, pointer or long / unsigned - long types e.g. in BPF context structure will have width of 32 bit - while the BPF LLVM back end still operates in 64 bit. The native - target is mostly needed in tracing for the case of walking pt_regs - or other kernel structures where CPU's register width matters. - Otherwise, clang -target bpf is generally recommended. - - You should use default target when: - - - Your program includes a header file, e.g., ptrace.h, which eventually - pulls in some header files containing file scope host assembly codes. - - You can add "-fno-jump-tables" to work around the switch table issue. - - Otherwise, you can use bpf target. - -Happy BPF hacking! diff --git a/Documentation/core-api/atomic_ops.rst b/Documentation/core-api/atomic_ops.rst index fce929144ccd..2e7165f86f55 100644 --- a/Documentation/core-api/atomic_ops.rst +++ b/Documentation/core-api/atomic_ops.rst @@ -111,7 +111,6 @@ If the compiler can prove that do_something() does not store to the variable a, then the compiler is within its rights transforming this to the following:: - tmp = a; if (a > 0) for (;;) do_something(); @@ -119,7 +118,7 @@ the following:: If you don't want the compiler to do this (and you probably don't), then you should use something like the following:: - while (READ_ONCE(a) < 0) + while (READ_ONCE(a) > 0) do_something(); Alternatively, you could place a barrier() call in the loop. @@ -467,10 +466,12 @@ Like the above, except that these routines return a boolean which indicates whether the changed bit was set _BEFORE_ the atomic bit operation. -WARNING! It is incredibly important that the value be a boolean, -ie. "0" or "1". Do not try to be fancy and save a few instructions by -declaring the above to return "long" and just returning something like -"old_val & mask" because that will not work. + +.. warning:: + It is incredibly important that the value be a boolean, ie. "0" or "1". + Do not try to be fancy and save a few instructions by declaring the + above to return "long" and just returning something like "old_val & + mask" because that will not work. For one thing, this return value gets truncated to int in many code paths using these interfaces, so on 64-bit if the bit is set in the diff --git a/Documentation/cachetlb.txt b/Documentation/core-api/cachetlb.rst index 6eb9d3f090cd..6eb9d3f090cd 100644 --- a/Documentation/cachetlb.txt +++ b/Documentation/core-api/cachetlb.rst diff --git a/Documentation/circular-buffers.txt b/Documentation/core-api/circular-buffers.rst index 53e51caa3347..53e51caa3347 100644 --- a/Documentation/circular-buffers.txt +++ b/Documentation/core-api/circular-buffers.rst diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst new file mode 100644 index 000000000000..e0df8f416582 --- /dev/null +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst @@ -0,0 +1,66 @@ +================================= +GFP masks used from FS/IO context +================================= + +:Date: May, 2018 +:Author: Michal Hocko <mhocko@kernel.org> + +Introduction +============ + +Code paths in the filesystem and IO stacks must be careful when +allocating memory to prevent recursion deadlocks caused by direct +memory reclaim calling back into the FS or IO paths and blocking on +already held resources (e.g. locks - most commonly those used for the +transaction context). + +The traditional way to avoid this deadlock problem is to clear __GFP_FS +respectively __GFP_IO (note the latter implies clearing the first as well) in +the gfp mask when calling an allocator. GFP_NOFS respectively GFP_NOIO can be +used as shortcut. It turned out though that above approach has led to +abuses when the restricted gfp mask is used "just in case" without a +deeper consideration which leads to problems because an excessive use +of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory +reclaim issues. + +New API +======== + +Since 4.12 we do have a generic scope API for both NOFS and NOIO context +``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``, +``memalloc_noio_restore`` which allow to mark a scope to be a critical +section from a filesystem or I/O point of view. Any allocation from that +scope will inherently drop __GFP_FS respectively __GFP_IO from the given +mask so no memory allocation can recurse back in the FS/IO. + +.. kernel-doc:: include/linux/sched/mm.h + :functions: memalloc_nofs_save memalloc_nofs_restore +.. kernel-doc:: include/linux/sched/mm.h + :functions: memalloc_noio_save memalloc_noio_restore + +FS/IO code then simply calls the appropriate save function before +any critical section with respect to the reclaim is started - e.g. +lock shared with the reclaim context or when a transaction context +nesting would be possible via reclaim. The restore function should be +called when the critical section ends. All that ideally along with an +explanation what is the reclaim context for easier maintenance. + +Please note that the proper pairing of save/restore functions +allows nesting so it is safe to call ``memalloc_noio_save`` or +``memalloc_noio_restore`` respectively from an existing NOIO or NOFS +scope. + +What about __vmalloc(GFP_NOFS) +============================== + +vmalloc doesn't support GFP_NOFS semantic because there are hardcoded +GFP_KERNEL allocations deep inside the allocator which are quite non-trivial +to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is +almost always a bug. The good news is that the NOFS/NOIO semantic can be +achieved by the scope API. + +In the ideal world, upper layers should already mark dangerous contexts +and so no special care is required and vmalloc should be called without +any problems. Sometimes if the context is not really clear or there are +layering violations then the recommended way around that is to wrap ``vmalloc`` +by the scope API with a comment explaining the problem. diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index c670a8031786..f5a66b72f984 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -14,6 +14,7 @@ Core utilities kernel-api assoc_array atomic_ops + cachetlb refcount-vs-atomic cpu_hotplug idr @@ -25,6 +26,8 @@ Core utilities genalloc errseq printk-formats + circular-buffers + gfp_mask-from-fs-io Interfaces for kernel debugging =============================== diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index ff335f8aeb39..8e44aea366c2 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -39,17 +39,17 @@ String Manipulation .. kernel-doc:: lib/string.c :export: +Basic Kernel Library Functions +============================== + +The Linux kernel provides more basic utility functions. + Bit Operations -------------- .. kernel-doc:: arch/x86/include/asm/bitops.h :internal: -Basic Kernel Library Functions -============================== - -The Linux kernel provides more basic utility functions. - Bitmap Operations ----------------- @@ -80,6 +80,31 @@ Command-line Parsing .. kernel-doc:: lib/cmdline.c :export: +Sorting +------- + +.. kernel-doc:: lib/sort.c + :export: + +.. kernel-doc:: lib/list_sort.c + :export: + +Text Searching +-------------- + +.. kernel-doc:: lib/textsearch.c + :doc: ts_intro + +.. kernel-doc:: lib/textsearch.c + :export: + +.. kernel-doc:: include/linux/textsearch.h + :functions: textsearch_find textsearch_next \ + textsearch_get_pattern textsearch_get_pattern_len + +CRC and Math Functions in Linux +=============================== + CRC Functions ------------- @@ -103,9 +128,6 @@ CRC Functions .. kernel-doc:: lib/crc-itu-t.c :export: -Math Functions in Linux -======================= - Base 2 log and power Functions ------------------------------ @@ -127,15 +149,6 @@ Division Functions .. kernel-doc:: lib/gcd.c :export: -Sorting -------- - -.. kernel-doc:: lib/sort.c - :export: - -.. kernel-doc:: lib/list_sort.c - :export: - UUID/GUID --------- diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index eb30efdd2e78..25dc591cb110 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -419,11 +419,10 @@ struct clk %pC pll1 %pCn pll1 - %pCr 1560000000 For printing struct clk structures. %pC and %pCn print the name (Common Clock Framework) or address (legacy clock framework) of the -structure; %pCr prints the current clock rate. +structure. Passed by reference. diff --git a/Documentation/core-api/refcount-vs-atomic.rst b/Documentation/core-api/refcount-vs-atomic.rst index 83351c258cdb..322851bada16 100644 --- a/Documentation/core-api/refcount-vs-atomic.rst +++ b/Documentation/core-api/refcount-vs-atomic.rst @@ -17,7 +17,7 @@ in order to help maintainers validate their code against the change in these memory ordering guarantees. The terms used through this document try to follow the formal LKMM defined in -github.com/aparri/memory-model/blob/master/Documentation/explanation.txt +tools/memory-model/Documentation/explanation.txt. memory-barriers.txt and atomic_t.txt provide more background to the memory ordering in general and for atomic operations specifically. diff --git a/Documentation/crypto/index.rst b/Documentation/crypto/index.rst index 94c4786f2573..c4ff5d791233 100644 --- a/Documentation/crypto/index.rst +++ b/Documentation/crypto/index.rst @@ -20,5 +20,6 @@ for cryptographic use cases, as well as programming examples. architecture devel-algos userspace-if + crypto_engine api api-samples diff --git a/Documentation/dell_rbu.txt b/Documentation/dell_rbu.txt index 0fdb6aa2704c..5d1ce7bcd04d 100644 --- a/Documentation/dell_rbu.txt +++ b/Documentation/dell_rbu.txt @@ -121,10 +121,7 @@ read back the image downloaded. .. note:: - This driver requires a patch for firmware_class.c which has the modified - request_firmware_nowait function. - - Also after updating the BIOS image a user mode application needs to execute + After updating the BIOS image a user mode application needs to execute code which sends the BIOS update request to the BIOS. So on the next reboot the BIOS knows about the new image downloaded and it updates itself. Also don't unload the rbu driver if the image has to be updated. diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst index f7a18f274357..aabc8738b3d8 100644 --- a/Documentation/dev-tools/kasan.rst +++ b/Documentation/dev-tools/kasan.rst @@ -120,7 +120,7 @@ A typical out of bounds access report looks like this:: The header of the report discribe what kind of bug happened and what kind of access caused it. It's followed by the description of the accessed slub object -(see 'SLUB Debug output' section in Documentation/vm/slub.txt for details) and +(see 'SLUB Debug output' section in Documentation/vm/slub.rst for details) and the description of the accessed memory page. In the last section the report shows memory state around the accessed address. diff --git a/Documentation/dev-tools/kselftest.rst b/Documentation/dev-tools/kselftest.rst index e80850eefe13..3bf371a938d0 100644 --- a/Documentation/dev-tools/kselftest.rst +++ b/Documentation/dev-tools/kselftest.rst @@ -151,6 +151,11 @@ Contributing new tests (details) TEST_FILES, TEST_GEN_FILES mean it is the file which is used by test. + * First use the headers inside the kernel source and/or git repo, and then the + system headers. Headers for the kernel release as opposed to headers + installed by the distro on the system should be the primary focus to be able + to find regressions. + Test Harness ============ diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt index 4bcd4b7f79f9..3d01948ea061 100644 --- a/Documentation/device-mapper/thin-provisioning.txt +++ b/Documentation/device-mapper/thin-provisioning.txt @@ -264,7 +264,10 @@ i) Constructor data device, but just remove the mapping. read_only: Don't allow any changes to be made to the pool - metadata. + metadata. This mode is only available after the + thin-pool has been created and first used in full + read/write mode. It cannot be specified on initial + thin-pool creation. error_if_no_space: Error IOs, instead of queueing, if no space. diff --git a/Documentation/devicetree/bindings/arm/mediatek/mediatek,g3dsys.txt b/Documentation/devicetree/bindings/arm/mediatek/mediatek,g3dsys.txt new file mode 100644 index 000000000000..7de43bf41fdc --- /dev/null +++ b/Documentation/devicetree/bindings/arm/mediatek/mediatek,g3dsys.txt @@ -0,0 +1,30 @@ +MediaTek g3dsys controller +============================ + +The MediaTek g3dsys controller provides various clocks and reset controller to +the GPU. + +Required Properties: + +- compatible: Should be: + - "mediatek,mt2701-g3dsys", "syscon": + for MT2701 SoC + - "mediatek,mt7623-g3dsys", "mediatek,mt2701-g3dsys", "syscon": + for MT7623 SoC +- #clock-cells: Must be 1 +- #reset-cells: Must be 1 + +The g3dsys controller uses the common clk binding from +Documentation/devicetree/bindings/clock/clock-bindings.txt +The available clocks are defined in dt-bindings/clock/mt*-clk.h. + +Example: + +g3dsys: clock-controller@13000000 { + compatible = "mediatek,mt7623-g3dsys", + "mediatek,mt2701-g3dsys", + "syscon"; + reg = <0 0x13000000 0 0x200>; + #clock-cells = <1>; + #reset-cells = <1>; +}; diff --git a/Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt b/Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt new file mode 100644 index 000000000000..99980aee26e5 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt @@ -0,0 +1,14 @@ +STMicroelectronics STM32 Platforms System Controller + +Properties: + - compatible : should contain two values. First value must be : + - " st,stm32mp157-syscfg " - for stm32mp157 based SoCs, + second value must be always "syscon". + - reg : offset and length of the register set. + + Example: + syscfg: syscon@50020000 { + compatible = "st,stm32mp157-syscfg", "syscon"; + reg = <0x50020000 0x400>; + }; + diff --git a/Documentation/devicetree/bindings/arm/stm32.txt b/Documentation/devicetree/bindings/arm/stm32/stm32.txt index 6808ed9ddfd5..6808ed9ddfd5 100644 --- a/Documentation/devicetree/bindings/arm/stm32.txt +++ b/Documentation/devicetree/bindings/arm/stm32/stm32.txt diff --git a/Documentation/devicetree/bindings/arm/ux500/boards.txt b/Documentation/devicetree/bindings/arm/ux500/boards.txt index 7334c24625fc..0fa429534f49 100644 --- a/Documentation/devicetree/bindings/arm/ux500/boards.txt +++ b/Documentation/devicetree/bindings/arm/ux500/boards.txt @@ -26,7 +26,7 @@ interrupt-controller: see binding for interrupt-controller/arm,gic.txt timer: - see binding for arm/twd.txt + see binding for timer/arm,twd.txt clocks: see binding for clocks/ux500.txt diff --git a/Documentation/devicetree/bindings/ata/ahci-platform.txt b/Documentation/devicetree/bindings/ata/ahci-platform.txt index f4006d3c9fdf..c760ecb81381 100644 --- a/Documentation/devicetree/bindings/ata/ahci-platform.txt +++ b/Documentation/devicetree/bindings/ata/ahci-platform.txt @@ -30,7 +30,6 @@ compatible: Optional properties: - dma-coherent : Present if dma operations are coherent - clocks : a list of phandle + clock specifier pairs -- resets : a list of phandle + reset specifier pairs - target-supply : regulator for SATA target power - phys : reference to the SATA PHY node - phy-names : must be "sata-phy" diff --git a/Documentation/devicetree/bindings/clock/actions,s900-cmu.txt b/Documentation/devicetree/bindings/clock/actions,s900-cmu.txt new file mode 100644 index 000000000000..93e4fb827cd6 --- /dev/null +++ b/Documentation/devicetree/bindings/clock/actions,s900-cmu.txt @@ -0,0 +1,47 @@ +* Actions S900 Clock Management Unit (CMU) + +The Actions S900 clock management unit generates and supplies clock to various +controllers within the SoC. The clock binding described here is applicable to +S900 SoC. + +Required Properties: + +- compatible: should be "actions,s900-cmu" +- reg: physical base address of the controller and length of memory mapped + region. +- clocks: Reference to the parent clocks ("hosc", "losc") +- #clock-cells: should be 1. + +Each clock is assigned an identifier, and client nodes can use this identifier +to specify the clock which they consume. + +All available clocks are defined as preprocessor macros in +dt-bindings/clock/actions,s900-cmu.h header and can be used in device +tree sources. + +External clocks: + +The hosc clock used as input for the plls is generated outside the SoC. It is +expected that it is defined using standard clock bindings as "hosc". + +Actions S900 CMU also requires one more clock: + - "losc" - internal low frequency oscillator + +Example: Clock Management Unit node: + + cmu: clock-controller@e0160000 { + compatible = "actions,s900-cmu"; + reg = <0x0 0xe0160000 0x0 0x1000>; + clocks = <&hosc>, <&losc>; + #clock-cells = <1>; + }; + +Example: UART controller node that consumes clock generated by the clock +management unit: + + uart: serial@e012a000 { + compatible = "actions,s900-uart", "actions,owl-uart"; + reg = <0x0 0xe012a000 0x0 0x2000>; + interrupts = <GIC_SPI 34 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&cmu CLK_UART5>; + }; diff --git a/Documentation/devicetree/bindings/clock/amlogic,gxbb-aoclkc.txt b/Documentation/devicetree/bindings/clock/amlogic,gxbb-aoclkc.txt index 786dc39ca904..3a880528030e 100644 --- a/Documentation/devicetree/bindings/clock/amlogic,gxbb-aoclkc.txt +++ b/Documentation/devicetree/bindings/clock/amlogic,gxbb-aoclkc.txt @@ -9,6 +9,7 @@ Required Properties: - GXBB (S905) : "amlogic,meson-gxbb-aoclkc" - GXL (S905X, S905D) : "amlogic,meson-gxl-aoclkc" - GXM (S912) : "amlogic,meson-gxm-aoclkc" + - AXG (A113D, A113X) : "amlogic,meson-axg-aoclkc" followed by the common "amlogic,meson-gx-aoclkc" - #clock-cells: should be 1. diff --git a/Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt b/Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt index f8e4a93466cb..ab730ea0a560 100644 --- a/Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt +++ b/Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt @@ -276,36 +276,38 @@ These clock IDs are defined in: clk_ts_500_ref genpll2 2 BCM_SR_GENPLL2_TS_500_REF_CLK clk_125_nitro genpll2 3 BCM_SR_GENPLL2_125_NITRO_CLK clk_chimp genpll2 4 BCM_SR_GENPLL2_CHIMP_CLK - clk_nic_flash genpll2 5 BCM_SR_GENPLL2_NIC_FLASH + clk_nic_flash genpll2 5 BCM_SR_GENPLL2_NIC_FLASH_CLK + clk_fs genpll2 6 BCM_SR_GENPLL2_FS_CLK genpll3 crystal 0 BCM_SR_GENPLL3 clk_hsls genpll3 1 BCM_SR_GENPLL3_HSLS_CLK clk_sdio genpll3 2 BCM_SR_GENPLL3_SDIO_CLK genpll4 crystal 0 BCM_SR_GENPLL4 - ccn genpll4 1 BCM_SR_GENPLL4_CCN_CLK + clk_ccn genpll4 1 BCM_SR_GENPLL4_CCN_CLK clk_tpiu_pll genpll4 2 BCM_SR_GENPLL4_TPIU_PLL_CLK - noc_clk genpll4 3 BCM_SR_GENPLL4_NOC_CLK + clk_noc genpll4 3 BCM_SR_GENPLL4_NOC_CLK clk_chclk_fs4 genpll4 4 BCM_SR_GENPLL4_CHCLK_FS4_CLK clk_bridge_fscpu genpll4 5 BCM_SR_GENPLL4_BRIDGE_FSCPU_CLK - genpll5 crystal 0 BCM_SR_GENPLL5 - fs4_hf_clk genpll5 1 BCM_SR_GENPLL5_FS4_HF_CLK - crypto_ae_clk genpll5 2 BCM_SR_GENPLL5_CRYPTO_AE_CLK - raid_ae_clk genpll5 3 BCM_SR_GENPLL5_RAID_AE_CLK + clk_fs4_hf genpll5 1 BCM_SR_GENPLL5_FS4_HF_CLK + clk_crypto_ae genpll5 2 BCM_SR_GENPLL5_CRYPTO_AE_CLK + clk_raid_ae genpll5 3 BCM_SR_GENPLL5_RAID_AE_CLK genpll6 crystal 0 BCM_SR_GENPLL6 - 48_usb genpll6 1 BCM_SR_GENPLL6_48_USB_CLK + clk_48_usb genpll6 1 BCM_SR_GENPLL6_48_USB_CLK lcpll0 crystal 0 BCM_SR_LCPLL0 clk_sata_refp lcpll0 1 BCM_SR_LCPLL0_SATA_REFP_CLK clk_sata_refn lcpll0 2 BCM_SR_LCPLL0_SATA_REFN_CLK - clk_usb_ref lcpll0 3 BCM_SR_LCPLL0_USB_REF_CLK - sata_refpn lcpll0 3 BCM_SR_LCPLL0_SATA_REFPN_CLK + clk_sata_350 lcpll0 3 BCM_SR_LCPLL0_SATA_350_CLK + clk_sata_500 lcpll0 4 BCM_SR_LCPLL0_SATA_500_CLK lcpll1 crystal 0 BCM_SR_LCPLL1 - wan lcpll1 1 BCM_SR_LCPLL0_WAN_CLK + clk_wan lcpll1 1 BCM_SR_LCPLL1_WAN_CLK + clk_usb_ref lcpll1 2 BCM_SR_LCPLL1_USB_REF_CLK + clk_crmu_ts lcpll1 3 BCM_SR_LCPLL1_CRMU_TS_CLK lcpll_pcie crystal 0 BCM_SR_LCPLL_PCIE - pcie_phy_ref lcpll1 1 BCM_SR_LCPLL_PCIE_PHY_REF_CLK + clk_pcie_phy_ref lcpll1 1 BCM_SR_LCPLL_PCIE_PHY_REF_CLK diff --git a/Documentation/devicetree/bindings/clock/nuvoton,npcm750-clk.txt b/Documentation/devicetree/bindings/clock/nuvoton,npcm750-clk.txt new file mode 100644 index 000000000000..f82064546d11 --- /dev/null +++ b/Documentation/devicetree/bindings/clock/nuvoton,npcm750-clk.txt @@ -0,0 +1,100 @@ +* Nuvoton NPCM7XX Clock Controller + +Nuvoton Poleg BMC NPCM7XX contains an integrated clock controller, which +generates and supplies clocks to all modules within the BMC. + +External clocks: + +There are six fixed clocks that are generated outside the BMC. All clocks are of +a known fixed value that cannot be changed. clk_refclk, clk_mcbypck and +clk_sysbypck are inputs to the clock controller. +clk_rg1refck, clk_rg2refck and clk_xin are external clocks suppling the +network. They are set on the device tree, but not used by the clock module. The +network devices use them directly. +Example can be found below. + +All available clocks are defined as preprocessor macros in: +dt-bindings/clock/nuvoton,npcm7xx-clock.h +and can be reused as DT sources. + +Required Properties of clock controller: + + - compatible: "nuvoton,npcm750-clk" : for clock controller of Nuvoton + Poleg BMC NPCM750 + + - reg: physical base address of the clock controller and length of + memory mapped region. + + - #clock-cells: should be 1. + +Example: Clock controller node: + + clk: clock-controller@f0801000 { + compatible = "nuvoton,npcm750-clk"; + #clock-cells = <1>; + reg = <0xf0801000 0x1000>; + clock-names = "refclk", "sysbypck", "mcbypck"; + clocks = <&clk_refclk>, <&clk_sysbypck>, <&clk_mcbypck>; + }; + +Example: Required external clocks for network: + + /* external reference clock */ + clk_refclk: clk-refclk { + compatible = "fixed-clock"; + #clock-cells = <0>; + clock-frequency = <25000000>; + clock-output-names = "refclk"; + }; + + /* external reference clock for cpu. float in normal operation */ + clk_sysbypck: clk-sysbypck { + compatible = "fixed-clock"; + #clock-cells = <0>; + clock-frequency = <800000000>; + clock-output-names = "sysbypck"; + }; + + /* external reference clock for MC. float in normal operation */ + clk_mcbypck: clk-mcbypck { + compatible = "fixed-clock"; + #clock-cells = <0>; + clock-frequency = <800000000>; + clock-output-names = "mcbypck"; + }; + + /* external clock signal rg1refck, supplied by the phy */ + clk_rg1refck: clk-rg1refck { + compatible = "fixed-clock"; + #clock-cells = <0>; + clock-frequency = <125000000>; + clock-output-names = "clk_rg1refck"; + }; + + /* external clock signal rg2refck, supplied by the phy */ + clk_rg2refck: clk-rg2refck { + compatible = "fixed-clock"; + #clock-cells = <0>; + clock-frequency = <125000000>; + clock-output-names = "clk_rg2refck"; + }; + + clk_xin: clk-xin { + compatible = "fixed-clock"; + #clock-cells = <0>; + clock-frequency = <50000000>; + clock-output-names = "clk_xin"; + }; + + +Example: GMAC controller node that consumes two clocks: a generated clk by the +clock controller and a fixed clock from DT (clk_rg1refck). + + ethernet0: ethernet@f0802000 { + compatible = "snps,dwmac"; + reg = <0xf0802000 0x2000>; + interrupts = <0 14 4>; + interrupt-names = "macirq"; + clocks = <&clk_rg1refck>, <&clk NPCM7XX_CLK_AHB>; + clock-names = "stmmaceth", "clk_gmac"; + }; diff --git a/Documentation/devicetree/bindings/clock/qcom,gcc.txt b/Documentation/devicetree/bindings/clock/qcom,gcc.txt index 551d03be9665..664ea1fd6c76 100644 --- a/Documentation/devicetree/bindings/clock/qcom,gcc.txt +++ b/Documentation/devicetree/bindings/clock/qcom,gcc.txt @@ -17,7 +17,9 @@ Required properties : "qcom,gcc-msm8974pro-ac" "qcom,gcc-msm8994" "qcom,gcc-msm8996" + "qcom,gcc-msm8998" "qcom,gcc-mdm9615" + "qcom,gcc-sdm845" - reg : shall contain base register location and length - #clock-cells : shall contain 1 diff --git a/Documentation/devicetree/bindings/clock/qcom,rpmh-clk.txt b/Documentation/devicetree/bindings/clock/qcom,rpmh-clk.txt new file mode 100644 index 000000000000..3c007653da31 --- /dev/null +++ b/Documentation/devicetree/bindings/clock/qcom,rpmh-clk.txt @@ -0,0 +1,22 @@ +Qualcomm Technologies, Inc. RPMh Clocks +------------------------------------------------------- + +Resource Power Manager Hardened (RPMh) manages shared resources on +some Qualcomm Technologies Inc. SoCs. It accepts clock requests from +other hardware subsystems via RSC to control clocks. + +Required properties : +- compatible : shall contain "qcom,sdm845-rpmh-clk" + +- #clock-cells : must contain 1 + +Example : + +#include <dt-bindings/clock/qcom,rpmh.h> + + &apps_rsc { + rpmhcc: clock-controller { + compatible = "qcom,sdm845-rpmh-clk"; + #clock-cells = <1>; + }; + }; diff --git a/Documentation/devicetree/bindings/clock/qcom,videocc.txt b/Documentation/devicetree/bindings/clock/qcom,videocc.txt new file mode 100644 index 000000000000..e7c035afa778 --- /dev/null +++ b/Documentation/devicetree/bindings/clock/qcom,videocc.txt @@ -0,0 +1,19 @@ +Qualcomm Video Clock & Reset Controller Binding +----------------------------------------------- + +Required properties : +- compatible : shall contain "qcom,sdm845-videocc" +- reg : shall contain base register location and length +- #clock-cells : from common clock binding, shall contain 1. +- #power-domain-cells : from generic power domain binding, shall contain 1. + +Optional properties : +- #reset-cells : from common reset binding, shall contain 1. + +Example: + videocc: clock-controller@ab00000 { + compatible = "qcom,sdm845-videocc"; + reg = <0xab00000 0x10000>; + #clock-cells = <1>; + #power-domain-cells = <1>; + }; diff --git a/Documentation/devicetree/bindings/clock/renesas,cpg-mssr.txt b/Documentation/devicetree/bindings/clock/renesas,cpg-mssr.txt index 773a5226342f..db542abadb75 100644 --- a/Documentation/devicetree/bindings/clock/renesas,cpg-mssr.txt +++ b/Documentation/devicetree/bindings/clock/renesas,cpg-mssr.txt @@ -15,6 +15,7 @@ Required Properties: - compatible: Must be one of: - "renesas,r8a7743-cpg-mssr" for the r8a7743 SoC (RZ/G1M) - "renesas,r8a7745-cpg-mssr" for the r8a7745 SoC (RZ/G1E) + - "renesas,r8a77470-cpg-mssr" for the r8a77470 SoC (RZ/G1C) - "renesas,r8a7790-cpg-mssr" for the r8a7790 SoC (R-Car H2) - "renesas,r8a7791-cpg-mssr" for the r8a7791 SoC (R-Car M2-W) - "renesas,r8a7792-cpg-mssr" for the r8a7792 SoC (R-Car V2H) @@ -25,6 +26,7 @@ Required Properties: - "renesas,r8a77965-cpg-mssr" for the r8a77965 SoC (R-Car M3-N) - "renesas,r8a77970-cpg-mssr" for the r8a77970 SoC (R-Car V3M) - "renesas,r8a77980-cpg-mssr" for the r8a77980 SoC (R-Car V3H) + - "renesas,r8a77990-cpg-mssr" for the r8a77990 SoC (R-Car E3) - "renesas,r8a77995-cpg-mssr" for the r8a77995 SoC (R-Car D3) - reg: Base address and length of the memory resource used by the CPG/MSSR @@ -33,10 +35,12 @@ Required Properties: - clocks: References to external parent clocks, one entry for each entry in clock-names - clock-names: List of external parent clock names. Valid names are: - - "extal" (r8a7743, r8a7745, r8a7790, r8a7791, r8a7792, r8a7793, r8a7794, - r8a7795, r8a7796, r8a77965, r8a77970, r8a77980, r8a77995) + - "extal" (r8a7743, r8a7745, r8a77470, r8a7790, r8a7791, r8a7792, + r8a7793, r8a7794, r8a7795, r8a7796, r8a77965, r8a77970, + r8a77980, r8a77990, r8a77995) - "extalr" (r8a7795, r8a7796, r8a77965, r8a77970, r8a77980) - - "usb_extal" (r8a7743, r8a7745, r8a7790, r8a7791, r8a7793, r8a7794) + - "usb_extal" (r8a7743, r8a7745, r8a77470, r8a7790, r8a7791, r8a7793, + r8a7794) - #clock-cells: Must be 2 - For CPG core clocks, the two clock specifier cells must be "CPG_CORE" diff --git a/Documentation/devicetree/bindings/clock/rockchip.txt b/Documentation/devicetree/bindings/clock/rockchip.txt deleted file mode 100644 index 22f6769e5d4a..000000000000 --- a/Documentation/devicetree/bindings/clock/rockchip.txt +++ /dev/null @@ -1,77 +0,0 @@ -Device Tree Clock bindings for arch-rockchip - -This binding uses the common clock binding[1]. - -[1] Documentation/devicetree/bindings/clock/clock-bindings.txt - -== Gate clocks == - -These bindings are deprecated! -Please use the soc specific CRU bindings instead. - -The gate registers form a continuos block which makes the dt node -structure a matter of taste, as either all gates can be put into -one gate clock spanning all registers or they can be divided into -the 10 individual gates containing 16 clocks each. -The code supports both approaches. - -Required properties: -- compatible : "rockchip,rk2928-gate-clk" -- reg : shall be the control register address(es) for the clock. -- #clock-cells : from common clock binding; shall be set to 1 -- clock-output-names : the corresponding gate names that the clock controls -- clocks : should contain the parent clock for each individual gate, - therefore the number of clocks elements should match the number of - clock-output-names - -Example using multiple gate clocks: - - clk_gates0: gate-clk@200000d0 { - compatible = "rockchip,rk2928-gate-clk"; - reg = <0x200000d0 0x4>; - clocks = <&dummy>, <&dummy>, - <&dummy>, <&dummy>, - <&dummy>, <&dummy>, - <&dummy>, <&dummy>, - <&dummy>, <&dummy>, - <&dummy>, <&dummy>, - <&dummy>, <&dummy>, - <&dummy>, <&dummy>; - - clock-output-names = - "gate_core_periph", "gate_cpu_gpll", - "gate_ddrphy", "gate_aclk_cpu", - "gate_hclk_cpu", "gate_pclk_cpu", - "gate_atclk_cpu", "gate_i2s0", - "gate_i2s0_frac", "gate_i2s1", - "gate_i2s1_frac", "gate_i2s2", - "gate_i2s2_frac", "gate_spdif", - "gate_spdif_frac", "gate_testclk"; - - #clock-cells = <1>; - }; - - clk_gates1: gate-clk@200000d4 { - compatible = "rockchip,rk2928-gate-clk"; - reg = <0x200000d4 0x4>; - clocks = <&xin24m>, <&xin24m>, - <&xin24m>, <&dummy>, - <&dummy>, <&xin24m>, - <&xin24m>, <&dummy>, - <&xin24m>, <&dummy>, - <&xin24m>, <&dummy>, - <&xin24m>, <&dummy>, - <&xin24m>, <&dummy>; - - clock-output-names = - "gate_timer0", "gate_timer1", - "gate_timer2", "gate_jtag", - "gate_aclk_lcdc1_src", "gate_otgphy0", - "gate_otgphy1", "gate_ddr_gpll", - "gate_uart0", "gate_frac_uart0", - "gate_uart1", "gate_frac_uart1", - "gate_uart2", "gate_frac_uart2", - "gate_uart3", "gate_frac_uart3"; - - #clock-cells = <1>; - }; diff --git a/Documentation/devicetree/bindings/clock/sunxi-ccu.txt b/Documentation/devicetree/bindings/clock/sunxi-ccu.txt index 460ef27b1008..47d2e902ced4 100644 --- a/Documentation/devicetree/bindings/clock/sunxi-ccu.txt +++ b/Documentation/devicetree/bindings/clock/sunxi-ccu.txt @@ -21,6 +21,7 @@ Required properties : - "allwinner,sun50i-a64-r-ccu" - "allwinner,sun50i-h5-ccu" - "allwinner,sun50i-h6-ccu" + - "allwinner,sun50i-h6-r-ccu" - "nextthing,gr8-ccu" - reg: Must contain the registers base address and length @@ -35,7 +36,7 @@ Required properties : For the main CCU on H6, one more clock is needed: - "iosc": the SoC's internal frequency oscillator -For the PRCM CCUs on A83T/H3/A64, two more clocks are needed: +For the PRCM CCUs on A83T/H3/A64/H6, two more clocks are needed: - "pll-periph": the SoC's peripheral PLL from the main CCU - "iosc": the SoC's internal frequency oscillator diff --git a/Documentation/devicetree/bindings/display/bridge/adi,adv7511.txt b/Documentation/devicetree/bindings/display/bridge/adi,adv7511.txt index 0047b1394c70..2c887536258c 100644 --- a/Documentation/devicetree/bindings/display/bridge/adi,adv7511.txt +++ b/Documentation/devicetree/bindings/display/bridge/adi,adv7511.txt @@ -14,7 +14,13 @@ Required properties: "adi,adv7513" "adi,adv7533" -- reg: I2C slave address +- reg: I2C slave addresses + The ADV7511 internal registers are split into four pages exposed through + different I2C addresses, creating four register maps. Each map has it own + I2C address and acts as a standard slave device on the I2C bus. The main + address is mandatory, others are optional and revert to defaults if not + specified. + The ADV7511 supports a large number of input data formats that differ by their color depth, color format, clock mode, bit justification and random @@ -70,6 +76,9 @@ Optional properties: rather than generate its own timings for HDMI output. - clocks: from common clock binding: reference to the CEC clock. - clock-names: from common clock binding: must be "cec". +- reg-names : Names of maps with programmable addresses. + It can contain any map needing a non-default address. + Possible maps names are : "main", "edid", "cec", "packet" Required nodes: @@ -88,7 +97,12 @@ Example adv7511w: hdmi@39 { compatible = "adi,adv7511w"; - reg = <39>; + /* + * The EDID page will be accessible on address 0x66 on the I2C + * bus. All other maps continue to use their default addresses. + */ + reg = <0x39>, <0x66>; + reg-names = "main", "edid"; interrupt-parent = <&gpio3>; interrupts = <29 IRQ_TYPE_EDGE_FALLING>; clocks = <&cec_clock>; diff --git a/Documentation/devicetree/bindings/display/bridge/cdns,dsi.txt b/Documentation/devicetree/bindings/display/bridge/cdns,dsi.txt new file mode 100644 index 000000000000..f5725bb6c61c --- /dev/null +++ b/Documentation/devicetree/bindings/display/bridge/cdns,dsi.txt @@ -0,0 +1,133 @@ +Cadence DSI bridge +================== + +The Cadence DSI bridge is a DPI to DSI bridge supporting up to 4 DSI lanes. + +Required properties: +- compatible: should be set to "cdns,dsi". +- reg: physical base address and length of the controller's registers. +- interrupts: interrupt line connected to the DSI bridge. +- clocks: DSI bridge clocks. +- clock-names: must contain "dsi_p_clk" and "dsi_sys_clk". +- phys: phandle link to the MIPI D-PHY controller. +- phy-names: must contain "dphy". +- #address-cells: must be set to 1. +- #size-cells: must be set to 0. + +Optional properties: +- resets: DSI reset lines. +- reset-names: can contain "dsi_p_rst". + +Required subnodes: +- ports: Ports as described in Documentation/devicetree/bindings/graph.txt. + 2 ports are available: + * port 0: this port is only needed if some of your DSI devices are + controlled through an external bus like I2C or SPI. Can have at + most 4 endpoints. The endpoint number is directly encoding the + DSI virtual channel used by this device. + * port 1: represents the DPI input. + Other ports will be added later to support the new kind of inputs. + +- one subnode per DSI device connected on the DSI bus. Each DSI device should + contain a reg property encoding its virtual channel. + +Cadence DPHY +============ + +Cadence DPHY block. + +Required properties: +- compatible: should be set to "cdns,dphy". +- reg: physical base address and length of the DPHY registers. +- clocks: DPHY reference clocks. +- clock-names: must contain "psm" and "pll_ref". +- #phy-cells: must be set to 0. + + +Example: + dphy0: dphy@fd0e0000{ + compatible = "cdns,dphy"; + reg = <0x0 0xfd0e0000 0x0 0x1000>; + clocks = <&psm_clk>, <&pll_ref_clk>; + clock-names = "psm", "pll_ref"; + #phy-cells = <0>; + }; + + dsi0: dsi@fd0c0000 { + compatible = "cdns,dsi"; + reg = <0x0 0xfd0c0000 0x0 0x1000>; + clocks = <&pclk>, <&sysclk>; + clock-names = "dsi_p_clk", "dsi_sys_clk"; + interrupts = <1>; + phys = <&dphy0>; + phy-names = "dphy"; + #address-cells = <1>; + #size-cells = <0>; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@1 { + reg = <1>; + dsi0_dpi_input: endpoint { + remote-endpoint = <&xxx_dpi_output>; + }; + }; + }; + + panel: dsi-dev@0 { + compatible = "<vendor,panel>"; + reg = <0>; + }; + }; + +or + + dsi0: dsi@fd0c0000 { + compatible = "cdns,dsi"; + reg = <0x0 0xfd0c0000 0x0 0x1000>; + clocks = <&pclk>, <&sysclk>; + clock-names = "dsi_p_clk", "dsi_sys_clk"; + interrupts = <1>; + phys = <&dphy1>; + phy-names = "dphy"; + #address-cells = <1>; + #size-cells = <0>; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@0 { + reg = <0>; + #address-cells = <1>; + #size-cells = <0>; + + dsi0_output: endpoint@0 { + reg = <0>; + remote-endpoint = <&dsi_panel_input>; + }; + }; + + port@1 { + reg = <1>; + dsi0_dpi_input: endpoint { + remote-endpoint = <&xxx_dpi_output>; + }; + }; + }; + }; + + i2c@xxx { + panel: panel@59 { + compatible = "<vendor,panel>"; + reg = <0x59>; + + port { + dsi_panel_input: endpoint { + remote-endpoint = <&dsi0_output>; + }; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt b/Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt index 3a72a103a18a..a41d280c3f9f 100644 --- a/Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt +++ b/Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt @@ -14,6 +14,7 @@ Required properties: - compatible : Shall contain one or more of - "renesas,r8a7795-hdmi" for R8A7795 (R-Car H3) compatible HDMI TX - "renesas,r8a7796-hdmi" for R8A7796 (R-Car M3-W) compatible HDMI TX + - "renesas,r8a77965-hdmi" for R8A77965 (R-Car M3-N) compatible HDMI TX - "renesas,rcar-gen3-hdmi" for the generic R-Car Gen3 compatible HDMI TX When compatible with generic versions, nodes must list the SoC-specific diff --git a/Documentation/devicetree/bindings/display/bridge/tda998x.txt b/Documentation/devicetree/bindings/display/bridge/tda998x.txt index 24cc2466185a..1a4eaca40d94 100644 --- a/Documentation/devicetree/bindings/display/bridge/tda998x.txt +++ b/Documentation/devicetree/bindings/display/bridge/tda998x.txt @@ -27,6 +27,9 @@ Optional properties: in question is used. The implementation allows one or two DAIs. If two DAIs are defined, they must be of different type. + - nxp,calib-gpios: calibration GPIO, which must correspond with the + gpio used for the TDA998x interrupt pin. + [1] Documentation/sound/alsa/soc/DAI.txt [2] include/dt-bindings/display/tda998x.h diff --git a/Documentation/devicetree/bindings/display/bridge/thine,thc63lvd1024.txt b/Documentation/devicetree/bindings/display/bridge/thine,thc63lvd1024.txt new file mode 100644 index 000000000000..37f0c04d5a28 --- /dev/null +++ b/Documentation/devicetree/bindings/display/bridge/thine,thc63lvd1024.txt @@ -0,0 +1,60 @@ +Thine Electronics THC63LVD1024 LVDS decoder +------------------------------------------- + +The THC63LVD1024 is a dual link LVDS receiver designed to convert LVDS streams +to parallel data outputs. The chip supports single/dual input/output modes, +handling up to two LVDS input streams and up to two digital CMOS/TTL outputs. + +Single or dual operation mode, output data mapping and DDR output modes are +configured through input signals and the chip does not expose any control bus. + +Required properties: +- compatible: Shall be "thine,thc63lvd1024" +- vcc-supply: Power supply for TTL output, TTL CLOCKOUT signal, LVDS input, + PPL and digital circuitry + +Optional properties: +- powerdown-gpios: Power down GPIO signal, pin name "/PDWN". Active low +- oe-gpios: Output enable GPIO signal, pin name "OE". Active high + +The THC63LVD1024 video port connections are modeled according +to OF graph bindings specified by Documentation/devicetree/bindings/graph.txt + +Required video port nodes: +- port@0: First LVDS input port +- port@2: First digital CMOS/TTL parallel output + +Optional video port nodes: +- port@1: Second LVDS input port +- port@3: Second digital CMOS/TTL parallel output + +Example: +-------- + + thc63lvd1024: lvds-decoder { + compatible = "thine,thc63lvd1024"; + + vcc-supply = <®_lvds_vcc>; + powerdown-gpios = <&gpio4 15 GPIO_ACTIVE_LOW>; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@0 { + reg = <0>; + + lvds_dec_in_0: endpoint { + remote-endpoint = <&lvds_out>; + }; + }; + + port@2{ + reg = <2>; + + lvds_dec_out_2: endpoint { + remote-endpoint = <&adv7511_in>; + }; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/display/exynos/exynos5433-decon.txt b/Documentation/devicetree/bindings/display/exynos/exynos5433-decon.txt index fc2588292a68..775193e1c641 100644 --- a/Documentation/devicetree/bindings/display/exynos/exynos5433-decon.txt +++ b/Documentation/devicetree/bindings/display/exynos/exynos5433-decon.txt @@ -19,7 +19,8 @@ Required properties: clock-names property. - clock-names: list of clock names sorted in the same order as the clocks property. Must contain "pclk", "aclk_decon", "aclk_smmu_decon0x", - "aclk_xiu_decon0x", "pclk_smmu_decon0x", clk_decon_vclk", + "aclk_xiu_decon0x", "pclk_smmu_decon0x", "aclk_smmu_decon1x", + "aclk_xiu_decon1x", "pclk_smmu_decon1x", clk_decon_vclk", "sclk_decon_eclk" - ports: contains a port which is connected to mic node. address-cells and size-cells must 1 and 0, respectively. @@ -34,10 +35,14 @@ decon: decon@13800000 { clocks = <&cmu_disp CLK_ACLK_DECON>, <&cmu_disp CLK_ACLK_SMMU_DECON0X>, <&cmu_disp CLK_ACLK_XIU_DECON0X>, <&cmu_disp CLK_PCLK_SMMU_DECON0X>, + <&cmu_disp CLK_ACLK_SMMU_DECON1X>, + <&cmu_disp CLK_ACLK_XIU_DECON1X>, + <&cmu_disp CLK_PCLK_SMMU_DECON1X>, <&cmu_disp CLK_SCLK_DECON_VCLK>, <&cmu_disp CLK_SCLK_DECON_ECLK>; clock-names = "aclk_decon", "aclk_smmu_decon0x", "aclk_xiu_decon0x", - "pclk_smmu_decon0x", "sclk_decon_vclk", "sclk_decon_eclk"; + "pclk_smmu_decon0x", "aclk_smmu_decon1x", "aclk_xiu_decon1x", + "pclk_smmu_decon1x", "sclk_decon_vclk", "sclk_decon_eclk"; interrupt-names = "vsync", "lcd_sys"; interrupts = <0 202 0>, <0 203 0>; diff --git a/Documentation/devicetree/bindings/display/panel/panel-common.txt b/Documentation/devicetree/bindings/display/panel/panel-common.txt index 557fa765adcb..5d2519af4bb5 100644 --- a/Documentation/devicetree/bindings/display/panel/panel-common.txt +++ b/Documentation/devicetree/bindings/display/panel/panel-common.txt @@ -38,7 +38,7 @@ Display Timings require specific display timings. The panel-timing subnode expresses those timings as specified in the timing subnode section of the display timing bindings defined in - Documentation/devicetree/bindings/display/display-timing.txt. + Documentation/devicetree/bindings/display/panel/display-timing.txt. Connectivity diff --git a/Documentation/devicetree/bindings/display/renesas,du.txt b/Documentation/devicetree/bindings/display/renesas,du.txt index c9cd17f99702..7c6854bd0a04 100644 --- a/Documentation/devicetree/bindings/display/renesas,du.txt +++ b/Documentation/devicetree/bindings/display/renesas,du.txt @@ -13,6 +13,7 @@ Required Properties: - "renesas,du-r8a7794" for R8A7794 (R-Car E2) compatible DU - "renesas,du-r8a7795" for R8A7795 (R-Car H3) compatible DU - "renesas,du-r8a7796" for R8A7796 (R-Car M3-W) compatible DU + - "renesas,du-r8a77965" for R8A77965 (R-Car M3-N) compatible DU - "renesas,du-r8a77970" for R8A77970 (R-Car V3M) compatible DU - "renesas,du-r8a77995" for R8A77995 (R-Car D3) compatible DU @@ -47,20 +48,21 @@ bindings specified in Documentation/devicetree/bindings/graph.txt. The following table lists for each supported model the port number corresponding to each DU output. - Port0 Port1 Port2 Port3 + Port0 Port1 Port2 Port3 ----------------------------------------------------------------------------- - R8A7743 (RZ/G1M) DPAD 0 LVDS 0 - - - R8A7745 (RZ/G1E) DPAD 0 DPAD 1 - - - R8A7779 (R-Car H1) DPAD 0 DPAD 1 - - - R8A7790 (R-Car H2) DPAD 0 LVDS 0 LVDS 1 - - R8A7791 (R-Car M2-W) DPAD 0 LVDS 0 - - - R8A7792 (R-Car V2H) DPAD 0 DPAD 1 - - - R8A7793 (R-Car M2-N) DPAD 0 LVDS 0 - - - R8A7794 (R-Car E2) DPAD 0 DPAD 1 - - - R8A7795 (R-Car H3) DPAD 0 HDMI 0 HDMI 1 LVDS 0 - R8A7796 (R-Car M3-W) DPAD 0 HDMI 0 LVDS 0 - - R8A77970 (R-Car V3M) DPAD 0 LVDS 0 - - - R8A77995 (R-Car D3) DPAD 0 LVDS 0 LVDS 1 - + R8A7743 (RZ/G1M) DPAD 0 LVDS 0 - - + R8A7745 (RZ/G1E) DPAD 0 DPAD 1 - - + R8A7779 (R-Car H1) DPAD 0 DPAD 1 - - + R8A7790 (R-Car H2) DPAD 0 LVDS 0 LVDS 1 - + R8A7791 (R-Car M2-W) DPAD 0 LVDS 0 - - + R8A7792 (R-Car V2H) DPAD 0 DPAD 1 - - + R8A7793 (R-Car M2-N) DPAD 0 LVDS 0 - - + R8A7794 (R-Car E2) DPAD 0 DPAD 1 - - + R8A7795 (R-Car H3) DPAD 0 HDMI 0 HDMI 1 LVDS 0 + R8A7796 (R-Car M3-W) DPAD 0 HDMI 0 LVDS 0 - + R8A77965 (R-Car M3-N) DPAD 0 HDMI 0 LVDS 0 - + R8A77970 (R-Car V3M) DPAD 0 LVDS 0 - - + R8A77995 (R-Car D3) DPAD 0 LVDS 0 LVDS 1 - Example: R8A7795 (R-Car H3) ES2.0 DU diff --git a/Documentation/devicetree/bindings/display/sunxi/sun6i-dsi.txt b/Documentation/devicetree/bindings/display/sunxi/sun6i-dsi.txt new file mode 100644 index 000000000000..6a6cf5de08b0 --- /dev/null +++ b/Documentation/devicetree/bindings/display/sunxi/sun6i-dsi.txt @@ -0,0 +1,93 @@ +Allwinner A31 DSI Encoder +========================= + +The DSI pipeline consists of two separate blocks: the DSI controller +itself, and its associated D-PHY. + +DSI Encoder +----------- + +The DSI Encoder generates the DSI signal from the TCON's. + +Required properties: + - compatible: value must be one of: + * allwinner,sun6i-a31-mipi-dsi + - reg: base address and size of memory-mapped region + - interrupts: interrupt associated to this IP + - clocks: phandles to the clocks feeding the DSI encoder + * bus: the DSI interface clock + * mod: the DSI module clock + - clock-names: the clock names mentioned above + - phys: phandle to the D-PHY + - phy-names: must be "dphy" + - resets: phandle to the reset controller driving the encoder + + - ports: A ports node with endpoint definitions as defined in + Documentation/devicetree/bindings/media/video-interfaces.txt. The + first port should be the input endpoint, usually coming from the + associated TCON. + +Any MIPI-DSI device attached to this should be described according to +the bindings defined in ../mipi-dsi-bus.txt + +D-PHY +----- + +Required properties: + - compatible: value must be one of: + * allwinner,sun6i-a31-mipi-dphy + - reg: base address and size of memory-mapped region + - clocks: phandles to the clocks feeding the DSI encoder + * bus: the DSI interface clock + * mod: the DSI module clock + - clock-names: the clock names mentioned above + - resets: phandle to the reset controller driving the encoder + +Example: + +dsi0: dsi@1ca0000 { + compatible = "allwinner,sun6i-a31-mipi-dsi"; + reg = <0x01ca0000 0x1000>; + interrupts = <GIC_SPI 89 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&ccu CLK_BUS_MIPI_DSI>, + <&ccu CLK_DSI_SCLK>; + clock-names = "bus", "mod"; + resets = <&ccu RST_BUS_MIPI_DSI>; + phys = <&dphy0>; + phy-names = "dphy"; + #address-cells = <1>; + #size-cells = <0>; + + panel@0 { + compatible = "bananapi,lhr050h41", "ilitek,ili9881c"; + reg = <0>; + power-gpios = <&pio 1 7 GPIO_ACTIVE_HIGH>; /* PB07 */ + reset-gpios = <&r_pio 0 5 GPIO_ACTIVE_LOW>; /* PL05 */ + backlight = <&pwm_bl>; + }; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@0 { + #address-cells = <1>; + #size-cells = <0>; + reg = <0>; + + dsi0_in_tcon0: endpoint { + remote-endpoint = <&tcon0_out_dsi0>; + }; + }; + }; +}; + +dphy0: d-phy@1ca1000 { + compatible = "allwinner,sun6i-a31-mipi-dphy"; + reg = <0x01ca1000 0x1000>; + clocks = <&ccu CLK_BUS_MIPI_DSI>, + <&ccu CLK_DSI_DPHY>; + clock-names = "bus", "mod"; + resets = <&ccu RST_BUS_MIPI_DSI>; + #phy-cells = <0>; +}; diff --git a/Documentation/devicetree/bindings/dma/k3dma.txt b/Documentation/devicetree/bindings/dma/k3dma.txt index 23f8d712c3ce..4945aeac4dc4 100644 --- a/Documentation/devicetree/bindings/dma/k3dma.txt +++ b/Documentation/devicetree/bindings/dma/k3dma.txt @@ -23,7 +23,6 @@ Controller: dma-requests = <27>; interrupts = <0 12 4>; clocks = <&pclk>; - status = "disable"; }; Client: diff --git a/Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt b/Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt index aadfb236d53a..b1ba639554c0 100644 --- a/Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt +++ b/Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt @@ -26,8 +26,10 @@ Required Properties: - "renesas,dmac-r8a7794" (R-Car E2) - "renesas,dmac-r8a7795" (R-Car H3) - "renesas,dmac-r8a7796" (R-Car M3-W) + - "renesas,dmac-r8a77965" (R-Car M3-N) - "renesas,dmac-r8a77970" (R-Car V3M) - "renesas,dmac-r8a77980" (R-Car V3H) + - "renesas,dmac-r8a77995" (R-Car D3) - reg: base address and length of the registers block for the DMAC diff --git a/Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt b/Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt index 9dc935e24e55..482e54362d3e 100644 --- a/Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt +++ b/Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt @@ -12,6 +12,8 @@ Required Properties: - "renesas,r8a7795-usb-dmac" (R-Car H3) - "renesas,r8a7796-usb-dmac" (R-Car M3-W) - "renesas,r8a77965-usb-dmac" (R-Car M3-N) + - "renesas,r8a77990-usb-dmac" (R-Car E3) + - "renesas,r8a77995-usb-dmac" (R-Car D3) - reg: base address and length of the registers block for the DMAC - interrupts: interrupt specifiers for the DMAC, one for each entry in interrupt-names. diff --git a/Documentation/devicetree/bindings/dma/ti-edma.txt b/Documentation/devicetree/bindings/dma/ti-edma.txt index 66026dcf53e1..3f15f6644527 100644 --- a/Documentation/devicetree/bindings/dma/ti-edma.txt +++ b/Documentation/devicetree/bindings/dma/ti-edma.txt @@ -190,7 +190,6 @@ mmc0: mmc@23000000 { power-domains = <&k2g_pds 0xb>; clocks = <&k2g_clks 0xb 1>, <&k2g_clks 0xb 2>; clock-names = "fck", "mmchsdb_fck"; - status = "disabled"; }; ------------------------------------------------------------------------------ diff --git a/Documentation/devicetree/bindings/arm/altera/socfpga-eccmgr.txt b/Documentation/devicetree/bindings/edac/socfpga-eccmgr.txt index 4a1714f96bab..5626560a6cfd 100644 --- a/Documentation/devicetree/bindings/arm/altera/socfpga-eccmgr.txt +++ b/Documentation/devicetree/bindings/edac/socfpga-eccmgr.txt @@ -231,3 +231,38 @@ Example: <48 IRQ_TYPE_LEVEL_HIGH>; }; }; + +Stratix10 SoCFPGA ECC Manager +The Stratix10 SoC ECC Manager handles the IRQs for each peripheral +in a shared register similar to the Arria10. However, ECC requires +access to registers that can only be read from Secure Monitor with +SMC calls. Therefore the device tree is slightly different. + +Required Properties: +- compatible : Should be "altr,socfpga-s10-ecc-manager" +- interrupts : Should be single bit error interrupt, then double bit error + interrupt. +- interrupt-controller : boolean indicator that ECC Manager is an interrupt controller +- #interrupt-cells : must be set to 2. + +Subcomponents: + +SDRAM ECC +Required Properties: +- compatible : Should be "altr,sdram-edac-s10" +- interrupts : Should be single bit error interrupt, then double bit error + interrupt, in this order. + +Example: + + eccmgr { + compatible = "altr,socfpga-s10-ecc-manager"; + interrupts = <0 15 4>, <0 95 4>; + interrupt-controller; + #interrupt-cells = <2>; + + sdramedac { + compatible = "altr,sdram-edac-s10"; + interrupts = <16 4>, <48 4>; + }; + }; diff --git a/Documentation/devicetree/bindings/fpga/lattice-machxo2-spi.txt b/Documentation/devicetree/bindings/fpga/lattice-machxo2-spi.txt new file mode 100644 index 000000000000..a8c362eb160c --- /dev/null +++ b/Documentation/devicetree/bindings/fpga/lattice-machxo2-spi.txt @@ -0,0 +1,29 @@ +Lattice MachXO2 Slave SPI FPGA Manager + +Lattice MachXO2 FPGAs support a method of loading the bitstream over +'slave SPI' interface. + +See 'MachXO2ProgrammingandConfigurationUsageGuide.pdf' on www.latticesemi.com + +Required properties: +- compatible: should contain "lattice,machxo2-slave-spi" +- reg: spi chip select of the FPGA + +Example for full FPGA configuration: + + fpga-region0 { + compatible = "fpga-region"; + fpga-mgr = <&fpga_mgr_spi>; + #address-cells = <0x1>; + #size-cells = <0x1>; + }; + + spi1: spi@2000 { + ... + + fpga_mgr_spi: fpga-mgr@0 { + compatible = "lattice,machxo2-slave-spi"; + spi-max-frequency = <8000000>; + reg = <0>; + }; + }; diff --git a/Documentation/devicetree/bindings/fsi/fsi-master-gpio.txt b/Documentation/devicetree/bindings/fsi/fsi-master-gpio.txt index a767259dedad..1e442450747f 100644 --- a/Documentation/devicetree/bindings/fsi/fsi-master-gpio.txt +++ b/Documentation/devicetree/bindings/fsi/fsi-master-gpio.txt @@ -11,6 +11,10 @@ Optional properties: - trans-gpios = <gpio-descriptor>; : GPIO for voltage translator enable - mux-gpios = <gpio-descriptor>; : GPIO for pin multiplexing with other functions (eg, external FSI masters) + - no-gpio-delays; : Don't add extra delays between GPIO + accesses. This is useful when the HW + GPIO block is running at a low enough + frequency. Examples: diff --git a/Documentation/devicetree/bindings/gpio/gpio-pca953x.txt b/Documentation/devicetree/bindings/gpio/gpio-pca953x.txt index d2a937682836..88f228665507 100644 --- a/Documentation/devicetree/bindings/gpio/gpio-pca953x.txt +++ b/Documentation/devicetree/bindings/gpio/gpio-pca953x.txt @@ -31,10 +31,15 @@ Required properties: ti,tca9554 onnn,pca9654 exar,xra1202 + - gpio-controller: if used as gpio expander. + - #gpio-cells: if used as gpio expander. + - interrupt-controller: if to be used as interrupt expander. + - #interrupt-cells: if to be used as interrupt expander. Optional properties: - reset-gpios: GPIO specification for the RESET input. This is an active low signal to the PCA953x. + - vcc-supply: power supply regulator. Example: @@ -47,3 +52,32 @@ Example: interrupt-parent = <&gpio3>; interrupts = <23 IRQ_TYPE_LEVEL_LOW>; }; + + +Example with Interrupts: + + + gpio99: gpio@22 { + compatible = "nxp,pcal6524"; + reg = <0x22>; + interrupt-parent = <&gpio6>; + interrupts = <1 IRQ_TYPE_EDGE_FALLING>; /* gpio6_161 */ + interrupt-controller; + #interrupt-cells = <2>; + vcc-supply = <&vdds_1v8_main>; + gpio-controller; + #gpio-cells = <2>; + gpio-line-names = + "hdmi-ct-hpd", "hdmi.ls-oe", "p02", "p03", "vibra", "fault2", "p06", "p07", + "en-usb", "en-host1", "en-host2", "chg-int", "p14", "p15", "mic-int", "en-modem", + "shdn-hs-amp", "chg-status+red", "green", "blue", "en-esata", "fault1", "p26", "p27"; + }; + + ts3a227@3b { + compatible = "ti,ts3a227e"; + reg = <0x3b>; + interrupt-parent = <&gpio99>; + interrupts = <14 IRQ_TYPE_EDGE_RISING>; + ti,micbias = <0>; /* 2.1V */ + }; + diff --git a/Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt b/Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt index 9474138d776e..378f1322211e 100644 --- a/Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt +++ b/Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt @@ -5,6 +5,7 @@ Required Properties: - compatible: should contain one or more of the following: - "renesas,gpio-r8a7743": for R8A7743 (RZ/G1M) compatible GPIO controller. - "renesas,gpio-r8a7745": for R8A7745 (RZ/G1E) compatible GPIO controller. + - "renesas,gpio-r8a77470": for R8A77470 (RZ/G1C) compatible GPIO controller. - "renesas,gpio-r8a7778": for R8A7778 (R-Car M1) compatible GPIO controller. - "renesas,gpio-r8a7779": for R8A7779 (R-Car H1) compatible GPIO controller. - "renesas,gpio-r8a7790": for R8A7790 (R-Car H2) compatible GPIO controller. @@ -14,7 +15,9 @@ Required Properties: - "renesas,gpio-r8a7794": for R8A7794 (R-Car E2) compatible GPIO controller. - "renesas,gpio-r8a7795": for R8A7795 (R-Car H3) compatible GPIO controller. - "renesas,gpio-r8a7796": for R8A7796 (R-Car M3-W) compatible GPIO controller. + - "renesas,gpio-r8a77965": for R8A77965 (R-Car M3-N) compatible GPIO controller. - "renesas,gpio-r8a77970": for R8A77970 (R-Car V3M) compatible GPIO controller. + - "renesas,gpio-r8a77990": for R8A77990 (R-Car E3) compatible GPIO controller. - "renesas,gpio-r8a77995": for R8A77995 (R-Car D3) compatible GPIO controller. - "renesas,rcar-gen1-gpio": for a generic R-Car Gen1 GPIO controller. - "renesas,rcar-gen2-gpio": for a generic R-Car Gen2 or RZ/G1 GPIO controller. diff --git a/Documentation/devicetree/bindings/gpio/snps-dwapb-gpio.txt b/Documentation/devicetree/bindings/gpio/snps-dwapb-gpio.txt index 4a75da7051bd..3c1118bc67f5 100644 --- a/Documentation/devicetree/bindings/gpio/snps-dwapb-gpio.txt +++ b/Documentation/devicetree/bindings/gpio/snps-dwapb-gpio.txt @@ -26,8 +26,13 @@ controller. the second encodes the triger flags encoded as described in Documentation/devicetree/bindings/interrupt-controller/interrupts.txt - interrupt-parent : The parent interrupt controller. -- interrupts : The interrupt to the parent controller raised when GPIOs - generate the interrupts. +- interrupts : The interrupts to the parent controller raised when GPIOs + generate the interrupts. If the controller provides one combined interrupt + for all GPIOs, specify a single interrupt. If the controller provides one + interrupt for each GPIO, provide a list of interrupts that correspond to each + of the GPIO pins. When specifying multiple interrupts, if any are unconnected, + use the interrupts-extended property to specify the interrupts and set the + interrupt controller handle for unused interrupts to 0. - snps,nr-gpios : The number of pins in the port, a single cell. - resets : Reset line for the controller. diff --git a/Documentation/devicetree/bindings/gpu/brcm,bcm-v3d.txt b/Documentation/devicetree/bindings/gpu/brcm,bcm-v3d.txt new file mode 100644 index 000000000000..c907aa8dd755 --- /dev/null +++ b/Documentation/devicetree/bindings/gpu/brcm,bcm-v3d.txt @@ -0,0 +1,28 @@ +Broadcom V3D GPU + +Only the Broadcom V3D 3.x and newer GPUs are covered by this binding. +For V3D 2.x, see brcm,bcm-vc4.txt. + +Required properties: +- compatible: Should be "brcm,7268-v3d" or "brcm,7278-v3d" +- reg: Physical base addresses and lengths of the register areas +- reg-names: Names for the register areas. The "hub", "bridge", and "core0" + register areas are always required. The "gca" register area + is required if the GCA cache controller is present. +- interrupts: The interrupt numbers. The first interrupt is for the hub, + while the following interrupts are for the cores. + See bindings/interrupt-controller/interrupts.txt + +Optional properties: +- clocks: The core clock the unit runs on + +v3d { + compatible = "brcm,7268-v3d"; + reg = <0xf1204000 0x100>, + <0xf1200000 0x4000>, + <0xf1208000 0x4000>, + <0xf1204100 0x100>; + reg-names = "bridge", "hub", "core0", "gca"; + interrupts = <0 78 4>, + <0 77 4>; +}; diff --git a/Documentation/devicetree/bindings/gpu/samsung-scaler.txt b/Documentation/devicetree/bindings/gpu/samsung-scaler.txt new file mode 100644 index 000000000000..9c3d98105dfd --- /dev/null +++ b/Documentation/devicetree/bindings/gpu/samsung-scaler.txt @@ -0,0 +1,27 @@ +* Samsung Exynos Image Scaler + +Required properties: + - compatible : value should be one of the following: + (a) "samsung,exynos5420-scaler" for Scaler IP in Exynos5420 + (b) "samsung,exynos5433-scaler" for Scaler IP in Exynos5433 + + - reg : Physical base address of the IP registers and length of memory + mapped region. + + - interrupts : Interrupt specifier for scaler interrupt, according to format + specific to interrupt parent. + + - clocks : Clock specifier for scaler clock, according to generic clock + bindings. (See Documentation/devicetree/bindings/clock/exynos*.txt) + + - clock-names : Names of clocks. For exynos scaler, it should be "mscl" + on 5420 and "pclk", "aclk" and "aclk_xiu" on 5433. + +Example: + scaler@12800000 { + compatible = "samsung,exynos5420-scaler"; + reg = <0x12800000 0x1294>; + interrupts = <0 220 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&clock CLK_MSCL0>; + clock-names = "mscl"; + }; diff --git a/Documentation/devicetree/bindings/hwmon/gpio-fan.txt b/Documentation/devicetree/bindings/hwmon/gpio-fan.txt index 439a7430fc68..2becdcfdc840 100644 --- a/Documentation/devicetree/bindings/hwmon/gpio-fan.txt +++ b/Documentation/devicetree/bindings/hwmon/gpio-fan.txt @@ -11,7 +11,7 @@ Optional properties: must have the RPM values in ascending order. - alarm-gpios: This pin going active indicates something is wrong with the fan, and a udev event will be fired. -- cooling-cells: If used as a cooling device, must be <2> +- #cooling-cells: If used as a cooling device, must be <2> Also see: Documentation/devicetree/bindings/thermal/thermal.txt min and max states are derived from the speed-map of the fan. diff --git a/Documentation/devicetree/bindings/hwmon/ltc2990.txt b/Documentation/devicetree/bindings/hwmon/ltc2990.txt new file mode 100644 index 000000000000..f92f54029e84 --- /dev/null +++ b/Documentation/devicetree/bindings/hwmon/ltc2990.txt @@ -0,0 +1,36 @@ +ltc2990: Linear Technology LTC2990 power monitor + +Required properties: +- compatible: Must be "lltc,ltc2990" +- reg: I2C slave address +- lltc,meas-mode: + An array of two integers for configuring the chip measurement mode. + + The first integer defines the bits 2..0 in the control register. In all + cases the internal temperature and supply voltage are measured. In + addition the following input measurements are enabled per mode: + + 0: V1, V2, TR2 + 1: V1-V2, TR2 + 2: V1-V2, V3, V4 + 3: TR1, V3, V4 + 4: TR1, V3-V4 + 5: TR1, TR2 + 6: V1-V2, V3-V4 + 7: V1, V2, V3, V4 + + The second integer defines the bits 4..3 in the control register. This + allows a subset of the measurements to be enabled: + + 0: Internal temperature and supply voltage only + 1: TR1, V1 or V1-V2 only per mode + 2: TR2, V3 or V3-V4 only per mode + 3: All measurements per mode + +Example: + +ltc2990@4c { + compatible = "lltc,ltc2990"; + reg = <0x4c>; + lltc,meas-mode = <7 3>; /* V1, V2, V3, V4 */ +}; diff --git a/Documentation/devicetree/bindings/iio/adc/amlogic,meson-saradc.txt b/Documentation/devicetree/bindings/iio/adc/amlogic,meson-saradc.txt index 1e6ee3deb4fa..d1acd5ea2737 100644 --- a/Documentation/devicetree/bindings/iio/adc/amlogic,meson-saradc.txt +++ b/Documentation/devicetree/bindings/iio/adc/amlogic,meson-saradc.txt @@ -7,6 +7,7 @@ Required properties: - "amlogic,meson-gxbb-saradc" for GXBB - "amlogic,meson-gxl-saradc" for GXL - "amlogic,meson-gxm-saradc" for GXM + - "amlogic,meson-axg-saradc" for AXG along with the generic "amlogic,meson-saradc" - reg: the physical base address and length of the registers - interrupts: the interrupt indicating end of sampling diff --git a/Documentation/devicetree/bindings/iio/adc/mcp320x.txt b/Documentation/devicetree/bindings/iio/adc/mcp320x.txt index 7d64753df949..56373d643f76 100644 --- a/Documentation/devicetree/bindings/iio/adc/mcp320x.txt +++ b/Documentation/devicetree/bindings/iio/adc/mcp320x.txt @@ -49,7 +49,7 @@ Required properties: Examples: spi_controller { mcp3x0x@0 { - compatible = "mcp3002"; + compatible = "microchip,mcp3002"; reg = <0>; spi-max-frequency = <1000000>; vref-supply = <&vref_reg>; diff --git a/Documentation/devicetree/bindings/arm/samsung/exynos-adc.txt b/Documentation/devicetree/bindings/iio/adc/samsung,exynos-adc.txt index 6c49db7f8ad2..6c49db7f8ad2 100644 --- a/Documentation/devicetree/bindings/arm/samsung/exynos-adc.txt +++ b/Documentation/devicetree/bindings/iio/adc/samsung,exynos-adc.txt diff --git a/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt b/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt index e8bb8243e92c..f1ead43a1a95 100644 --- a/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt +++ b/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt @@ -24,8 +24,11 @@ Required properties: - compatible: Should be one of: "st,stm32f4-adc-core" "st,stm32h7-adc-core" + "st,stm32mp1-adc-core" - reg: Offset and length of the ADC block register set. -- interrupts: Must contain the interrupt for ADC block. +- interrupts: One or more interrupts for ADC block. Some parts like stm32f4 + and stm32h7 share a common ADC interrupt line. stm32mp1 has two separate + interrupt lines, one for each ADC within ADC block. - clocks: Core can use up to two clocks, depending on part used: - "adc" clock: for the analog circuitry, common to all ADCs. It's required on stm32f4. @@ -53,6 +56,7 @@ Required properties: - compatible: Should be one of: "st,stm32f4-adc" "st,stm32h7-adc" + "st,stm32mp1-adc" - reg: Offset of ADC instance in ADC block (e.g. may be 0x0, 0x100, 0x200). - clocks: Input clock private to this ADC instance. It's required only on stm32f4, that has per instance clock input for registers access. diff --git a/Documentation/devicetree/bindings/iio/adc/st,stm32-dfsdm-adc.txt b/Documentation/devicetree/bindings/iio/adc/st,stm32-dfsdm-adc.txt index ed7520d1d051..75ba25d062e1 100644 --- a/Documentation/devicetree/bindings/iio/adc/st,stm32-dfsdm-adc.txt +++ b/Documentation/devicetree/bindings/iio/adc/st,stm32-dfsdm-adc.txt @@ -8,14 +8,16 @@ It is mainly targeted for: - PDM microphones (audio digital microphone) It features up to 8 serial digital interfaces (SPI or Manchester) and -up to 4 filters on stm32h7. +up to 4 filters on stm32h7 or 6 filters on stm32mp1. Each child node match with a filter instance. Contents of a STM32 DFSDM root node: ------------------------------------ Required properties: -- compatible: Should be "st,stm32h7-dfsdm". +- compatible: Should be one of: + "st,stm32h7-dfsdm" + "st,stm32mp1-dfsdm" - reg: Offset and length of the DFSDM block register set. - clocks: IP and serial interfaces clocking. Should be set according to rcc clock ID and "clock-names". @@ -45,6 +47,7 @@ Required properties: "st,stm32-dfsdm-adc" for sigma delta ADCs "st,stm32-dfsdm-dmic" for audio digital microphone. - reg: Specifies the DFSDM filter instance used. + Valid values are from 0 to 3 on stm32h7, 0 to 5 on stm32mp1. - interrupts: IRQ lines connected to each DFSDM filter instance. - st,adc-channels: List of single-ended channels muxed for this ADC. valid values: diff --git a/Documentation/devicetree/bindings/iio/afe/current-sense-amplifier.txt b/Documentation/devicetree/bindings/iio/afe/current-sense-amplifier.txt new file mode 100644 index 000000000000..821b61b8c542 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/afe/current-sense-amplifier.txt @@ -0,0 +1,26 @@ +Current Sense Amplifier +======================= + +When an io-channel measures the output voltage from a current sense +amplifier, the interesting measurement is almost always the current +through the sense resistor, not the voltage output. This binding +describes such a current sense circuit. + +Required properties: +- compatible : "current-sense-amplifier" +- io-channels : Channel node of a voltage io-channel. +- sense-resistor-micro-ohms : The sense resistance in microohms. + +Optional properties: +- sense-gain-mult: Amplifier gain multiplier. The default is <1>. +- sense-gain-div: Amplifier gain divider. The default is <1>. + +Example: + +sysi { + compatible = "current-sense-amplifier"; + io-channels = <&tiadc 0>; + + sense-resistor-micro-ohms = <20000>; + sense-gain-mul = <50>; +}; diff --git a/Documentation/devicetree/bindings/iio/afe/current-sense-shunt.txt b/Documentation/devicetree/bindings/iio/afe/current-sense-shunt.txt new file mode 100644 index 000000000000..0f67108a07b6 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/afe/current-sense-shunt.txt @@ -0,0 +1,41 @@ +Current Sense Shunt +=================== + +When an io-channel measures the voltage over a current sense shunt, +the interesting measurement is almost always the current through the +shunt, not the voltage over it. This binding describes such a current +sense circuit. + +Required properties: +- compatible : "current-sense-shunt" +- io-channels : Channel node of a voltage io-channel. +- shunt-resistor-micro-ohms : The shunt resistance in microohms. + +Example: +The system current is measured by measuring the voltage over a +3.3 ohms shunt resistor. + +sysi { + compatible = "current-sense-shunt"; + io-channels = <&tiadc 0>; + + /* Divide the voltage by 3300000/1000000 (or 3.3) for the current. */ + shunt-resistor-micro-ohms = <3300000>; +}; + +&i2c { + tiadc: adc@48 { + compatible = "ti,ads1015"; + reg = <0x48>; + #io-channel-cells = <1>; + + #address-cells = <1>; + #size-cells = <0>; + + channel@0 { /* IN0,IN1 differential */ + reg = <0>; + ti,gain = <1>; + ti,datarate = <4>; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/iio/afe/voltage-divider.txt b/Documentation/devicetree/bindings/iio/afe/voltage-divider.txt new file mode 100644 index 000000000000..b452a8406107 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/afe/voltage-divider.txt @@ -0,0 +1,53 @@ +Voltage divider +=============== + +When an io-channel measures the midpoint of a voltage divider, the +interesting voltage is often the voltage over the full resistance +of the divider. This binding describes the voltage divider in such +a curcuit. + + Vin ----. + | + .-----. + | R | + '-----' + | + +---- Vout + | + .-----. + | Rout| + '-----' + | + GND + +Required properties: +- compatible : "voltage-divider" +- io-channels : Channel node of a voltage io-channel measuring Vout. +- output-ohms : Resistance Rout over which the output voltage is measured. + See full-ohms. +- full-ohms : Resistance R + Rout for the full divider. The io-channel + is scaled by the Rout / (R + Rout) quotient. + +Example: +The system voltage is circa 12V, but divided down with a 22/222 +voltage divider (R = 200 Ohms, Rout = 22 Ohms) and fed to an ADC. + +sysv { + compatible = "voltage-divider"; + io-channels = <&maxadc 1>; + + /* Scale the system voltage by 22/222 to fit the ADC range. */ + output-ohms = <22>; + full-ohms = <222>; /* 200 + 22 */ +}; + +&spi { + maxadc: adc@0 { + compatible = "maxim,max1027"; + reg = <0>; + #io-channel-cells = <1>; + interrupt-parent = <&gpio5>; + interrupts = <15 IRQ_TYPE_EDGE_RISING>; + spi-max-frequency = <1000000>; + }; +}; diff --git a/Documentation/devicetree/bindings/iio/dac/ltc2632.txt b/Documentation/devicetree/bindings/iio/dac/ltc2632.txt index eb911e5a8ab4..e0d5fea33031 100644 --- a/Documentation/devicetree/bindings/iio/dac/ltc2632.txt +++ b/Documentation/devicetree/bindings/iio/dac/ltc2632.txt @@ -12,12 +12,26 @@ Required properties: Property rules described in Documentation/devicetree/bindings/spi/spi-bus.txt apply. In particular, "reg" and "spi-max-frequency" properties must be given. +Optional properties: + - vref-supply: Phandle to the external reference voltage supply. This should + only be set if there is an external reference voltage connected to the VREF + pin. If the property is not set the internal reference is used. + Example: + vref: regulator-vref { + compatible = "regulator-fixed"; + regulator-name = "vref-ltc2632"; + regulator-min-microvolt = <1250000>; + regulator-max-microvolt = <1250000>; + regulator-always-on; + }; + spi_master { dac: ltc2632@0 { compatible = "lltc,ltc2632-l12"; reg = <0>; /* CS0 */ spi-max-frequency = <1000000>; + vref-supply = <&vref>; /* optional */ }; }; diff --git a/Documentation/devicetree/bindings/iio/dac/ti,dac5571.txt b/Documentation/devicetree/bindings/iio/dac/ti,dac5571.txt new file mode 100644 index 000000000000..03af6b9a4d07 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/dac/ti,dac5571.txt @@ -0,0 +1,24 @@ +* Texas Instruments DAC5571 Family + +Required properties: + - compatible: Should contain + "ti,dac5571" + "ti,dac6571" + "ti,dac7571" + "ti,dac5574" + "ti,dac6574" + "ti,dac7574" + "ti,dac5573" + "ti,dac6573" + "ti,dac7573" + - reg: Should contain the DAC I2C address + +Optional properties: + - vref-supply: The regulator supply for DAC reference voltage + +Example: +dac@0 { + compatible = "ti,dac5571"; + reg = <0x4C>; + vref-supply = <&vdd_supply>; +}; diff --git a/Documentation/devicetree/bindings/iio/imu/inv_mpu6050.txt b/Documentation/devicetree/bindings/iio/imu/inv_mpu6050.txt index 2b4514592f83..5f4777e8cc9e 100644 --- a/Documentation/devicetree/bindings/iio/imu/inv_mpu6050.txt +++ b/Documentation/devicetree/bindings/iio/imu/inv_mpu6050.txt @@ -8,10 +8,16 @@ Required properties: "invensense,mpu6500" "invensense,mpu9150" "invensense,mpu9250" + "invensense,mpu9255" "invensense,icm20608" - reg : the I2C address of the sensor - interrupt-parent : should be the phandle for the interrupt controller - - interrupts : interrupt mapping for GPIO IRQ + - interrupts: interrupt mapping for IRQ. It should be configured with flags + IRQ_TYPE_LEVEL_HIGH, IRQ_TYPE_EDGE_RISING, IRQ_TYPE_LEVEL_LOW or + IRQ_TYPE_EDGE_FALLING. + + Refer to interrupt-controller/interrupts.txt for generic interrupt client node + bindings. Optional properties: - mount-matrix: an optional 3x3 mounting rotation matrix @@ -24,7 +30,7 @@ Example: compatible = "invensense,mpu6050"; reg = <0x68>; interrupt-parent = <&gpio1>; - interrupts = <18 1>; + interrupts = <18 IRQ_TYPE_EDGE_RISING>; mount-matrix = "-0.984807753012208", /* x0 */ "0", /* y0 */ "-0.173648177666930", /* z0 */ @@ -41,7 +47,7 @@ Example: compatible = "invensense,mpu9250"; reg = <0x68>; interrupt-parent = <&gpio3>; - interrupts = <21 1>; + interrupts = <21 IRQ_TYPE_LEVEL_HIGH>; i2c-gate { #address-cells = <1>; #size-cells = <0>; diff --git a/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt b/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt index 1ff1af799c76..ef8a8566c63f 100644 --- a/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt +++ b/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt @@ -6,6 +6,7 @@ Required properties: "st,lsm6ds3h" "st,lsm6dsl" "st,lsm6dsm" + "st,ism330dlc" - reg: i2c address of the sensor / spi cs line Optional properties: diff --git a/Documentation/devicetree/bindings/iio/potentiostat/lmp91000.txt b/Documentation/devicetree/bindings/iio/potentiostat/lmp91000.txt index b9b621e94cd7..e6d0c2eb345c 100644 --- a/Documentation/devicetree/bindings/iio/potentiostat/lmp91000.txt +++ b/Documentation/devicetree/bindings/iio/potentiostat/lmp91000.txt @@ -1,10 +1,13 @@ -* Texas Instruments LMP91000 potentiostat +* Texas Instruments LMP91000 series of potentiostats -http://www.ti.com/lit/ds/symlink/lmp91000.pdf +LMP91000: http://www.ti.com/lit/ds/symlink/lmp91000.pdf +LMP91002: http://www.ti.com/lit/ds/symlink/lmp91002.pdf Required properties: - - compatible: should be "ti,lmp91000" + - compatible: should be one of the following: + "ti,lmp91000" + "ti,lmp91002" - reg: the I2C address of the device - io-channels: the phandle of the iio provider diff --git a/Documentation/devicetree/bindings/input/atmel,maxtouch.txt b/Documentation/devicetree/bindings/input/atmel,maxtouch.txt index 23e3abc3fdef..c88919480d37 100644 --- a/Documentation/devicetree/bindings/input/atmel,maxtouch.txt +++ b/Documentation/devicetree/bindings/input/atmel,maxtouch.txt @@ -4,6 +4,13 @@ Required properties: - compatible: atmel,maxtouch + The following compatibles have been used in various products but are + deprecated: + atmel,qt602240_ts + atmel,atmel_mxt_ts + atmel,atmel_mxt_tp + atmel,mXT224 + - reg: The I2C address of the device - interrupts: The sink for the touchpad's IRQ output diff --git a/Documentation/devicetree/bindings/input/elan_i2c.txt b/Documentation/devicetree/bindings/input/elan_i2c.txt index ee3242c4ba67..d80a83583238 100644 --- a/Documentation/devicetree/bindings/input/elan_i2c.txt +++ b/Documentation/devicetree/bindings/input/elan_i2c.txt @@ -14,6 +14,7 @@ Optional properties: - pinctrl-0: a phandle pointing to the pin settings for the device (see pinctrl binding [1]). - vcc-supply: a phandle for the regulator supplying 3.3V power. +- elan,trackpoint: touchpad can support a trackpoint (boolean) [0]: Documentation/devicetree/bindings/interrupt-controller/interrupts.txt [1]: Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt diff --git a/Documentation/devicetree/bindings/interrupt-controller/amlogic,meson-gpio-intc.txt b/Documentation/devicetree/bindings/interrupt-controller/amlogic,meson-gpio-intc.txt index a83f9a5734ca..89674ad8a097 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/amlogic,meson-gpio-intc.txt +++ b/Documentation/devicetree/bindings/interrupt-controller/amlogic,meson-gpio-intc.txt @@ -9,11 +9,12 @@ number of interrupt exposed depends on the SoC. Required properties: -- compatible : must have "amlogic,meson8-gpio-intc” and either - “amlogic,meson8-gpio-intc” for meson8 SoCs (S802) or - “amlogic,meson8b-gpio-intc” for meson8b SoCs (S805) or - “amlogic,meson-gxbb-gpio-intc” for GXBB SoCs (S905) or - “amlogic,meson-gxl-gpio-intc” for GXL SoCs (S905X, S912) +- compatible : must have "amlogic,meson8-gpio-intc" and either + "amlogic,meson8-gpio-intc" for meson8 SoCs (S802) or + "amlogic,meson8b-gpio-intc" for meson8b SoCs (S805) or + "amlogic,meson-gxbb-gpio-intc" for GXBB SoCs (S905) or + "amlogic,meson-gxl-gpio-intc" for GXL SoCs (S905X, S912) + "amlogic,meson-axg-gpio-intc" for AXG SoCs (A113D, A113X) - interrupt-parent : a phandle to the GIC the interrupts are routed to. Usually this is provided at the root level of the device tree as it is common to most of the SoC. diff --git a/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.txt b/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.txt index 0a57f2f4167d..3ea78c4ef887 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.txt +++ b/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.txt @@ -57,6 +57,20 @@ Optional occupied by the redistributors. Required if more than one such region is present. +- msi-controller: Boolean property. Identifies the node as an MSI + controller. Only present if the Message Based Interrupt + functionnality is being exposed by the HW, and the mbi-ranges + property present. + +- mbi-ranges: A list of pairs <intid span>, where "intid" is the first + SPI of a range that can be used an MBI, and "span" the size of that + range. Multiple ranges can be provided. Requires "msi-controller" to + be set. + +- mbi-alias: Address property. Base address of an alias of the GICD + region containing only the {SET,CLR}SPI registers to be used if + isolation is required, and if supported by the HW. + Sub-nodes: PPI affinity can be expressed as a single "ppi-partitions" node, @@ -99,6 +113,9 @@ Examples: <0x0 0x2c020000 0 0x2000>; // GICV interrupts = <1 9 4>; + msi-controller; + mbi-ranges = <256 128>; + gic-its@2c200000 { compatible = "arm,gic-v3-its"; msi-controller; diff --git a/Documentation/devicetree/bindings/interrupt-controller/st,stm32-exti.txt b/Documentation/devicetree/bindings/interrupt-controller/st,stm32-exti.txt index edf03f09244b..136bd612bd83 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/st,stm32-exti.txt +++ b/Documentation/devicetree/bindings/interrupt-controller/st,stm32-exti.txt @@ -5,11 +5,14 @@ Required properties: - compatible: Should be: "st,stm32-exti" "st,stm32h7-exti" + "st,stm32mp1-exti" - reg: Specifies base physical address and size of the registers - interrupt-controller: Indentifies the node as an interrupt controller - #interrupt-cells: Specifies the number of cells to encode an interrupt specifier, shall be 2 - interrupts: interrupts references to primary interrupt controller + (only needed for exti controller with multiple exti under + same parent interrupt: st,stm32-exti and st,stm32h7-exti") Example: diff --git a/Documentation/devicetree/bindings/ipmi/npcm7xx-kcs-bmc.txt b/Documentation/devicetree/bindings/ipmi/npcm7xx-kcs-bmc.txt new file mode 100644 index 000000000000..3538a214fff1 --- /dev/null +++ b/Documentation/devicetree/bindings/ipmi/npcm7xx-kcs-bmc.txt @@ -0,0 +1,39 @@ +* Nuvoton NPCM7xx KCS (Keyboard Controller Style) IPMI interface + +The Nuvoton SOCs (NPCM7xx) are commonly used as BMCs +(Baseboard Management Controllers) and the KCS interface can be +used to perform in-band IPMI communication with their host. + +Required properties: +- compatible : should be one of + "nuvoton,npcm750-kcs-bmc" +- interrupts : interrupt generated by the controller +- kcs_chan : The KCS channel number in the controller + +Example: + + lpc_kcs: lpc_kcs@f0007000 { + compatible = "nuvoton,npcm750-lpc-kcs", "simple-mfd", "syscon"; + reg = <0xf0007000 0x40>; + reg-io-width = <1>; + + #address-cells = <1>; + #size-cells = <1>; + ranges = <0x0 0xf0007000 0x40>; + + kcs1: kcs1@0 { + compatible = "nuvoton,npcm750-kcs-bmc"; + reg = <0x0 0x40>; + interrupts = <0 9 4>; + kcs_chan = <1>; + status = "disabled"; + }; + + kcs2: kcs2@0 { + compatible = "nuvoton,npcm750-kcs-bmc"; + reg = <0x0 0x40>; + interrupts = <0 9 4>; + kcs_chan = <2>; + status = "disabled"; + }; + };
\ No newline at end of file diff --git a/Documentation/devicetree/bindings/leds/leds-cr0014114.txt b/Documentation/devicetree/bindings/leds/leds-cr0014114.txt new file mode 100644 index 000000000000..4255b19ad25c --- /dev/null +++ b/Documentation/devicetree/bindings/leds/leds-cr0014114.txt @@ -0,0 +1,54 @@ +Crane Merchandising System - cr0014114 LED driver +------------------------------------------------- + +This LED Board is widely used in vending machines produced +by Crane Merchandising Systems. + +Required properties: +- compatible: "crane,cr0014114" + +Property rules described in Documentation/devicetree/bindings/spi/spi-bus.txt +apply. In particular, "reg" and "spi-max-frequency" properties must be given. + +LED sub-node properties: +- label : + see Documentation/devicetree/bindings/leds/common.txt +- linux,default-trigger : (optional) + see Documentation/devicetree/bindings/leds/common.txt + +Example +------- + +led-controller@0 { + compatible = "crane,cr0014114"; + reg = <0>; + spi-max-frequency = <50000>; + #address-cells = <1>; + #size-cells = <0>; + + led@0 { + reg = <0>; + label = "red:coin"; + }; + led@1 { + reg = <1>; + label = "green:coin"; + }; + led@2 { + reg = <2>; + label = "blue:coin"; + }; + led@3 { + reg = <3>; + label = "red:bill"; + }; + led@4 { + reg = <4>; + label = "green:bill"; + }; + led@5 { + reg = <5>; + label = "blue:bill"; + }; + ... +}; diff --git a/Documentation/devicetree/bindings/leds/leds-lm3601x.txt b/Documentation/devicetree/bindings/leds/leds-lm3601x.txt new file mode 100644 index 000000000000..a88b2c41e75b --- /dev/null +++ b/Documentation/devicetree/bindings/leds/leds-lm3601x.txt @@ -0,0 +1,45 @@ +* Texas Instruments - lm3601x Single-LED Flash Driver + +The LM3601X are ultra-small LED flash drivers that +provide a high level of adjustability. + +Required properties: + - compatible : Can be one of the following + "ti,lm36010" + "ti,lm36011" + - reg : I2C slave address + - #address-cells : 1 + - #size-cells : 0 + +Required child properties: + - reg : 0 - Indicates a IR mode + 1 - Indicates a Torch (white LED) mode + +Required properties for flash LED child nodes: + See Documentation/devicetree/bindings/leds/common.txt + - flash-max-microamp : Range from 11mA - 1.5A + - flash-max-timeout-us : Range from 40ms - 1600ms + - led-max-microamp : Range from 2.4mA - 376mA + +Optional child properties: + - label : see Documentation/devicetree/bindings/leds/common.txt + +Example: +led-controller@64 { + compatible = "ti,lm36010"; + #address-cells = <1>; + #size-cells = <0>; + reg = <0x64>; + + led@0 { + reg = <1>; + label = "white:torch"; + led-max-microamp = <376000>; + flash-max-microamp = <1500000>; + flash-max-timeout-us = <1600000>; + }; +} + +For more product information please see the links below: +http://www.ti.com/product/LM36010 +http://www.ti.com/product/LM36011 diff --git a/Documentation/devicetree/bindings/leds/leds-sc27xx-bltc.txt b/Documentation/devicetree/bindings/leds/leds-sc27xx-bltc.txt new file mode 100644 index 000000000000..dddf84f9c7ea --- /dev/null +++ b/Documentation/devicetree/bindings/leds/leds-sc27xx-bltc.txt @@ -0,0 +1,41 @@ +LEDs connected to Spreadtrum SC27XX PMIC breathing light controller + +The SC27xx breathing light controller supports to 3 outputs: +red LED, green LED and blue LED. Each LED can work at normal +PWM mode or breath light mode. + +Required properties: +- compatible: Should be "sprd,sc2731-bltc". +- #address-cells: Must be 1. +- #size-cells: Must be 0. +- reg: Specify the controller address. + +Required child properties: +- reg: Port this LED is connected to. + +Optional child properties: +- label: See Documentation/devicetree/bindings/leds/common.txt. + +Examples: + +led-controller@200 { + compatible = "sprd,sc2731-bltc"; + #address-cells = <1>; + #size-cells = <0>; + reg = <0x200>; + + led@0 { + label = "red"; + reg = <0x0>; + }; + + led@1 { + label = "green"; + reg = <0x1>; + }; + + led@2 { + label = "blue"; + reg = <0x2>; + }; +}; diff --git a/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.txt b/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.txt index 16964f0c1773..6e8a9ab0fdae 100644 --- a/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.txt +++ b/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.txt @@ -10,6 +10,8 @@ platforms. Definition: must be one of: "qcom,msm8916-apcs-kpss-global", "qcom,msm8996-apcs-hmss-global" + "qcom,msm8998-apcs-hmss-global" + "qcom,sdm845-apss-shared" - reg: Usage: required diff --git a/Documentation/devicetree/bindings/mailbox/stm32-ipcc.txt b/Documentation/devicetree/bindings/mailbox/stm32-ipcc.txt new file mode 100644 index 000000000000..1d2b7fee7b85 --- /dev/null +++ b/Documentation/devicetree/bindings/mailbox/stm32-ipcc.txt @@ -0,0 +1,47 @@ +* STMicroelectronics STM32 IPCC (Inter-Processor Communication Controller) + +The IPCC block provides a non blocking signaling mechanism to post and +retrieve messages in an atomic way between two processors. +It provides the signaling for N bidirectionnal channels. The number of channels +(N) can be read from a dedicated register. + +Required properties: +- compatible: Must be "st,stm32mp1-ipcc" +- reg: Register address range (base address and length) +- st,proc-id: Processor id using the mailbox (0 or 1) +- clocks: Input clock +- interrupt-names: List of names for the interrupts described by the interrupt + property. Must contain the following entries: + - "rx" + - "tx" + - "wakeup" +- interrupts: Interrupt specifiers for "rx channel occupied", "tx channel + free" and "system wakeup". +- #mbox-cells: Number of cells required for the mailbox specifier. Must be 1. + The data contained in the mbox specifier of the "mboxes" + property in the client node is the mailbox channel index. + +Optional properties: +- wakeup-source: Flag to indicate whether this device can wake up the system + + + +Example: + ipcc: mailbox@4c001000 { + compatible = "st,stm32mp1-ipcc"; + #mbox-cells = <1>; + reg = <0x4c001000 0x400>; + st,proc-id = <0>; + interrupts-extended = <&intc GIC_SPI 100 IRQ_TYPE_NONE>, + <&intc GIC_SPI 101 IRQ_TYPE_NONE>, + <&aiec 62 1>; + interrupt-names = "rx", "tx", "wakeup"; + clocks = <&rcc_clk IPCC>; + wakeup-source; + } + +Client: + mbox_test { + ... + mboxes = <&ipcc 0>, <&ipcc 1>; + }; diff --git a/Documentation/devicetree/bindings/marvell.txt b/Documentation/devicetree/bindings/marvell.txt deleted file mode 100644 index 7f722316458a..000000000000 --- a/Documentation/devicetree/bindings/marvell.txt +++ /dev/null @@ -1,516 +0,0 @@ -Marvell Discovery mv64[345]6x System Controller chips -=========================================================== - -The Marvell mv64[345]60 series of system controller chips contain -many of the peripherals needed to implement a complete computer -system. In this section, we define device tree nodes to describe -the system controller chip itself and each of the peripherals -which it contains. Compatible string values for each node are -prefixed with the string "marvell,", for Marvell Technology Group Ltd. - -1) The /system-controller node - - This node is used to represent the system-controller and must be - present when the system uses a system controller chip. The top-level - system-controller node contains information that is global to all - devices within the system controller chip. The node name begins - with "system-controller" followed by the unit address, which is - the base address of the memory-mapped register set for the system - controller chip. - - Required properties: - - - ranges : Describes the translation of system controller addresses - for memory mapped registers. - - clock-frequency: Contains the main clock frequency for the system - controller chip. - - reg : This property defines the address and size of the - memory-mapped registers contained within the system controller - chip. The address specified in the "reg" property should match - the unit address of the system-controller node. - - #address-cells : Address representation for system controller - devices. This field represents the number of cells needed to - represent the address of the memory-mapped registers of devices - within the system controller chip. - - #size-cells : Size representation for the memory-mapped - registers within the system controller chip. - - #interrupt-cells : Defines the width of cells used to represent - interrupts. - - Optional properties: - - - model : The specific model of the system controller chip. Such - as, "mv64360", "mv64460", or "mv64560". - - compatible : A string identifying the compatibility identifiers - of the system controller chip. - - The system-controller node contains child nodes for each system - controller device that the platform uses. Nodes should not be created - for devices which exist on the system controller chip but are not used - - Example Marvell Discovery mv64360 system-controller node: - - system-controller@f1000000 { /* Marvell Discovery mv64360 */ - #address-cells = <1>; - #size-cells = <1>; - model = "mv64360"; /* Default */ - compatible = "marvell,mv64360"; - clock-frequency = <133333333>; - reg = <0xf1000000 0x10000>; - virtual-reg = <0xf1000000>; - ranges = <0x88000000 0x88000000 0x1000000 /* PCI 0 I/O Space */ - 0x80000000 0x80000000 0x8000000 /* PCI 0 MEM Space */ - 0xa0000000 0xa0000000 0x4000000 /* User FLASH */ - 0x00000000 0xf1000000 0x0010000 /* Bridge's regs */ - 0xf2000000 0xf2000000 0x0040000>;/* Integrated SRAM */ - - [ child node definitions... ] - } - -2) Child nodes of /system-controller - - a) Marvell Discovery MDIO bus - - The MDIO is a bus to which the PHY devices are connected. For each - device that exists on this bus, a child node should be created. See - the definition of the PHY node below for an example of how to define - a PHY. - - Required properties: - - #address-cells : Should be <1> - - #size-cells : Should be <0> - - compatible : Should be "marvell,mv64360-mdio" - - Example: - - mdio { - #address-cells = <1>; - #size-cells = <0>; - compatible = "marvell,mv64360-mdio"; - - ethernet-phy@0 { - ...... - }; - }; - - - b) Marvell Discovery ethernet controller - - The Discover ethernet controller is described with two levels - of nodes. The first level describes an ethernet silicon block - and the second level describes up to 3 ethernet nodes within - that block. The reason for the multiple levels is that the - registers for the node are interleaved within a single set - of registers. The "ethernet-block" level describes the - shared register set, and the "ethernet" nodes describe ethernet - port-specific properties. - - Ethernet block node - - Required properties: - - #address-cells : <1> - - #size-cells : <0> - - compatible : "marvell,mv64360-eth-block" - - reg : Offset and length of the register set for this block - - Optional properties: - - clocks : Phandle to the clock control device and gate bit - - Example Discovery Ethernet block node: - ethernet-block@2000 { - #address-cells = <1>; - #size-cells = <0>; - compatible = "marvell,mv64360-eth-block"; - reg = <0x2000 0x2000>; - ethernet@0 { - ....... - }; - }; - - Ethernet port node - - Required properties: - - compatible : Should be "marvell,mv64360-eth". - - reg : Should be <0>, <1>, or <2>, according to which registers - within the silicon block the device uses. - - interrupts : <a> where a is the interrupt number for the port. - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - phy : the phandle for the PHY connected to this ethernet - controller. - - local-mac-address : 6 bytes, MAC address - - Example Discovery Ethernet port node: - ethernet@0 { - compatible = "marvell,mv64360-eth"; - reg = <0>; - interrupts = <32>; - interrupt-parent = <&PIC>; - phy = <&PHY0>; - local-mac-address = [ 00 00 00 00 00 00 ]; - }; - - - - c) Marvell Discovery PHY nodes - - Required properties: - - interrupts : <a> where a is the interrupt number for this phy. - - interrupt-parent : the phandle for the interrupt controller that - services interrupts for this device. - - reg : The ID number for the phy, usually a small integer - - Example Discovery PHY node: - ethernet-phy@1 { - compatible = "broadcom,bcm5421"; - interrupts = <76>; /* GPP 12 */ - interrupt-parent = <&PIC>; - reg = <1>; - }; - - - d) Marvell Discovery SDMA nodes - - Represent DMA hardware associated with the MPSC (multiprotocol - serial controllers). - - Required properties: - - compatible : "marvell,mv64360-sdma" - - reg : Offset and length of the register set for this device - - interrupts : <a> where a is the interrupt number for the DMA - device. - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery SDMA node: - sdma@4000 { - compatible = "marvell,mv64360-sdma"; - reg = <0x4000 0xc18>; - virtual-reg = <0xf1004000>; - interrupts = <36>; - interrupt-parent = <&PIC>; - }; - - - e) Marvell Discovery BRG nodes - - Represent baud rate generator hardware associated with the MPSC - (multiprotocol serial controllers). - - Required properties: - - compatible : "marvell,mv64360-brg" - - reg : Offset and length of the register set for this device - - clock-src : A value from 0 to 15 which selects the clock - source for the baud rate generator. This value corresponds - to the CLKS value in the BRGx configuration register. See - the mv64x60 User's Manual. - - clock-frequence : The frequency (in Hz) of the baud rate - generator's input clock. - - current-speed : The current speed setting (presumably by - firmware) of the baud rate generator. - - Example Discovery BRG node: - brg@b200 { - compatible = "marvell,mv64360-brg"; - reg = <0xb200 0x8>; - clock-src = <8>; - clock-frequency = <133333333>; - current-speed = <9600>; - }; - - - f) Marvell Discovery CUNIT nodes - - Represent the Serial Communications Unit device hardware. - - Required properties: - - reg : Offset and length of the register set for this device - - Example Discovery CUNIT node: - cunit@f200 { - reg = <0xf200 0x200>; - }; - - - g) Marvell Discovery MPSCROUTING nodes - - Represent the Discovery's MPSC routing hardware - - Required properties: - - reg : Offset and length of the register set for this device - - Example Discovery CUNIT node: - mpscrouting@b500 { - reg = <0xb400 0xc>; - }; - - - h) Marvell Discovery MPSCINTR nodes - - Represent the Discovery's MPSC DMA interrupt hardware registers - (SDMA cause and mask registers). - - Required properties: - - reg : Offset and length of the register set for this device - - Example Discovery MPSCINTR node: - mpsintr@b800 { - reg = <0xb800 0x100>; - }; - - - i) Marvell Discovery MPSC nodes - - Represent the Discovery's MPSC (Multiprotocol Serial Controller) - serial port. - - Required properties: - - compatible : "marvell,mv64360-mpsc" - - reg : Offset and length of the register set for this device - - sdma : the phandle for the SDMA node used by this port - - brg : the phandle for the BRG node used by this port - - cunit : the phandle for the CUNIT node used by this port - - mpscrouting : the phandle for the MPSCROUTING node used by this port - - mpscintr : the phandle for the MPSCINTR node used by this port - - cell-index : the hardware index of this cell in the MPSC core - - max_idle : value needed for MPSC CHR3 (Maximum Frame Length) - register - - interrupts : <a> where a is the interrupt number for the MPSC. - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery MPSCINTR node: - mpsc@8000 { - compatible = "marvell,mv64360-mpsc"; - reg = <0x8000 0x38>; - virtual-reg = <0xf1008000>; - sdma = <&SDMA0>; - brg = <&BRG0>; - cunit = <&CUNIT>; - mpscrouting = <&MPSCROUTING>; - mpscintr = <&MPSCINTR>; - cell-index = <0>; - max_idle = <40>; - interrupts = <40>; - interrupt-parent = <&PIC>; - }; - - - j) Marvell Discovery Watch Dog Timer nodes - - Represent the Discovery's watchdog timer hardware - - Required properties: - - compatible : "marvell,mv64360-wdt" - - reg : Offset and length of the register set for this device - - Example Discovery Watch Dog Timer node: - wdt@b410 { - compatible = "marvell,mv64360-wdt"; - reg = <0xb410 0x8>; - }; - - - k) Marvell Discovery I2C nodes - - Represent the Discovery's I2C hardware - - Required properties: - - device_type : "i2c" - - compatible : "marvell,mv64360-i2c" - - reg : Offset and length of the register set for this device - - interrupts : <a> where a is the interrupt number for the I2C. - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery I2C node: - compatible = "marvell,mv64360-i2c"; - reg = <0xc000 0x20>; - virtual-reg = <0xf100c000>; - interrupts = <37>; - interrupt-parent = <&PIC>; - }; - - - l) Marvell Discovery PIC (Programmable Interrupt Controller) nodes - - Represent the Discovery's PIC hardware - - Required properties: - - #interrupt-cells : <1> - - #address-cells : <0> - - compatible : "marvell,mv64360-pic" - - reg : Offset and length of the register set for this device - - interrupt-controller - - Example Discovery PIC node: - pic { - #interrupt-cells = <1>; - #address-cells = <0>; - compatible = "marvell,mv64360-pic"; - reg = <0x0 0x88>; - interrupt-controller; - }; - - - m) Marvell Discovery MPP (Multipurpose Pins) multiplexing nodes - - Represent the Discovery's MPP hardware - - Required properties: - - compatible : "marvell,mv64360-mpp" - - reg : Offset and length of the register set for this device - - Example Discovery MPP node: - mpp@f000 { - compatible = "marvell,mv64360-mpp"; - reg = <0xf000 0x10>; - }; - - - n) Marvell Discovery GPP (General Purpose Pins) nodes - - Represent the Discovery's GPP hardware - - Required properties: - - compatible : "marvell,mv64360-gpp" - - reg : Offset and length of the register set for this device - - Example Discovery GPP node: - gpp@f000 { - compatible = "marvell,mv64360-gpp"; - reg = <0xf100 0x20>; - }; - - - o) Marvell Discovery PCI host bridge node - - Represents the Discovery's PCI host bridge device. The properties - for this node conform to Rev 2.1 of the PCI Bus Binding to IEEE - 1275-1994. A typical value for the compatible property is - "marvell,mv64360-pci". - - Example Discovery PCI host bridge node - pci@80000000 { - #address-cells = <3>; - #size-cells = <2>; - #interrupt-cells = <1>; - device_type = "pci"; - compatible = "marvell,mv64360-pci"; - reg = <0xcf8 0x8>; - ranges = <0x01000000 0x0 0x0 - 0x88000000 0x0 0x01000000 - 0x02000000 0x0 0x80000000 - 0x80000000 0x0 0x08000000>; - bus-range = <0 255>; - clock-frequency = <66000000>; - interrupt-parent = <&PIC>; - interrupt-map-mask = <0xf800 0x0 0x0 0x7>; - interrupt-map = < - /* IDSEL 0x0a */ - 0x5000 0 0 1 &PIC 80 - 0x5000 0 0 2 &PIC 81 - 0x5000 0 0 3 &PIC 91 - 0x5000 0 0 4 &PIC 93 - - /* IDSEL 0x0b */ - 0x5800 0 0 1 &PIC 91 - 0x5800 0 0 2 &PIC 93 - 0x5800 0 0 3 &PIC 80 - 0x5800 0 0 4 &PIC 81 - - /* IDSEL 0x0c */ - 0x6000 0 0 1 &PIC 91 - 0x6000 0 0 2 &PIC 93 - 0x6000 0 0 3 &PIC 80 - 0x6000 0 0 4 &PIC 81 - - /* IDSEL 0x0d */ - 0x6800 0 0 1 &PIC 93 - 0x6800 0 0 2 &PIC 80 - 0x6800 0 0 3 &PIC 81 - 0x6800 0 0 4 &PIC 91 - >; - }; - - - p) Marvell Discovery CPU Error nodes - - Represent the Discovery's CPU error handler device. - - Required properties: - - compatible : "marvell,mv64360-cpu-error" - - reg : Offset and length of the register set for this device - - interrupts : the interrupt number for this device - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery CPU Error node: - cpu-error@70 { - compatible = "marvell,mv64360-cpu-error"; - reg = <0x70 0x10 0x128 0x28>; - interrupts = <3>; - interrupt-parent = <&PIC>; - }; - - - q) Marvell Discovery SRAM Controller nodes - - Represent the Discovery's SRAM controller device. - - Required properties: - - compatible : "marvell,mv64360-sram-ctrl" - - reg : Offset and length of the register set for this device - - interrupts : the interrupt number for this device - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery SRAM Controller node: - sram-ctrl@380 { - compatible = "marvell,mv64360-sram-ctrl"; - reg = <0x380 0x80>; - interrupts = <13>; - interrupt-parent = <&PIC>; - }; - - - r) Marvell Discovery PCI Error Handler nodes - - Represent the Discovery's PCI error handler device. - - Required properties: - - compatible : "marvell,mv64360-pci-error" - - reg : Offset and length of the register set for this device - - interrupts : the interrupt number for this device - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery PCI Error Handler node: - pci-error@1d40 { - compatible = "marvell,mv64360-pci-error"; - reg = <0x1d40 0x40 0xc28 0x4>; - interrupts = <12>; - interrupt-parent = <&PIC>; - }; - - - s) Marvell Discovery Memory Controller nodes - - Represent the Discovery's memory controller device. - - Required properties: - - compatible : "marvell,mv64360-mem-ctrl" - - reg : Offset and length of the register set for this device - - interrupts : the interrupt number for this device - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery Memory Controller node: - mem-ctrl@1400 { - compatible = "marvell,mv64360-mem-ctrl"; - reg = <0x1400 0x60>; - interrupts = <17>; - interrupt-parent = <&PIC>; - }; - - diff --git a/Documentation/devicetree/bindings/media/cdns,csi2rx.txt b/Documentation/devicetree/bindings/media/cdns,csi2rx.txt new file mode 100644 index 000000000000..6b02a0657ad9 --- /dev/null +++ b/Documentation/devicetree/bindings/media/cdns,csi2rx.txt @@ -0,0 +1,100 @@ +Cadence MIPI-CSI2 RX controller +=============================== + +The Cadence MIPI-CSI2 RX controller is a CSI-2 bridge supporting up to 4 CSI +lanes in input, and 4 different pixel streams in output. + +Required properties: + - compatible: must be set to "cdns,csi2rx" and an SoC-specific compatible + - reg: base address and size of the memory mapped region + - clocks: phandles to the clocks driving the controller + - clock-names: must contain: + * sys_clk: main clock + * p_clk: register bank clock + * pixel_if[0-3]_clk: pixel stream output clock, one for each stream + implemented in hardware, between 0 and 3 + +Optional properties: + - phys: phandle to the external D-PHY, phy-names must be provided + - phy-names: must contain "dphy", if the implementation uses an + external D-PHY + +Required subnodes: + - ports: A ports node with one port child node per device input and output + port, in accordance with the video interface bindings defined in + Documentation/devicetree/bindings/media/video-interfaces.txt. The + port nodes are numbered as follows: + + Port Description + ----------------------------- + 0 CSI-2 input + 1 Stream 0 output + 2 Stream 1 output + 3 Stream 2 output + 4 Stream 3 output + + The stream output port nodes are optional if they are not + connected to anything at the hardware level or implemented + in the design.Since there is only one endpoint per port, + the endpoints are not numbered. + + +Example: + +csi2rx: csi-bridge@0d060000 { + compatible = "cdns,csi2rx"; + reg = <0x0d060000 0x1000>; + clocks = <&byteclock>, <&byteclock> + <&coreclock>, <&coreclock>, + <&coreclock>, <&coreclock>; + clock-names = "sys_clk", "p_clk", + "pixel_if0_clk", "pixel_if1_clk", + "pixel_if2_clk", "pixel_if3_clk"; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@0 { + reg = <0>; + + csi2rx_in_sensor: endpoint { + remote-endpoint = <&sensor_out_csi2rx>; + clock-lanes = <0>; + data-lanes = <1 2>; + }; + }; + + port@1 { + reg = <1>; + + csi2rx_out_grabber0: endpoint { + remote-endpoint = <&grabber0_in_csi2rx>; + }; + }; + + port@2 { + reg = <2>; + + csi2rx_out_grabber1: endpoint { + remote-endpoint = <&grabber1_in_csi2rx>; + }; + }; + + port@3 { + reg = <3>; + + csi2rx_out_grabber2: endpoint { + remote-endpoint = <&grabber2_in_csi2rx>; + }; + }; + + port@4 { + reg = <4>; + + csi2rx_out_grabber3: endpoint { + remote-endpoint = <&grabber3_in_csi2rx>; + }; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/media/cdns,csi2tx.txt b/Documentation/devicetree/bindings/media/cdns,csi2tx.txt new file mode 100644 index 000000000000..459c6e332f52 --- /dev/null +++ b/Documentation/devicetree/bindings/media/cdns,csi2tx.txt @@ -0,0 +1,98 @@ +Cadence MIPI-CSI2 TX controller +=============================== + +The Cadence MIPI-CSI2 TX controller is a CSI-2 bridge supporting up to +4 CSI lanes in output, and up to 4 different pixel streams in input. + +Required properties: + - compatible: must be set to "cdns,csi2tx" + - reg: base address and size of the memory mapped region + - clocks: phandles to the clocks driving the controller + - clock-names: must contain: + * esc_clk: escape mode clock + * p_clk: register bank clock + * pixel_if[0-3]_clk: pixel stream output clock, one for each stream + implemented in hardware, between 0 and 3 + +Optional properties + - phys: phandle to the D-PHY. If it is set, phy-names need to be set + - phy-names: must contain "dphy" + +Required subnodes: + - ports: A ports node with one port child node per device input and output + port, in accordance with the video interface bindings defined in + Documentation/devicetree/bindings/media/video-interfaces.txt. The + port nodes are numbered as follows. + + Port Description + ----------------------------- + 0 CSI-2 output + 1 Stream 0 input + 2 Stream 1 input + 3 Stream 2 input + 4 Stream 3 input + + The stream input port nodes are optional if they are not + connected to anything at the hardware level or implemented + in the design. Since there is only one endpoint per port, + the endpoints are not numbered. + +Example: + +csi2tx: csi-bridge@0d0e1000 { + compatible = "cdns,csi2tx"; + reg = <0x0d0e1000 0x1000>; + clocks = <&byteclock>, <&byteclock>, + <&coreclock>, <&coreclock>, + <&coreclock>, <&coreclock>; + clock-names = "p_clk", "esc_clk", + "pixel_if0_clk", "pixel_if1_clk", + "pixel_if2_clk", "pixel_if3_clk"; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@0 { + reg = <0>; + + csi2tx_out: endpoint { + remote-endpoint = <&remote_in>; + clock-lanes = <0>; + data-lanes = <1 2>; + }; + }; + + port@1 { + reg = <1>; + + csi2tx_in_stream0: endpoint { + remote-endpoint = <&stream0_out>; + }; + }; + + port@2 { + reg = <2>; + + csi2tx_in_stream1: endpoint { + remote-endpoint = <&stream1_out>; + }; + }; + + port@3 { + reg = <3>; + + csi2tx_in_stream2: endpoint { + remote-endpoint = <&stream2_out>; + }; + }; + + port@4 { + reg = <4>; + + csi2tx_in_stream3: endpoint { + remote-endpoint = <&stream3_out>; + }; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/media/i2c/ov7251.txt b/Documentation/devicetree/bindings/media/i2c/ov7251.txt new file mode 100644 index 000000000000..8281151f7493 --- /dev/null +++ b/Documentation/devicetree/bindings/media/i2c/ov7251.txt @@ -0,0 +1,52 @@ +* Omnivision 1/7.5-Inch B&W VGA CMOS Digital Image Sensor + +The Omnivision OV7251 is a 1/7.5-Inch CMOS active pixel digital image sensor +with an active array size of 640H x 480V. It is programmable through a serial +I2C interface. + +Required Properties: +- compatible: Value should be "ovti,ov7251". +- clocks: Reference to the xclk clock. +- clock-names: Should be "xclk". +- clock-frequency: Frequency of the xclk clock. +- enable-gpios: Chip enable GPIO. Polarity is GPIO_ACTIVE_HIGH. This corresponds + to the hardware pin XSHUTDOWN which is physically active low. +- vdddo-supply: Chip digital IO regulator. +- vdda-supply: Chip analog regulator. +- vddd-supply: Chip digital core regulator. + +The device node shall contain one 'port' child node with a single 'endpoint' +subnode for its digital output video port, in accordance with the video +interface bindings defined in +Documentation/devicetree/bindings/media/video-interfaces.txt. + +Example: + + &i2c1 { + ... + + ov7251: camera-sensor@60 { + compatible = "ovti,ov7251"; + reg = <0x60>; + + enable-gpios = <&gpio1 6 GPIO_ACTIVE_HIGH>; + pinctrl-names = "default"; + pinctrl-0 = <&camera_bw_default>; + + clocks = <&clks 200>; + clock-names = "xclk"; + clock-frequency = <24000000>; + + vdddo-supply = <&camera_dovdd_1v8>; + vdda-supply = <&camera_avdd_2v8>; + vddd-supply = <&camera_dvdd_1v2>; + + port { + ov7251_ep: endpoint { + clock-lanes = <1>; + data-lanes = <0>; + remote-endpoint = <&csi0_ep>; + }; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/media/i2c/ov772x.txt b/Documentation/devicetree/bindings/media/i2c/ov772x.txt new file mode 100644 index 000000000000..0b3ede5b8e6a --- /dev/null +++ b/Documentation/devicetree/bindings/media/i2c/ov772x.txt @@ -0,0 +1,40 @@ +* Omnivision OV7720/OV7725 CMOS sensor + +The Omnivision OV7720/OV7725 sensor supports multiple resolutions output, +such as VGA, QVGA, and any size scaling down from CIF to 40x30. It also can +support the YUV422, RGB565/555/444, GRB422 or raw RGB output formats. + +Required Properties: +- compatible: shall be one of + "ovti,ov7720" + "ovti,ov7725" +- clocks: reference to the xclk input clock. + +Optional Properties: +- reset-gpios: reference to the GPIO connected to the RSTB pin which is + active low, if any. +- powerdown-gpios: reference to the GPIO connected to the PWDN pin which is + active high, if any. + +The device node shall contain one 'port' child node with one child 'endpoint' +subnode for its digital output video port, in accordance with the video +interface bindings defined in Documentation/devicetree/bindings/media/ +video-interfaces.txt. + +Example: + +&i2c0 { + ov772x: camera@21 { + compatible = "ovti,ov7725"; + reg = <0x21>; + reset-gpios = <&axi_gpio_0 0 GPIO_ACTIVE_LOW>; + powerdown-gpios = <&axi_gpio_0 1 GPIO_ACTIVE_LOW>; + clocks = <&xclk>; + + port { + ov772x_0: endpoint { + remote-endpoint = <&vcap1_in0>; + }; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/media/i2c/panasonic,amg88xx.txt b/Documentation/devicetree/bindings/media/i2c/panasonic,amg88xx.txt new file mode 100644 index 000000000000..4a3181a3dd7e --- /dev/null +++ b/Documentation/devicetree/bindings/media/i2c/panasonic,amg88xx.txt @@ -0,0 +1,19 @@ +* Panasonic AMG88xx + +The Panasonic family of AMG88xx Grid-Eye sensors allow recording +8x8 10Hz video which consists of thermal datapoints + +Required Properties: + - compatible : Must be "panasonic,amg88xx" + - reg : i2c address of the device + +Example: + + i2c0@1c22000 { + ... + amg88xx@69 { + compatible = "panasonic,amg88xx"; + reg = <0x69>; + }; + ... + }; diff --git a/Documentation/devicetree/bindings/media/rcar_vin.txt b/Documentation/devicetree/bindings/media/rcar_vin.txt index 1ce7ff9449c5..a19517e1c669 100644 --- a/Documentation/devicetree/bindings/media/rcar_vin.txt +++ b/Documentation/devicetree/bindings/media/rcar_vin.txt @@ -2,20 +2,28 @@ Renesas R-Car Video Input driver (rcar_vin) ------------------------------------------- The rcar_vin device provides video input capabilities for the Renesas R-Car -family of devices. The current blocks are always slaves and suppot one input -channel which can be either RGB, YUYV or BT656. +family of devices. + +Each VIN instance has a single parallel input that supports RGB and YUV video, +with both external synchronization and BT.656 synchronization for the latter. +Depending on the instance the VIN input is connected to external SoC pins, or +on Gen3 platforms to a CSI-2 receiver. - compatible: Must be one or more of the following - - "renesas,vin-r8a7795" for the R8A7795 device - - "renesas,vin-r8a7794" for the R8A7794 device - - "renesas,vin-r8a7793" for the R8A7793 device - - "renesas,vin-r8a7792" for the R8A7792 device - - "renesas,vin-r8a7791" for the R8A7791 device - - "renesas,vin-r8a7790" for the R8A7790 device - - "renesas,vin-r8a7779" for the R8A7779 device + - "renesas,vin-r8a7743" for the R8A7743 device + - "renesas,vin-r8a7745" for the R8A7745 device - "renesas,vin-r8a7778" for the R8A7778 device - - "renesas,rcar-gen2-vin" for a generic R-Car Gen2 compatible device. - - "renesas,rcar-gen3-vin" for a generic R-Car Gen3 compatible device. + - "renesas,vin-r8a7779" for the R8A7779 device + - "renesas,vin-r8a7790" for the R8A7790 device + - "renesas,vin-r8a7791" for the R8A7791 device + - "renesas,vin-r8a7792" for the R8A7792 device + - "renesas,vin-r8a7793" for the R8A7793 device + - "renesas,vin-r8a7794" for the R8A7794 device + - "renesas,vin-r8a7795" for the R8A7795 device + - "renesas,vin-r8a7796" for the R8A7796 device + - "renesas,vin-r8a77970" for the R8A77970 device + - "renesas,rcar-gen2-vin" for a generic R-Car Gen2 or RZ/G1 compatible + device. When compatible with the generic version nodes must list the SoC-specific version corresponding to the platform first @@ -28,21 +36,38 @@ channel which can be either RGB, YUYV or BT656. Additionally, an alias named vinX will need to be created to specify which video input device this is. -The per-board settings: +The per-board settings Gen2 platforms: - port sub-node describing a single endpoint connected to the vin as described in video-interfaces.txt[1]. Only the first one will be considered as each vin interface has one input port. - These settings are used to work out video input format and widths - into the system. +The per-board settings Gen3 platforms: +Gen3 platforms can support both a single connected parallel input source +from external SoC pins (port0) and/or multiple parallel input sources +from local SoC CSI-2 receivers (port1) depending on SoC. -Device node example -------------------- +- renesas,id - ID number of the VIN, VINx in the documentation. +- ports + - port 0 - sub-node describing a single endpoint connected to the VIN + from external SoC pins described in video-interfaces.txt[1]. + Describing more then one endpoint in port 0 is invalid. Only VIN + instances that are connected to external pins should have port 0. + - port 1 - sub-nodes describing one or more endpoints connected to + the VIN from local SoC CSI-2 receivers. The endpoint numbers must + use the following schema. - aliases { - vin0 = &vin0; - }; + - Endpoint 0 - sub-node describing the endpoint connected to CSI20 + - Endpoint 1 - sub-node describing the endpoint connected to CSI21 + - Endpoint 2 - sub-node describing the endpoint connected to CSI40 + - Endpoint 3 - sub-node describing the endpoint connected to CSI41 + +Device node example for Gen2 platforms +-------------------------------------- + + aliases { + vin0 = &vin0; + }; vin0: vin@e6ef0000 { compatible = "renesas,vin-r8a7790", "renesas,rcar-gen2-vin"; @@ -52,8 +77,8 @@ Device node example status = "disabled"; }; -Board setup example (vin1 composite video input) ------------------------------------------------- +Board setup example for Gen2 platforms (vin1 composite video input) +------------------------------------------------------------------- &i2c2 { status = "okay"; @@ -92,6 +117,77 @@ Board setup example (vin1 composite video input) }; }; +Device node example for Gen3 platforms +-------------------------------------- + + vin0: video@e6ef0000 { + compatible = "renesas,vin-r8a7795"; + reg = <0 0xe6ef0000 0 0x1000>; + interrupts = <GIC_SPI 188 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&cpg CPG_MOD 811>; + power-domains = <&sysc R8A7795_PD_ALWAYS_ON>; + resets = <&cpg 811>; + renesas,id = <0>; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@1 { + #address-cells = <1>; + #size-cells = <0>; + + reg = <1>; + + vin0csi20: endpoint@0 { + reg = <0>; + remote-endpoint= <&csi20vin0>; + }; + vin0csi21: endpoint@1 { + reg = <1>; + remote-endpoint= <&csi21vin0>; + }; + vin0csi40: endpoint@2 { + reg = <2>; + remote-endpoint= <&csi40vin0>; + }; + }; + }; + }; + csi20: csi2@fea80000 { + compatible = "renesas,r8a7795-csi2"; + reg = <0 0xfea80000 0 0x10000>; + interrupts = <GIC_SPI 184 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&cpg CPG_MOD 714>; + power-domains = <&sysc R8A7795_PD_ALWAYS_ON>; + resets = <&cpg 714>; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@0 { + reg = <0>; + csi20_in: endpoint { + clock-lanes = <0>; + data-lanes = <1>; + remote-endpoint = <&adv7482_txb>; + }; + }; + + port@1 { + #address-cells = <1>; + #size-cells = <0>; + + reg = <1>; + + csi20vin0: endpoint@0 { + reg = <0>; + remote-endpoint = <&vin0csi20>; + }; + }; + }; + }; [1] video-interfaces.txt common video media interface diff --git a/Documentation/devicetree/bindings/media/renesas,ceu.txt b/Documentation/devicetree/bindings/media/renesas,ceu.txt index 3fc66dfb192c..8a7a616e9019 100644 --- a/Documentation/devicetree/bindings/media/renesas,ceu.txt +++ b/Documentation/devicetree/bindings/media/renesas,ceu.txt @@ -2,14 +2,15 @@ Renesas Capture Engine Unit (CEU) ---------------------------------------------- The Capture Engine Unit is the image capture interface found in the Renesas -SH Mobile and RZ SoCs. +SH Mobile, R-Mobile and RZ SoCs. The interface supports a single parallel input with data bus width of 8 or 16 bits. Required properties: -- compatible: Shall be "renesas,r7s72100-ceu" for CEU units found in RZ/A1H - and RZ/A1M SoCs. +- compatible: Shall be one of the following values: + "renesas,r7s72100-ceu" for CEU units found in RZ/A1H and RZ/A1M SoCs + "renesas,r8a7740-ceu" for CEU units found in R-Mobile A1 R8A7740 SoCs - reg: Registers address base and size. - interrupts: The interrupt specifier. diff --git a/Documentation/devicetree/bindings/media/renesas,rcar-csi2.txt b/Documentation/devicetree/bindings/media/renesas,rcar-csi2.txt new file mode 100644 index 000000000000..2d385b65b275 --- /dev/null +++ b/Documentation/devicetree/bindings/media/renesas,rcar-csi2.txt @@ -0,0 +1,101 @@ +Renesas R-Car MIPI CSI-2 +------------------------ + +The R-Car CSI-2 receiver device provides MIPI CSI-2 capabilities for the +Renesas R-Car family of devices. It is used in conjunction with the +R-Car VIN module, which provides the video capture capabilities. + +Mandatory properties +-------------------- + - compatible: Must be one or more of the following + - "renesas,r8a7795-csi2" for the R8A7795 device. + - "renesas,r8a7796-csi2" for the R8A7796 device. + - "renesas,r8a77965-csi2" for the R8A77965 device. + - "renesas,r8a77970-csi2" for the R8A77970 device. + + - reg: the register base and size for the device registers + - interrupts: the interrupt for the device + - clocks: reference to the parent clock + +The device node shall contain two 'port' child nodes according to the +bindings defined in Documentation/devicetree/bindings/media/ +video-interfaces.txt. port@0 shall connect to the CSI-2 source. port@1 +shall connect to all the R-Car VIN modules that have a hardware +connection to the CSI-2 receiver. + +- port@0- Video source (mandatory) + - endpoint@0 - sub-node describing the endpoint that is the video source + +- port@1 - VIN instances (optional) + - One endpoint sub-node for every R-Car VIN instance which is connected + to the R-Car CSI-2 receiver. + +Example: + + csi20: csi2@fea80000 { + compatible = "renesas,r8a7796-csi2"; + reg = <0 0xfea80000 0 0x10000>; + interrupts = <0 184 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&cpg CPG_MOD 714>; + power-domains = <&sysc R8A7796_PD_ALWAYS_ON>; + resets = <&cpg 714>; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@0 { + #address-cells = <1>; + #size-cells = <0>; + + reg = <0>; + + csi20_in: endpoint@0 { + reg = <0>; + clock-lanes = <0>; + data-lanes = <1>; + remote-endpoint = <&adv7482_txb>; + }; + }; + + port@1 { + #address-cells = <1>; + #size-cells = <0>; + + reg = <1>; + + csi20vin0: endpoint@0 { + reg = <0>; + remote-endpoint = <&vin0csi20>; + }; + csi20vin1: endpoint@1 { + reg = <1>; + remote-endpoint = <&vin1csi20>; + }; + csi20vin2: endpoint@2 { + reg = <2>; + remote-endpoint = <&vin2csi20>; + }; + csi20vin3: endpoint@3 { + reg = <3>; + remote-endpoint = <&vin3csi20>; + }; + csi20vin4: endpoint@4 { + reg = <4>; + remote-endpoint = <&vin4csi20>; + }; + csi20vin5: endpoint@5 { + reg = <5>; + remote-endpoint = <&vin5csi20>; + }; + csi20vin6: endpoint@6 { + reg = <6>; + remote-endpoint = <&vin6csi20>; + }; + csi20vin7: endpoint@7 { + reg = <7>; + remote-endpoint = <&vin7csi20>; + }; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/arm/tegra/nvidia,tegra20-mc.txt b/Documentation/devicetree/bindings/memory-controllers/nvidia,tegra20-mc.txt index f9632bacbd04..f9632bacbd04 100644 --- a/Documentation/devicetree/bindings/arm/tegra/nvidia,tegra20-mc.txt +++ b/Documentation/devicetree/bindings/memory-controllers/nvidia,tegra20-mc.txt diff --git a/Documentation/devicetree/bindings/mfd/axp20x.txt b/Documentation/devicetree/bindings/mfd/axp20x.txt index 9455503b0299..d1762f3b30af 100644 --- a/Documentation/devicetree/bindings/mfd/axp20x.txt +++ b/Documentation/devicetree/bindings/mfd/axp20x.txt @@ -43,7 +43,7 @@ Optional properties: regulator to drive the OTG VBus, rather then as an input pin which signals whether the board is driving OTG VBus or not. - (axp221 / axp223 / axp813 only) + (axp221 / axp223 / axp803/ axp813 only) - x-powers,master-mode: Boolean (axp806 only). Set this when the PMIC is wired for master mode. The default is slave mode. @@ -132,6 +132,7 @@ FLDO2 : LDO : fldoin-supply : shared supply LDO_IO0 : LDO : ips-supply : GPIO 0 LDO_IO1 : LDO : ips-supply : GPIO 1 RTC_LDO : LDO : ips-supply : always on +DRIVEVBUS : Enable output : drivevbus-supply : external regulator AXP806 regulators, type, and corresponding input supply names: diff --git a/Documentation/devicetree/bindings/mfd/bd9571mwv.txt b/Documentation/devicetree/bindings/mfd/bd9571mwv.txt index 9ab216a851d5..25d1f697eb25 100644 --- a/Documentation/devicetree/bindings/mfd/bd9571mwv.txt +++ b/Documentation/devicetree/bindings/mfd/bd9571mwv.txt @@ -25,6 +25,25 @@ Required properties: Each child node is defined using the standard binding for regulators. +Optional properties: + - rohm,ddr-backup-power : Value to use for DDR-Backup Power (default 0). + This is a bitmask that specifies which DDR power + rails need to be kept powered when backup mode is + entered, for system suspend: + - bit 0: DDR0 + - bit 1: DDR1 + - bit 2: DDR0C + - bit 3: DDR1C + These bits match the KEEPON_DDR* bits in the + documentation for the "BKUP Mode Cnt" register. + - rohm,rstbmode-level: The RSTB signal is configured for level mode, to + accommodate a toggle power switch (the RSTBMODE pin is + strapped low). + - rohm,rstbmode-pulse: The RSTB signal is configured for pulse mode, to + accommodate a momentary power switch (the RSTBMODE pin + is strapped high). + The two properties above are mutually exclusive. + Example: pmic: pmic@30 { @@ -36,6 +55,8 @@ Example: #interrupt-cells = <2>; gpio-controller; #gpio-cells = <2>; + rohm,ddr-backup-power = <0xf>; + rohm,rstbmode-pulse; regulators { dvfs: dvfs { diff --git a/Documentation/devicetree/bindings/mips/lantiq/rcu.txt b/Documentation/devicetree/bindings/mips/lantiq/rcu.txt index a086f1e1cdd7..7f0822b4beae 100644 --- a/Documentation/devicetree/bindings/mips/lantiq/rcu.txt +++ b/Documentation/devicetree/bindings/mips/lantiq/rcu.txt @@ -61,7 +61,6 @@ Example of the RCU bindings on a xRX200 SoC: usb_phy0: usb2-phy@18 { compatible = "lantiq,xrx200-usb2-phy"; reg = <0x18 4>, <0x38 4>; - status = "disabled"; resets = <&reset1 4 4>, <&reset0 4 4>; reset-names = "phy", "ctrl"; @@ -71,7 +70,6 @@ Example of the RCU bindings on a xRX200 SoC: usb_phy1: usb2-phy@34 { compatible = "lantiq,xrx200-usb2-phy"; reg = <0x34 4>, <0x3C 4>; - status = "disabled"; resets = <&reset1 5 4>, <&reset0 4 4>; reset-names = "phy", "ctrl"; diff --git a/Documentation/devicetree/bindings/mmc/amlogic,meson-gx.txt b/Documentation/devicetree/bindings/mmc/amlogic,meson-gx.txt index 50bf611a4d2c..13e70409e8ac 100644 --- a/Documentation/devicetree/bindings/mmc/amlogic,meson-gx.txt +++ b/Documentation/devicetree/bindings/mmc/amlogic,meson-gx.txt @@ -12,6 +12,7 @@ Required properties: - "amlogic,meson-gxbb-mmc" - "amlogic,meson-gxl-mmc" - "amlogic,meson-gxm-mmc" + - "amlogic,meson-axg-mmc" - clocks : A list of phandle + clock-specifier pairs for the clocks listed in clock-names. - clock-names: Should contain the following: "core" - Main peripheral bus clock @@ -19,6 +20,7 @@ Required properties: "clkin1" - Other parent clock of internal mux The driver has an internal mux clock which switches between clkin0 and clkin1 depending on the clock rate requested by the MMC core. +- resets : phandle of the internal reset line Example: @@ -29,4 +31,5 @@ Example: clocks = <&clkc CLKID_SD_EMMC_A>, <&xtal>, <&clkc CLKID_FCLK_DIV2>; clock-names = "core", "clkin0", "clkin1"; pinctrl-0 = <&emmc_pins>; + resets = <&reset RESET_SD_EMMC_A>; }; diff --git a/Documentation/devicetree/bindings/mmc/bluefield-dw-mshc.txt b/Documentation/devicetree/bindings/mmc/bluefield-dw-mshc.txt new file mode 100644 index 000000000000..b0f0999ea1a9 --- /dev/null +++ b/Documentation/devicetree/bindings/mmc/bluefield-dw-mshc.txt @@ -0,0 +1,29 @@ +* Mellanox Bluefield SoC specific extensions to the Synopsys Designware + Mobile Storage Host Controller + +Read synopsys-dw-mshc.txt for more details + +The Synopsys designware mobile storage host controller is used to interface +a SoC with storage medium such as eMMC or SD/MMC cards. This file documents +differences between the core Synopsys dw mshc controller properties described +by synopsys-dw-mshc.txt and the properties used by the Mellanox Bluefield SoC +specific extensions to the Synopsys Designware Mobile Storage Host Controller. + +Required Properties: + +* compatible: should be one of the following. + - "mellanox,bluefield-dw-mshc": for controllers with Mellanox Bluefield SoC + specific extensions. + +Example: + + /* Mellanox Bluefield SoC MMC */ + mmc@6008000 { + compatible = "mellanox,bluefield-dw-mshc"; + reg = <0x6008000 0x400>; + interrupts = <32>; + fifo-depth = <0x100>; + clock-frequency = <24000000>; + bus-width = <8>; + cap-mmc-highspeed; + }; diff --git a/Documentation/devicetree/bindings/mmc/jz4740.txt b/Documentation/devicetree/bindings/mmc/jz4740.txt new file mode 100644 index 000000000000..7cd8c432d7c8 --- /dev/null +++ b/Documentation/devicetree/bindings/mmc/jz4740.txt @@ -0,0 +1,38 @@ +* Ingenic JZ47xx MMC controllers + +This file documents the device tree properties used for the MMC controller in +Ingenic JZ4740/JZ4780 SoCs. These are in addition to the core MMC properties +described in mmc.txt. + +Required properties: +- compatible: Should be one of the following: + - "ingenic,jz4740-mmc" for the JZ4740 + - "ingenic,jz4780-mmc" for the JZ4780 +- reg: Should contain the MMC controller registers location and length. +- interrupts: Should contain the interrupt specifier of the MMC controller. +- clocks: Clock for the MMC controller. + +Optional properties: +- dmas: List of DMA specifiers with the controller specific format + as described in the generic DMA client binding. A tx and rx + specifier is required. +- dma-names: RX and TX DMA request names. + Should be "rx" and "tx", in that order. + +For additional details on DMA client bindings see ../dma/dma.txt. + +Example: + +mmc0: mmc@13450000 { + compatible = "ingenic,jz4780-mmc"; + reg = <0x13450000 0x1000>; + + interrupt-parent = <&intc>; + interrupts = <37>; + + clocks = <&cgu JZ4780_CLK_MSC0>; + clock-names = "mmc"; + + dmas = <&dma JZ4780_DMA_MSC0_RX 0xffffffff>, <&dma JZ4780_DMA_MSC0_TX 0xffffffff>; + dma-names = "rx", "tx"; +}; diff --git a/Documentation/devicetree/bindings/mmc/mmc.txt b/Documentation/devicetree/bindings/mmc/mmc.txt index 467cd7b147ce..f5a0923b34ca 100644 --- a/Documentation/devicetree/bindings/mmc/mmc.txt +++ b/Documentation/devicetree/bindings/mmc/mmc.txt @@ -19,6 +19,8 @@ Optional properties: - wp-gpios: Specify GPIOs for write protection, see gpio binding - cd-inverted: when present, polarity on the CD line is inverted. See the note below for the case, when a GPIO is used for the CD line +- cd-debounce-delay-ms: Set delay time before detecting card after card insert interrupt. + It's only valid when cd-gpios is present. - wp-inverted: when present, polarity on the WP line is inverted. See the note below for the case, when a GPIO is used for the WP line - disable-wp: When set no physical WP line is present. This property should @@ -56,6 +58,10 @@ Optional properties: - fixed-emmc-driver-type: for non-removable eMMC, enforce this driver type. The value <n> is the driver type as specified in the eMMC specification (table 206 in spec version 5.1). +- post-power-on-delay-ms : It was invented for MMC pwrseq-simple which could + be referred to mmc-pwrseq-simple.txt. But now it's reused as a tunable delay + waiting for I/O signalling and card power supply to be stable, regardless of + whether pwrseq-simple is used. Default to 10ms if no available. *NOTE* on CD and WP polarity. To use common for all SD/MMC host controllers line polarity properties, we have to fix the meaning of the "normal" and "inverted" diff --git a/Documentation/devicetree/bindings/mmc/sdhci-omap.txt b/Documentation/devicetree/bindings/mmc/sdhci-omap.txt index 51775a372c06..393848c2138e 100644 --- a/Documentation/devicetree/bindings/mmc/sdhci-omap.txt +++ b/Documentation/devicetree/bindings/mmc/sdhci-omap.txt @@ -4,7 +4,14 @@ Refer to mmc.txt for standard MMC bindings. Required properties: - compatible: Should be "ti,dra7-sdhci" for DRA7 and DRA72 controllers + Should be "ti,k2g-sdhci" for K2G - ti,hwmods: Must be "mmc<n>", <n> is controller instance starting 1 + (Not required for K2G). +- pinctrl-names: Should be subset of "default", "hs", "sdr12", "sdr25", "sdr50", + "ddr50-rev11", "sdr104-rev11", "ddr50", "sdr104", + "ddr_1_8v-rev11", "ddr_1_8v" or "ddr_3_3v", "hs200_1_8v-rev11", + "hs200_1_8v", +- pinctrl-<n> : Pinctrl states as described in bindings/pinctrl/pinctrl-bindings.txt Example: mmc1: mmc@4809c000 { diff --git a/Documentation/devicetree/bindings/mmc/tmio_mmc.txt b/Documentation/devicetree/bindings/mmc/tmio_mmc.txt index 2d5287eeed95..839f469f4525 100644 --- a/Documentation/devicetree/bindings/mmc/tmio_mmc.txt +++ b/Documentation/devicetree/bindings/mmc/tmio_mmc.txt @@ -26,6 +26,8 @@ Required properties: "renesas,sdhi-r8a7794" - SDHI IP on R8A7794 SoC "renesas,sdhi-r8a7795" - SDHI IP on R8A7795 SoC "renesas,sdhi-r8a7796" - SDHI IP on R8A7796 SoC + "renesas,sdhi-r8a77965" - SDHI IP on R8A77965 SoC + "renesas,sdhi-r8a77980" - SDHI IP on R8A77980 SoC "renesas,sdhi-r8a77995" - SDHI IP on R8A77995 SoC "renesas,sdhi-shmobile" - a generic sh-mobile SDHI controller "renesas,rcar-gen1-sdhi" - a generic R-Car Gen1 SDHI controller @@ -67,7 +69,6 @@ Example: R8A7790 (R-Car H2) SDHI controller nodes max-frequency = <195000000>; power-domains = <&sysc R8A7790_PD_ALWAYS_ON>; resets = <&cpg 314>; - status = "disabled"; }; sdhi1: sd@ee120000 { @@ -81,7 +82,6 @@ Example: R8A7790 (R-Car H2) SDHI controller nodes max-frequency = <195000000>; power-domains = <&sysc R8A7790_PD_ALWAYS_ON>; resets = <&cpg 313>; - status = "disabled"; }; sdhi2: sd@ee140000 { @@ -95,7 +95,6 @@ Example: R8A7790 (R-Car H2) SDHI controller nodes max-frequency = <97500000>; power-domains = <&sysc R8A7790_PD_ALWAYS_ON>; resets = <&cpg 312>; - status = "disabled"; }; sdhi3: sd@ee160000 { @@ -109,5 +108,4 @@ Example: R8A7790 (R-Car H2) SDHI controller nodes max-frequency = <97500000>; power-domains = <&sysc R8A7790_PD_ALWAYS_ON>; resets = <&cpg 311>; - status = "disabled"; }; diff --git a/Documentation/devicetree/bindings/mtd/gpmi-nand.txt b/Documentation/devicetree/bindings/mtd/gpmi-nand.txt index b289ef3c1b7e..393588385c6e 100644 --- a/Documentation/devicetree/bindings/mtd/gpmi-nand.txt +++ b/Documentation/devicetree/bindings/mtd/gpmi-nand.txt @@ -47,6 +47,11 @@ Optional properties: partitions written from Linux with this feature turned on may not be accessible by the BootROM code. + - nand-ecc-strength: integer representing the number of bits to correct + per ECC step. Needs to be a multiple of 2. + - nand-ecc-step-size: integer representing the number of data bytes + that are covered by a single ECC step. The driver + supports 512 and 1024. The device tree may optionally contain sub-nodes describing partitions of the address space. See partition.txt for more detail. diff --git a/Documentation/devicetree/bindings/powerpc/4xx/ndfc.txt b/Documentation/devicetree/bindings/mtd/ibm,ndfc.txt index 869f0b5f16e8..869f0b5f16e8 100644 --- a/Documentation/devicetree/bindings/powerpc/4xx/ndfc.txt +++ b/Documentation/devicetree/bindings/mtd/ibm,ndfc.txt diff --git a/Documentation/devicetree/bindings/mtd/mtk-nand.txt b/Documentation/devicetree/bindings/mtd/mtk-nand.txt index 1c88526dedfc..4d3ec5e4ff8a 100644 --- a/Documentation/devicetree/bindings/mtd/mtk-nand.txt +++ b/Documentation/devicetree/bindings/mtd/mtk-nand.txt @@ -20,7 +20,6 @@ Required NFI properties: - interrupts: Interrupts of NFI. - clocks: NFI required clocks. - clock-names: NFI clocks internal name. -- status: Disabled default. Then set "okay" by platform. - ecc-engine: Required ECC Engine node. - #address-cells: NAND chip index, should be 1. - #size-cells: Should be 0. @@ -34,7 +33,6 @@ Example: clocks = <&pericfg CLK_PERI_NFI>, <&pericfg CLK_PERI_NFI_PAD>; clock-names = "nfi_clk", "pad_clk"; - status = "disabled"; ecc-engine = <&bch>; #address-cells = <1>; #size-cells = <0>; @@ -50,14 +48,19 @@ Optional: - nand-on-flash-bbt: Store BBT on NAND Flash. - nand-ecc-mode: the NAND ecc mode (check driver for supported modes) - nand-ecc-step-size: Number of data bytes covered by a single ECC step. - valid values: 512 and 1024. + valid values: + 512 and 1024 on mt2701 and mt2712. + 512 only on mt7622. 1024 is recommended for large page NANDs. - nand-ecc-strength: Number of bits to correct per ECC step. - The valid values that the controller supports are: 4, 6, - 8, 10, 12, 14, 16, 18, 20, 22, 24, 28, 32, 36, 40, 44, - 48, 52, 56, 60. + The valid values that each controller supports: + mt2701: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 28, + 32, 36, 40, 44, 48, 52, 56, 60. + mt2712: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 28, + 32, 36, 40, 44, 48, 52, 56, 60, 68, 72, 80. + mt7622: 4, 6, 8, 10, 12, 14, 16. The strength should be calculated as follows: - E = (S - F) * 8 / 14 + E = (S - F) * 8 / B S = O / (P / Q) E : nand-ecc-strength. S : spare size per sector. @@ -66,6 +69,15 @@ Optional: O : oob size. P : page size. Q : nand-ecc-step-size. + B : number of parity bits needed to correct + 1 bitflip. + According to MTK NAND controller design, + this number depends on max ecc step size + that MTK NAND controller supports. + If max ecc step size supported is 1024, + then it should be always 14. And if max + ecc step size is 512, then it should be + always 13. If the result does not match any one of the listed choices above, please select the smaller valid value from the list. @@ -152,7 +164,6 @@ Required BCH properties: - interrupts: Interrupts of ECC. - clocks: ECC required clocks. - clock-names: ECC clocks internal name. -- status: Disabled default. Then set "okay" by platform. Example: @@ -162,5 +173,4 @@ Example: interrupts = <GIC_SPI 55 IRQ_TYPE_LEVEL_LOW>; clocks = <&pericfg CLK_PERI_NFI_ECC>; clock-names = "nfiecc_clk"; - status = "disabled"; }; diff --git a/Documentation/devicetree/bindings/mtd/partition.txt b/Documentation/devicetree/bindings/mtd/partition.txt index 36f3b769a626..a8f382642ba9 100644 --- a/Documentation/devicetree/bindings/mtd/partition.txt +++ b/Documentation/devicetree/bindings/mtd/partition.txt @@ -14,7 +14,7 @@ method is used for a given flash device. To describe the method there should be a subnode of the flash device that is named 'partitions'. It must have a 'compatible' property, which is used to identify the method to use. -We currently only document a binding for fixed layouts. +Available bindings are listed in the "partitions" subdirectory. Fixed Partitions diff --git a/Documentation/devicetree/bindings/mtd/partitions/brcm,bcm947xx-cfe-partitions.txt b/Documentation/devicetree/bindings/mtd/partitions/brcm,bcm947xx-cfe-partitions.txt new file mode 100644 index 000000000000..1d61a029395e --- /dev/null +++ b/Documentation/devicetree/bindings/mtd/partitions/brcm,bcm947xx-cfe-partitions.txt @@ -0,0 +1,42 @@ +Broadcom BCM47xx Partitions +=========================== + +Broadcom is one of hardware manufacturers providing SoCs (BCM47xx) used in +home routers. Their BCM947xx boards using CFE bootloader have several partitions +without any on-flash partition table. On some devices their sizes and/or +meanings can also vary so fixed partitioning can't be used. + +Discovering partitions on these devices is possible thanks to having a special +header and/or magic signature at the beginning of each of them. They are also +block aligned which is important for determinig a size. + +Most of partitions use ASCII text based magic for determining a type. More +complex partitions (like TRX with its HDR0 magic) may include extra header +containing some details, including a length. + +A list of supported partitions includes: +1) Bootloader with Broadcom's CFE (Common Firmware Environment) +2) NVRAM with configuration/calibration data +3) Device manufacturer's data with some default values (e.g. SSIDs) +4) TRX firmware container which can hold up to 4 subpartitions +5) Backup TRX firmware used after failed upgrade + +As mentioned earlier, role of some partitions may depend on extra configuration. +For example both: main firmware and backup firmware use the same TRX format with +the same header. To distinguish currently used firmware a CFE's environment +variable "bootpartition" is used. + + +Devices using Broadcom partitions described above should should have flash node +with a subnode named "partitions" using following properties: + +Required properties: +- compatible : (required) must be "brcm,bcm947xx-cfe-partitions" + +Example: + +flash@0 { + partitions { + compatible = "brcm,bcm947xx-cfe-partitions"; + }; +}; diff --git a/Documentation/devicetree/bindings/mtd/sunxi-nand.txt b/Documentation/devicetree/bindings/mtd/sunxi-nand.txt index 0734f03bf3d3..dcd5a5d80dc0 100644 --- a/Documentation/devicetree/bindings/mtd/sunxi-nand.txt +++ b/Documentation/devicetree/bindings/mtd/sunxi-nand.txt @@ -22,8 +22,6 @@ Optional properties: - reset : phandle + reset specifier pair - reset-names : must contain "ahb" - allwinner,rb : shall contain the native Ready/Busy ids. - or -- rb-gpios : shall contain the gpios used as R/B pins. - nand-ecc-mode : one of the supported ECC modes ("hw", "soft", "soft_bch" or "none") diff --git a/Documentation/devicetree/bindings/net/can/rcar_canfd.txt b/Documentation/devicetree/bindings/net/can/rcar_canfd.txt index 93c3a6ae32f9..ac71daa46195 100644 --- a/Documentation/devicetree/bindings/net/can/rcar_canfd.txt +++ b/Documentation/devicetree/bindings/net/can/rcar_canfd.txt @@ -5,7 +5,9 @@ Required properties: - compatible: Must contain one or more of the following: - "renesas,rcar-gen3-canfd" for R-Car Gen3 compatible controller. - "renesas,r8a7795-canfd" for R8A7795 (R-Car H3) compatible controller. - - "renesas,r8a7796-canfd" for R8A7796 (R-Car M3) compatible controller. + - "renesas,r8a7796-canfd" for R8A7796 (R-Car M3-W) compatible controller. + - "renesas,r8a77970-canfd" for R8A77970 (R-Car V3M) compatible controller. + - "renesas,r8a77980-canfd" for R8A77980 (R-Car V3H) compatible controller. When compatible with the generic version, nodes must list the SoC-specific version corresponding to the platform first, followed by the diff --git a/Documentation/devicetree/bindings/net/dsa/b53.txt b/Documentation/devicetree/bindings/net/dsa/b53.txt index 8acf51a4dfa8..47a6a7fe0b86 100644 --- a/Documentation/devicetree/bindings/net/dsa/b53.txt +++ b/Documentation/devicetree/bindings/net/dsa/b53.txt @@ -10,6 +10,7 @@ Required properties: "brcm,bcm53128" "brcm,bcm5365" "brcm,bcm5395" + "brcm,bcm5389" "brcm,bcm5397" "brcm,bcm5398" diff --git a/Documentation/devicetree/bindings/net/dsa/dsa.txt b/Documentation/devicetree/bindings/net/dsa/dsa.txt index cfe8f64eca4f..3ceeb8de1196 100644 --- a/Documentation/devicetree/bindings/net/dsa/dsa.txt +++ b/Documentation/devicetree/bindings/net/dsa/dsa.txt @@ -82,8 +82,6 @@ linked into one DSA cluster. switch0: switch0@0 { compatible = "marvell,mv88e6085"; - #address-cells = <1>; - #size-cells = <0>; reg = <0>; dsa,member = <0 0>; @@ -135,8 +133,6 @@ linked into one DSA cluster. switch1: switch1@0 { compatible = "marvell,mv88e6085"; - #address-cells = <1>; - #size-cells = <0>; reg = <0>; dsa,member = <0 1>; @@ -204,8 +200,6 @@ linked into one DSA cluster. switch2: switch2@0 { compatible = "marvell,mv88e6085"; - #address-cells = <1>; - #size-cells = <0>; reg = <0>; dsa,member = <0 2>; diff --git a/Documentation/devicetree/bindings/net/dsa/qca8k.txt b/Documentation/devicetree/bindings/net/dsa/qca8k.txt index 9c67ee4890d7..bbcb255c3150 100644 --- a/Documentation/devicetree/bindings/net/dsa/qca8k.txt +++ b/Documentation/devicetree/bindings/net/dsa/qca8k.txt @@ -2,7 +2,10 @@ Required properties: -- compatible: should be "qca,qca8337" +- compatible: should be one of: + "qca,qca8334" + "qca,qca8337" + - #size-cells: must be 0 - #address-cells: must be 1 @@ -14,6 +17,20 @@ port and PHY id, each subnode describing a port needs to have a valid phandle referencing the internal PHY connected to it. The CPU port of this switch is always port 0. +A CPU port node has the following optional node: + +- fixed-link : Fixed-link subnode describing a link to a non-MDIO + managed entity. See + Documentation/devicetree/bindings/net/fixed-link.txt + for details. + +For QCA8K the 'fixed-link' sub-node supports only the following properties: + +- 'speed' (integer, mandatory), to indicate the link speed. Accepted + values are 10, 100 and 1000 +- 'full-duplex' (boolean, optional), to indicate that full duplex is + used. When absent, half duplex is assumed. + Example: @@ -53,6 +70,10 @@ Example: label = "cpu"; ethernet = <&gmac1>; phy-mode = "rgmii"; + fixed-link { + speed = 1000; + full-duplex; + }; }; port@1 { diff --git a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt index 3d6d5fa0c4d5..cfe724398a12 100644 --- a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt +++ b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt @@ -7,6 +7,7 @@ Required properties: - compatible: must be one of the following string: "allwinner,sun8i-a83t-emac" "allwinner,sun8i-h3-emac" + "allwinner,sun8i-r40-gmac" "allwinner,sun8i-v3s-emac" "allwinner,sun50i-a64-emac" - reg: address and length of the register for the device. @@ -20,18 +21,18 @@ Required properties: - phy-handle: See ethernet.txt - #address-cells: shall be 1 - #size-cells: shall be 0 -- syscon: A phandle to the syscon of the SoC with one of the following - compatible string: - - allwinner,sun8i-h3-system-controller - - allwinner,sun8i-v3s-system-controller - - allwinner,sun50i-a64-system-controller - - allwinner,sun8i-a83t-system-controller +- syscon: A phandle to the device containing the EMAC or GMAC clock register Optional properties: -- allwinner,tx-delay-ps: TX clock delay chain value in ps. Range value is 0-700. Default is 0) -- allwinner,rx-delay-ps: RX clock delay chain value in ps. Range value is 0-3100. Default is 0) -Both delay properties need to be a multiple of 100. They control the delay for -external PHY. +- allwinner,tx-delay-ps: TX clock delay chain value in ps. + Range is 0-700. Default is 0. + Unavailable for allwinner,sun8i-r40-gmac +- allwinner,rx-delay-ps: RX clock delay chain value in ps. + Range is 0-3100. Default is 0. + Range is 0-700 for allwinner,sun8i-r40-gmac +Both delay properties need to be a multiple of 100. They control the +clock delay for external RGMII PHY. They do not apply to the internal +PHY or external non-RGMII PHYs. Optional properties for the following compatibles: - "allwinner,sun8i-h3-emac", diff --git a/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt b/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt index 79bf352e659c..047bdf7bdd2f 100644 --- a/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt +++ b/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt @@ -86,70 +86,4 @@ Example: * Gianfar PTP clock nodes -General Properties: - - - compatible Should be "fsl,etsec-ptp" - - reg Offset and length of the register set for the device - - interrupts There should be at least two interrupts. Some devices - have as many as four PTP related interrupts. - -Clock Properties: - - - fsl,cksel Timer reference clock source. - - fsl,tclk-period Timer reference clock period in nanoseconds. - - fsl,tmr-prsc Prescaler, divides the output clock. - - fsl,tmr-add Frequency compensation value. - - fsl,tmr-fiper1 Fixed interval period pulse generator. - - fsl,tmr-fiper2 Fixed interval period pulse generator. - - fsl,max-adj Maximum frequency adjustment in parts per billion. - - These properties set the operational parameters for the PTP - clock. You must choose these carefully for the clock to work right. - Here is how to figure good values: - - TimerOsc = selected reference clock MHz - tclk_period = desired clock period nanoseconds - NominalFreq = 1000 / tclk_period MHz - FreqDivRatio = TimerOsc / NominalFreq (must be greater that 1.0) - tmr_add = ceil(2^32 / FreqDivRatio) - OutputClock = NominalFreq / tmr_prsc MHz - PulseWidth = 1 / OutputClock microseconds - FiperFreq1 = desired frequency in Hz - FiperDiv1 = 1000000 * OutputClock / FiperFreq1 - tmr_fiper1 = tmr_prsc * tclk_period * FiperDiv1 - tclk_period - max_adj = 1000000000 * (FreqDivRatio - 1.0) - 1 - - The calculation for tmr_fiper2 is the same as for tmr_fiper1. The - driver expects that tmr_fiper1 will be correctly set to produce a 1 - Pulse Per Second (PPS) signal, since this will be offered to the PPS - subsystem to synchronize the Linux clock. - - Reference clock source is determined by the value, which is holded - in CKSEL bits in TMR_CTRL register. "fsl,cksel" property keeps the - value, which will be directly written in those bits, that is why, - according to reference manual, the next clock sources can be used: - - <0> - external high precision timer reference clock (TSEC_TMR_CLK - input is used for this purpose); - <1> - eTSEC system clock; - <2> - eTSEC1 transmit clock; - <3> - RTC clock input. - - When this attribute is not used, eTSEC system clock will serve as - IEEE 1588 timer reference clock. - -Example: - - ptp_clock@24e00 { - compatible = "fsl,etsec-ptp"; - reg = <0x24E00 0xB0>; - interrupts = <12 0x8 13 0x8>; - interrupt-parent = < &ipic >; - fsl,cksel = <1>; - fsl,tclk-period = <10>; - fsl,tmr-prsc = <100>; - fsl,tmr-add = <0x999999A4>; - fsl,tmr-fiper1 = <0x3B9AC9F6>; - fsl,tmr-fiper2 = <0x00018696>; - fsl,max-adj = <659999998>; - }; +Refer to Documentation/devicetree/bindings/ptp/ptp-qoriq.txt diff --git a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt b/Documentation/devicetree/bindings/net/ibm,emac.txt index 44b842b6ca15..44b842b6ca15 100644 --- a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt +++ b/Documentation/devicetree/bindings/net/ibm,emac.txt diff --git a/Documentation/devicetree/bindings/net/marvell-pp2.txt b/Documentation/devicetree/bindings/net/marvell-pp2.txt index 1814fa13f6ab..fc019df0d863 100644 --- a/Documentation/devicetree/bindings/net/marvell-pp2.txt +++ b/Documentation/devicetree/bindings/net/marvell-pp2.txt @@ -21,9 +21,10 @@ Required properties: - main controller clock (for both armada-375-pp2 and armada-7k-pp2) - GOP clock (for both armada-375-pp2 and armada-7k-pp2) - MG clock (only for armada-7k-pp2) + - MG Core clock (only for armada-7k-pp2) - AXI clock (only for armada-7k-pp2) -- clock-names: names of used clocks, must be "pp_clk", "gop_clk", "mg_clk" - and "axi_clk" (the 2 latter only for armada-7k-pp2). +- clock-names: names of used clocks, must be "pp_clk", "gop_clk", "mg_clk", + "mg_core_clk" and "axi_clk" (the 3 latter only for armada-7k-pp2). The ethernet ports are represented by subnodes. At least one port is required. @@ -80,8 +81,8 @@ cpm_ethernet: ethernet@0 { compatible = "marvell,armada-7k-pp22"; reg = <0x0 0x100000>, <0x129000 0xb000>; clocks = <&cpm_syscon0 1 3>, <&cpm_syscon0 1 9>, - <&cpm_syscon0 1 5>, <&cpm_syscon0 1 18>; - clock-names = "pp_clk", "gop_clk", "gp_clk", "axi_clk"; + <&cpm_syscon0 1 5>, <&cpm_syscon0 1 6>, <&cpm_syscon0 1 18>; + clock-names = "pp_clk", "gop_clk", "mg_clk", "mg_core_clk", "axi_clk"; eth0: eth0 { interrupts = <ICU_GRP_NSR 39 IRQ_TYPE_LEVEL_HIGH>, diff --git a/Documentation/devicetree/bindings/net/meson-dwmac.txt b/Documentation/devicetree/bindings/net/meson-dwmac.txt index 61cada22ae6c..1321bb194ed9 100644 --- a/Documentation/devicetree/bindings/net/meson-dwmac.txt +++ b/Documentation/devicetree/bindings/net/meson-dwmac.txt @@ -11,6 +11,7 @@ Required properties on all platforms: - "amlogic,meson8b-dwmac" - "amlogic,meson8m2-dwmac" - "amlogic,meson-gxbb-dwmac" + - "amlogic,meson-axg-dwmac" Additionally "snps,dwmac" and any applicable more detailed version number described in net/stmmac.txt should be used. diff --git a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt index 42a248301615..e22d8cfea687 100644 --- a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt +++ b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt @@ -57,6 +57,13 @@ KSZ9031: - txd2-skew-ps : Skew control of TX data 2 pad - txd3-skew-ps : Skew control of TX data 3 pad + - micrel,force-master: + Boolean, force phy to master mode. Only set this option if the phy + reference clock provided at CLK125_NDO pin is used as MAC reference + clock because the clock jitter in slave mode is to high (errata#2). + Attention: The link partner must be configurable as slave otherwise + no link will be established. + Examples: mdio { diff --git a/Documentation/devicetree/bindings/net/microchip,lan78xx.txt b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt new file mode 100644 index 000000000000..76786a0f6d3d --- /dev/null +++ b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt @@ -0,0 +1,54 @@ +Microchip LAN78xx Gigabit Ethernet controller + +The LAN78XX devices are usually configured by programming their OTP or with +an external EEPROM, but some platforms (e.g. Raspberry Pi 3 B+) have neither. +The Device Tree properties, if present, override the OTP and EEPROM. + +Required properties: +- compatible: Should be one of "usb424,7800", "usb424,7801" or "usb424,7850". + +Optional properties: +- local-mac-address: see ethernet.txt +- mac-address: see ethernet.txt + +Optional properties of the embedded PHY: +- microchip,led-modes: a 0..4 element vector, with each element configuring + the operating mode of an LED. Omitted LEDs are turned off. Allowed values + are defined in "include/dt-bindings/net/microchip-lan78xx.h". + +Example: + +/* Based on the configuration for a Raspberry Pi 3 B+ */ +&usb { + usb-port@1 { + compatible = "usb424,2514"; + reg = <1>; + #address-cells = <1>; + #size-cells = <0>; + + usb-port@1 { + compatible = "usb424,2514"; + reg = <1>; + #address-cells = <1>; + #size-cells = <0>; + + ethernet: ethernet@1 { + compatible = "usb424,7800"; + reg = <1>; + local-mac-address = [ 00 11 22 33 44 55 ]; + + mdio { + #address-cells = <0x1>; + #size-cells = <0x0>; + eth_phy: ethernet-phy@1 { + reg = <1>; + microchip,led-modes = < + LAN78XX_LINK_1000_ACTIVITY + LAN78XX_LINK_10_100_ACTIVITY + >; + }; + }; + }; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/net/mscc-miim.txt b/Documentation/devicetree/bindings/net/mscc-miim.txt new file mode 100644 index 000000000000..7104679cf59d --- /dev/null +++ b/Documentation/devicetree/bindings/net/mscc-miim.txt @@ -0,0 +1,26 @@ +Microsemi MII Management Controller (MIIM) / MDIO +================================================= + +Properties: +- compatible: must be "mscc,ocelot-miim" +- reg: The base address of the MDIO bus controller register bank. Optionally, a + second register bank can be defined if there is an associated reset register + for internal PHYs +- #address-cells: Must be <1>. +- #size-cells: Must be <0>. MDIO addresses have no size component. +- interrupts: interrupt specifier (refer to the interrupt binding) + +Typically an MDIO bus might have several children. + +Example: + mdio@107009c { + #address-cells = <1>; + #size-cells = <0>; + compatible = "mscc,ocelot-miim"; + reg = <0x107009c 0x36>, <0x10700f0 0x8>; + interrupts = <14>; + + phy0: ethernet-phy@0 { + reg = <0>; + }; + }; diff --git a/Documentation/devicetree/bindings/net/mscc-ocelot.txt b/Documentation/devicetree/bindings/net/mscc-ocelot.txt new file mode 100644 index 000000000000..0a84711abece --- /dev/null +++ b/Documentation/devicetree/bindings/net/mscc-ocelot.txt @@ -0,0 +1,82 @@ +Microsemi Ocelot network Switch +=============================== + +The Microsemi Ocelot network switch can be found on Microsemi SoCs (VSC7513, +VSC7514) + +Required properties: +- compatible: Should be "mscc,vsc7514-switch" +- reg: Must contain an (offset, length) pair of the register set for each + entry in reg-names. +- reg-names: Must include the following entries: + - "sys" + - "rew" + - "qs" + - "hsio" + - "qsys" + - "ana" + - "portX" with X from 0 to the number of last port index available on that + switch +- interrupts: Should contain the switch interrupts for frame extraction and + frame injection +- interrupt-names: should contain the interrupt names: "xtr", "inj" +- ethernet-ports: A container for child nodes representing switch ports. + +The ethernet-ports container has the following properties + +Required properties: + +- #address-cells: Must be 1 +- #size-cells: Must be 0 + +Each port node must have the following mandatory properties: +- reg: Describes the port address in the switch + +Port nodes may also contain the following optional standardised +properties, described in binding documents: + +- phy-handle: Phandle to a PHY on an MDIO bus. See + Documentation/devicetree/bindings/net/ethernet.txt for details. + +Example: + + switch@1010000 { + compatible = "mscc,vsc7514-switch"; + reg = <0x1010000 0x10000>, + <0x1030000 0x10000>, + <0x1080000 0x100>, + <0x10d0000 0x10000>, + <0x11e0000 0x100>, + <0x11f0000 0x100>, + <0x1200000 0x100>, + <0x1210000 0x100>, + <0x1220000 0x100>, + <0x1230000 0x100>, + <0x1240000 0x100>, + <0x1250000 0x100>, + <0x1260000 0x100>, + <0x1270000 0x100>, + <0x1280000 0x100>, + <0x1800000 0x80000>, + <0x1880000 0x10000>; + reg-names = "sys", "rew", "qs", "hsio", "port0", + "port1", "port2", "port3", "port4", "port5", + "port6", "port7", "port8", "port9", "port10", + "qsys", "ana"; + interrupts = <21 22>; + interrupt-names = "xtr", "inj"; + + ethernet-ports { + #address-cells = <1>; + #size-cells = <0>; + + port0: port@0 { + reg = <0>; + phy-handle = <&phy0>; + }; + port1: port@1 { + reg = <1>; + phy-handle = <&phy1>; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt b/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt new file mode 100644 index 000000000000..0ea18a53cc29 --- /dev/null +++ b/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt @@ -0,0 +1,30 @@ +Qualcomm Bluetooth Chips +--------------------- + +This documents the binding structure and common properties for serial +attached Qualcomm devices. + +Serial attached Qualcomm devices shall be a child node of the host UART +device the slave device is attached to. + +Required properties: + - compatible: should contain one of the following: + * "qcom,qca6174-bt" + +Optional properties: + - enable-gpios: gpio specifier used to enable chip + - clocks: clock provided to the controller (SUSCLK_32KHZ) + +Example: + +serial@7570000 { + label = "BT-UART"; + status = "okay"; + + bluetooth { + compatible = "qcom,qca6174-bt"; + + enable-gpios = <&pm8994_gpios 19 GPIO_ACTIVE_HIGH>; + clocks = <&divclk4>; + }; +}; diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt b/Documentation/devicetree/bindings/net/renesas,ravb.txt index c306f55d335b..fac897d54423 100644 --- a/Documentation/devicetree/bindings/net/renesas,ravb.txt +++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt @@ -18,8 +18,10 @@ Required properties: - "renesas,etheravb-r8a7795" for the R8A7795 SoC. - "renesas,etheravb-r8a7796" for the R8A7796 SoC. + - "renesas,etheravb-r8a77965" for the R8A77965 SoC. - "renesas,etheravb-r8a77970" for the R8A77970 SoC. - "renesas,etheravb-r8a77980" for the R8A77980 SoC. + - "renesas,etheravb-r8a77990" for the R8A77990 SoC. - "renesas,etheravb-r8a77995" for the R8A77995 SoC. - "renesas,etheravb-rcar-gen3" as a fallback for the above R-Car Gen3 devices. diff --git a/Documentation/devicetree/bindings/net/sff,sfp.txt b/Documentation/devicetree/bindings/net/sff,sfp.txt index 929591d52ed6..832139919f20 100644 --- a/Documentation/devicetree/bindings/net/sff,sfp.txt +++ b/Documentation/devicetree/bindings/net/sff,sfp.txt @@ -7,11 +7,11 @@ Required properties: "sff,sfp" for SFP modules "sff,sff" for soldered down SFF modules -Optional Properties: - - i2c-bus : phandle of an I2C bus controller for the SFP two wire serial interface +Optional Properties: + - mod-def0-gpios : GPIO phandle and a specifier of the MOD-DEF0 (AKA Mod_ABS) module presence input gpio signal, active (module absent) high. Must not be present for SFF modules diff --git a/Documentation/devicetree/bindings/net/sh_eth.txt b/Documentation/devicetree/bindings/net/sh_eth.txt index 5172799a7f1a..82a4cf2c145d 100644 --- a/Documentation/devicetree/bindings/net/sh_eth.txt +++ b/Documentation/devicetree/bindings/net/sh_eth.txt @@ -14,6 +14,7 @@ Required properties: "renesas,ether-r8a7791" if the device is a part of R8A7791 SoC. "renesas,ether-r8a7793" if the device is a part of R8A7793 SoC. "renesas,ether-r8a7794" if the device is a part of R8A7794 SoC. + "renesas,gether-r8a77980" if the device is a part of R8A77980 SoC. "renesas,ether-r7s72100" if the device is a part of R7S72100 SoC. "renesas,rcar-gen1-ether" for a generic R-Car Gen1 device. "renesas,rcar-gen2-ether" for a generic R-Car Gen2 or RZ/G1 diff --git a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt index 96398cc2982f..fc8f01718690 100644 --- a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt +++ b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt @@ -13,13 +13,25 @@ Required properties: - reg: Address where registers are mapped and size of region. - interrupts: Should contain the MAC interrupt. - phy-mode: See ethernet.txt in the same directory. Allow to choose - "rgmii", "rmii", or "mii" according to the PHY. + "rgmii", "rmii", "mii", or "internal" according to the PHY. + The acceptable mode is SoC-dependent. - phy-handle: Should point to the external phy device. See ethernet.txt file in the same directory. - clocks: A phandle to the clock for the MAC. + For Pro4 SoC, that is "socionext,uniphier-pro4-ave4", + another MAC clock, GIO bus clock and PHY clock are also required. + - clock-names: Should contain + - "ether", "ether-gb", "gio", "ether-phy" for Pro4 SoC + - "ether" for others + - resets: A phandle to the reset control for the MAC. For Pro4 SoC, + GIO bus reset is also required. + - reset-names: Should contain + - "ether", "gio" for Pro4 SoC + - "ether" for others + - socionext,syscon-phy-mode: A phandle to syscon with one argument + that configures phy mode. The argument is the ID of MAC instance. Optional properties: - - resets: A phandle to the reset control for the MAC. - local-mac-address: See ethernet.txt in the same directory. Required subnode: @@ -34,8 +46,11 @@ Example: interrupts = <0 66 4>; phy-mode = "rgmii"; phy-handle = <ðphy>; + clock-names = "ether"; clocks = <&sys_clk 6>; + reset-names = "ether"; resets = <&sys_rst 6>; + socionext,syscon-phy-mode = <&soc_glue 0>; local-mac-address = [00 00 00 00 00 00]; mdio { diff --git a/Documentation/devicetree/bindings/net/stm32-dwmac.txt b/Documentation/devicetree/bindings/net/stm32-dwmac.txt index 489dbcb66c5a..1341012722aa 100644 --- a/Documentation/devicetree/bindings/net/stm32-dwmac.txt +++ b/Documentation/devicetree/bindings/net/stm32-dwmac.txt @@ -6,14 +6,28 @@ Please see stmmac.txt for the other unchanged properties. The device node has following properties. Required properties: -- compatible: Should be "st,stm32-dwmac" to select glue, and +- compatible: For MCU family should be "st,stm32-dwmac" to select glue, and "snps,dwmac-3.50a" to select IP version. + For MPU family should be "st,stm32mp1-dwmac" to select + glue, and "snps,dwmac-4.20a" to select IP version. - clocks: Must contain a phandle for each entry in clock-names. - clock-names: Should be "stmmaceth" for the host clock. Should be "mac-clk-tx" for the MAC TX clock. Should be "mac-clk-rx" for the MAC RX clock. + For MPU family need to add also "ethstp" for power mode clock and, + "syscfg-clk" for SYSCFG clock. +- interrupt-names: Should contain a list of interrupt names corresponding to + the interrupts in the interrupts property, if available. + Should be "macirq" for the main MAC IRQ + Should be "eth_wake_irq" for the IT which wake up system - st,syscon : Should be phandle/offset pair. The phandle to the syscon node which - encompases the glue register, and the offset of the control register. + encompases the glue register, and the offset of the control register. + +Optional properties: +- clock-names: For MPU family "mac-clk-ck" for PHY without quartz +- st,int-phyclk (boolean) : valid only where PHY do not have quartz and need to be clock + by RCC + Example: ethernet@40028000 { diff --git a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt index 3d2a031217da..7fd4e8ce4149 100644 --- a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt +++ b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt @@ -4,6 +4,7 @@ Required properties: - compatible: Should be one of the following: * "qcom,ath10k" * "qcom,ipq4019-wifi" + * "qcom,wcn3990-wifi" PCI based devices uses compatible string "qcom,ath10k" and takes calibration data along with board specific data via "qcom,ath10k-calibration-data". @@ -18,8 +19,12 @@ In general, entry "qcom,ath10k-pre-calibration-data" and "qcom,ath10k-calibration-data" conflict with each other and only one can be provided per device. +SNOC based devices (i.e. wcn3990) uses compatible string "qcom,wcn3990-wifi". + Optional properties: - reg: Address and length of the register set for the device. +- reg-names: Must include the list of following reg names, + "membase" - resets: Must contain an entry for each entry in reset-names. See ../reset/reseti.txt for details. - reset-names: Must include the list of following reset names, @@ -49,6 +54,8 @@ Optional properties: hw versions. - qcom,ath10k-pre-calibration-data : pre calibration data as an array, the length can vary between hw versions. +- <supply-name>-supply: handle to the regulator device tree node + optional "supply-name" is "vdd-0.8-cx-mx". Example (to supply the calibration data alone): @@ -119,3 +126,27 @@ wifi0: wifi@a000000 { qcom,msi_base = <0x40>; qcom,ath10k-pre-calibration-data = [ 01 02 03 ... ]; }; + +Example (to supply wcn3990 SoC wifi block details): + +wifi@18000000 { + compatible = "qcom,wcn3990-wifi"; + reg = <0x18800000 0x800000>; + reg-names = "membase"; + clocks = <&clock_gcc clk_aggre2_noc_clk>; + clock-names = "smmu_aggre2_noc_clk" + interrupts = + <0 130 0 /* CE0 */ >, + <0 131 0 /* CE1 */ >, + <0 132 0 /* CE2 */ >, + <0 133 0 /* CE3 */ >, + <0 134 0 /* CE4 */ >, + <0 135 0 /* CE5 */ >, + <0 136 0 /* CE6 */ >, + <0 137 0 /* CE7 */ >, + <0 138 0 /* CE8 */ >, + <0 139 0 /* CE9 */ >, + <0 140 0 /* CE10 */ >, + <0 141 0 /* CE11 */ >; + vdd-0.8-cx-mx-supply = <&pm8998_l5>; +}; diff --git a/Documentation/devicetree/bindings/nvmem/zii,rave-sp-eeprom.txt b/Documentation/devicetree/bindings/nvmem/zii,rave-sp-eeprom.txt new file mode 100644 index 000000000000..d5e22fc67d66 --- /dev/null +++ b/Documentation/devicetree/bindings/nvmem/zii,rave-sp-eeprom.txt @@ -0,0 +1,40 @@ +Zodiac Inflight Innovations RAVE EEPROM Bindings + +RAVE SP EEPROM device is a "MFD cell" device exposing physical EEPROM +attached to RAVE Supervisory Processor. It is expected that its Device +Tree node is specified as a child of the node corresponding to the +parent RAVE SP device (as documented in +Documentation/devicetree/bindings/mfd/zii,rave-sp.txt) + +Required properties: + +- compatible: Should be "zii,rave-sp-eeprom" + +Optional properties: + +- zii,eeprom-name: Unique EEPROM identifier describing its function in the + system. Will be used as created NVMEM deivce's name. + +Data cells: + +Data cells are child nodes of eerpom node, bindings for which are +documented in Documentation/bindings/nvmem/nvmem.txt + +Example: + + rave-sp { + compatible = "zii,rave-sp-rdu1"; + current-speed = <38400>; + + eeprom@a4 { + compatible = "zii,rave-sp-eeprom"; + reg = <0xa4 0x4000>; + #address-cells = <1>; + #size-cells = <1>; + zii,eeprom-name = "main-eeprom"; + + wdt_timeout: wdt-timeout@81 { + reg = <0x81 2>; + }; + }; + } diff --git a/Documentation/devicetree/bindings/opp/kryo-cpufreq.txt b/Documentation/devicetree/bindings/opp/kryo-cpufreq.txt new file mode 100644 index 000000000000..c2127b96805a --- /dev/null +++ b/Documentation/devicetree/bindings/opp/kryo-cpufreq.txt @@ -0,0 +1,680 @@ +Qualcomm Technologies, Inc. KRYO CPUFreq and OPP bindings +=================================== + +In Certain Qualcomm Technologies, Inc. SoCs like apq8096 and msm8996 +that have KRYO processors, the CPU ferequencies subset and voltage value +of each OPP varies based on the silicon variant in use. +Qualcomm Technologies, Inc. Process Voltage Scaling Tables +defines the voltage and frequency value based on the msm-id in SMEM +and speedbin blown in the efuse combination. +The qcom-cpufreq-kryo driver reads the msm-id and efuse value from the SoC +to provide the OPP framework with required information (existing HW bitmap). +This is used to determine the voltage and frequency value for each OPP of +operating-points-v2 table when it is parsed by the OPP framework. + +Required properties: +-------------------- +In 'cpus' nodes: +- operating-points-v2: Phandle to the operating-points-v2 table to use. + +In 'operating-points-v2' table: +- compatible: Should be + - 'operating-points-v2-kryo-cpu' for apq8096 and msm8996. +- nvmem-cells: A phandle pointing to a nvmem-cells node representing the + efuse registers that has information about the + speedbin that is used to select the right frequency/voltage + value pair. + Please refer the for nvmem-cells + bindings Documentation/devicetree/bindings/nvmem/nvmem.txt + and also examples below. + +In every OPP node: +- opp-supported-hw: A single 32 bit bitmap value, representing compatible HW. + Bitmap: + 0: MSM8996 V3, speedbin 0 + 1: MSM8996 V3, speedbin 1 + 2: MSM8996 V3, speedbin 2 + 3: unused + 4: MSM8996 SG, speedbin 0 + 5: MSM8996 SG, speedbin 1 + 6: MSM8996 SG, speedbin 2 + 7-31: unused + +Example 1: +--------- + + cpus { + #address-cells = <2>; + #size-cells = <0>; + + CPU0: cpu@0 { + device_type = "cpu"; + compatible = "qcom,kryo"; + reg = <0x0 0x0>; + enable-method = "psci"; + clocks = <&kryocc 0>; + cpu-supply = <&pm8994_s11_saw>; + operating-points-v2 = <&cluster0_opp>; + #cooling-cells = <2>; + next-level-cache = <&L2_0>; + L2_0: l2-cache { + compatible = "cache"; + cache-level = <2>; + }; + }; + + CPU1: cpu@1 { + device_type = "cpu"; + compatible = "qcom,kryo"; + reg = <0x0 0x1>; + enable-method = "psci"; + clocks = <&kryocc 0>; + cpu-supply = <&pm8994_s11_saw>; + operating-points-v2 = <&cluster0_opp>; + #cooling-cells = <2>; + next-level-cache = <&L2_0>; + }; + + CPU2: cpu@100 { + device_type = "cpu"; + compatible = "qcom,kryo"; + reg = <0x0 0x100>; + enable-method = "psci"; + clocks = <&kryocc 1>; + cpu-supply = <&pm8994_s11_saw>; + operating-points-v2 = <&cluster1_opp>; + #cooling-cells = <2>; + next-level-cache = <&L2_1>; + L2_1: l2-cache { + compatible = "cache"; + cache-level = <2>; + }; + }; + + CPU3: cpu@101 { + device_type = "cpu"; + compatible = "qcom,kryo"; + reg = <0x0 0x101>; + enable-method = "psci"; + clocks = <&kryocc 1>; + cpu-supply = <&pm8994_s11_saw>; + operating-points-v2 = <&cluster1_opp>; + #cooling-cells = <2>; + next-level-cache = <&L2_1>; + }; + + cpu-map { + cluster0 { + core0 { + cpu = <&CPU0>; + }; + + core1 { + cpu = <&CPU1>; + }; + }; + + cluster1 { + core0 { + cpu = <&CPU2>; + }; + + core1 { + cpu = <&CPU3>; + }; + }; + }; + }; + + cluster0_opp: opp_table0 { + compatible = "operating-points-v2-kryo-cpu"; + nvmem-cells = <&speedbin_efuse>; + opp-shared; + + opp-307200000 { + opp-hz = /bits/ 64 <307200000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x77>; + clock-latency-ns = <200000>; + }; + opp-384000000 { + opp-hz = /bits/ 64 <384000000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-422400000 { + opp-hz = /bits/ 64 <422400000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-460800000 { + opp-hz = /bits/ 64 <460800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-480000000 { + opp-hz = /bits/ 64 <480000000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-537600000 { + opp-hz = /bits/ 64 <537600000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-556800000 { + opp-hz = /bits/ 64 <556800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-614400000 { + opp-hz = /bits/ 64 <614400000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-652800000 { + opp-hz = /bits/ 64 <652800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-691200000 { + opp-hz = /bits/ 64 <691200000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-729600000 { + opp-hz = /bits/ 64 <729600000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-768000000 { + opp-hz = /bits/ 64 <768000000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-844800000 { + opp-hz = /bits/ 64 <844800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x77>; + clock-latency-ns = <200000>; + }; + opp-902400000 { + opp-hz = /bits/ 64 <902400000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-960000000 { + opp-hz = /bits/ 64 <960000000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-979200000 { + opp-hz = /bits/ 64 <979200000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1036800000 { + opp-hz = /bits/ 64 <1036800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1056000000 { + opp-hz = /bits/ 64 <1056000000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1113600000 { + opp-hz = /bits/ 64 <1113600000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1132800000 { + opp-hz = /bits/ 64 <1132800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1190400000 { + opp-hz = /bits/ 64 <1190400000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1209600000 { + opp-hz = /bits/ 64 <1209600000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1228800000 { + opp-hz = /bits/ 64 <1228800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1286400000 { + opp-hz = /bits/ 64 <1286400000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1324800000 { + opp-hz = /bits/ 64 <1324800000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x5>; + clock-latency-ns = <200000>; + }; + opp-1363200000 { + opp-hz = /bits/ 64 <1363200000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x72>; + clock-latency-ns = <200000>; + }; + opp-1401600000 { + opp-hz = /bits/ 64 <1401600000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x5>; + clock-latency-ns = <200000>; + }; + opp-1440000000 { + opp-hz = /bits/ 64 <1440000000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1478400000 { + opp-hz = /bits/ 64 <1478400000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x1>; + clock-latency-ns = <200000>; + }; + opp-1497600000 { + opp-hz = /bits/ 64 <1497600000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x4>; + clock-latency-ns = <200000>; + }; + opp-1516800000 { + opp-hz = /bits/ 64 <1516800000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1593600000 { + opp-hz = /bits/ 64 <1593600000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x71>; + clock-latency-ns = <200000>; + }; + opp-1996800000 { + opp-hz = /bits/ 64 <1996800000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x20>; + clock-latency-ns = <200000>; + }; + opp-2188800000 { + opp-hz = /bits/ 64 <2188800000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x10>; + clock-latency-ns = <200000>; + }; + }; + + cluster1_opp: opp_table1 { + compatible = "operating-points-v2-kryo-cpu"; + nvmem-cells = <&speedbin_efuse>; + opp-shared; + + opp-307200000 { + opp-hz = /bits/ 64 <307200000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x77>; + clock-latency-ns = <200000>; + }; + opp-384000000 { + opp-hz = /bits/ 64 <384000000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-403200000 { + opp-hz = /bits/ 64 <403200000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-460800000 { + opp-hz = /bits/ 64 <460800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-480000000 { + opp-hz = /bits/ 64 <480000000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-537600000 { + opp-hz = /bits/ 64 <537600000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-556800000 { + opp-hz = /bits/ 64 <556800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-614400000 { + opp-hz = /bits/ 64 <614400000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-652800000 { + opp-hz = /bits/ 64 <652800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-691200000 { + opp-hz = /bits/ 64 <691200000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-729600000 { + opp-hz = /bits/ 64 <729600000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-748800000 { + opp-hz = /bits/ 64 <748800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-806400000 { + opp-hz = /bits/ 64 <806400000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-825600000 { + opp-hz = /bits/ 64 <825600000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-883200000 { + opp-hz = /bits/ 64 <883200000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-902400000 { + opp-hz = /bits/ 64 <902400000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-940800000 { + opp-hz = /bits/ 64 <940800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-979200000 { + opp-hz = /bits/ 64 <979200000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1036800000 { + opp-hz = /bits/ 64 <1036800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1056000000 { + opp-hz = /bits/ 64 <1056000000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1113600000 { + opp-hz = /bits/ 64 <1113600000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1132800000 { + opp-hz = /bits/ 64 <1132800000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1190400000 { + opp-hz = /bits/ 64 <1190400000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1209600000 { + opp-hz = /bits/ 64 <1209600000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1248000000 { + opp-hz = /bits/ 64 <1248000000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1286400000 { + opp-hz = /bits/ 64 <1286400000>; + opp-microvolt = <905000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1324800000 { + opp-hz = /bits/ 64 <1324800000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1363200000 { + opp-hz = /bits/ 64 <1363200000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1401600000 { + opp-hz = /bits/ 64 <1401600000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1440000000 { + opp-hz = /bits/ 64 <1440000000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1478400000 { + opp-hz = /bits/ 64 <1478400000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1516800000 { + opp-hz = /bits/ 64 <1516800000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1555200000 { + opp-hz = /bits/ 64 <1555200000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1593600000 { + opp-hz = /bits/ 64 <1593600000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1632000000 { + opp-hz = /bits/ 64 <1632000000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1670400000 { + opp-hz = /bits/ 64 <1670400000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1708800000 { + opp-hz = /bits/ 64 <1708800000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1747200000 { + opp-hz = /bits/ 64 <1747200000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x70>; + clock-latency-ns = <200000>; + }; + opp-1785600000 { + opp-hz = /bits/ 64 <1785600000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x7>; + clock-latency-ns = <200000>; + }; + opp-1804800000 { + opp-hz = /bits/ 64 <1804800000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x6>; + clock-latency-ns = <200000>; + }; + opp-1824000000 { + opp-hz = /bits/ 64 <1824000000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x71>; + clock-latency-ns = <200000>; + }; + opp-1900800000 { + opp-hz = /bits/ 64 <1900800000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x74>; + clock-latency-ns = <200000>; + }; + opp-1920000000 { + opp-hz = /bits/ 64 <1920000000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x1>; + clock-latency-ns = <200000>; + }; + opp-1977600000 { + opp-hz = /bits/ 64 <1977600000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x30>; + clock-latency-ns = <200000>; + }; + opp-1996800000 { + opp-hz = /bits/ 64 <1996800000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x1>; + clock-latency-ns = <200000>; + }; + opp-2054400000 { + opp-hz = /bits/ 64 <2054400000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x30>; + clock-latency-ns = <200000>; + }; + opp-2073600000 { + opp-hz = /bits/ 64 <2073600000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x1>; + clock-latency-ns = <200000>; + }; + opp-2150400000 { + opp-hz = /bits/ 64 <2150400000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x31>; + clock-latency-ns = <200000>; + }; + opp-2246400000 { + opp-hz = /bits/ 64 <2246400000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x10>; + clock-latency-ns = <200000>; + }; + opp-2342400000 { + opp-hz = /bits/ 64 <2342400000>; + opp-microvolt = <1140000 905000 1140000>; + opp-supported-hw = <0x10>; + clock-latency-ns = <200000>; + }; + }; + +.... + +reserved-memory { + #address-cells = <2>; + #size-cells = <2>; + ranges; +.... + smem_mem: smem-mem@86000000 { + reg = <0x0 0x86000000 0x0 0x200000>; + no-map; + }; +.... +}; + +smem { + compatible = "qcom,smem"; + memory-region = <&smem_mem>; + hwlocks = <&tcsr_mutex 3>; +}; + +soc { +.... + qfprom: qfprom@74000 { + compatible = "qcom,qfprom"; + reg = <0x00074000 0x8ff>; + #address-cells = <1>; + #size-cells = <1>; + .... + speedbin_efuse: speedbin@133 { + reg = <0x133 0x1>; + bits = <5 3>; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/opp/opp.txt b/Documentation/devicetree/bindings/opp/opp.txt index 4e4f30288c8b..c396c4c0af92 100644 --- a/Documentation/devicetree/bindings/opp/opp.txt +++ b/Documentation/devicetree/bindings/opp/opp.txt @@ -82,7 +82,10 @@ This defines voltage-current-frequency combinations along with other related properties. Required properties: -- opp-hz: Frequency in Hz, expressed as a 64-bit big-endian integer. +- opp-hz: Frequency in Hz, expressed as a 64-bit big-endian integer. This is a + required property for all device nodes but devices like power domains. The + power domain nodes must have another (implementation dependent) property which + uniquely identifies the OPP nodes. Optional properties: - opp-microvolt: voltage in micro Volts. @@ -159,7 +162,7 @@ Optional properties: - status: Marks the node enabled/disabled. -- required-opp: This contains phandle to an OPP node in another device's OPP +- required-opps: This contains phandle to an OPP node in another device's OPP table. It may contain an array of phandles, where each phandle points to an OPP of a different device. It should not contain multiple phandles to the OPP nodes in the same OPP table. This specifies the minimum required OPP of the diff --git a/Documentation/devicetree/bindings/pci/designware-pcie.txt b/Documentation/devicetree/bindings/pci/designware-pcie.txt index 1da7ade3183c..c124f9bc11f3 100644 --- a/Documentation/devicetree/bindings/pci/designware-pcie.txt +++ b/Documentation/devicetree/bindings/pci/designware-pcie.txt @@ -1,7 +1,9 @@ * Synopsys DesignWare PCIe interface Required properties: -- compatible: should contain "snps,dw-pcie" to identify the core. +- compatible: + "snps,dw-pcie" for RC mode; + "snps,dw-pcie-ep" for EP mode; - reg: Should contain the configuration address space. - reg-names: Must be "config" for the PCIe configuration space. (The old way of getting the configuration address space from "ranges" @@ -41,11 +43,11 @@ EP mode: Example configuration: - pcie: pcie@dffff000 { + pcie: pcie@dfc00000 { compatible = "snps,dw-pcie"; - reg = <0xdffff000 0x1000>, /* Controller registers */ - <0xd0000000 0x2000>; /* PCI config space */ - reg-names = "ctrlreg", "config"; + reg = <0xdfc00000 0x0001000>, /* IP registers */ + <0xd0000000 0x0002000>; /* Configuration space */ + reg-names = "dbi", "config"; #address-cells = <3>; #size-cells = <2>; device_type = "pci"; @@ -54,5 +56,15 @@ Example configuration: interrupts = <25>, <24>; #interrupt-cells = <1>; num-lanes = <1>; - num-viewport = <3>; + }; +or + pcie: pcie@dfc00000 { + compatible = "snps,dw-pcie-ep"; + reg = <0xdfc00000 0x0001000>, /* IP registers 1 */ + <0xdfc01000 0x0001000>, /* IP registers 2 */ + <0xd0000000 0x2000000>; /* Configuration space */ + reg-names = "dbi", "dbi2", "addr_space"; + num-ib-windows = <6>; + num-ob-windows = <2>; + num-lanes = <1>; }; diff --git a/Documentation/devicetree/bindings/pci/mobiveil-pcie.txt b/Documentation/devicetree/bindings/pci/mobiveil-pcie.txt new file mode 100644 index 000000000000..65038aa642e5 --- /dev/null +++ b/Documentation/devicetree/bindings/pci/mobiveil-pcie.txt @@ -0,0 +1,73 @@ +* Mobiveil AXI PCIe Root Port Bridge DT description + +Mobiveil's GPEX 4.0 is a PCIe Gen4 root port bridge IP. This configurable IP +has up to 8 outbound and inbound windows for the address translation. + +Required properties: +- #address-cells: Address representation for root ports, set to <3> +- #size-cells: Size representation for root ports, set to <2> +- #interrupt-cells: specifies the number of cells needed to encode an + interrupt source. The value must be 1. +- compatible: Should contain "mbvl,gpex40-pcie" +- reg: Should contain PCIe registers location and length + "config_axi_slave": PCIe controller registers + "csr_axi_slave" : Bridge config registers + "gpio_slave" : GPIO registers to control slot power + "apb_csr" : MSI registers + +- device_type: must be "pci" +- apio-wins : number of requested apio outbound windows + default 2 outbound windows are configured - + 1. Config window + 2. Memory window +- ppio-wins : number of requested ppio inbound windows + default 1 inbound memory window is configured. +- bus-range: PCI bus numbers covered +- interrupt-controller: identifies the node as an interrupt controller +- #interrupt-cells: specifies the number of cells needed to encode an + interrupt source. The value must be 1. +- interrupt-parent : phandle to the interrupt controller that + it is attached to, it should be set to gic to point to + ARM's Generic Interrupt Controller node in system DT. +- interrupts: The interrupt line of the PCIe controller + last cell of this field is set to 4 to + denote it as IRQ_TYPE_LEVEL_HIGH type interrupt. +- interrupt-map-mask, + interrupt-map: standard PCI properties to define the mapping of the + PCI interface to interrupt numbers. +- ranges: ranges for the PCI memory regions (I/O space region is not + supported by hardware) + Please refer to the standard PCI bus binding document for a more + detailed explanation + + +Example: +++++++++ + pcie0: pcie@a0000000 { + #address-cells = <3>; + #size-cells = <2>; + compatible = "mbvl,gpex40-pcie"; + reg = <0xa0000000 0x00001000>, + <0xb0000000 0x00010000>, + <0xff000000 0x00200000>, + <0xb0010000 0x00001000>; + reg-names = "config_axi_slave", + "csr_axi_slave", + "gpio_slave", + "apb_csr"; + device_type = "pci"; + apio-wins = <2>; + ppio-wins = <1>; + bus-range = <0x00000000 0x000000ff>; + interrupt-controller; + interrupt-parent = <&gic>; + #interrupt-cells = <1>; + interrupts = < 0 89 4 >; + interrupt-map-mask = <0 0 0 7>; + interrupt-map = <0 0 0 0 &pci_express 0>, + <0 0 0 1 &pci_express 1>, + <0 0 0 2 &pci_express 2>, + <0 0 0 3 &pci_express 3>; + ranges = < 0x83000000 0 0x00000000 0xa8000000 0 0x8000000>; + + }; diff --git a/Documentation/devicetree/bindings/pci/pci-armada8k.txt b/Documentation/devicetree/bindings/pci/pci-armada8k.txt index c1e4c3d10a74..9e3fc15e1af8 100644 --- a/Documentation/devicetree/bindings/pci/pci-armada8k.txt +++ b/Documentation/devicetree/bindings/pci/pci-armada8k.txt @@ -12,7 +12,10 @@ Required properties: - "ctrl" for the control register region - "config" for the config space region - interrupts: Interrupt specifier for the PCIe controler -- clocks: reference to the PCIe controller clock +- clocks: reference to the PCIe controller clocks +- clock-names: mandatory if there is a second clock, in this case the + name must be "core" for the first clock and "reg" for the second + one Example: diff --git a/Documentation/devicetree/bindings/pci/rcar-pci.txt b/Documentation/devicetree/bindings/pci/rcar-pci.txt index 1fb614e615da..a5f7fc62d10e 100644 --- a/Documentation/devicetree/bindings/pci/rcar-pci.txt +++ b/Documentation/devicetree/bindings/pci/rcar-pci.txt @@ -8,6 +8,7 @@ compatible: "renesas,pcie-r8a7743" for the R8A7743 SoC; "renesas,pcie-r8a7793" for the R8A7793 SoC; "renesas,pcie-r8a7795" for the R8A7795 SoC; "renesas,pcie-r8a7796" for the R8A7796 SoC; + "renesas,pcie-r8a77980" for the R8A77980 SoC; "renesas,pcie-rcar-gen2" for a generic R-Car Gen2 or RZ/G1 compatible device. "renesas,pcie-rcar-gen3" for a generic R-Car Gen3 compatible device. @@ -32,6 +33,11 @@ compatible: "renesas,pcie-r8a7743" for the R8A7743 SoC; and PCIe bus clocks. - clock-names: from common clock binding: should be "pcie" and "pcie_bus". +Optional properties: +- phys: from common PHY binding: PHY phandle and specifier (only make sense + for R-Car gen3 SoCs where the PCIe PHYs have their own register blocks). +- phy-names: from common PHY binding: should be "pcie". + Example: SoC-specific DT Entry: diff --git a/Documentation/devicetree/bindings/pci/rockchip-pcie-ep.txt b/Documentation/devicetree/bindings/pci/rockchip-pcie-ep.txt new file mode 100644 index 000000000000..778467307a93 --- /dev/null +++ b/Documentation/devicetree/bindings/pci/rockchip-pcie-ep.txt @@ -0,0 +1,62 @@ +* Rockchip AXI PCIe Endpoint Controller DT description + +Required properties: +- compatible: Should contain "rockchip,rk3399-pcie-ep" +- reg: Two register ranges as listed in the reg-names property +- reg-names: Must include the following names + - "apb-base" + - "mem-base" +- clocks: Must contain an entry for each entry in clock-names. + See ../clocks/clock-bindings.txt for details. +- clock-names: Must include the following entries: + - "aclk" + - "aclk-perf" + - "hclk" + - "pm" +- resets: Must contain seven entries for each entry in reset-names. + See ../reset/reset.txt for details. +- reset-names: Must include the following names + - "core" + - "mgmt" + - "mgmt-sticky" + - "pipe" + - "pm" + - "aclk" + - "pclk" +- pinctrl-names : The pin control state names +- pinctrl-0: The "default" pinctrl state +- phys: Must contain an phandle to a PHY for each entry in phy-names. +- phy-names: Must include 4 entries for all 4 lanes even if some of + them won't be used for your cases. Entries are of the form "pcie-phy-N": + where N ranges from 0 to 3. + (see example below and you MUST also refer to ../phy/rockchip-pcie-phy.txt + for changing the #phy-cells of phy node to support it) +- rockchip,max-outbound-regions: Maximum number of outbound regions + +Optional Property: +- num-lanes: number of lanes to use +- max-functions: Maximum number of functions that can be configured (default 1). + +pcie0-ep: pcie@f8000000 { + compatible = "rockchip,rk3399-pcie-ep"; + #address-cells = <3>; + #size-cells = <2>; + rockchip,max-outbound-regions = <16>; + clocks = <&cru ACLK_PCIE>, <&cru ACLK_PERF_PCIE>, + <&cru PCLK_PCIE>, <&cru SCLK_PCIE_PM>; + clock-names = "aclk", "aclk-perf", + "hclk", "pm"; + max-functions = /bits/ 8 <8>; + num-lanes = <4>; + reg = <0x0 0xfd000000 0x0 0x1000000>, <0x0 0x80000000 0x0 0x20000>; + reg-names = "apb-base", "mem-base"; + resets = <&cru SRST_PCIE_CORE>, <&cru SRST_PCIE_MGMT>, + <&cru SRST_PCIE_MGMT_STICKY>, <&cru SRST_PCIE_PIPE> , + <&cru SRST_PCIE_PM>, <&cru SRST_P_PCIE>, <&cru SRST_A_PCIE>; + reset-names = "core", "mgmt", "mgmt-sticky", "pipe", + "pm", "pclk", "aclk"; + phys = <&pcie_phy 0>, <&pcie_phy 1>, <&pcie_phy 2>, <&pcie_phy 3>; + phy-names = "pcie-phy-0", "pcie-phy-1", "pcie-phy-2", "pcie-phy-3"; + pinctrl-names = "default"; + pinctrl-0 = <&pcie_clkreq>; +}; diff --git a/Documentation/devicetree/bindings/pci/rockchip-pcie.txt b/Documentation/devicetree/bindings/pci/rockchip-pcie-host.txt index af34c65773fd..af34c65773fd 100644 --- a/Documentation/devicetree/bindings/pci/rockchip-pcie.txt +++ b/Documentation/devicetree/bindings/pci/rockchip-pcie-host.txt diff --git a/Documentation/devicetree/bindings/pci/xgene-pci.txt b/Documentation/devicetree/bindings/pci/xgene-pci.txt index 6fd2decfa66c..92490330dc1c 100644 --- a/Documentation/devicetree/bindings/pci/xgene-pci.txt +++ b/Documentation/devicetree/bindings/pci/xgene-pci.txt @@ -25,8 +25,6 @@ Optional properties: Example: -SoC-specific DT Entry: - pcie0: pcie@1f2b0000 { status = "disabled"; device_type = "pci"; @@ -50,8 +48,3 @@ SoC-specific DT Entry: clocks = <&pcie0clk 0>; }; - -Board-specific DT Entry: - &pcie0 { - status = "ok"; - }; diff --git a/Documentation/devicetree/bindings/phy/phy-mtk-xsphy.txt b/Documentation/devicetree/bindings/phy/phy-mtk-xsphy.txt new file mode 100644 index 000000000000..e7caefa0b9c2 --- /dev/null +++ b/Documentation/devicetree/bindings/phy/phy-mtk-xsphy.txt @@ -0,0 +1,109 @@ +MediaTek XS-PHY binding +-------------------------- + +The XS-PHY controller supports physical layer functionality for USB3.1 +GEN2 controller on MediaTek SoCs. + +Required properties (controller (parent) node): + - compatible : should be "mediatek,<soc-model>-xsphy", "mediatek,xsphy", + soc-model is the name of SoC, such as mt3611 etc; + when using "mediatek,xsphy" compatible string, you need SoC specific + ones in addition, one of: + - "mediatek,mt3611-xsphy" + + - #address-cells, #size-cells : should use the same values as the root node + - ranges: must be present + +Optional properties (controller (parent) node): + - reg : offset and length of register shared by multiple U3 ports, + exclude port's private register, if only U2 ports provided, + shouldn't use the property. + - mediatek,src-ref-clk-mhz : u32, frequency of reference clock for slew rate + calibrate + - mediatek,src-coef : u32, coefficient for slew rate calibrate, depends on + SoC process + +Required nodes : a sub-node is required for each port the controller + provides. Address range information including the usual + 'reg' property is used inside these nodes to describe + the controller's topology. + +Required properties (port (child) node): +- reg : address and length of the register set for the port. +- clocks : a list of phandle + clock-specifier pairs, one for each + entry in clock-names +- clock-names : must contain + "ref": 48M reference clock for HighSpeed analog phy; and 26M + reference clock for SuperSpeedPlus analog phy, sometimes is + 24M, 25M or 27M, depended on platform. +- #phy-cells : should be 1 + cell after port phandle is phy type from: + - PHY_TYPE_USB2 + - PHY_TYPE_USB3 + +The following optional properties are only for debug or HQA test +Optional properties (PHY_TYPE_USB2 port (child) node): +- mediatek,eye-src : u32, the value of slew rate calibrate +- mediatek,eye-vrt : u32, the selection of VRT reference voltage +- mediatek,eye-term : u32, the selection of HS_TX TERM reference voltage +- mediatek,efuse-intr : u32, the selection of Internal Resistor + +Optional properties (PHY_TYPE_USB3 port (child) node): +- mediatek,efuse-intr : u32, the selection of Internal Resistor +- mediatek,efuse-tx-imp : u32, the selection of TX Impedance +- mediatek,efuse-rx-imp : u32, the selection of RX Impedance + +Banks layout of xsphy +------------------------------------------------------------- +port offset bank +u2 port0 0x0000 MISC + 0x0100 FMREG + 0x0300 U2PHY_COM +u2 port1 0x1000 MISC + 0x1100 FMREG + 0x1300 U2PHY_COM +u2 port2 0x2000 MISC + ... +u31 common 0x3000 DIG_GLB + 0x3100 PHYA_GLB +u31 port0 0x3400 DIG_LN_TOP + 0x3500 DIG_LN_TX0 + 0x3600 DIG_LN_RX0 + 0x3700 DIG_LN_DAIF + 0x3800 PHYA_LN +u31 port1 0x3a00 DIG_LN_TOP + 0x3b00 DIG_LN_TX0 + 0x3c00 DIG_LN_RX0 + 0x3d00 DIG_LN_DAIF + 0x3e00 PHYA_LN + ... + +DIG_GLB & PHYA_GLB are shared by U31 ports. + +Example: + +u3phy: usb-phy@11c40000 { + compatible = "mediatek,mt3611-xsphy", "mediatek,xsphy"; + reg = <0 0x11c43000 0 0x0200>; + mediatek,src-ref-clk-mhz = <26>; + mediatek,src-coef = <17>; + #address-cells = <2>; + #size-cells = <2>; + ranges; + + u2port0: usb-phy@11c40000 { + reg = <0 0x11c40000 0 0x0400>; + clocks = <&clk48m>; + clock-names = "ref"; + mediatek,eye-src = <4>; + #phy-cells = <1>; + }; + + u3port0: usb-phy@11c43000 { + reg = <0 0x11c43400 0 0x0500>; + clocks = <&clk26m>; + clock-names = "ref"; + mediatek,efuse-intr = <28>; + #phy-cells = <1>; + }; +}; diff --git a/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt b/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt index dcf1b8f691d5..266a1bb8bb6e 100644 --- a/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt +++ b/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt @@ -9,7 +9,8 @@ Required properties: "qcom,ipq8074-qmp-pcie-phy" for PCIe phy on IPQ8074 "qcom,msm8996-qmp-pcie-phy" for 14nm PCIe phy on msm8996, "qcom,msm8996-qmp-usb3-phy" for 14nm USB3 phy on msm8996, - "qcom,qmp-v3-usb3-phy" for USB3 QMP V3 phy. + "qcom,sdm845-qmp-usb3-phy" for USB3 QMP V3 phy on sdm845, + "qcom,sdm845-qmp-usb3-uni-phy" for USB3 QMP V3 UNI phy on sdm845. - reg: offset and length of register set for PHY's common serdes block. diff --git a/Documentation/devicetree/bindings/phy/qcom-qusb2-phy.txt b/Documentation/devicetree/bindings/phy/qcom-qusb2-phy.txt index 42c97426836e..03025d97998b 100644 --- a/Documentation/devicetree/bindings/phy/qcom-qusb2-phy.txt +++ b/Documentation/devicetree/bindings/phy/qcom-qusb2-phy.txt @@ -6,7 +6,7 @@ QUSB2 controller supports LS/FS/HS usb connectivity on Qualcomm chipsets. Required properties: - compatible: compatible list, contains "qcom,msm8996-qusb2-phy" for 14nm PHY on msm8996, - "qcom,qusb2-v2-phy" for QUSB2 V2 PHY. + "qcom,sdm845-qusb2-phy" for 10nm PHY on sdm845. - reg: offset and length of the PHY register set. - #phy-cells: must be 0. @@ -27,6 +27,27 @@ Optional properties: tuning parameter value for qusb2 phy. - qcom,tcsr-syscon: Phandle to TCSR syscon register region. + - qcom,imp-res-offset-value: It is a 6 bit value that specifies offset to be + added to PHY refgen RESCODE via IMP_CTRL1 register. It is a PHY + tuning parameter that may vary for different boards of same SOC. + This property is applicable to only QUSB2 v2 PHY (sdm845). + - qcom,hstx-trim-value: It is a 4 bit value that specifies tuning for HSTX + output current. + Possible range is - 15mA to 24mA (stepsize of 600 uA). + See dt-bindings/phy/phy-qcom-qusb2.h for applicable values. + This property is applicable to only QUSB2 v2 PHY (sdm845). + Default value is 22.2mA for sdm845. + - qcom,preemphasis-level: It is a 2 bit value that specifies pre-emphasis level. + Possible range is 0 to 15% (stepsize of 5%). + See dt-bindings/phy/phy-qcom-qusb2.h for applicable values. + This property is applicable to only QUSB2 v2 PHY (sdm845). + Default value is 10% for sdm845. +- qcom,preemphasis-width: It is a 1 bit value that specifies how long the HSTX + pre-emphasis (specified using qcom,preemphasis-level) must be in + effect. Duration could be half-bit of full-bit. + See dt-bindings/phy/phy-qcom-qusb2.h for applicable values. + This property is applicable to only QUSB2 v2 PHY (sdm845). + Default value is full-bit width for sdm845. Example: hsusb_phy: phy@7411000 { diff --git a/Documentation/devicetree/bindings/pinctrl/actions,s900-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/actions,s900-pinctrl.txt index fb87c7d74f2e..8fb5a53775e8 100644 --- a/Documentation/devicetree/bindings/pinctrl/actions,s900-pinctrl.txt +++ b/Documentation/devicetree/bindings/pinctrl/actions,s900-pinctrl.txt @@ -8,6 +8,17 @@ Required Properties: - reg: Should contain the register base address and size of the pin controller. - clocks: phandle of the clock feeding the pin controller +- gpio-controller: Marks the device node as a GPIO controller. +- gpio-ranges: Specifies the mapping between gpio controller and + pin-controller pins. +- #gpio-cells: Should be two. The first cell is the gpio pin number + and the second cell is used for optional parameters. +- interrupt-controller: Marks the device node as an interrupt controller. +- #interrupt-cells: Specifies the number of cells needed to encode an + interrupt. Shall be set to 2. The first cell + defines the interrupt number, the second encodes + the trigger flags described in + bindings/interrupt-controller/interrupts.txt Please refer to pinctrl-bindings.txt in this directory for details of the common pinctrl bindings used by client devices, including the meaning of the @@ -164,6 +175,11 @@ Example: compatible = "actions,s900-pinctrl"; reg = <0x0 0xe01b0000 0x0 0x1000>; clocks = <&cmu CLK_GPIO>; + gpio-controller; + gpio-ranges = <&pinctrl 0 0 146>; + #gpio-cells = <2>; + interrupt-controller; + #interrupt-cells = <2>; uart2-default: uart2-default { pinmux { diff --git a/Documentation/devicetree/bindings/pinctrl/allwinner,sunxi-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/allwinner,sunxi-pinctrl.txt index ed5eb547afc8..258a4648ab81 100644 --- a/Documentation/devicetree/bindings/pinctrl/allwinner,sunxi-pinctrl.txt +++ b/Documentation/devicetree/bindings/pinctrl/allwinner,sunxi-pinctrl.txt @@ -28,6 +28,7 @@ Required properties: "allwinner,sun50i-a64-r-pinctrl" "allwinner,sun50i-h5-pinctrl" "allwinner,sun50i-h6-pinctrl" + "allwinner,sun50i-h6-r-pinctrl" "nextthing,gr8-pinctrl" - reg: Should contain the register physical address and length for the @@ -56,9 +57,9 @@ pins it needs, and how they should be configured, with regard to muxer configuration, drive strength and pullups. If one of these options is not set, its actual value will be unspecified. -This driver supports the generic pin multiplexing and configuration -bindings. For details on each properties, you can refer to -./pinctrl-bindings.txt. +Allwinner A1X Pin Controller supports the generic pin multiplexing and +configuration bindings. For details on each properties, you can refer to + ./pinctrl-bindings.txt. Required sub-node properties: - pins diff --git a/Documentation/devicetree/bindings/pinctrl/brcm,bcm2835-gpio.txt b/Documentation/devicetree/bindings/pinctrl/brcm,bcm2835-gpio.txt index 2569866c692f..3fac0a061bcc 100644 --- a/Documentation/devicetree/bindings/pinctrl/brcm,bcm2835-gpio.txt +++ b/Documentation/devicetree/bindings/pinctrl/brcm,bcm2835-gpio.txt @@ -36,6 +36,24 @@ listed. In other words, a subnode that lists only a mux function implies no information about any pull configuration. Similarly, a subnode that lists only a pul parameter implies no information about the mux function. +The BCM2835 pin configuration and multiplexing supports the generic bindings. +For details on each properties, you can refer to ./pinctrl-bindings.txt. + +Required sub-node properties: + - pins + - function + +Optional sub-node properties: + - bias-disable + - bias-pull-up + - bias-pull-down + - output-high + - output-low + +Legacy pin configuration and multiplexing binding: +*** (Its use is deprecated, use generic multiplexing and configuration +bindings instead) + Required subnode-properties: - brcm,pins: An array of cells. Each cell contains the ID of a pin. Valid IDs are the integer GPIO IDs; 0==GPIO0, 1==GPIO1, ... 53==GPIO53. diff --git a/Documentation/devicetree/bindings/pinctrl/meson,pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/meson,pinctrl.txt index 2c12f9789116..54ecb8ab7788 100644 --- a/Documentation/devicetree/bindings/pinctrl/meson,pinctrl.txt +++ b/Documentation/devicetree/bindings/pinctrl/meson,pinctrl.txt @@ -3,8 +3,10 @@ Required properties for the root node: - compatible: one of "amlogic,meson8-cbus-pinctrl" "amlogic,meson8b-cbus-pinctrl" + "amlogic,meson8m2-cbus-pinctrl" "amlogic,meson8-aobus-pinctrl" "amlogic,meson8b-aobus-pinctrl" + "amlogic,meson8m2-aobus-pinctrl" "amlogic,meson-gxbb-periphs-pinctrl" "amlogic,meson-gxbb-aobus-pinctrl" "amlogic,meson-gxl-periphs-pinctrl" diff --git a/Documentation/devicetree/bindings/pinctrl/pinctrl-mcp23s08.txt b/Documentation/devicetree/bindings/pinctrl/pinctrl-mcp23s08.txt index a5a8322a31bd..a677145ae6d1 100644 --- a/Documentation/devicetree/bindings/pinctrl/pinctrl-mcp23s08.txt +++ b/Documentation/devicetree/bindings/pinctrl/pinctrl-mcp23s08.txt @@ -18,7 +18,9 @@ Required properties: removed. - #gpio-cells : Should be two. - first cell is the pin number - - second cell is used to specify flags. Flags are currently unused. + - second cell is used to specify flags as described in + 'Documentation/devicetree/bindings/gpio/gpio.txt'. Allowed values defined by + 'include/dt-bindings/gpio/gpio.h' (e.g. GPIO_ACTIVE_LOW). - gpio-controller : Marks the device node as a GPIO controller. - reg : For an address on its bus. I2C uses this a the I2C address of the chip. SPI uses this to specify the chipselect line which the chip is diff --git a/Documentation/devicetree/bindings/pinctrl/pinctrl-mt7622.txt b/Documentation/devicetree/bindings/pinctrl/pinctrl-mt7622.txt index f18ed99f6e14..def8fcad8941 100644 --- a/Documentation/devicetree/bindings/pinctrl/pinctrl-mt7622.txt +++ b/Documentation/devicetree/bindings/pinctrl/pinctrl-mt7622.txt @@ -9,6 +9,16 @@ Required properties for the root node: - #gpio-cells: Should be two. The first cell is the pin number and the second is the GPIO flags. +Optional properties: +- interrupt-controller : Marks the device node as an interrupt controller + +If the property interrupt-controller is defined, following property is required +- reg-names: A string describing the "reg" entries. Must contain "eint". +- interrupts : The interrupt output from the controller. +- #interrupt-cells: Should be two. +- interrupt-parent: Phandle of the interrupt parent to which the external + GPIO interrupts are forwarded to. + Please refer to pinctrl-bindings.txt in this directory for details of the common pinctrl bindings used by client devices, including the meaning of the phrase "pin configuration node". diff --git a/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt index 892d8fd7b700..abd8fbcf1e62 100644 --- a/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt +++ b/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt @@ -15,6 +15,7 @@ Required Properties: - "renesas,pfc-r8a7740": for R8A7740 (R-Mobile A1) compatible pin-controller. - "renesas,pfc-r8a7743": for R8A7743 (RZ/G1M) compatible pin-controller. - "renesas,pfc-r8a7745": for R8A7745 (RZ/G1E) compatible pin-controller. + - "renesas,pfc-r8a77470": for R8A77470 (RZ/G1C) compatible pin-controller. - "renesas,pfc-r8a7778": for R8A7778 (R-Car M1) compatible pin-controller. - "renesas,pfc-r8a7779": for R8A7779 (R-Car H1) compatible pin-controller. - "renesas,pfc-r8a7790": for R8A7790 (R-Car H2) compatible pin-controller. @@ -27,6 +28,7 @@ Required Properties: - "renesas,pfc-r8a77965": for R8A77965 (R-Car M3-N) compatible pin-controller. - "renesas,pfc-r8a77970": for R8A77970 (R-Car V3M) compatible pin-controller. - "renesas,pfc-r8a77980": for R8A77980 (R-Car V3H) compatible pin-controller. + - "renesas,pfc-r8a77990": for R8A77990 (R-Car E3) compatible pin-controller. - "renesas,pfc-r8a77995": for R8A77995 (R-Car D3) compatible pin-controller. - "renesas,pfc-sh73a0": for SH73A0 (SH-Mobile AG5) compatible pin-controller. diff --git a/Documentation/devicetree/bindings/pinctrl/rockchip,pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/rockchip,pinctrl.txt index a01a3b8a2363..0919db294c17 100644 --- a/Documentation/devicetree/bindings/pinctrl/rockchip,pinctrl.txt +++ b/Documentation/devicetree/bindings/pinctrl/rockchip,pinctrl.txt @@ -20,6 +20,7 @@ defined as gpio sub-nodes of the pinmux controller. Required properties for iomux controller: - compatible: should be + "rockchip,px30-pinctrl": for Rockchip PX30 "rockchip,rv1108-pinctrl": for Rockchip RV1108 "rockchip,rk2928-pinctrl": for Rockchip RK2928 "rockchip,rk3066a-pinctrl": for Rockchip RK3066a diff --git a/Documentation/devicetree/bindings/power/power_domain.txt b/Documentation/devicetree/bindings/power/power_domain.txt index f3355313c020..4733f76cbe48 100644 --- a/Documentation/devicetree/bindings/power/power_domain.txt +++ b/Documentation/devicetree/bindings/power/power_domain.txt @@ -127,7 +127,7 @@ inside a PM domain with index 0 of a power controller represented by a node with the label "power". Optional properties: -- required-opp: This contains phandle to an OPP node in another device's OPP +- required-opps: This contains phandle to an OPP node in another device's OPP table. It may contain an array of phandles, where each phandle points to an OPP of a different device. It should not contain multiple phandles to the OPP nodes in the same OPP table. This specifies the minimum required OPP of the @@ -175,14 +175,14 @@ Example: compatible = "foo,i-leak-current"; reg = <0x12350000 0x1000>; power-domains = <&power 0>; - required-opp = <&domain0_opp_0>; + required-opps = <&domain0_opp_0>; }; leaky-device1@12350000 { compatible = "foo,i-leak-current"; reg = <0x12350000 0x1000>; power-domains = <&power 1>; - required-opp = <&domain1_opp_1>; + required-opps = <&domain1_opp_1>; }; [1]. Documentation/devicetree/bindings/power/domain-idle-state.txt diff --git a/Documentation/devicetree/bindings/power/rockchip-io-domain.txt b/Documentation/devicetree/bindings/power/rockchip-io-domain.txt index 4a4766e9c254..e66fd4eab71c 100644 --- a/Documentation/devicetree/bindings/power/rockchip-io-domain.txt +++ b/Documentation/devicetree/bindings/power/rockchip-io-domain.txt @@ -31,6 +31,8 @@ SoC is on the same page. Required properties: - compatible: should be one of: + - "rockchip,px30-io-voltage-domain" for px30 + - "rockchip,px30-pmu-io-voltage-domain" for px30 pmu-domains - "rockchip,rk3188-io-voltage-domain" for rk3188 - "rockchip,rk3228-io-voltage-domain" for rk3228 - "rockchip,rk3288-io-voltage-domain" for rk3288 @@ -51,6 +53,19 @@ a phandle the relevant regulator. All specified supplies must be able to report their voltage. The IO Voltage Domain for any non-specified supplies will be not be touched. +Possible supplies for PX30: +- vccio6-supply: The supply connected to VCCIO6. +- vccio1-supply: The supply connected to VCCIO1. +- vccio2-supply: The supply connected to VCCIO2. +- vccio3-supply: The supply connected to VCCIO3. +- vccio4-supply: The supply connected to VCCIO4. +- vccio5-supply: The supply connected to VCCIO5. +- vccio-oscgpi-supply: The supply connected to VCCIO_OSCGPI. + +Possible supplies for PX30 pmu-domains: +- pmuio1-supply: The supply connected to PMUIO1. +- pmuio2-supply: The supply connected to PMUIO2. + Possible supplies for rk3188: - ap0-supply: The supply connected to AP0_VCC. - ap1-supply: The supply connected to AP1_VCC. diff --git a/Documentation/devicetree/bindings/power/supply/bq27xxx.txt b/Documentation/devicetree/bindings/power/supply/bq27xxx.txt index 615c1cb6889f..37994fdb18ca 100644 --- a/Documentation/devicetree/bindings/power/supply/bq27xxx.txt +++ b/Documentation/devicetree/bindings/power/supply/bq27xxx.txt @@ -25,6 +25,7 @@ Required properties: * "ti,bq27545" - BQ27545 * "ti,bq27421" - BQ27421 * "ti,bq27425" - BQ27425 + * "ti,bq27426" - BQ27426 * "ti,bq27441" - BQ27441 * "ti,bq27621" - BQ27621 - reg: integer, I2C address of the fuel gauge. diff --git a/Documentation/devicetree/bindings/pps/pps-gpio.txt b/Documentation/devicetree/bindings/pps/pps-gpio.txt index 0de23b793657..3683874832ae 100644 --- a/Documentation/devicetree/bindings/pps/pps-gpio.txt +++ b/Documentation/devicetree/bindings/pps/pps-gpio.txt @@ -20,5 +20,4 @@ Example: assert-falling-edge; compatible = "pps-gpio"; - status = "okay"; }; diff --git a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt new file mode 100644 index 000000000000..0f569d8e73a3 --- /dev/null +++ b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt @@ -0,0 +1,69 @@ +* Freescale QorIQ 1588 timer based PTP clock + +General Properties: + + - compatible Should be "fsl,etsec-ptp" + - reg Offset and length of the register set for the device + - interrupts There should be at least two interrupts. Some devices + have as many as four PTP related interrupts. + +Clock Properties: + + - fsl,cksel Timer reference clock source. + - fsl,tclk-period Timer reference clock period in nanoseconds. + - fsl,tmr-prsc Prescaler, divides the output clock. + - fsl,tmr-add Frequency compensation value. + - fsl,tmr-fiper1 Fixed interval period pulse generator. + - fsl,tmr-fiper2 Fixed interval period pulse generator. + - fsl,max-adj Maximum frequency adjustment in parts per billion. + + These properties set the operational parameters for the PTP + clock. You must choose these carefully for the clock to work right. + Here is how to figure good values: + + TimerOsc = selected reference clock MHz + tclk_period = desired clock period nanoseconds + NominalFreq = 1000 / tclk_period MHz + FreqDivRatio = TimerOsc / NominalFreq (must be greater that 1.0) + tmr_add = ceil(2^32 / FreqDivRatio) + OutputClock = NominalFreq / tmr_prsc MHz + PulseWidth = 1 / OutputClock microseconds + FiperFreq1 = desired frequency in Hz + FiperDiv1 = 1000000 * OutputClock / FiperFreq1 + tmr_fiper1 = tmr_prsc * tclk_period * FiperDiv1 - tclk_period + max_adj = 1000000000 * (FreqDivRatio - 1.0) - 1 + + The calculation for tmr_fiper2 is the same as for tmr_fiper1. The + driver expects that tmr_fiper1 will be correctly set to produce a 1 + Pulse Per Second (PPS) signal, since this will be offered to the PPS + subsystem to synchronize the Linux clock. + + Reference clock source is determined by the value, which is holded + in CKSEL bits in TMR_CTRL register. "fsl,cksel" property keeps the + value, which will be directly written in those bits, that is why, + according to reference manual, the next clock sources can be used: + + <0> - external high precision timer reference clock (TSEC_TMR_CLK + input is used for this purpose); + <1> - eTSEC system clock; + <2> - eTSEC1 transmit clock; + <3> - RTC clock input. + + When this attribute is not used, eTSEC system clock will serve as + IEEE 1588 timer reference clock. + +Example: + + ptp_clock@24e00 { + compatible = "fsl,etsec-ptp"; + reg = <0x24E00 0xB0>; + interrupts = <12 0x8 13 0x8>; + interrupt-parent = < &ipic >; + fsl,cksel = <1>; + fsl,tclk-period = <10>; + fsl,tmr-prsc = <100>; + fsl,tmr-add = <0x999999A4>; + fsl,tmr-fiper1 = <0x3B9AC9F6>; + fsl,tmr-fiper2 = <0x00018696>; + fsl,max-adj = <659999998>; + }; diff --git a/Documentation/devicetree/bindings/pwm/pwm-omap-dmtimer.txt b/Documentation/devicetree/bindings/pwm/pwm-omap-dmtimer.txt index 2e53324fb720..5ccfcc82da08 100644 --- a/Documentation/devicetree/bindings/pwm/pwm-omap-dmtimer.txt +++ b/Documentation/devicetree/bindings/pwm/pwm-omap-dmtimer.txt @@ -2,7 +2,7 @@ Required properties: - compatible: Shall contain "ti,omap-dmtimer-pwm". -- ti,timers: phandle to PWM capable OMAP timer. See arm/omap/timer.txt for info +- ti,timers: phandle to PWM capable OMAP timer. See timer/ti,timer.txt for info about these timers. - #pwm-cells: Should be 3. See pwm.txt in this directory for a description of the cells format. diff --git a/Documentation/devicetree/bindings/regulator/pfuze100.txt b/Documentation/devicetree/bindings/regulator/pfuze100.txt index c6dd3f5e485b..f0ada3b14d70 100644 --- a/Documentation/devicetree/bindings/regulator/pfuze100.txt +++ b/Documentation/devicetree/bindings/regulator/pfuze100.txt @@ -21,7 +21,7 @@ Each regulator is defined using the standard binding for regulators. Example 1: PFUZE100 - pmic: pfuze100@8 { + pfuze100: pmic@8 { compatible = "fsl,pfuze100"; reg = <0x08>; @@ -122,7 +122,7 @@ Example 1: PFUZE100 Example 2: PFUZE200 - pmic: pfuze200@8 { + pfuze200: pmic@8 { compatible = "fsl,pfuze200"; reg = <0x08>; @@ -216,7 +216,7 @@ Example 2: PFUZE200 Example 3: PFUZE3000 - pmic: pfuze3000@8 { + pfuze3000: pmic@8 { compatible = "fsl,pfuze3000"; reg = <0x08>; diff --git a/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.txt b/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.txt index 57d2c65899df..406f2e570c50 100644 --- a/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.txt +++ b/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.txt @@ -110,6 +110,11 @@ Qualcomm SPMI Regulators Definition: Reference to regulator supplying the input pin, as described in the data sheet. +- qcom,saw-reg: + Usage: optional + Value type: <phandle> + Description: Reference to syscon node defining the SAW registers. + The regulator node houses sub-nodes for each regulator within the device. Each sub-node is identified using the node's name, with valid values listed for each @@ -201,6 +206,17 @@ see regulator.txt - with additional custom properties described below: 2 = 0.55 uA 3 = 0.75 uA +- qcom,saw-slave: + Usage: optional + Value type: <boo> + Description: SAW controlled gang slave. Will not be configured. + +- qcom,saw-leader: + Usage: optional + Value type: <boo> + Description: SAW controlled gang leader. Will be configured as + SAW regulator. + Example: regulators { @@ -221,3 +237,32 @@ Example: .... }; + +Example 2: + + saw3: syscon@9A10000 { + compatible = "syscon"; + reg = <0x9A10000 0x1000>; + }; + + ... + + spm-regulators { + compatible = "qcom,pm8994-regulators"; + qcom,saw-reg = <&saw3>; + s8 { + qcom,saw-slave; + }; + s9 { + qcom,saw-slave; + }; + s10 { + qcom,saw-slave; + }; + pm8994_s11_saw: s11 { + qcom,saw-leader; + regulator-always-on; + regulator-min-microvolt = <900000>; + regulator-max-microvolt = <1140000>; + }; + }; diff --git a/Documentation/devicetree/bindings/regulator/regulator.txt b/Documentation/devicetree/bindings/regulator/regulator.txt index 2babe15b618d..a7cd36877bfe 100644 --- a/Documentation/devicetree/bindings/regulator/regulator.txt +++ b/Documentation/devicetree/bindings/regulator/regulator.txt @@ -59,6 +59,11 @@ Optional properties: - regulator-initial-mode: initial operating mode. The set of possible operating modes depends on the capabilities of every hardware so each device binding documentation explains which values the regulator supports. +- regulator-allowed-modes: list of operating modes that software is allowed to + configure for the regulator at run-time. Elements may be specified in any + order. The set of possible operating modes depends on the capabilities of + every hardware so each device binding document explains which values the + regulator supports. - regulator-system-load: Load in uA present on regulator that is not captured by any consumer request. - regulator-pull-down: Enable pull down resistor when the regulator is disabled. @@ -68,6 +73,11 @@ Optional properties: 0: Disable active discharge. 1: Enable active discharge. Absence of this property will leave configuration to default. +- regulator-coupled-with: Regulators with which the regulator + is coupled. The linkage is 2-way - all coupled regulators should be linked + with each other. A regulator should not be coupled with its supplier. +- regulator-coupled-max-spread: Max spread between voltages of coupled regulators + in microvolts. Deprecated properties: - regulator-compatible: If a regulator chip contains multiple diff --git a/Documentation/devicetree/bindings/regulator/rohm,bd71837-regulator.txt b/Documentation/devicetree/bindings/regulator/rohm,bd71837-regulator.txt new file mode 100644 index 000000000000..4edf3137d9f7 --- /dev/null +++ b/Documentation/devicetree/bindings/regulator/rohm,bd71837-regulator.txt @@ -0,0 +1,126 @@ +ROHM BD71837 Power Management Integrated Circuit (PMIC) regulator bindings + +BD71837MWV is a programmable Power Management +IC (PMIC) for powering single-core, dual-core, and +quad-core SoC’s such as NXP-i.MX 8M. It is optimized +for low BOM cost and compact solution footprint. It +integrates 8 Buck regulators and 7 LDO’s to provide all +the power rails required by the SoC and the commonly +used peripherals. + +Required properties: + - regulator-name: should be "buck1", ..., "buck8" and "ldo1", ..., "ldo7" + +List of regulators provided by this controller. BD71837 regulators node +should be sub node of the BD71837 MFD node. See BD71837 MFD bindings at +Documentation/devicetree/bindings/mfd/rohm,bd71837-pmic.txt +Regulator nodes should be named to BUCK_<number> and LDO_<number>. The +definition for each of these nodes is defined using the standard +binding for regulators at +Documentation/devicetree/bindings/regulator/regulator.txt. +Note that if BD71837 starts at RUN state you probably want to use +regulator-boot-on at least for BUCK6 and BUCK7 so that those are not +disabled by driver at startup. LDO5 and LDO6 are supplied by those and +if they are disabled at startup the voltage monitoring for LDO5/LDO6 will +cause PMIC to reset. + +The valid names for regulator nodes are: +BUCK1, BUCK2, BUCK3, BUCK4, BUCK5, BUCK6, BUCK7, BUCK8 +LDO1, LDO2, LDO3, LDO4, LDO5, LDO6, LDO7 + +Optional properties: +- Any optional property defined in bindings/regulator/regulator.txt + +Example: +regulators { + buck1: BUCK1 { + regulator-name = "buck1"; + regulator-min-microvolt = <700000>; + regulator-max-microvolt = <1300000>; + regulator-boot-on; + regulator-ramp-delay = <1250>; + }; + buck2: BUCK2 { + regulator-name = "buck2"; + regulator-min-microvolt = <700000>; + regulator-max-microvolt = <1300000>; + regulator-boot-on; + regulator-always-on; + regulator-ramp-delay = <1250>; + }; + buck3: BUCK3 { + regulator-name = "buck3"; + regulator-min-microvolt = <700000>; + regulator-max-microvolt = <1300000>; + regulator-boot-on; + }; + buck4: BUCK4 { + regulator-name = "buck4"; + regulator-min-microvolt = <700000>; + regulator-max-microvolt = <1300000>; + regulator-boot-on; + }; + buck5: BUCK5 { + regulator-name = "buck5"; + regulator-min-microvolt = <700000>; + regulator-max-microvolt = <1350000>; + regulator-boot-on; + }; + buck6: BUCK6 { + regulator-name = "buck6"; + regulator-min-microvolt = <3000000>; + regulator-max-microvolt = <3300000>; + regulator-boot-on; + }; + buck7: BUCK7 { + regulator-name = "buck7"; + regulator-min-microvolt = <1605000>; + regulator-max-microvolt = <1995000>; + regulator-boot-on; + }; + buck8: BUCK8 { + regulator-name = "buck8"; + regulator-min-microvolt = <800000>; + regulator-max-microvolt = <1400000>; + }; + + ldo1: LDO1 { + regulator-name = "ldo1"; + regulator-min-microvolt = <3000000>; + regulator-max-microvolt = <3300000>; + regulator-boot-on; + }; + ldo2: LDO2 { + regulator-name = "ldo2"; + regulator-min-microvolt = <900000>; + regulator-max-microvolt = <900000>; + regulator-boot-on; + }; + ldo3: LDO3 { + regulator-name = "ldo3"; + regulator-min-microvolt = <1800000>; + regulator-max-microvolt = <3300000>; + }; + ldo4: LDO4 { + regulator-name = "ldo4"; + regulator-min-microvolt = <900000>; + regulator-max-microvolt = <1800000>; + }; + ldo5: LDO5 { + regulator-name = "ldo5"; + regulator-min-microvolt = <1800000>; + regulator-max-microvolt = <3300000>; + }; + ldo6: LDO6 { + regulator-name = "ldo6"; + regulator-min-microvolt = <900000>; + regulator-max-microvolt = <1800000>; + }; + ldo7_reg: LDO7 { + regulator-name = "ldo7"; + regulator-min-microvolt = <1800000>; + regulator-max-microvolt = <3300000>; + }; +}; + + diff --git a/Documentation/devicetree/bindings/regulator/sy8106a-regulator.txt b/Documentation/devicetree/bindings/regulator/sy8106a-regulator.txt new file mode 100644 index 000000000000..39a8ca73f572 --- /dev/null +++ b/Documentation/devicetree/bindings/regulator/sy8106a-regulator.txt @@ -0,0 +1,23 @@ +SY8106A Voltage regulator + +Required properties: +- compatible: Must be "silergy,sy8106a" +- reg: I2C slave address - must be <0x65> +- silergy,fixed-microvolt - the voltage when I2C regulating is disabled (set + by external resistor like a fixed voltage) + +Any property defined as part of the core regulator binding, defined in +./regulator.txt, can also be used. + +Example: + + sy8106a { + compatible = "silergy,sy8106a"; + reg = <0x65>; + regulator-name = "sy8106a-vdd"; + silergy,fixed-microvolt = <1200000>; + regulator-min-microvolt = <1000000>; + regulator-max-microvolt = <1400000>; + regulator-boot-on; + regulator-always-on; + }; diff --git a/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt b/Documentation/devicetree/bindings/rng/samsung,exynos4-rng.txt index a13fbdb4bd88..a13fbdb4bd88 100644 --- a/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt +++ b/Documentation/devicetree/bindings/rng/samsung,exynos4-rng.txt diff --git a/Documentation/devicetree/bindings/sparc_sun_oracle_rng.txt b/Documentation/devicetree/bindings/rng/sparc_sun_oracle_rng.txt index b0b211194c71..b0b211194c71 100644 --- a/Documentation/devicetree/bindings/sparc_sun_oracle_rng.txt +++ b/Documentation/devicetree/bindings/rng/sparc_sun_oracle_rng.txt diff --git a/Documentation/devicetree/bindings/rtc/nxp,rtc-2123.txt b/Documentation/devicetree/bindings/rtc/nxp,rtc-2123.txt index 5cbc0b145a61..811124a36d16 100644 --- a/Documentation/devicetree/bindings/rtc/nxp,rtc-2123.txt +++ b/Documentation/devicetree/bindings/rtc/nxp,rtc-2123.txt @@ -9,7 +9,7 @@ Optional properties: Example: -rtc: nxp,rtc-pcf2123@3 { +pcf2123: rtc@3 { compatible = "nxp,rtc-pcf2123" reg = <3> spi-cs-high; diff --git a/Documentation/devicetree/bindings/rtc/st,stm32-rtc.txt b/Documentation/devicetree/bindings/rtc/st,stm32-rtc.txt index a66692a08ace..c920e2736991 100644 --- a/Documentation/devicetree/bindings/rtc/st,stm32-rtc.txt +++ b/Documentation/devicetree/bindings/rtc/st,stm32-rtc.txt @@ -1,23 +1,29 @@ STM32 Real Time Clock Required properties: -- compatible: can be either "st,stm32-rtc" or "st,stm32h7-rtc", depending on - the device is compatible with stm32(f4/f7) or stm32h7. +- compatible: can be one of the following: + - "st,stm32-rtc" for devices compatible with stm32(f4/f7). + - "st,stm32h7-rtc" for devices compatible with stm32h7. + - "st,stm32mp1-rtc" for devices compatible with stm32mp1. - reg: address range of rtc register set. - clocks: can use up to two clocks, depending on part used: - "rtc_ck": RTC clock source. - It is required on stm32(f4/f7) and stm32h7. - "pclk": RTC APB interface clock. It is not present on stm32(f4/f7). - It is required on stm32h7. + It is required on stm32(h7/mp1). - clock-names: must be "rtc_ck" and "pclk". - It is required only on stm32h7. + It is required on stm32(h7/mp1). - interrupt-parent: phandle for the interrupt controller. -- interrupts: rtc alarm interrupt. -- st,syscfg: phandle for pwrcfg, mandatory to disable/enable backup domain - (RTC registers) write protection. + It is required on stm32(f4/f7/h7). +- interrupts: rtc alarm interrupt. On stm32mp1, a second interrupt is required + for rtc alarm wakeup interrupt. +- st,syscfg: phandle/offset/mask triplet. The phandle to pwrcfg used to + access control register at offset, and change the dbp (Disable Backup + Protection) bit represented by the mask, mandatory to disable/enable backup + domain (RTC registers) write protection. + It is required on stm32(f4/f7/h7). -Optional properties (to override default rtc_ck parent clock): +Optional properties (to override default rtc_ck parent clock on stm32(f4/f7/h7): - assigned-clocks: reference to the rtc_ck clock entry. - assigned-clock-parents: phandle of the new parent clock of rtc_ck. @@ -31,7 +37,7 @@ Example: assigned-clock-parents = <&rcc 1 CLK_LSE>; interrupt-parent = <&exti>; interrupts = <17 1>; - st,syscfg = <&pwrcfg>; + st,syscfg = <&pwrcfg 0x00 0x100>; }; rtc: rtc@58004000 { @@ -44,5 +50,14 @@ Example: interrupt-parent = <&exti>; interrupts = <17 1>; interrupt-names = "alarm"; - st,syscfg = <&pwrcfg>; + st,syscfg = <&pwrcfg 0x00 0x100>; + }; + + rtc: rtc@5c004000 { + compatible = "st,stm32mp1-rtc"; + reg = <0x5c004000 0x400>; + clocks = <&rcc RTCAPB>, <&rcc RTC>; + clock-names = "pclk", "rtc_ck"; + interrupts-extended = <&intc GIC_SPI 3 IRQ_TYPE_NONE>, + <&exti 19 1>; }; diff --git a/Documentation/devicetree/bindings/serial/amlogic,meson-uart.txt b/Documentation/devicetree/bindings/serial/amlogic,meson-uart.txt index 8ff65fa632fd..c06c045126fc 100644 --- a/Documentation/devicetree/bindings/serial/amlogic,meson-uart.txt +++ b/Documentation/devicetree/bindings/serial/amlogic,meson-uart.txt @@ -21,7 +21,7 @@ Required properties: - interrupts : identifier to the device interrupt - clocks : a list of phandle + clock-specifier pairs, one for each entry in clock names. -- clocks-names : +- clock-names : * "xtal" for external xtal clock identifier * "pclk" for the bus core clock, either the clk81 clock or the gate clock * "baud" for the source of the baudrate generator, can be either the xtal diff --git a/Documentation/devicetree/bindings/serial/mvebu-uart.txt b/Documentation/devicetree/bindings/serial/mvebu-uart.txt index 2ae2fee7e023..b7e0e32b9ac6 100644 --- a/Documentation/devicetree/bindings/serial/mvebu-uart.txt +++ b/Documentation/devicetree/bindings/serial/mvebu-uart.txt @@ -24,7 +24,7 @@ Required properties: - Must contain two elements for the extended variant of the IP (marvell,armada-3700-uart-ext): "uart-tx" and "uart-rx", respectively the UART TX interrupt and the UART RX interrupt. A - corresponding interrupts-names property must be defined. + corresponding interrupt-names property must be defined. - For backward compatibility reasons, a single element interrupts property is also supported for the standard variant of the IP, containing only the UART sum interrupt. This form is deprecated diff --git a/Documentation/devicetree/bindings/serial/renesas,sci-serial.txt b/Documentation/devicetree/bindings/serial/renesas,sci-serial.txt index ad962f4ec3aa..106808b55b6d 100644 --- a/Documentation/devicetree/bindings/serial/renesas,sci-serial.txt +++ b/Documentation/devicetree/bindings/serial/renesas,sci-serial.txt @@ -17,6 +17,8 @@ Required properties: - "renesas,scifa-r8a7745" for R8A7745 (RZ/G1E) SCIFA compatible UART. - "renesas,scifb-r8a7745" for R8A7745 (RZ/G1E) SCIFB compatible UART. - "renesas,hscif-r8a7745" for R8A7745 (RZ/G1E) HSCIF compatible UART. + - "renesas,scif-r8a77470" for R8A77470 (RZ/G1C) SCIF compatible UART. + - "renesas,hscif-r8a77470" for R8A77470 (RZ/G1C) HSCIF compatible UART. - "renesas,scif-r8a7778" for R8A7778 (R-Car M1) SCIF compatible UART. - "renesas,scif-r8a7779" for R8A7779 (R-Car H1) SCIF compatible UART. - "renesas,scif-r8a7790" for R8A7790 (R-Car H2) SCIF compatible UART. @@ -41,6 +43,8 @@ Required properties: - "renesas,hscif-r8a7795" for R8A7795 (R-Car H3) HSCIF compatible UART. - "renesas,scif-r8a7796" for R8A7796 (R-Car M3-W) SCIF compatible UART. - "renesas,hscif-r8a7796" for R8A7796 (R-Car M3-W) HSCIF compatible UART. + - "renesas,scif-r8a77965" for R8A77965 (R-Car M3-N) SCIF compatible UART. + - "renesas,hscif-r8a77965" for R8A77965 (R-Car M3-N) HSCIF compatible UART. - "renesas,scif-r8a77970" for R8A77970 (R-Car V3M) SCIF compatible UART. - "renesas,hscif-r8a77970" for R8A77970 (R-Car V3M) HSCIF compatible UART. - "renesas,scif-r8a77980" for R8A77980 (R-Car V3H) SCIF compatible UART. diff --git a/Documentation/devicetree/bindings/soc/qcom/qcom,apr.txt b/Documentation/devicetree/bindings/soc/qcom/qcom,apr.txt new file mode 100644 index 000000000000..bcc612cc7423 --- /dev/null +++ b/Documentation/devicetree/bindings/soc/qcom/qcom,apr.txt @@ -0,0 +1,84 @@ +Qualcomm APR (Asynchronous Packet Router) binding + +This binding describes the Qualcomm APR. APR is a IPC protocol for +communication between Application processor and QDSP. APR is mainly +used for audio/voice services on the QDSP. + +- compatible: + Usage: required + Value type: <stringlist> + Definition: must be "qcom,apr-v<VERSION-NUMBER>", example "qcom,apr-v2" + +- reg + Usage: required + Value type: <u32> + Definition: Destination processor ID. + Possible values are : + 1 - APR simulator + 2 - PC + 3 - MODEM + 4 - ADSP + 5 - APPS + 6 - MODEM2 + 7 - APPS2 + += APR SERVICES +Each subnode of the APR node represents service tied to this apr. The name +of the nodes are not important. The properties of these nodes are defined +by the individual bindings for the specific service +- All APR services MUST contain the following property: + +- reg + Usage: required + Value type: <u32> + Definition: APR Service ID + Possible values are : + 3 - DSP Core Service + 4 - Audio Front End Service. + 5 - Voice Stream Manager Service. + 6 - Voice processing manager. + 7 - Audio Stream Manager Service. + 8 - Audio Device Manager Service. + 9 - Multimode voice manager. + 10 - Core voice stream. + 11 - Core voice processor. + 12 - Ultrasound stream manager. + 13 - Listen stream manager. + += EXAMPLE +The following example represents a QDSP based sound card on a MSM8996 device +which uses apr as communication between Apps and QDSP. + + apr@4 { + compatible = "qcom,apr-v2"; + reg = <APR_DOMAIN_ADSP>; + + q6core@3 { + compatible = "qcom,q6core"; + reg = <APR_SVC_ADSP_CORE>; + }; + + q6afe@4 { + compatible = "qcom,q6afe"; + reg = <APR_SVC_AFE>; + + dais { + #sound-dai-cells = <1>; + hdmi@1 { + reg = <1>; + }; + }; + }; + + q6asm@7 { + compatible = "qcom,q6asm"; + reg = <APR_SVC_ASM>; + ... + }; + + q6adm@8 { + compatible = "qcom,q6adm"; + reg = <APR_SVC_ADM>; + ... + }; + }; diff --git a/Documentation/devicetree/bindings/soc/ti/keystone-navigator-qmss.txt b/Documentation/devicetree/bindings/soc/ti/keystone-navigator-qmss.txt index 77cd42cc5f54..b025770eeb92 100644 --- a/Documentation/devicetree/bindings/soc/ti/keystone-navigator-qmss.txt +++ b/Documentation/devicetree/bindings/soc/ti/keystone-navigator-qmss.txt @@ -17,7 +17,8 @@ pool management. Required properties: -- compatible : Must be "ti,keystone-navigator-qmss"; +- compatible : Must be "ti,keystone-navigator-qmss". + : Must be "ti,66ak2g-navss-qm" for QMSS on K2G SoC. - clocks : phandle to the reference clock for this device. - queue-range : <start number> total range of queue numbers for the device. - linkram0 : <address size> for internal link ram, where size is the total @@ -39,6 +40,12 @@ Required properties: - Descriptor memory setup region. - Queue Management/Queue Proxy region for queue Push. - Queue Management/Queue Proxy region for queue Pop. + +For QMSS on K2G SoC, following QM reg indexes are used in that order + - Queue Peek region. + - Queue configuration region. + - Queue Management/Queue Proxy region for queue Push/Pop. + - queue-pools : child node classifying the queue ranges into pools. Queue ranges are grouped into 3 type of pools: - qpend : pool of qpend(interruptible) queues diff --git a/Documentation/devicetree/bindings/sound/adi,ssm2305.txt b/Documentation/devicetree/bindings/sound/adi,ssm2305.txt new file mode 100644 index 000000000000..a9c9d83c8a30 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/adi,ssm2305.txt @@ -0,0 +1,14 @@ +Analog Devices SSM2305 Speaker Amplifier +======================================== + +Required properties: + - compatible : "adi,ssm2305" + - shutdown-gpios : The gpio connected to the shutdown pin. + The gpio signal is ACTIVE_LOW. + +Example: + +ssm2305: analog-amplifier { + compatible = "adi,ssm2305"; + shutdown-gpios = <&gpio3 20 GPIO_ACTIVE_LOW>; +}; diff --git a/Documentation/devicetree/bindings/sound/atmel-i2s.txt b/Documentation/devicetree/bindings/sound/atmel-i2s.txt new file mode 100644 index 000000000000..735368b8a73f --- /dev/null +++ b/Documentation/devicetree/bindings/sound/atmel-i2s.txt @@ -0,0 +1,47 @@ +* Atmel I2S controller + +Required properties: +- compatible: Should be "atmel,sama5d2-i2s". +- reg: Should be the physical base address of the controller and the + length of memory mapped region. +- interrupts: Should contain the interrupt for the controller. +- dmas: Should be one per channel name listed in the dma-names property, + as described in atmel-dma.txt and dma.txt files. +- dma-names: Two dmas have to be defined, "tx" and "rx". + This IP also supports one shared channel for both rx and tx; + if this mode is used, one "rx-tx" name must be used. +- clocks: Must contain an entry for each entry in clock-names. + Please refer to clock-bindings.txt. +- clock-names: Should be one of each entry matching the clocks phandles list: + - "pclk" (peripheral clock) Required. + - "gclk" (generated clock) Optional (1). + - "aclk" (Audio PLL clock) Optional (1). + - "muxclk" (I2S mux clock) Optional (1). + +Optional properties: +- pinctrl-0: Should specify pin control groups used for this controller. +- princtrl-names: Should contain only one value - "default". + + +(1) : Only the peripheral clock is required. The generated clock, the Audio + PLL clock adn the I2S mux clock are optional and should only be set + together, when Master Mode is required. + +Example: + + i2s@f8050000 { + compatible = "atmel,sama5d2-i2s"; + reg = <0xf8050000 0x300>; + interrupts = <54 IRQ_TYPE_LEVEL_HIGH 7>; + dmas = <&dma0 + (AT91_XDMAC_DT_MEM_IF(0) | AT91_XDMAC_DT_PER_IF(1) | + AT91_XDMAC_DT_PERID(31))>, + <&dma0 + (AT91_XDMAC_DT_MEM_IF(0) | AT91_XDMAC_DT_PER_IF(1) | + AT91_XDMAC_DT_PERID(32))>; + dma-names = "tx", "rx"; + clocks = <&i2s0_clk>, <&i2s0_gclk>, <&audio_pll_pmc>, <&i2s0muxck>; + clock-names = "pclk", "gclk", "aclk", "muxclk"; + pinctrl-names = "default"; + pinctrl-0 = <&pinctrl_i2s0_default>; + }; diff --git a/Documentation/devicetree/bindings/sound/cs42xx8.txt b/Documentation/devicetree/bindings/sound/cs42xx8.txt index f631fbca6284..8619a156d038 100644 --- a/Documentation/devicetree/bindings/sound/cs42xx8.txt +++ b/Documentation/devicetree/bindings/sound/cs42xx8.txt @@ -16,7 +16,7 @@ Required properties: Example: -codec: cs42888@48 { +cs42888: codec@48 { compatible = "cirrus,cs42888"; reg = <0x48>; clocks = <&codec_mclk 0>; diff --git a/Documentation/devicetree/bindings/sound/fsl,asrc.txt b/Documentation/devicetree/bindings/sound/fsl,asrc.txt index f5a14115b459..1d4d9f938689 100644 --- a/Documentation/devicetree/bindings/sound/fsl,asrc.txt +++ b/Documentation/devicetree/bindings/sound/fsl,asrc.txt @@ -31,14 +31,16 @@ Required properties: it. This property is optional depending on the SoC design. - - big-endian : If this property is absent, the little endian mode - will be in use as default. Otherwise, the big endian - mode will be in use for all the device registers. - - fsl,asrc-rate : Defines a mutual sample rate used by DPCM Back Ends. - fsl,asrc-width : Defines a mutual sample width used by DPCM Back Ends. +Optional properties: + + - big-endian : If this property is absent, the little endian mode + will be in use as default. Otherwise, the big endian + mode will be in use for all the device registers. + Example: asrc: asrc@2034000 { diff --git a/Documentation/devicetree/bindings/sound/fsl,esai.txt b/Documentation/devicetree/bindings/sound/fsl,esai.txt index cacd18bb9ba6..5b9914367610 100644 --- a/Documentation/devicetree/bindings/sound/fsl,esai.txt +++ b/Documentation/devicetree/bindings/sound/fsl,esai.txt @@ -42,6 +42,8 @@ Required properties: means all the settings for Receiving would be duplicated from Transmition related registers. +Optional properties: + - big-endian : If this property is absent, the native endian mode will be in use as default, or the big endian mode will be in use for all the device registers. diff --git a/Documentation/devicetree/bindings/sound/fsl,spdif.txt b/Documentation/devicetree/bindings/sound/fsl,spdif.txt index 38cfa7573441..8b324f82a782 100644 --- a/Documentation/devicetree/bindings/sound/fsl,spdif.txt +++ b/Documentation/devicetree/bindings/sound/fsl,spdif.txt @@ -33,6 +33,8 @@ Required properties: it. This property is optional depending on the SoC design. +Optional properties: + - big-endian : If this property is absent, the native endian mode will be in use as default, or the big endian mode will be in use for all the device registers. diff --git a/Documentation/devicetree/bindings/sound/fsl-sai.txt b/Documentation/devicetree/bindings/sound/fsl-sai.txt index 740b467adf7d..dd9e59738e08 100644 --- a/Documentation/devicetree/bindings/sound/fsl-sai.txt +++ b/Documentation/devicetree/bindings/sound/fsl-sai.txt @@ -28,9 +28,6 @@ Required properties: pinctrl-names. See ../pinctrl/pinctrl-bindings.txt for details of the property values. - - big-endian : Boolean property, required if all the FTM_PWM - registers are big-endian rather than little-endian. - - lsb-first : Configures whether the LSB or the MSB is transmitted first for the fifo data. If this property is absent, the MSB is transmitted first as default, or the LSB @@ -48,6 +45,11 @@ Required properties: receive data by following their own bit clocks and frame sync clocks separately. +Optional properties: + + - big-endian : Boolean property, required if all the SAI + registers are big-endian rather than little-endian. + Optional properties (for mx6ul): - fsl,sai-mclk-direction-output: This is a boolean property. If present, diff --git a/Documentation/devicetree/bindings/sound/mt2701-afe-pcm.txt b/Documentation/devicetree/bindings/sound/mt2701-afe-pcm.txt index e2f7f4951215..560762e0a168 100644 --- a/Documentation/devicetree/bindings/sound/mt2701-afe-pcm.txt +++ b/Documentation/devicetree/bindings/sound/mt2701-afe-pcm.txt @@ -1,7 +1,9 @@ Mediatek AFE PCM controller for mt2701 Required properties: -- compatible = "mediatek,mt2701-audio"; +- compatible: should be one of the followings. + - "mediatek,mt2701-audio" + - "mediatek,mt7622-audio" - interrupts: should contain AFE and ASYS interrupts - interrupt-names: should be "afe" and "asys" - power-domains: should define the power domain diff --git a/Documentation/devicetree/bindings/sound/mt6351.txt b/Documentation/devicetree/bindings/sound/mt6351.txt new file mode 100644 index 000000000000..7fb2cb99245e --- /dev/null +++ b/Documentation/devicetree/bindings/sound/mt6351.txt @@ -0,0 +1,16 @@ +Mediatek MT6351 Audio Codec + +The communication between MT6351 and SoC is through Mediatek PMIC wrapper. +For more detail, please visit Mediatek PMIC wrapper documentation. + +Must be a child node of PMIC wrapper. + +Required properties: + +- compatible : "mediatek,mt6351-sound". + +Example: + +mt6351_snd { + compatible = "mediatek,mt6351-sound"; +}; diff --git a/Documentation/devicetree/bindings/sound/mt6797-afe-pcm.txt b/Documentation/devicetree/bindings/sound/mt6797-afe-pcm.txt new file mode 100644 index 000000000000..0ae29de15bfd --- /dev/null +++ b/Documentation/devicetree/bindings/sound/mt6797-afe-pcm.txt @@ -0,0 +1,42 @@ +Mediatek AFE PCM controller for mt6797 + +Required properties: +- compatible = "mediatek,mt6797-audio"; +- reg: register location and size +- interrupts: should contain AFE interrupt +- power-domains: should define the power domain +- clocks: Must contain an entry for each entry in clock-names +- clock-names: should have these clock names: + "infra_sys_audio_clk", + "infra_sys_audio_26m", + "mtkaif_26m_clk", + "top_mux_audio", + "top_mux_aud_intbus", + "top_sys_pll3_d4", + "top_sys_pll1_d4", + "top_clk26m_clk"; + +Example: + + afe: mt6797-afe-pcm@11220000 { + compatible = "mediatek,mt6797-audio"; + reg = <0 0x11220000 0 0x1000>; + interrupts = <GIC_SPI 151 IRQ_TYPE_LEVEL_LOW>; + power-domains = <&scpsys MT6797_POWER_DOMAIN_AUDIO>; + clocks = <&infrasys CLK_INFRA_AUDIO>, + <&infrasys CLK_INFRA_AUDIO_26M>, + <&infrasys CLK_INFRA_AUDIO_26M_PAD_TOP>, + <&topckgen CLK_TOP_MUX_AUDIO>, + <&topckgen CLK_TOP_MUX_AUD_INTBUS>, + <&topckgen CLK_TOP_SYSPLL3_D4>, + <&topckgen CLK_TOP_SYSPLL1_D4>, + <&clk26m>; + clock-names = "infra_sys_audio_clk", + "infra_sys_audio_26m", + "mtkaif_26m_clk", + "top_mux_audio", + "top_mux_aud_intbus", + "top_sys_pll3_d4", + "top_sys_pll1_d4", + "top_clk26m_clk"; + }; diff --git a/Documentation/devicetree/bindings/sound/mt6797-mt6351.txt b/Documentation/devicetree/bindings/sound/mt6797-mt6351.txt new file mode 100644 index 000000000000..1d95a8840f19 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/mt6797-mt6351.txt @@ -0,0 +1,14 @@ +MT6797 with MT6351 CODEC + +Required properties: +- compatible: "mediatek,mt6797-mt6351-sound" +- mediatek,platform: the phandle of MT6797 ASoC platform +- mediatek,audio-codec: the phandles of MT6351 codec + +Example: + + sound { + compatible = "mediatek,mt6797-mt6351-sound"; + mediatek,audio-codec = <&mt6351_snd>; + mediatek,platform = <&afe>; + }; diff --git a/Documentation/devicetree/bindings/sound/qcom,apq8096.txt b/Documentation/devicetree/bindings/sound/qcom,apq8096.txt new file mode 100644 index 000000000000..aa54e49fc8a2 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/qcom,apq8096.txt @@ -0,0 +1,109 @@ +* Qualcomm Technologies APQ8096 ASoC sound card driver + +This binding describes the APQ8096 sound card, which uses qdsp for audio. + +- compatible: + Usage: required + Value type: <stringlist> + Definition: must be "qcom,apq8096-sndcard" + +- qcom,audio-routing: + Usage: Optional + Value type: <stringlist> + Definition: A list of the connections between audio components. + Each entry is a pair of strings, the first being the + connection's sink, the second being the connection's + source. Valid names could be power supplies, MicBias + of codec and the jacks on the board: + Valid names include: + + Board Connectors: + "Headphone Left" + "Headphone Right" + "Earphone" + "Line Out1" + "Line Out2" + "Line Out3" + "Line Out4" + "Analog Mic1" + "Analog Mic2" + "Analog Mic3" + "Analog Mic4" + "Analog Mic5" + "Analog Mic6" + "Digital Mic2" + "Digital Mic3" + + Audio pins and MicBias on WCD9335 Codec: + "MIC_BIAS1 + "MIC_BIAS2" + "MIC_BIAS3" + "MIC_BIAS4" + "AMIC1" + "AMIC2" + "AMIC3" + "AMIC4" + "AMIC5" + "AMIC6" + "AMIC6" + "DMIC1" + "DMIC2" + "DMIC3" += dailinks +Each subnode of sndcard represents either a dailink, and subnodes of each +dailinks would be cpu/codec/platform dais. + +- link-name: + Usage: required + Value type: <string> + Definition: User friendly name for dai link + += CPU, PLATFORM, CODEC dais subnodes +- cpu: + Usage: required + Value type: <subnode> + Definition: cpu dai sub-node + +- codec: + Usage: Optional + Value type: <subnode> + Definition: codec dai sub-node + +- platform: + Usage: Optional + Value type: <subnode> + Definition: platform dai sub-node + +- sound-dai: + Usage: required + Value type: <phandle with arguments> + Definition: dai phandle/s and port of CPU/CODEC/PLATFORM node. + +Example: + +audio { + compatible = "qcom,apq8096-sndcard"; + qcom,model = "DB820c"; + + mm1-dai-link { + link-name = "MultiMedia1"; + cpu { + sound-dai = <&q6asmdai MSM_FRONTEND_DAI_MULTIMEDIA1>; + }; + }; + + hdmi-dai-link { + link-name = "HDMI Playback"; + cpu { + sound-dai = <&q6afe HDMI_RX>; + }; + + platform { + sound-dai = <&q6adm>; + }; + + codec { + sound-dai = <&hdmi 0>; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/sound/qcom,q6adm.txt b/Documentation/devicetree/bindings/sound/qcom,q6adm.txt new file mode 100644 index 000000000000..cb709e5dbc44 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/qcom,q6adm.txt @@ -0,0 +1,33 @@ +Qualcomm Audio Device Manager (Q6ADM) binding + +Q6ADM is one of the APR audio service on Q6DSP. +Please refer to qcom,apr.txt for details of the coommon apr service bindings +used by the apr service device. + +- but must contain the following property: + +- compatible: + Usage: required + Value type: <stringlist> + Definition: must be "qcom,q6adm-v<MAJOR-NUMBER>.<MINOR-NUMBER>". + Or "qcom,q6adm" where the version number can be queried + from DSP. + example "qcom,q6adm-v2.0" + + += ADM routing +"routing" subnode of the ADM node represents adm routing specific configuration + +- #sound-dai-cells + Usage: required + Value type: <u32> + Definition: Must be 0 + += EXAMPLE +q6adm@8 { + compatible = "qcom,q6adm"; + reg = <APR_SVC_ADM>; + q6routing: routing { + #sound-dai-cells = <0>; + }; +}; diff --git a/Documentation/devicetree/bindings/sound/qcom,q6afe.txt b/Documentation/devicetree/bindings/sound/qcom,q6afe.txt new file mode 100644 index 000000000000..bdbf87df8c0b --- /dev/null +++ b/Documentation/devicetree/bindings/sound/qcom,q6afe.txt @@ -0,0 +1,172 @@ +Qualcomm Audio Front End (Q6AFE) binding + +AFE is one of the APR audio service on Q6DSP +Please refer to qcom,apr.txt for details of the common apr service bindings +used by all apr services. Must contain the following properties. + +- compatible: + Usage: required + Value type: <stringlist> + Definition: must be "qcom,q6afe-v<MAJOR-NUMBER>.<MINOR-NUMBER>" + Or "qcom,q6afe" where the version number can be queried + from DSP. + example "qcom,q6afe" + += AFE DAIs (Digial Audio Interface) +"dais" subnode of the AFE node. It represents afe dais, each afe dai is a +subnode of "dais" representing board specific dai setup. +"dais" node should have following properties followed by dai children. + +- #sound-dai-cells + Usage: required + Value type: <u32> + Definition: Must be 1 + +- #address-cells + Usage: required + Value type: <u32> + Definition: Must be 1 + +- #size-cells + Usage: required + Value type: <u32> + Definition: Must be 0 + +== AFE DAI is subnode of "dais" and represent a dai, it includes board specific +configuration of each dai. Must contain the following properties. + +- reg + Usage: required + Value type: <u32> + Definition: Must be dai id + +- qcom,sd-lines + Usage: required for mi2s interface + Value type: <prop-encoded-array> + Definition: Must be list of serial data lines used by this dai. + should be one or more of the 1-4 sd lines. + + - qcom,tdm-sync-mode: + Usage: required for tdm interface + Value type: <prop-encoded-array> + Definition: Synchronization mode. + 0 - Short sync bit mode + 1 - Long sync mode + 2 - Short sync slot mode + + - qcom,tdm-sync-src: + Usage: required for tdm interface + Value type: <prop-encoded-array> + Definition: Synchronization source. + 0 - External source + 1 - Internal source + + - qcom,tdm-data-out: + Usage: required for tdm interface + Value type: <prop-encoded-array> + Definition: Data out signal to drive with other masters. + 0 - Disable + 1 - Enable + + - qcom,tdm-invert-sync: + Usage: required for tdm interface + Value type: <prop-encoded-array> + Definition: Invert the sync. + 0 - Normal + 1 - Invert + + - qcom,tdm-data-delay: + Usage: required for tdm interface + Value type: <prop-encoded-array> + Definition: Number of bit clock to delay data + with respect to sync edge. + 0 - 0 bit clock cycle + 1 - 1 bit clock cycle + 2 - 2 bit clock cycle + + - qcom,tdm-data-align: + Usage: required for tdm interface + Value type: <prop-encoded-array> + Definition: Indicate how data is packed + within the slot. For example, 32 slot width in case of + sample bit width is 24. + 0 - MSB + 1 - LSB + += EXAMPLE + +q6afe@4 { + compatible = "qcom,q6afe"; + reg = <APR_SVC_AFE>; + + dais { + #sound-dai-cells = <1>; + #address-cells = <1>; + #size-cells = <0>; + + hdmi@1 { + reg = <1>; + }; + + tdm@24 { + reg = <24>; + qcom,tdm-sync-mode = <1>: + qcom,tdm-sync-src = <1>; + qcom,tdm-data-out = <0>; + qcom,tdm-invert-sync = <1>; + qcom,tdm-data-delay = <1>; + qcom,tdm-data-align = <0>; + + }; + + tdm@25 { + reg = <25>; + qcom,tdm-sync-mode = <1>: + qcom,tdm-sync-src = <1>; + qcom,tdm-data-out = <0>; + qcom,tdm-invert-sync = <1>; + qcom,tdm-data-delay <1>: + qcom,tdm-data-align = <0>; + }; + + prim-mi2s-rx@16 { + reg = <16>; + qcom,sd-lines = <1 3>; + }; + + prim-mi2s-tx@17 { + reg = <17>; + qcom,sd-lines = <2>; + }; + + sec-mi2s-rx@18 { + reg = <18>; + qcom,sd-lines = <1 4>; + }; + + sec-mi2s-tx@19 { + reg = <19>; + qcom,sd-lines = <2>; + }; + + tert-mi2s-rx@20 { + reg = <20>; + qcom,sd-lines = <2 4>; + }; + + tert-mi2s-tx@21 { + reg = <21>; + qcom,sd-lines = <1>; + }; + + quat-mi2s-rx@22 { + reg = <22>; + qcom,sd-lines = <1>; + }; + + quat-mi2s-tx@23 { + reg = <23>; + qcom,sd-lines = <2>; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/sound/qcom,q6asm.txt b/Documentation/devicetree/bindings/sound/qcom,q6asm.txt new file mode 100644 index 000000000000..2178eb91146f --- /dev/null +++ b/Documentation/devicetree/bindings/sound/qcom,q6asm.txt @@ -0,0 +1,33 @@ +Qualcomm Audio Stream Manager (Q6ASM) binding + +Q6ASM is one of the APR audio service on Q6DSP. +Please refer to qcom,apr.txt for details of the common apr service bindings +used by the apr service device. + +- but must contain the following property: + +- compatible: + Usage: required + Value type: <stringlist> + Definition: must be "qcom,q6asm-v<MAJOR-NUMBER>.<MINOR-NUMBER>". + Or "qcom,q6asm" where the version number can be queried + from DSP. + example "qcom,q6asm-v2.0" + += ASM DAIs (Digial Audio Interface) +"dais" subnode of the ASM node represents dai specific configuration + +- #sound-dai-cells + Usage: required + Value type: <u32> + Definition: Must be 1 + += EXAMPLE + +q6asm@7 { + compatible = "qcom,q6asm"; + reg = <APR_SVC_ASM>; + q6asmdai: dais { + #sound-dai-cells = <1>; + }; +}; diff --git a/Documentation/devicetree/bindings/sound/qcom,q6core.txt b/Documentation/devicetree/bindings/sound/qcom,q6core.txt new file mode 100644 index 000000000000..7f36ff8bec18 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/qcom,q6core.txt @@ -0,0 +1,21 @@ +Qualcomm ADSP Core service binding + +Q6CORE is one of the APR audio service on Q6DSP. +Please refer to qcom,apr.txt for details of the common apr service bindings +used by the apr service device. + +- but must contain the following property: + +- compatible: + Usage: required + Value type: <stringlist> + Definition: must be "qcom,q6core-v<MAJOR-NUMBER>.<MINOR-NUMBER>". + Or "qcom,q6core" where the version number can be queried + from DSP. + example "qcom,q6core-v2.0" + += EXAMPLE +q6core@3 { + compatible = "qcom,q6core"; + reg = <APR_SVC_ADSP_CORE>; +}; diff --git a/Documentation/devicetree/bindings/sound/rt274.txt b/Documentation/devicetree/bindings/sound/rt274.txt index e9a6178c78cf..791a1bd767b9 100644 --- a/Documentation/devicetree/bindings/sound/rt274.txt +++ b/Documentation/devicetree/bindings/sound/rt274.txt @@ -26,7 +26,7 @@ Pins on the device (for linking into audio routes) for RT274: Example: -codec: rt274@1c { +rt274: codec@1c { compatible = "realtek,rt274"; reg = <0x1c>; interrupts = <7 IRQ_TYPE_EDGE_FALLING>; diff --git a/Documentation/devicetree/bindings/sound/rt5514.txt b/Documentation/devicetree/bindings/sound/rt5514.txt index 4f33b0d96afe..b25ed08c7a5a 100644 --- a/Documentation/devicetree/bindings/sound/rt5514.txt +++ b/Documentation/devicetree/bindings/sound/rt5514.txt @@ -32,7 +32,7 @@ Pins on the device (for linking into audio routes) for I2C: Example: -codec: rt5514@57 { +rt5514: codec@57 { compatible = "realtek,rt5514"; reg = <0x57>; }; diff --git a/Documentation/devicetree/bindings/sound/rt5616.txt b/Documentation/devicetree/bindings/sound/rt5616.txt index e41085818559..540a4bf252e4 100644 --- a/Documentation/devicetree/bindings/sound/rt5616.txt +++ b/Documentation/devicetree/bindings/sound/rt5616.txt @@ -26,7 +26,7 @@ Pins on the device (for linking into audio routes) for RT5616: Example: -codec: rt5616@1b { +rt5616: codec@1b { compatible = "realtek,rt5616"; reg = <0x1b>; }; diff --git a/Documentation/devicetree/bindings/sound/rt5640.txt b/Documentation/devicetree/bindings/sound/rt5640.txt index 57fe64643050..e40e4893eed8 100644 --- a/Documentation/devicetree/bindings/sound/rt5640.txt +++ b/Documentation/devicetree/bindings/sound/rt5640.txt @@ -22,6 +22,41 @@ Optional properties: - realtek,ldo1-en-gpios : The GPIO that controls the CODEC's LDO1_EN pin. +- realtek,dmic1-data-pin + 0: dmic1 is not used + 1: using IN1P pin as dmic1 data pin + 2: using GPIO3 pin as dmic1 data pin + +- realtek,dmic2-data-pin + 0: dmic2 is not used + 1: using IN1N pin as dmic2 data pin + 2: using GPIO4 pin as dmic2 data pin + +- realtek,jack-detect-source + u32. Valid values: + 0: jack-detect is not used + 1: Use GPIO1 for jack-detect + 2: Use JD1_IN4P for jack-detect + 3: Use JD2_IN4N for jack-detect + 4: Use GPIO2 for jack-detect + 5: Use GPIO3 for jack-detect + 6: Use GPIO4 for jack-detect + +- realtek,jack-detect-not-inverted + bool. Normal jack-detect switches give an inverted signal, set this bool + in the rare case you've a jack-detect switch which is not inverted. + +- realtek,over-current-threshold-microamp + u32, micbias over-current detection threshold in µA, valid values are + 600, 1500 and 2000µA. + +- realtek,over-current-scale-factor + u32, micbias over-current detection scale-factor, valid values are: + 0: Scale current by 0.5 + 1: Scale current by 0.75 + 2: Scale current by 1.0 + 3: Scale current by 1.5 + Pins on the device (for linking into audio routes) for RT5639/RT5640: * DMIC1 diff --git a/Documentation/devicetree/bindings/sound/rt5645.txt b/Documentation/devicetree/bindings/sound/rt5645.txt index 7cee1f518f59..a03f9a872a71 100644 --- a/Documentation/devicetree/bindings/sound/rt5645.txt +++ b/Documentation/devicetree/bindings/sound/rt5645.txt @@ -69,4 +69,4 @@ codec: rt5650@1a { realtek,dmic-en = "true"; realtek,en-jd-func = "true"; realtek,jd-mode = <3>; -};
\ No newline at end of file +}; diff --git a/Documentation/devicetree/bindings/sound/rt5651.txt b/Documentation/devicetree/bindings/sound/rt5651.txt index b85221864cec..a41199a5cd79 100644 --- a/Documentation/devicetree/bindings/sound/rt5651.txt +++ b/Documentation/devicetree/bindings/sound/rt5651.txt @@ -50,7 +50,7 @@ Pins on the device (for linking into audio routes) for RT5651: Example: -codec: rt5651@1a { +rt5651: codec@1a { compatible = "realtek,rt5651"; reg = <0x1a>; realtek,dmic-en = "true"; diff --git a/Documentation/devicetree/bindings/sound/rt5663.txt b/Documentation/devicetree/bindings/sound/rt5663.txt index 497bcfc58b71..23386446c63d 100644 --- a/Documentation/devicetree/bindings/sound/rt5663.txt +++ b/Documentation/devicetree/bindings/sound/rt5663.txt @@ -47,7 +47,7 @@ Pins on the device (for linking into audio routes) for RT5663: Example: -codec: rt5663@12 { +rt5663: codec@12 { compatible = "realtek,rt5663"; reg = <0x12>; interrupts = <7 IRQ_TYPE_EDGE_FALLING>; diff --git a/Documentation/devicetree/bindings/sound/rt5668.txt b/Documentation/devicetree/bindings/sound/rt5668.txt new file mode 100644 index 000000000000..c88b96e7764b --- /dev/null +++ b/Documentation/devicetree/bindings/sound/rt5668.txt @@ -0,0 +1,50 @@ +RT5668B audio CODEC + +This device supports I2C only. + +Required properties: + +- compatible : "realtek,rt5668b" + +- reg : The I2C address of the device. + +Optional properties: + +- interrupts : The CODEC's interrupt output. + +- realtek,dmic1-data-pin + 0: dmic1 is not used + 1: using GPIO2 pin as dmic1 data pin + 2: using GPIO5 pin as dmic1 data pin + +- realtek,dmic1-clk-pin + 0: using GPIO1 pin as dmic1 clock pin + 1: using GPIO3 pin as dmic1 clock pin + +- realtek,jd-src + 0: No JD is used + 1: using JD1 as JD source + +- realtek,ldo1-en-gpios : The GPIO that controls the CODEC's LDO1_EN pin. + +Pins on the device (for linking into audio routes) for RT5668B: + + * DMIC L1 + * DMIC R1 + * IN1P + * HPOL + * HPOR + +Example: + +rt5668 { + compatible = "realtek,rt5668b"; + reg = <0x1a>; + interrupt-parent = <&gpio>; + interrupts = <TEGRA_GPIO(U, 6) GPIO_ACTIVE_HIGH>; + realtek,ldo1-en-gpios = + <&gpio TEGRA_GPIO(R, 2) GPIO_ACTIVE_HIGH>; + realtek,dmic1-data-pin = <1>; + realtek,dmic1-clk-pin = <1>; + realtek,jd-src = <1>; +}; diff --git a/Documentation/devicetree/bindings/sound/sgtl5000.txt b/Documentation/devicetree/bindings/sound/sgtl5000.txt index 9a36c7e2a143..0f214457476f 100644 --- a/Documentation/devicetree/bindings/sound/sgtl5000.txt +++ b/Documentation/devicetree/bindings/sound/sgtl5000.txt @@ -39,7 +39,7 @@ VDDIO 1.8V 2.5V 3.3V Example: -codec: sgtl5000@a { +sgtl5000: codec@a { compatible = "fsl,sgtl5000"; reg = <0x0a>; #sound-dai-cells = <0>; diff --git a/Documentation/devicetree/bindings/sound/simple-card.txt b/Documentation/devicetree/bindings/sound/simple-card.txt index 17c13e74667d..a4c72d09cd45 100644 --- a/Documentation/devicetree/bindings/sound/simple-card.txt +++ b/Documentation/devicetree/bindings/sound/simple-card.txt @@ -86,6 +86,11 @@ Optional CPU/CODEC subnodes properties: in dai startup() and disabled with clk_disable_unprepare() in dai shutdown(). + If a clock is specified and a + multiplication factor is given with + mclk-fs, the clock will be set to the + calculated mclk frequency when the + stream starts. - system-clock-direction-out : specifies clock direction as 'out' on initialization. It is useful for some aCPUs with fixed clocks. diff --git a/Documentation/devicetree/bindings/sound/ti,tas6424.txt b/Documentation/devicetree/bindings/sound/ti,tas6424.txt index 1c4ada0eef4e..eacb54f34188 100644 --- a/Documentation/devicetree/bindings/sound/ti,tas6424.txt +++ b/Documentation/devicetree/bindings/sound/ti,tas6424.txt @@ -6,6 +6,8 @@ Required properties: - compatible: "ti,tas6424" - TAS6424 - reg: I2C slave address - sound-dai-cells: must be equal to 0 + - standby-gpios: GPIO used to shut the TAS6424 down. + - mute-gpios: GPIO used to mute all the outputs Example: diff --git a/Documentation/devicetree/bindings/sound/tscs42xx.txt b/Documentation/devicetree/bindings/sound/tscs42xx.txt index 2ac2f0996697..7eea32e9d078 100644 --- a/Documentation/devicetree/bindings/sound/tscs42xx.txt +++ b/Documentation/devicetree/bindings/sound/tscs42xx.txt @@ -8,9 +8,15 @@ Required Properties: - reg : <0x71> for analog mic <0x69> for digital mic + - clock-names: Must one of the following "mclk1", "xtal", "mclk2" + + - clocks: phandle of the clock that provides the codec sysclk + Example: wookie: codec@69 { compatible = "tempo,tscs42A2"; reg = <0x69>; + clock-names = "xtal"; + clocks = <&audio_xtal>; }; diff --git a/Documentation/devicetree/bindings/sound/tscs454.txt b/Documentation/devicetree/bindings/sound/tscs454.txt new file mode 100644 index 000000000000..3ba3e2d2c206 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/tscs454.txt @@ -0,0 +1,23 @@ +TSCS454 Audio CODEC + +Required Properties: + + - compatible : "tempo,tscs454" + + - reg : <0x69> + + - clock-names: Must one of the following "xtal", "mclk1", "mclk2" + + - clocks: phandle of the clock that provides the codec sysclk + + Note: If clock is not provided then bit clock is assumed + +Example: + +redwood: codec@69 { + #sound-dai-cells = <1>; + compatible = "tempo,tscs454"; + reg = <0x69>; + clock-names = "mclk1"; + clocks = <&audio_mclk>; +}; diff --git a/Documentation/devicetree/bindings/sound/wm8510.txt b/Documentation/devicetree/bindings/sound/wm8510.txt index fa1a32b85577..e6b6cc041f89 100644 --- a/Documentation/devicetree/bindings/sound/wm8510.txt +++ b/Documentation/devicetree/bindings/sound/wm8510.txt @@ -12,7 +12,7 @@ Required properties: Example: -codec: wm8510@1a { +wm8510: codec@1a { compatible = "wlf,wm8510"; reg = <0x1a>; }; diff --git a/Documentation/devicetree/bindings/sound/wm8523.txt b/Documentation/devicetree/bindings/sound/wm8523.txt index 04746186b283..f3a6485f4b8a 100644 --- a/Documentation/devicetree/bindings/sound/wm8523.txt +++ b/Documentation/devicetree/bindings/sound/wm8523.txt @@ -10,7 +10,7 @@ Required properties: Example: -codec: wm8523@1a { +wm8523: codec@1a { compatible = "wlf,wm8523"; reg = <0x1a>; }; diff --git a/Documentation/devicetree/bindings/sound/wm8524.txt b/Documentation/devicetree/bindings/sound/wm8524.txt index 0f0553563fc1..f6c0c263b135 100644 --- a/Documentation/devicetree/bindings/sound/wm8524.txt +++ b/Documentation/devicetree/bindings/sound/wm8524.txt @@ -10,7 +10,7 @@ Required properties: Example: -codec: wm8524 { +wm8524: codec { compatible = "wlf,wm8524"; wlf,mute-gpios = <&gpio1 8 GPIO_ACTIVE_LOW>; }; diff --git a/Documentation/devicetree/bindings/sound/wm8580.txt b/Documentation/devicetree/bindings/sound/wm8580.txt index 78fce9b14954..ff3f9f5f2111 100644 --- a/Documentation/devicetree/bindings/sound/wm8580.txt +++ b/Documentation/devicetree/bindings/sound/wm8580.txt @@ -10,7 +10,7 @@ Required properties: Example: -codec: wm8580@1a { +wm8580: codec@1a { compatible = "wlf,wm8580"; reg = <0x1a>; }; diff --git a/Documentation/devicetree/bindings/sound/wm8711.txt b/Documentation/devicetree/bindings/sound/wm8711.txt index 8ed9998cd23c..c30a1387c4bf 100644 --- a/Documentation/devicetree/bindings/sound/wm8711.txt +++ b/Documentation/devicetree/bindings/sound/wm8711.txt @@ -12,7 +12,7 @@ Required properties: Example: -codec: wm8711@1a { +wm8711: codec@1a { compatible = "wlf,wm8711"; reg = <0x1a>; }; diff --git a/Documentation/devicetree/bindings/sound/wm8728.txt b/Documentation/devicetree/bindings/sound/wm8728.txt index a8b5c3668e60..a3608b4c78b9 100644 --- a/Documentation/devicetree/bindings/sound/wm8728.txt +++ b/Documentation/devicetree/bindings/sound/wm8728.txt @@ -12,7 +12,7 @@ Required properties: Example: -codec: wm8728@1a { +wm8728: codec@1a { compatible = "wlf,wm8728"; reg = <0x1a>; }; diff --git a/Documentation/devicetree/bindings/sound/wm8731.txt b/Documentation/devicetree/bindings/sound/wm8731.txt index 236690e99b87..f660d9bb0e69 100644 --- a/Documentation/devicetree/bindings/sound/wm8731.txt +++ b/Documentation/devicetree/bindings/sound/wm8731.txt @@ -12,7 +12,7 @@ Required properties: Example: -codec: wm8731@1a { +wm8731: codec@1a { compatible = "wlf,wm8731"; reg = <0x1a>; }; diff --git a/Documentation/devicetree/bindings/sound/wm8737.txt b/Documentation/devicetree/bindings/sound/wm8737.txt index 4bc2cea3b140..eda1ec6a7563 100644 --- a/Documentation/devicetree/bindings/sound/wm8737.txt +++ b/Documentation/devicetree/bindings/sound/wm8737.txt @@ -12,7 +12,7 @@ Required properties: Example: -codec: wm8737@1a { +wm8737: codec@1a { compatible = "wlf,wm8737"; reg = <0x1a>; }; diff --git a/Documentation/devicetree/bindings/sound/wm8741.txt b/Documentation/devicetree/bindings/sound/wm8741.txt index a13315408719..b69e196c741c 100644 --- a/Documentation/devicetree/bindings/sound/wm8741.txt +++ b/Documentation/devicetree/bindings/sound/wm8741.txt @@ -21,7 +21,7 @@ Optional properties: Example: -codec: wm8741@1a { +wm8741: codec@1a { compatible = "wlf,wm8741"; reg = <0x1a>; diff --git a/Documentation/devicetree/bindings/sound/wm8750.txt b/Documentation/devicetree/bindings/sound/wm8750.txt index 8db239fd5ecd..682f221f6f38 100644 --- a/Documentation/devicetree/bindings/sound/wm8750.txt +++ b/Documentation/devicetree/bindings/sound/wm8750.txt @@ -12,7 +12,7 @@ Required properties: Example: -codec: wm8750@1a { +wm8750: codec@1a { compatible = "wlf,wm8750"; reg = <0x1a>; }; diff --git a/Documentation/devicetree/bindings/sound/wm8753.txt b/Documentation/devicetree/bindings/sound/wm8753.txt index 8eee61282105..eca9e5a825a9 100644 --- a/Documentation/devicetree/bindings/sound/wm8753.txt +++ b/Documentation/devicetree/bindings/sound/wm8753.txt @@ -34,7 +34,7 @@ Pins on the device (for linking into audio routes): Example: -codec: wm8753@1a { +wm8753: codec@1a { compatible = "wlf,wm8753"; reg = <0x1a>; }; diff --git a/Documentation/devicetree/bindings/sound/wm8770.txt b/Documentation/devicetree/bindings/sound/wm8770.txt index 866e00ca150b..cac762a1105d 100644 --- a/Documentation/devicetree/bindings/sound/wm8770.txt +++ b/Documentation/devicetree/bindings/sound/wm8770.txt @@ -10,7 +10,7 @@ Required properties: Example: -codec: wm8770@1 { +wm8770: codec@1 { compatible = "wlf,wm8770"; reg = <1>; }; diff --git a/Documentation/devicetree/bindings/sound/wm8776.txt b/Documentation/devicetree/bindings/sound/wm8776.txt index 3b9ca49abc2b..01173369c3ed 100644 --- a/Documentation/devicetree/bindings/sound/wm8776.txt +++ b/Documentation/devicetree/bindings/sound/wm8776.txt @@ -12,7 +12,7 @@ Required properties: Example: -codec: wm8776@1a { +wm8776: codec@1a { compatible = "wlf,wm8776"; reg = <0x1a>; }; diff --git a/Documentation/devicetree/bindings/sound/wm8804.txt b/Documentation/devicetree/bindings/sound/wm8804.txt index 6fd124b16496..2c1641c17a91 100644 --- a/Documentation/devicetree/bindings/sound/wm8804.txt +++ b/Documentation/devicetree/bindings/sound/wm8804.txt @@ -19,7 +19,7 @@ Optional properties: Example: -codec: wm8804@1a { +wm8804: codec@1a { compatible = "wlf,wm8804"; reg = <0x1a>; }; diff --git a/Documentation/devicetree/bindings/sound/wm8903.txt b/Documentation/devicetree/bindings/sound/wm8903.txt index afc51caf1137..6371c2434afe 100644 --- a/Documentation/devicetree/bindings/sound/wm8903.txt +++ b/Documentation/devicetree/bindings/sound/wm8903.txt @@ -57,7 +57,7 @@ Pins on the device (for linking into audio routes): Example: -codec: wm8903@1a { +wm8903: codec@1a { compatible = "wlf,wm8903"; reg = <0x1a>; interrupts = < 347 >; diff --git a/Documentation/devicetree/bindings/sound/wm8960.txt b/Documentation/devicetree/bindings/sound/wm8960.txt index 2deb8a3da9c5..6d29ac3750ee 100644 --- a/Documentation/devicetree/bindings/sound/wm8960.txt +++ b/Documentation/devicetree/bindings/sound/wm8960.txt @@ -23,7 +23,7 @@ Optional properties: Example: -codec: wm8960@1a { +wm8960: codec@1a { compatible = "wlf,wm8960"; reg = <0x1a>; diff --git a/Documentation/devicetree/bindings/sound/wm8962.txt b/Documentation/devicetree/bindings/sound/wm8962.txt index 7f82b59ec8f9..dcfa9a3369fd 100644 --- a/Documentation/devicetree/bindings/sound/wm8962.txt +++ b/Documentation/devicetree/bindings/sound/wm8962.txt @@ -24,7 +24,7 @@ Optional properties: Example: -codec: wm8962@1a { +wm8962: codec@1a { compatible = "wlf,wm8962"; reg = <0x1a>; diff --git a/Documentation/devicetree/bindings/sound/wm8994.txt b/Documentation/devicetree/bindings/sound/wm8994.txt index 68c4e8d96bed..4a9dead1b7d3 100644 --- a/Documentation/devicetree/bindings/sound/wm8994.txt +++ b/Documentation/devicetree/bindings/sound/wm8994.txt @@ -59,7 +59,7 @@ Optional properties: Example: -codec: wm8994@1a { +wm8994: codec@1a { compatible = "wlf,wm8994"; reg = <0x1a>; diff --git a/Documentation/devicetree/bindings/submitting-patches.txt b/Documentation/devicetree/bindings/submitting-patches.txt index 274058c583dd..de0d6090c0fd 100644 --- a/Documentation/devicetree/bindings/submitting-patches.txt +++ b/Documentation/devicetree/bindings/submitting-patches.txt @@ -6,7 +6,14 @@ I. For patch submitters 0) Normal patch submission rules from Documentation/process/submitting-patches.rst applies. - 1) The Documentation/ portion of the patch should be a separate patch. + 1) The Documentation/ and include/dt-bindings/ portion of the patch should + be a separate patch. The preferred subject prefix for binding patches is: + + "dt-bindings: <binding dir>: ..." + + The 80 characters of the subject are precious. It is recommended to not + use "Documentation" or "doc" because that is implied. All bindings are + docs. Repeating "binding" again should also be avoided. 2) Submit the entire series to the devicetree mailinglist at diff --git a/Documentation/devicetree/bindings/thermal/exynos-thermal.txt b/Documentation/devicetree/bindings/thermal/exynos-thermal.txt index 1b596fd38dc4..b957acff57aa 100644 --- a/Documentation/devicetree/bindings/thermal/exynos-thermal.txt +++ b/Documentation/devicetree/bindings/thermal/exynos-thermal.txt @@ -49,19 +49,6 @@ on the SoC (only first trip points defined in DT will be configured): - samsung,exynos5433-tmu: 8 - samsung,exynos7-tmu: 8 -Following properties are mandatory (depending on SoC): -- samsung,tmu_gain: Gain value for internal TMU operation. -- samsung,tmu_reference_voltage: Value of TMU IP block's reference voltage -- samsung,tmu_noise_cancel_mode: Mode for noise cancellation -- samsung,tmu_efuse_value: Default level of temperature - it is needed when - in factory fusing produced wrong value -- samsung,tmu_min_efuse_value: Minimum temperature fused value -- samsung,tmu_max_efuse_value: Maximum temperature fused value -- samsung,tmu_first_point_trim: First point trimming value -- samsung,tmu_second_point_trim: Second point trimming value -- samsung,tmu_default_temp_offset: Default temperature offset -- samsung,tmu_cal_type: Callibration type - ** Optional properties: - vtmu-supply: This entry is optional and provides the regulator node supplying @@ -78,7 +65,7 @@ Example 1): clocks = <&clock 383>; clock-names = "tmu_apbif"; vtmu-supply = <&tmu_regulator_node>; - #include "exynos4412-tmu-sensor-conf.dtsi" + #thermal-sensor-cells = <0>; }; Example 2): @@ -89,7 +76,7 @@ Example 2): interrupts = <0 58 0>; clocks = <&clock 21>; clock-names = "tmu_apbif"; - #include "exynos5440-tmu-sensor-conf.dtsi" + #thermal-sensor-cells = <0>; }; Example 3): (In case of Exynos5420 "with misplaced TRIMINFO register") @@ -99,7 +86,7 @@ Example 3): (In case of Exynos5420 "with misplaced TRIMINFO register") interrupts = <0 184 0>; clocks = <&clock 318>, <&clock 318>; clock-names = "tmu_apbif", "tmu_triminfo_apbif"; - #include "exynos4412-tmu-sensor-conf.dtsi" + #thermal-sensor-cells = <0>; }; tmu_cpu3: tmu@1006c000 { @@ -108,7 +95,7 @@ Example 3): (In case of Exynos5420 "with misplaced TRIMINFO register") interrupts = <0 185 0>; clocks = <&clock 318>, <&clock 319>; clock-names = "tmu_apbif", "tmu_triminfo_apbif"; - #include "exynos4412-tmu-sensor-conf.dtsi" + #thermal-sensor-cells = <0>; }; tmu_gpu: tmu@100a0000 { @@ -117,7 +104,7 @@ Example 3): (In case of Exynos5420 "with misplaced TRIMINFO register") interrupts = <0 215 0>; clocks = <&clock 319>, <&clock 318>; clock-names = "tmu_apbif", "tmu_triminfo_apbif"; - #include "exynos4412-tmu-sensor-conf.dtsi" + #thermal-sensor-cells = <0>; }; Note: For multi-instance tmu each instance should have an alias correctly diff --git a/Documentation/devicetree/bindings/thermal/rcar-gen3-thermal.txt b/Documentation/devicetree/bindings/thermal/rcar-gen3-thermal.txt index fdf5caa6229b..39e7d4e61a63 100644 --- a/Documentation/devicetree/bindings/thermal/rcar-gen3-thermal.txt +++ b/Documentation/devicetree/bindings/thermal/rcar-gen3-thermal.txt @@ -27,9 +27,9 @@ Example: tsc: thermal@e6198000 { compatible = "renesas,r8a7795-thermal"; - reg = <0 0xe6198000 0 0x68>, - <0 0xe61a0000 0 0x5c>, - <0 0xe61a8000 0 0x5c>; + reg = <0 0xe6198000 0 0x100>, + <0 0xe61a0000 0 0x100>, + <0 0xe61a8000 0 0x100>; interrupts = <GIC_SPI 67 IRQ_TYPE_LEVEL_HIGH>, <GIC_SPI 68 IRQ_TYPE_LEVEL_HIGH>, <GIC_SPI 69 IRQ_TYPE_LEVEL_HIGH>; diff --git a/Documentation/devicetree/bindings/thermal/thermal.txt b/Documentation/devicetree/bindings/thermal/thermal.txt index 1719d47a5e2f..cc553f0952c5 100644 --- a/Documentation/devicetree/bindings/thermal/thermal.txt +++ b/Documentation/devicetree/bindings/thermal/thermal.txt @@ -55,8 +55,7 @@ of heat dissipation). For example a fan's cooling states correspond to the different fan speeds possible. Cooling states are referred to by single unsigned integers, where larger numbers mean greater heat dissipation. The precise set of cooling states associated with a device -(as referred to by the cooling-min-level and cooling-max-level -properties) should be defined in a particular device's binding. +should be defined in a particular device's binding. For more examples of cooling devices, refer to the example sections below. Required properties: @@ -69,15 +68,6 @@ Required properties: See Cooling device maps section below for more details on how consumers refer to cooling devices. -Optional properties: -- cooling-min-level: An integer indicating the smallest - Type: unsigned cooling state accepted. Typically 0. - Size: one cell - -- cooling-max-level: An integer indicating the largest - Type: unsigned cooling state accepted. - Size: one cell - * Trip points The trip node is a node to describe a point in the temperature domain @@ -226,8 +216,6 @@ cpus { 396000 950000 198000 850000 >; - cooling-min-level = <0>; - cooling-max-level = <3>; #cooling-cells = <2>; /* min followed by max */ }; ... @@ -241,8 +229,6 @@ cpus { */ fan0: fan@48 { ... - cooling-min-level = <0>; - cooling-max-level = <9>; #cooling-cells = <2>; /* min followed by max */ }; }; diff --git a/Documentation/devicetree/bindings/nios2/timer.txt b/Documentation/devicetree/bindings/timer/altr,timer-1.0.txt index 904a5846d7ac..904a5846d7ac 100644 --- a/Documentation/devicetree/bindings/nios2/timer.txt +++ b/Documentation/devicetree/bindings/timer/altr,timer-1.0.txt diff --git a/Documentation/devicetree/bindings/arm/arch_timer.txt b/Documentation/devicetree/bindings/timer/arm,arch_timer.txt index 68301b77e854..68301b77e854 100644 --- a/Documentation/devicetree/bindings/arm/arch_timer.txt +++ b/Documentation/devicetree/bindings/timer/arm,arch_timer.txt diff --git a/Documentation/devicetree/bindings/arm/armv7m_systick.txt b/Documentation/devicetree/bindings/timer/arm,armv7m-systick.txt index 7cf4a24601eb..7cf4a24601eb 100644 --- a/Documentation/devicetree/bindings/arm/armv7m_systick.txt +++ b/Documentation/devicetree/bindings/timer/arm,armv7m-systick.txt diff --git a/Documentation/devicetree/bindings/arm/global_timer.txt b/Documentation/devicetree/bindings/timer/arm,global_timer.txt index bdae3a818793..bdae3a818793 100644 --- a/Documentation/devicetree/bindings/arm/global_timer.txt +++ b/Documentation/devicetree/bindings/timer/arm,global_timer.txt diff --git a/Documentation/devicetree/bindings/arm/twd.txt b/Documentation/devicetree/bindings/timer/arm,twd.txt index 383ea19c2bf0..383ea19c2bf0 100644 --- a/Documentation/devicetree/bindings/arm/twd.txt +++ b/Documentation/devicetree/bindings/timer/arm,twd.txt diff --git a/Documentation/devicetree/bindings/powerpc/fsl/gtm.txt b/Documentation/devicetree/bindings/timer/fsl,gtm.txt index 9a33efded4bc..9a33efded4bc 100644 --- a/Documentation/devicetree/bindings/powerpc/fsl/gtm.txt +++ b/Documentation/devicetree/bindings/timer/fsl,gtm.txt diff --git a/Documentation/devicetree/bindings/arm/mrvl/timer.txt b/Documentation/devicetree/bindings/timer/mrvl,mmp-timer.txt index 9a6e251462e7..9a6e251462e7 100644 --- a/Documentation/devicetree/bindings/arm/mrvl/timer.txt +++ b/Documentation/devicetree/bindings/timer/mrvl,mmp-timer.txt diff --git a/Documentation/devicetree/bindings/timer/nuvoton,npcm7xx-timer.txt b/Documentation/devicetree/bindings/timer/nuvoton,npcm7xx-timer.txt new file mode 100644 index 000000000000..ea22dfe485be --- /dev/null +++ b/Documentation/devicetree/bindings/timer/nuvoton,npcm7xx-timer.txt @@ -0,0 +1,21 @@ +Nuvoton NPCM7xx timer + +Nuvoton NPCM7xx have three timer modules, each timer module provides five 24-bit +timer counters. + +Required properties: +- compatible : "nuvoton,npcm750-timer" for Poleg NPCM750. +- reg : Offset and length of the register set for the device. +- interrupts : Contain the timer interrupt with flags for + falling edge. +- clocks : phandle of timer reference clock (usually a 25 MHz clock). + +Example: + +timer@f0008000 { + compatible = "nuvoton,npcm750-timer"; + interrupts = <GIC_SPI 32 IRQ_TYPE_LEVEL_HIGH>; + reg = <0xf0008000 0x50>; + clocks = <&clk NPCM7XX_CLK_TIMER>; +}; + diff --git a/Documentation/devicetree/bindings/timer/nxp,tpm-timer.txt b/Documentation/devicetree/bindings/timer/nxp,tpm-timer.txt index b4aa7ddb5b13..f82087b220f4 100644 --- a/Documentation/devicetree/bindings/timer/nxp,tpm-timer.txt +++ b/Documentation/devicetree/bindings/timer/nxp,tpm-timer.txt @@ -15,7 +15,7 @@ Required properties: - interrupts : Should be the clock event device interrupt. - clocks : The clocks provided by the SoC to drive the timer, must contain an entry for each entry in clock-names. -- clock-names : Must include the following entries: "igp" and "per". +- clock-names : Must include the following entries: "ipg" and "per". Example: tpm5: tpm@40260000 { diff --git a/Documentation/devicetree/bindings/arm/msm/timer.txt b/Documentation/devicetree/bindings/timer/qcom,msm-timer.txt index 5e10c345548f..5e10c345548f 100644 --- a/Documentation/devicetree/bindings/arm/msm/timer.txt +++ b/Documentation/devicetree/bindings/timer/qcom,msm-timer.txt diff --git a/Documentation/devicetree/bindings/arm/spear-timer.txt b/Documentation/devicetree/bindings/timer/st,spear-timer.txt index c0017221cf55..c0017221cf55 100644 --- a/Documentation/devicetree/bindings/arm/spear-timer.txt +++ b/Documentation/devicetree/bindings/timer/st,spear-timer.txt diff --git a/Documentation/devicetree/bindings/c6x/timer64.txt b/Documentation/devicetree/bindings/timer/ti,c64x+timer64.txt index 95911fe70224..95911fe70224 100644 --- a/Documentation/devicetree/bindings/c6x/timer64.txt +++ b/Documentation/devicetree/bindings/timer/ti,c64x+timer64.txt diff --git a/Documentation/devicetree/bindings/arm/omap/timer.txt b/Documentation/devicetree/bindings/timer/ti,timer.txt index d02e27c764ec..d02e27c764ec 100644 --- a/Documentation/devicetree/bindings/arm/omap/timer.txt +++ b/Documentation/devicetree/bindings/timer/ti,timer.txt diff --git a/Documentation/devicetree/bindings/arm/vt8500/via,vt8500-timer.txt b/Documentation/devicetree/bindings/timer/via,vt8500-timer.txt index 901c73f0d8ef..901c73f0d8ef 100644 --- a/Documentation/devicetree/bindings/arm/vt8500/via,vt8500-timer.txt +++ b/Documentation/devicetree/bindings/timer/via,vt8500-timer.txt diff --git a/Documentation/devicetree/bindings/usb/ci-hdrc-usb2.txt b/Documentation/devicetree/bindings/usb/ci-hdrc-usb2.txt index 0e03344e2e8b..2e9318151df7 100644 --- a/Documentation/devicetree/bindings/usb/ci-hdrc-usb2.txt +++ b/Documentation/devicetree/bindings/usb/ci-hdrc-usb2.txt @@ -76,6 +76,10 @@ Optional properties: needs to make sure it does not send more than 90% maximum_periodic_data_per_frame. The use case is multiple transactions, but less frame rate. +- mux-controls: The mux control for toggling host/device output of this + controller. It's expected that a mux state of 0 indicates device mode and a + mux state of 1 indicates host mode. +- mux-control-names: Shall be "usb_switch" if mux-controls is specified. i.mx specific properties - fsl,usbmisc: phandler of non-core register device, with one @@ -102,4 +106,6 @@ Example: rx-burst-size-dword = <0x10>; extcon = <0>, <&usb_id>; phy-clkgate-delay-us = <400>; + mux-controls = <&usb_switch>; + mux-control-names = "usb_switch"; }; diff --git a/Documentation/devicetree/bindings/usb/dwc3.txt b/Documentation/devicetree/bindings/usb/dwc3.txt index 0dbd3083e7dd..7f13ebef06cb 100644 --- a/Documentation/devicetree/bindings/usb/dwc3.txt +++ b/Documentation/devicetree/bindings/usb/dwc3.txt @@ -7,6 +7,26 @@ Required properties: - compatible: must be "snps,dwc3" - reg : Address and length of the register set for the device - interrupts: Interrupts used by the dwc3 controller. + - clock-names: should contain "ref", "bus_early", "suspend" + - clocks: list of phandle and clock specifier pairs corresponding to + entries in the clock-names property. + +Exception for clocks: + clocks are optional if the parent node (i.e. glue-layer) is compatible to + one of the following: + "amlogic,meson-axg-dwc3" + "amlogic,meson-gxl-dwc3" + "cavium,octeon-7130-usb-uctl" + "qcom,dwc3" + "samsung,exynos5250-dwusb3" + "samsung,exynos7-dwusb3" + "sprd,sc9860-dwc3" + "st,stih407-dwc3" + "ti,am437x-dwc3" + "ti,dwc3" + "ti,keystone-dwc3" + "rockchip,rk3399-dwc3" + "xlnx,zynqmp-dwc3" Optional properties: - usb-phy : array of phandle for the PHY device. The first element @@ -15,6 +35,7 @@ Optional properties: - phys: from the *Generic PHY* bindings - phy-names: from the *Generic PHY* bindings; supported names are "usb2-phy" or "usb3-phy". + - resets: a single pair of phandle and reset specifier - snps,usb3_lpm_capable: determines if platform is USB3 LPM capable - snps,disable_scramble_quirk: true when SW should disable data scrambling. Only really useful for FPGA builds. diff --git a/Documentation/devicetree/bindings/usb/fcs,fusb302.txt b/Documentation/devicetree/bindings/usb/fcs,fusb302.txt index 472facfa5a71..6087dc7f209e 100644 --- a/Documentation/devicetree/bindings/usb/fcs,fusb302.txt +++ b/Documentation/devicetree/bindings/usb/fcs,fusb302.txt @@ -6,12 +6,6 @@ Required properties : - interrupts : Interrupt specifier Optional properties : -- fcs,max-sink-microvolt : Maximum voltage to negotiate when configured as sink -- fcs,max-sink-microamp : Maximum current to negotiate when configured as sink -- fcs,max-sink-microwatt : Maximum power to negotiate when configured as sink - If this is less then max-sink-microvolt * - max-sink-microamp then the configured current will - be clamped. - fcs,operating-sink-microwatt : Minimum amount of power accepted from a sink when negotiating diff --git a/Documentation/devicetree/bindings/usb/hisilicon,histb-xhci.txt b/Documentation/devicetree/bindings/usb/hisilicon,histb-xhci.txt new file mode 100644 index 000000000000..f4633496b122 --- /dev/null +++ b/Documentation/devicetree/bindings/usb/hisilicon,histb-xhci.txt @@ -0,0 +1,45 @@ +HiSilicon STB xHCI + +The device node for HiSilicon STB xHCI host controller + +Required properties: + - compatible: should be "hisilicon,hi3798cv200-xhci" + - reg: specifies physical base address and size of the registers + - interrupts : interrupt used by the controller + - clocks: a list of phandle + clock-specifier pairs, one for each + entry in clock-names + - clock-names: must contain + "bus": for bus clock + "utmi": for utmi clock + "pipe": for pipe clock + "suspend": for suspend clock + - resets: a list of phandle and reset specifier pairs as listed in + reset-names property. + - reset-names: must contain + "soft": for soft reset + - phys: a list of phandle + phy specifier pairs + - phy-names: must contain at least one of following: + "inno": for inno phy + "combo": for combo phy + +Optional properties: + - usb2-lpm-disable: indicate if we don't want to enable USB2 HW LPM + - usb3-lpm-capable: determines if platform is USB3 LPM capable + - imod-interval-ns: default interrupt moderation interval is 40000ns + +Example: + +xhci0: xchi@f98a0000 { + compatible = "hisilicon,hi3798cv200-xhci"; + reg = <0xf98a0000 0x10000>; + interrupts = <GIC_SPI 69 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&crg HISTB_USB3_BUS_CLK>, + <&crg HISTB_USB3_UTMI_CLK>, + <&crg HISTB_USB3_PIPE_CLK>, + <&crg HISTB_USB3_SUSPEND_CLK>; + clock-names = "bus", "utmi", "pipe", "suspend"; + resets = <&crg 0xb0 12>; + reset-names = "soft"; + phys = <&usb2_phy1_port1 0>, <&combphy0 PHY_TYPE_USB3>; + phy-names = "inno", "combo"; +}; diff --git a/Documentation/devicetree/bindings/usb/qcom,dwc3.txt b/Documentation/devicetree/bindings/usb/qcom,dwc3.txt index bc8a2fa5d2bf..95afdcf3c337 100644 --- a/Documentation/devicetree/bindings/usb/qcom,dwc3.txt +++ b/Documentation/devicetree/bindings/usb/qcom,dwc3.txt @@ -1,54 +1,95 @@ Qualcomm SuperSpeed DWC3 USB SoC controller Required properties: -- compatible: should contain "qcom,dwc3" +- compatible: Compatible list, contains + "qcom,dwc3" + "qcom,msm8996-dwc3" for msm8996 SOC. + "qcom,sdm845-dwc3" for sdm845 SOC. +- reg: Offset and length of register set for QSCRATCH wrapper +- power-domains: specifies a phandle to PM domain provider node - clocks: A list of phandle + clock-specifier pairs for the clocks listed in clock-names -- clock-names: Should contain the following: +- clock-names: Should contain the following: "core" Master/Core clock, have to be >= 125 MHz for SS operation and >= 60MHz for HS operation + "mock_utmi" Mock utmi clock needed for ITP/SOF generation in + host mode. Its frequency should be 19.2MHz. + "sleep" Sleep clock, used for wakeup when USB3 core goes + into low power mode (U3). Optional clocks: - "iface" System bus AXI clock. Not present on all platforms - "sleep" Sleep clock, used when USB3 core goes into low - power mode (U3). + "iface" System bus AXI clock. + Not present on "qcom,msm8996-dwc3" compatible. + "cfg_noc" System Config NOC clock. + Not present on "qcom,msm8996-dwc3" compatible. +- assigned-clocks: Should be: + MOCK_UTMI_CLK + MASTER_CLK +- assigned-clock-rates: Should be: + 19.2Mhz (192000000) for MOCK_UTMI_CLK + >=125Mhz (125000000) for MASTER_CLK in SS mode + >=60Mhz (60000000) for MASTER_CLK in HS mode + +Optional properties: +- resets: Phandle to reset control that resets core and wrapper. +- interrupts: specifies interrupts from controller wrapper used + to wakeup from low power/susepnd state. Must contain + one or more entry for interrupt-names property +- interrupt-names: Must include the following entries: + - "hs_phy_irq": The interrupt that is asserted when a + wakeup event is received on USB2 bus + - "ss_phy_irq": The interrupt that is asserted when a + wakeup event is received on USB3 bus + - "dm_hs_phy_irq" and "dp_hs_phy_irq": Separate + interrupts for any wakeup event on DM and DP lines +- qcom,select-utmi-as-pipe-clk: if present, disable USB3 pipe_clk requirement. + Used when dwc3 operates without SSPHY and only + HS/FS/LS modes are supported. Required child node: A child node must exist to represent the core DWC3 IP block. The name of the node is not important. The content of the node is defined in dwc3.txt. Phy documentation is provided in the following places: -Documentation/devicetree/bindings/phy/qcom-dwc3-usb-phy.txt +Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt - USB3 QMP PHY +Documentation/devicetree/bindings/phy/qcom-qusb2-phy.txt - USB2 QUSB2 PHY Example device nodes: hs_phy: phy@100f8800 { - compatible = "qcom,dwc3-hs-usb-phy"; - reg = <0x100f8800 0x30>; - clocks = <&gcc USB30_0_UTMI_CLK>; - clock-names = "ref"; - #phy-cells = <0>; - + compatible = "qcom,qusb2-v2-phy"; + ... }; ss_phy: phy@100f8830 { - compatible = "qcom,dwc3-ss-usb-phy"; - reg = <0x100f8830 0x30>; - clocks = <&gcc USB30_0_MASTER_CLK>; - clock-names = "ref"; - #phy-cells = <0>; - + compatible = "qcom,qmp-v3-usb3-phy"; + ... }; - usb3_0: usb30@0 { + usb3_0: usb30@a6f8800 { compatible = "qcom,dwc3"; + reg = <0xa6f8800 0x400>; #address-cells = <1>; #size-cells = <1>; - clocks = <&gcc USB30_0_MASTER_CLK>; - clock-names = "core"; - ranges; + interrupts = <0 131 0>, <0 486 0>, <0 488 0>, <0 489 0>; + interrupt-names = "hs_phy_irq", "ss_phy_irq", + "dm_hs_phy_irq", "dp_hs_phy_irq"; + + clocks = <&gcc GCC_USB30_PRIM_MASTER_CLK>, + <&gcc GCC_USB30_PRIM_MOCK_UTMI_CLK>, + <&gcc GCC_USB30_PRIM_SLEEP_CLK>; + clock-names = "core", "mock_utmi", "sleep"; + + assigned-clocks = <&gcc GCC_USB30_PRIM_MOCK_UTMI_CLK>, + <&gcc GCC_USB30_PRIM_MASTER_CLK>; + assigned-clock-rates = <19200000>, <133000000>; + + resets = <&gcc GCC_USB30_PRIM_BCR>; + reset-names = "core_reset"; + power-domains = <&gcc USB30_PRIM_GDSC>; + qcom,select-utmi-as-pipe-clk; dwc3@10000000 { compatible = "snps,dwc3"; diff --git a/Documentation/devicetree/bindings/usb/richtek,rt1711h.txt b/Documentation/devicetree/bindings/usb/richtek,rt1711h.txt new file mode 100644 index 000000000000..09e847e92e5e --- /dev/null +++ b/Documentation/devicetree/bindings/usb/richtek,rt1711h.txt @@ -0,0 +1,17 @@ +Richtek RT1711H TypeC PD Controller. + +Required properties: + - compatible : Must be "richtek,rt1711h". + - reg : Must be 0x4e, it's slave address of RT1711H. + - interrupt-parent : the phandle for the interrupt controller that + provides interrupts for this device. + - interrupts : <a b> where a is the interrupt number and b represents an + encoding of the sense and level information for the interrupt. + +Example : +rt1711h@4e { + compatible = "richtek,rt1711h"; + reg = <0x4e>; + interrupt-parent = <&gpio26>; + interrupts = <0 IRQ_TYPE_LEVEL_LOW>; +}; diff --git a/Documentation/devicetree/bindings/usb/usb-xhci.txt b/Documentation/devicetree/bindings/usb/usb-xhci.txt index c4c00dff4b56..bd1dd316fb23 100644 --- a/Documentation/devicetree/bindings/usb/usb-xhci.txt +++ b/Documentation/devicetree/bindings/usb/usb-xhci.txt @@ -28,7 +28,10 @@ Required properties: - interrupts: one XHCI interrupt should be described here. Optional properties: - - clocks: reference to a clock + - clocks: reference to the clocks + - clock-names: mandatory if there is a second clock, in this case + the name must be "core" for the first clock and "reg" for the + second one - usb2-lpm-disable: indicate if we don't want to enable USB2 HW LPM - usb3-lpm-capable: determines if platform is USB3 LPM capable - quirk-broken-port-ped: set if the controller has broken port disable mechanism diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt index b5f978a4cac6..4b38f3373f43 100644 --- a/Documentation/devicetree/bindings/vendor-prefixes.txt +++ b/Documentation/devicetree/bindings/vendor-prefixes.txt @@ -32,6 +32,7 @@ andestech Andes Technology Corporation apm Applied Micro Circuits Corporation (APM) aptina Aptina Imaging arasan Arasan Chip Systems +archermind ArcherMind Technology (Nanjing) Co., Ltd. arctic Arctic Sand aries Aries Embedded GmbH arm ARM Ltd. @@ -47,6 +48,7 @@ auvidea Auvidea GmbH avago Avago Technologies avia avia semiconductor avic Shanghai AVIC Optoelectronics Co., Ltd. +avnet Avnet, Inc. axentia Axentia Technologies AB axis Axis Communications AB bananapi BIPAI KEJI LIMITED @@ -75,6 +77,7 @@ cnxt Conexant Systems, Inc. compulab CompuLab Ltd. cortina Cortina Systems, Inc. cosmic Cosmic Circuits +crane Crane Connectivity Solutions creative Creative Technology Ltd crystalfontz Crystalfontz America, Inc. cubietech Cubietech, Ltd. @@ -182,8 +185,10 @@ karo Ka-Ro electronics GmbH keithkoep Keith & Koep GmbH keymile Keymile GmbH khadas Khadas +kiebackpeter Kieback & Peter GmbH kinetic Kinetic Technologies kingnovel Kingnovel Technology Co., Ltd. +koe Kaohsiung Opto-Electronics Inc. kosagi Sutajio Ko-Usagi PTE Ltd. kyo Kyocera Corporation lacie LaCie @@ -198,11 +203,13 @@ linaro Linaro Limited linksys Belkin International, Inc. (Linksys) linux Linux-specific binding lltc Linear Technology Corporation +logicpd Logic PD, Inc. lsi LSI Corp. (LSI Logic) lwn Liebherr-Werk Nenzing GmbH macnica Macnica Americas marvell Marvell Technology Group Ltd. maxim Maxim Integrated Products +mbvl Mobiveil Inc. mcube mCube meas Measurement Specialties mediatek MediaTek Inc. @@ -317,6 +324,7 @@ sgx SGX Sensortech sharp Sharp Corporation shimafuji Shimafuji Electric, Inc. si-en Si-En Technology Ltd. +sifive SiFive, Inc. sigma Sigma Designs, Inc. sii Seiko Instruments, Inc. sil Silicon Image @@ -392,6 +400,7 @@ vot Vision Optical Technology Co., Ltd. wd Western Digital Corp. wetek WeTek Electronics, limited. wexler Wexler +wi2wi Wi2Wi, Inc. winbond Winbond Electronics corp. winstar Winstar Display Corp. wlf Wolfson Microelectronics diff --git a/Documentation/devicetree/overlay-notes.txt b/Documentation/devicetree/overlay-notes.txt index a4feb6dde8cd..725fb8d255c1 100644 --- a/Documentation/devicetree/overlay-notes.txt +++ b/Documentation/devicetree/overlay-notes.txt @@ -98,6 +98,14 @@ Finally, if you need to remove all overlays in one-go, just call of_overlay_remove_all() which will remove every single one in the correct order. +In addition, there is the option to register notifiers that get called on +overlay operations. See of_overlay_notifier_register/unregister and +enum of_overlay_notify_action for details. + +Note that a notifier callback is not supposed to store pointers to a device +tree node or its content beyond OF_OVERLAY_POST_REMOVE corresponding to the +respective node it received. + Overlay DTS Format ------------------ diff --git a/Documentation/doc-guide/parse-headers.rst b/Documentation/doc-guide/parse-headers.rst index 96a0423d5dba..82a3e43b6864 100644 --- a/Documentation/doc-guide/parse-headers.rst +++ b/Documentation/doc-guide/parse-headers.rst @@ -177,14 +177,14 @@ BUGS **** -Report bugs to Mauro Carvalho Chehab <mchehab@s-opensource.com> +Report bugs to Mauro Carvalho Chehab <mchehab@kernel.org> COPYRIGHT ********* -Copyright (c) 2016 by Mauro Carvalho Chehab <mchehab@s-opensource.com>. +Copyright (c) 2016 by Mauro Carvalho Chehab <mchehab+samsung@kernel.org>. License GPLv2: GNU GPL version 2 <http://gnu.org/licenses/gpl.html>. diff --git a/Documentation/clk.txt b/Documentation/driver-api/clk.rst index 511628bb3d3a..593cca5058b1 100644 --- a/Documentation/clk.txt +++ b/Documentation/driver-api/clk.rst @@ -96,7 +96,7 @@ the operations defined in clk-provider.h:: int (*get_phase)(struct clk_hw *hw); int (*set_phase)(struct clk_hw *hw, int degrees); void (*init)(struct clk_hw *hw); - int (*debug_init)(struct clk_hw *hw, + void (*debug_init)(struct clk_hw *hw, struct dentry *dentry); }; diff --git a/Documentation/driver-api/device_connection.rst b/Documentation/driver-api/device_connection.rst index affbc5566ab0..ba364224c349 100644 --- a/Documentation/driver-api/device_connection.rst +++ b/Documentation/driver-api/device_connection.rst @@ -40,4 +40,4 @@ API --- .. kernel-doc:: drivers/base/devcon.c - : functions: device_connection_find_match device_connection_find device_connection_add device_connection_remove + :functions: device_connection_find_match device_connection_find device_connection_add device_connection_remove diff --git a/Documentation/driver-api/firmware/fallback-mechanisms.rst b/Documentation/driver-api/firmware/fallback-mechanisms.rst index f353783ae0be..d35fed65eae9 100644 --- a/Documentation/driver-api/firmware/fallback-mechanisms.rst +++ b/Documentation/driver-api/firmware/fallback-mechanisms.rst @@ -72,9 +72,12 @@ the firmware requested, and establishes it in the device hierarchy by associating the device used to make the request as the device's parent. The sysfs directory's file attributes are defined and controlled through the new device's class (firmware_class) and group (fw_dev_attr_groups). -This is actually where the original firmware_class.c file name comes from, -as originally the only firmware loading mechanism available was the -mechanism we now use as a fallback mechanism. +This is actually where the original firmware_class module name came from, +given that originally the only firmware loading mechanism available was the +mechanism we now use as a fallback mechanism, which registers a struct class +firmware_class. Because the attributes exposed are part of the module name, the +module name firmware_class cannot be renamed in the future, to ensure backward +compatibility with old userspace. To load firmware using the sysfs interface we expose a loading indicator, and a file upload firmware into: @@ -83,7 +86,7 @@ and a file upload firmware into: * /sys/$DEVPATH/data To upload firmware you will echo 1 onto the loading file to indicate -you are loading firmware. You then cat the firmware into the data file, +you are loading firmware. You then write the firmware into the data file, and you notify the kernel the firmware is ready by echo'ing 0 onto the loading file. @@ -136,7 +139,8 @@ by kobject uevents. This is specially exacerbated due to the fact that most distributions today disable CONFIG_FW_LOADER_USER_HELPER_FALLBACK. Refer to do_firmware_uevent() for details of the kobject event variables -setup. Variables passwdd with a kobject add event: +setup. The variables currently passed to userspace with a "kobject add" +event are: * FIRMWARE=firmware name * TIMEOUT=timeout value diff --git a/Documentation/driver-api/firmware/firmware_cache.rst b/Documentation/driver-api/firmware/firmware_cache.rst index 2210e5bfb332..c2e69d9c6bf1 100644 --- a/Documentation/driver-api/firmware/firmware_cache.rst +++ b/Documentation/driver-api/firmware/firmware_cache.rst @@ -29,8 +29,8 @@ Some implementation details about the firmware cache setup: * If an asynchronous call is used the firmware cache is only set up for a device if if the second argument (uevent) to request_firmware_nowait() is true. When uevent is true it requests that a kobject uevent be sent to - userspace for the firmware request. For details refer to the Fackback - mechanism documented below. + userspace for the firmware request through the sysfs fallback mechanism + if the firmware file is not found. * If the firmware cache is determined to be needed as per the above two criteria the firmware cache is setup by adding a devres entry for the diff --git a/Documentation/driver-api/firmware/request_firmware.rst b/Documentation/driver-api/firmware/request_firmware.rst index cf4516dfbf96..f62bdcbfed5b 100644 --- a/Documentation/driver-api/firmware/request_firmware.rst +++ b/Documentation/driver-api/firmware/request_firmware.rst @@ -17,17 +17,22 @@ an error is returned. request_firmware ---------------- -.. kernel-doc:: drivers/base/firmware_class.c +.. kernel-doc:: drivers/base/firmware_loader/main.c :functions: request_firmware +firmware_request_nowarn +----------------------- +.. kernel-doc:: drivers/base/firmware_loader/main.c + :functions: firmware_request_nowarn + request_firmware_direct ----------------------- -.. kernel-doc:: drivers/base/firmware_class.c +.. kernel-doc:: drivers/base/firmware_loader/main.c :functions: request_firmware_direct request_firmware_into_buf ------------------------- -.. kernel-doc:: drivers/base/firmware_class.c +.. kernel-doc:: drivers/base/firmware_loader/main.c :functions: request_firmware_into_buf Asynchronous firmware requests @@ -41,7 +46,7 @@ in atomic contexts. request_firmware_nowait ----------------------- -.. kernel-doc:: drivers/base/firmware_class.c +.. kernel-doc:: drivers/base/firmware_loader/main.c :functions: request_firmware_nowait Special optimizations on reboot @@ -50,12 +55,12 @@ Special optimizations on reboot Some devices have an optimization in place to enable the firmware to be retained during system reboot. When such optimizations are used the driver author must ensure the firmware is still available on resume from suspend, -this can be done with firmware_request_cache() insted of requesting for the -firmare to be loaded. +this can be done with firmware_request_cache() instead of requesting for the +firmware to be loaded. firmware_request_cache() ------------------------ -.. kernel-doc:: drivers/base/firmware_class.c +------------------------ +.. kernel-doc:: drivers/base/firmware_loader/main.c :functions: firmware_request_cache request firmware API expected driver use diff --git a/Documentation/driver-api/fpga/fpga-bridge.rst b/Documentation/driver-api/fpga/fpga-bridge.rst new file mode 100644 index 000000000000..2c2aaca894bf --- /dev/null +++ b/Documentation/driver-api/fpga/fpga-bridge.rst @@ -0,0 +1,49 @@ +FPGA Bridge +=========== + +API to implement a new FPGA bridge +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: include/linux/fpga/fpga-bridge.h + :functions: fpga_bridge + +.. kernel-doc:: include/linux/fpga/fpga-bridge.h + :functions: fpga_bridge_ops + +.. kernel-doc:: drivers/fpga/fpga-bridge.c + :functions: fpga_bridge_create + +.. kernel-doc:: drivers/fpga/fpga-bridge.c + :functions: fpga_bridge_free + +.. kernel-doc:: drivers/fpga/fpga-bridge.c + :functions: fpga_bridge_register + +.. kernel-doc:: drivers/fpga/fpga-bridge.c + :functions: fpga_bridge_unregister + +API to control an FPGA bridge +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You probably won't need these directly. FPGA regions should handle this. + +.. kernel-doc:: drivers/fpga/fpga-bridge.c + :functions: of_fpga_bridge_get + +.. kernel-doc:: drivers/fpga/fpga-bridge.c + :functions: fpga_bridge_get + +.. kernel-doc:: drivers/fpga/fpga-bridge.c + :functions: fpga_bridge_put + +.. kernel-doc:: drivers/fpga/fpga-bridge.c + :functions: fpga_bridge_get_to_list + +.. kernel-doc:: drivers/fpga/fpga-bridge.c + :functions: of_fpga_bridge_get_to_list + +.. kernel-doc:: drivers/fpga/fpga-bridge.c + :functions: fpga_bridge_enable + +.. kernel-doc:: drivers/fpga/fpga-bridge.c + :functions: fpga_bridge_disable diff --git a/Documentation/driver-api/fpga/fpga-mgr.rst b/Documentation/driver-api/fpga/fpga-mgr.rst new file mode 100644 index 000000000000..bcf2dd24e179 --- /dev/null +++ b/Documentation/driver-api/fpga/fpga-mgr.rst @@ -0,0 +1,220 @@ +FPGA Manager +============ + +Overview +-------- + +The FPGA manager core exports a set of functions for programming an FPGA with +an image. The API is manufacturer agnostic. All manufacturer specifics are +hidden away in a low level driver which registers a set of ops with the core. +The FPGA image data itself is very manufacturer specific, but for our purposes +it's just binary data. The FPGA manager core won't parse it. + +The FPGA image to be programmed can be in a scatter gather list, a single +contiguous buffer, or a firmware file. Because allocating contiguous kernel +memory for the buffer should be avoided, users are encouraged to use a scatter +gather list instead if possible. + +The particulars for programming the image are presented in a structure (struct +fpga_image_info). This struct contains parameters such as pointers to the +FPGA image as well as image-specific particulars such as whether the image was +built for full or partial reconfiguration. + +How to support a new FPGA device +-------------------------------- + +To add another FPGA manager, write a driver that implements a set of ops. The +probe function calls fpga_mgr_register(), such as:: + + static const struct fpga_manager_ops socfpga_fpga_ops = { + .write_init = socfpga_fpga_ops_configure_init, + .write = socfpga_fpga_ops_configure_write, + .write_complete = socfpga_fpga_ops_configure_complete, + .state = socfpga_fpga_ops_state, + }; + + static int socfpga_fpga_probe(struct platform_device *pdev) + { + struct device *dev = &pdev->dev; + struct socfpga_fpga_priv *priv; + struct fpga_manager *mgr; + int ret; + + priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL); + if (!priv) + return -ENOMEM; + + /* + * do ioremaps, get interrupts, etc. and save + * them in priv + */ + + mgr = fpga_mgr_create(dev, "Altera SOCFPGA FPGA Manager", + &socfpga_fpga_ops, priv); + if (!mgr) + return -ENOMEM; + + platform_set_drvdata(pdev, mgr); + + ret = fpga_mgr_register(mgr); + if (ret) + fpga_mgr_free(mgr); + + return ret; + } + + static int socfpga_fpga_remove(struct platform_device *pdev) + { + struct fpga_manager *mgr = platform_get_drvdata(pdev); + + fpga_mgr_unregister(mgr); + + return 0; + } + + +The ops will implement whatever device specific register writes are needed to +do the programming sequence for this particular FPGA. These ops return 0 for +success or negative error codes otherwise. + +The programming sequence is:: + 1. .write_init + 2. .write or .write_sg (may be called once or multiple times) + 3. .write_complete + +The .write_init function will prepare the FPGA to receive the image data. The +buffer passed into .write_init will be atmost .initial_header_size bytes long, +if the whole bitstream is not immediately available then the core code will +buffer up at least this much before starting. + +The .write function writes a buffer to the FPGA. The buffer may be contain the +whole FPGA image or may be a smaller chunk of an FPGA image. In the latter +case, this function is called multiple times for successive chunks. This interface +is suitable for drivers which use PIO. + +The .write_sg version behaves the same as .write except the input is a sg_table +scatter list. This interface is suitable for drivers which use DMA. + +The .write_complete function is called after all the image has been written +to put the FPGA into operating mode. + +The ops include a .state function which will read the hardware FPGA manager and +return a code of type enum fpga_mgr_states. It doesn't result in a change in +hardware state. + +How to write an image buffer to a supported FPGA +------------------------------------------------ + +Some sample code:: + + #include <linux/fpga/fpga-mgr.h> + + struct fpga_manager *mgr; + struct fpga_image_info *info; + int ret; + + /* + * Get a reference to FPGA manager. The manager is not locked, so you can + * hold onto this reference without it preventing programming. + * + * This example uses the device node of the manager. Alternatively, use + * fpga_mgr_get(dev) instead if you have the device. + */ + mgr = of_fpga_mgr_get(mgr_node); + + /* struct with information about the FPGA image to program. */ + info = fpga_image_info_alloc(dev); + + /* flags indicates whether to do full or partial reconfiguration */ + info->flags = FPGA_MGR_PARTIAL_RECONFIG; + + /* + * At this point, indicate where the image is. This is pseudo-code; you're + * going to use one of these three. + */ + if (image is in a scatter gather table) { + + info->sgt = [your scatter gather table] + + } else if (image is in a buffer) { + + info->buf = [your image buffer] + info->count = [image buffer size] + + } else if (image is in a firmware file) { + + info->firmware_name = devm_kstrdup(dev, firmware_name, GFP_KERNEL); + + } + + /* Get exclusive control of FPGA manager */ + ret = fpga_mgr_lock(mgr); + + /* Load the buffer to the FPGA */ + ret = fpga_mgr_buf_load(mgr, &info, buf, count); + + /* Release the FPGA manager */ + fpga_mgr_unlock(mgr); + fpga_mgr_put(mgr); + + /* Deallocate the image info if you're done with it */ + fpga_image_info_free(info); + +API for implementing a new FPGA Manager driver +---------------------------------------------- + +.. kernel-doc:: include/linux/fpga/fpga-mgr.h + :functions: fpga_manager + +.. kernel-doc:: include/linux/fpga/fpga-mgr.h + :functions: fpga_manager_ops + +.. kernel-doc:: drivers/fpga/fpga-mgr.c + :functions: fpga_mgr_create + +.. kernel-doc:: drivers/fpga/fpga-mgr.c + :functions: fpga_mgr_free + +.. kernel-doc:: drivers/fpga/fpga-mgr.c + :functions: fpga_mgr_register + +.. kernel-doc:: drivers/fpga/fpga-mgr.c + :functions: fpga_mgr_unregister + +API for programming a FPGA +-------------------------- + +.. kernel-doc:: include/linux/fpga/fpga-mgr.h + :functions: fpga_image_info + +.. kernel-doc:: include/linux/fpga/fpga-mgr.h + :functions: fpga_mgr_states + +.. kernel-doc:: drivers/fpga/fpga-mgr.c + :functions: fpga_image_info_alloc + +.. kernel-doc:: drivers/fpga/fpga-mgr.c + :functions: fpga_image_info_free + +.. kernel-doc:: drivers/fpga/fpga-mgr.c + :functions: of_fpga_mgr_get + +.. kernel-doc:: drivers/fpga/fpga-mgr.c + :functions: fpga_mgr_get + +.. kernel-doc:: drivers/fpga/fpga-mgr.c + :functions: fpga_mgr_put + +.. kernel-doc:: drivers/fpga/fpga-mgr.c + :functions: fpga_mgr_lock + +.. kernel-doc:: drivers/fpga/fpga-mgr.c + :functions: fpga_mgr_unlock + +.. kernel-doc:: include/linux/fpga/fpga-mgr.h + :functions: fpga_mgr_states + +Note - use :c:func:`fpga_region_program_fpga()` instead of :c:func:`fpga_mgr_load()` + +.. kernel-doc:: drivers/fpga/fpga-mgr.c + :functions: fpga_mgr_load diff --git a/Documentation/driver-api/fpga/fpga-region.rst b/Documentation/driver-api/fpga/fpga-region.rst new file mode 100644 index 000000000000..f89e4a311722 --- /dev/null +++ b/Documentation/driver-api/fpga/fpga-region.rst @@ -0,0 +1,102 @@ +FPGA Region +=========== + +Overview +-------- + +This document is meant to be an brief overview of the FPGA region API usage. A +more conceptual look at regions can be found in the Device Tree binding +document [#f1]_. + +For the purposes of this API document, let's just say that a region associates +an FPGA Manager and a bridge (or bridges) with a reprogrammable region of an +FPGA or the whole FPGA. The API provides a way to register a region and to +program a region. + +Currently the only layer above fpga-region.c in the kernel is the Device Tree +support (of-fpga-region.c) described in [#f1]_. The DT support layer uses regions +to program the FPGA and then DT to handle enumeration. The common region code +is intended to be used by other schemes that have other ways of accomplishing +enumeration after programming. + +An fpga-region can be set up to know the following things: + + * which FPGA manager to use to do the programming + + * which bridges to disable before programming and enable afterwards. + +Additional info needed to program the FPGA image is passed in the struct +fpga_image_info including: + + * pointers to the image as either a scatter-gather buffer, a contiguous + buffer, or the name of firmware file + + * flags indicating specifics such as whether the image if for partial + reconfiguration. + +How to program a FPGA using a region +------------------------------------ + +First, allocate the info struct:: + + info = fpga_image_info_alloc(dev); + if (!info) + return -ENOMEM; + +Set flags as needed, i.e.:: + + info->flags |= FPGA_MGR_PARTIAL_RECONFIG; + +Point to your FPGA image, such as:: + + info->sgt = &sgt; + +Add info to region and do the programming:: + + region->info = info; + ret = fpga_region_program_fpga(region); + +:c:func:`fpga_region_program_fpga()` operates on info passed in the +fpga_image_info (region->info). This function will attempt to: + + * lock the region's mutex + * lock the region's FPGA manager + * build a list of FPGA bridges if a method has been specified to do so + * disable the bridges + * program the FPGA + * re-enable the bridges + * release the locks + +Then you will want to enumerate whatever hardware has appeared in the FPGA. + +How to add a new FPGA region +---------------------------- + +An example of usage can be seen in the probe function of [#f2]_. + +.. [#f1] ../devicetree/bindings/fpga/fpga-region.txt +.. [#f2] ../../drivers/fpga/of-fpga-region.c + +API to program a FGPA +--------------------- + +.. kernel-doc:: drivers/fpga/fpga-region.c + :functions: fpga_region_program_fpga + +API to add a new FPGA region +---------------------------- + +.. kernel-doc:: include/linux/fpga/fpga-region.h + :functions: fpga_region + +.. kernel-doc:: drivers/fpga/fpga-region.c + :functions: fpga_region_create + +.. kernel-doc:: drivers/fpga/fpga-region.c + :functions: fpga_region_free + +.. kernel-doc:: drivers/fpga/fpga-region.c + :functions: fpga_region_register + +.. kernel-doc:: drivers/fpga/fpga-region.c + :functions: fpga_region_unregister diff --git a/Documentation/driver-api/fpga/index.rst b/Documentation/driver-api/fpga/index.rst new file mode 100644 index 000000000000..c51e5ebd544a --- /dev/null +++ b/Documentation/driver-api/fpga/index.rst @@ -0,0 +1,13 @@ +============== +FPGA Subsystem +============== + +:Author: Alan Tull + +.. toctree:: + :maxdepth: 2 + + intro + fpga-mgr + fpga-bridge + fpga-region diff --git a/Documentation/driver-api/fpga/intro.rst b/Documentation/driver-api/fpga/intro.rst new file mode 100644 index 000000000000..51cd81dbb4dc --- /dev/null +++ b/Documentation/driver-api/fpga/intro.rst @@ -0,0 +1,54 @@ +Introduction +============ + +The FPGA subsystem supports reprogramming FPGAs dynamically under +Linux. Some of the core intentions of the FPGA subsystems are: + +* The FPGA subsystem is vendor agnostic. + +* The FPGA subsystem separates upper layers (userspace interfaces and + enumeration) from lower layers that know how to program a specific + FPGA. + +* Code should not be shared between upper and lower layers. This + should go without saying. If that seems necessary, there's probably + framework functionality that that can be added that will benefit + other users. Write the linux-fpga mailing list and maintainers and + seek out a solution that expands the framework for broad reuse. + +* Generally, when adding code, think of the future. Plan for re-use. + +The framework in the kernel is divided into: + +FPGA Manager +------------ + +If you are adding a new FPGA or a new method of programming a FPGA, +this is the subsystem for you. Low level FPGA manager drivers contain +the knowledge of how to program a specific device. This subsystem +includes the framework in fpga-mgr.c and the low level drivers that +are registered with it. + +FPGA Bridge +----------- + +FPGA Bridges prevent spurious signals from going out of a FPGA or a +region of a FPGA during programming. They are disabled before +programming begins and re-enabled afterwards. An FPGA bridge may be +actual hard hardware that gates a bus to a cpu or a soft ("freeze") +bridge in FPGA fabric that surrounds a partial reconfiguration region +of an FPGA. This subsystem includes fpga-bridge.c and the low level +drivers that are registered with it. + +FPGA Region +----------- + +If you are adding a new interface to the FPGA framework, add it on top +of a FPGA region to allow the most reuse of your interface. + +The FPGA Region framework (fpga-region.c) associates managers and +bridges as reconfigurable regions. A region may refer to the whole +FPGA in full reconfiguration or to a partial reconfiguration region. + +The Device Tree FPGA Region support (of-fpga-region.c) handles +reprogramming FPGAs when device tree overlays are applied. diff --git a/Documentation/driver-api/gpio/board.rst b/Documentation/driver-api/gpio/board.rst index 25d62b2e9fd0..2c112553df84 100644 --- a/Documentation/driver-api/gpio/board.rst +++ b/Documentation/driver-api/gpio/board.rst @@ -177,3 +177,19 @@ mapping and is thus transparent to GPIO consumers. A set of functions such as gpiod_set_value() is available to work with the new descriptor-oriented interface. + +Boards using platform data can also hog GPIO lines by defining GPIO hog tables. + +.. code-block:: c + + struct gpiod_hog gpio_hog_table[] = { + GPIO_HOG("gpio.0", 10, "foo", GPIO_ACTIVE_LOW, GPIOD_OUT_HIGH), + { } + }; + +And the table can be added to the board code as follows:: + + gpiod_add_hogs(gpio_hog_table); + +The line will be hogged as soon as the gpiochip is created or - in case the +chip was created earlier - when the hog table is registered. diff --git a/Documentation/driver-api/gpio/driver.rst b/Documentation/driver-api/gpio/driver.rst index 505ee906d7d9..cbe0242842d1 100644 --- a/Documentation/driver-api/gpio/driver.rst +++ b/Documentation/driver-api/gpio/driver.rst @@ -44,7 +44,7 @@ common to each controller of that type: - methods to establish GPIO line direction - methods used to access GPIO line values - - method to set electrical configuration to a a given GPIO line + - method to set electrical configuration for a given GPIO line - method to return the IRQ number associated to a given GPIO line - flag saying whether calls to its methods may sleep - optional line names array to identify lines @@ -143,7 +143,7 @@ resistor will make the line tend to high level unless one of the transistors on the rail actively pulls it down. The level on the line will go as high as the VDD on the pull-up resistor, which -may be higher than the level supported by the transistor, achieveing a +may be higher than the level supported by the transistor, achieving a level-shift to the higher VDD. Integrated electronics often have an output driver stage in the form of a CMOS @@ -382,7 +382,7 @@ Real-Time compliance for GPIO IRQ chips Any provider of irqchips needs to be carefully tailored to support Real Time preemption. It is desirable that all irqchips in the GPIO subsystem keep this -in mind and does the proper testing to assure they are real time-enabled. +in mind and do the proper testing to assure they are real time-enabled. So, pay attention on above " RT_FULL:" notes, please. The following is a checklist to follow when preparing a driver for real time-compliance: diff --git a/Documentation/driver-api/gpio/drivers-on-gpio.rst b/Documentation/driver-api/gpio/drivers-on-gpio.rst index 7da0c1dd1f7a..f3a189320e11 100644 --- a/Documentation/driver-api/gpio/drivers-on-gpio.rst +++ b/Documentation/driver-api/gpio/drivers-on-gpio.rst @@ -85,6 +85,10 @@ hardware descriptions such as device tree or ACPI: any other serio bus to the system and makes it possible to connect drivers for e.g. keyboards and other PS/2 protocol based devices. +- cec-gpio: drivers/media/platform/cec-gpio/ is used to interact with a CEC + Consumer Electronics Control bus using only GPIO. It is used to communicate + with devices on the HDMI bus. + Apart from this there are special GPIO drivers in subsystems like MMC/SD to read card detect and write protect GPIO lines, and in the TTY serial subsystem to emulate MCTRL (modem control) signals CTS/RTS by using two GPIO lines. The diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 6d8352c0f354..6d9f2f9fe20e 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -17,7 +17,9 @@ available subsections can be seen below. basics infrastructure pm/index + clk device-io + device_connection dma-buf device_link message-based @@ -34,6 +36,7 @@ available subsections can be seen below. edac scsi libata + target mtdnand miscellaneous w1 @@ -49,6 +52,7 @@ available subsections can be seen below. dmaengine/index slimbus soundwire/index + fpga/index .. only:: subproject and html diff --git a/Documentation/driver-api/infrastructure.rst b/Documentation/driver-api/infrastructure.rst index 6d9ff316b608..bee1b9a1702f 100644 --- a/Documentation/driver-api/infrastructure.rst +++ b/Documentation/driver-api/infrastructure.rst @@ -28,7 +28,7 @@ Device Drivers Base .. kernel-doc:: drivers/base/node.c :internal: -.. kernel-doc:: drivers/base/firmware_class.c +.. kernel-doc:: drivers/base/firmware_loader/main.c :export: .. kernel-doc:: drivers/base/transport_class.c diff --git a/Documentation/driver-api/scsi.rst b/Documentation/driver-api/scsi.rst index 31ad0fed6763..64b231d125e0 100644 --- a/Documentation/driver-api/scsi.rst +++ b/Documentation/driver-api/scsi.rst @@ -334,5 +334,5 @@ todo ~~~~ Parallel (fast/wide/ultra) SCSI, USB, SATA, SAS, Fibre Channel, -FireWire, ATAPI devices, Infiniband, I2O, iSCSI, Parallel ports, +FireWire, ATAPI devices, Infiniband, I2O, Parallel ports, netlink... diff --git a/Documentation/driver-api/soundwire/error_handling.rst b/Documentation/driver-api/soundwire/error_handling.rst new file mode 100644 index 000000000000..aa3a0a23a066 --- /dev/null +++ b/Documentation/driver-api/soundwire/error_handling.rst @@ -0,0 +1,65 @@ +======================== +SoundWire Error Handling +======================== + +The SoundWire PHY was designed with care and errors on the bus are going to +be very unlikely, and if they happen it should be limited to single bit +errors. Examples of this design can be found in the synchronization +mechanism (sync loss after two errors) and short CRCs used for the Bulk +Register Access. + +The errors can be detected with multiple mechanisms: + +1. Bus clash or parity errors: This mechanism relies on low-level detectors + that are independent of the payload and usages, and they cover both control + and audio data. The current implementation only logs such errors. + Improvements could be invalidating an entire programming sequence and + restarting from a known position. In the case of such errors outside of a + control/command sequence, there is no concealment or recovery for audio + data enabled by the SoundWire protocol, the location of the error will also + impact its audibility (most-significant bits will be more impacted in PCM), + and after a number of such errors are detected the bus might be reset. Note + that bus clashes due to programming errors (two streams using the same bit + slots) or electrical issues during the transmit/receive transition cannot + be distinguished, although a recurring bus clash when audio is enabled is a + indication of a bus allocation issue. The interrupt mechanism can also help + identify Slaves which detected a Bus Clash or a Parity Error, but they may + not be responsible for the errors so resetting them individually is not a + viable recovery strategy. + +2. Command status: Each command is associated with a status, which only + covers transmission of the data between devices. The ACK status indicates + that the command was received and will be executed by the end of the + current frame. A NAK indicates that the command was in error and will not + be applied. In case of a bad programming (command sent to non-existent + Slave or to a non-implemented register) or electrical issue, no response + signals the command was ignored. Some Master implementations allow for a + command to be retransmitted several times. If the retransmission fails, + backtracking and restarting the entire programming sequence might be a + solution. Alternatively some implementations might directly issue a bus + reset and re-enumerate all devices. + +3. Timeouts: In a number of cases such as ChannelPrepare or + ClockStopPrepare, the bus driver is supposed to poll a register field until + it transitions to a NotFinished value of zero. The MIPI SoundWire spec 1.1 + does not define timeouts but the MIPI SoundWire DisCo document adds + recommendation on timeouts. If such configurations do not complete, the + driver will return a -ETIMEOUT. Such timeouts are symptoms of a faulty + Slave device and are likely impossible to recover from. + +Errors during global reconfiguration sequences are extremely difficult to +handle: + +1. BankSwitch: An error during the last command issuing a BankSwitch is + difficult to backtrack from. Retransmitting the Bank Switch command may be + possible in a single segment setup, but this can lead to synchronization + problems when enabling multiple bus segments (a command with side effects + such as frame reconfiguration would be handled at different times). A global + hard-reset might be the best solution. + +Note that SoundWire does not provide a mechanism to detect illegal values +written in valid registers. In a number of cases the standard even mentions +that the Slave might behave in implementation-defined ways. The bus +implementation does not provide a recovery mechanism for such errors, Slave +or Master driver implementers are responsible for writing valid values in +valid registers and implement additional range checking if needed. diff --git a/Documentation/driver-api/soundwire/index.rst b/Documentation/driver-api/soundwire/index.rst index 647e94654752..6db026028f27 100644 --- a/Documentation/driver-api/soundwire/index.rst +++ b/Documentation/driver-api/soundwire/index.rst @@ -6,6 +6,9 @@ SoundWire Documentation :maxdepth: 1 summary + stream + error_handling + locking .. only:: subproject diff --git a/Documentation/driver-api/soundwire/locking.rst b/Documentation/driver-api/soundwire/locking.rst new file mode 100644 index 000000000000..253f73555255 --- /dev/null +++ b/Documentation/driver-api/soundwire/locking.rst @@ -0,0 +1,106 @@ +================= +SoundWire Locking +================= + +This document explains locking mechanism of the SoundWire Bus. Bus uses +following locks in order to avoid race conditions in Bus operations on +shared resources. + + - Bus lock + + - Message lock + +Bus lock +======== + +SoundWire Bus lock is a mutex and is part of Bus data structure +(sdw_bus) which is used for every Bus instance. This lock is used to +serialize each of the following operations(s) within SoundWire Bus instance. + + - Addition and removal of Slave(s), changing Slave status. + + - Prepare, Enable, Disable and De-prepare stream operations. + + - Access of Stream data structure. + +Message lock +============ + +SoundWire message transfer lock. This mutex is part of +Bus data structure (sdw_bus). This lock is used to serialize the message +transfers (read/write) within a SoundWire Bus instance. + +Below examples show how locks are acquired. + +Example 1 +--------- + +Message transfer. + + 1. For every message transfer + + a. Acquire Message lock. + + b. Transfer message (Read/Write) to Slave1 or broadcast message on + Bus in case of bank switch. + + c. Release Message lock :: + + +----------+ +---------+ + | | | | + | Bus | | Master | + | | | Driver | + | | | | + +----+-----+ +----+----+ + | | + | bus->ops->xfer_msg() | + <-------------------------------+ a. Acquire Message lock + | | b. Transfer message + | | + +-------------------------------> c. Release Message lock + | return success/error | d. Return success/error + | | + + + + +Example 2 +--------- + +Prepare operation. + + 1. Acquire lock for Bus instance associated with Master 1. + + 2. For every message transfer in Prepare operation + + a. Acquire Message lock. + + b. Transfer message (Read/Write) to Slave1 or broadcast message on + Bus in case of bank switch. + + c. Release Message lock. + + 3. Release lock for Bus instance associated with Master 1 :: + + +----------+ +---------+ + | | | | + | Bus | | Master | + | | | Driver | + | | | | + +----+-----+ +----+----+ + | | + | sdw_prepare_stream() | + <-------------------------------+ 1. Acquire bus lock + | | 2. Perform stream prepare + | | + | | + | bus->ops->xfer_msg() | + <-------------------------------+ a. Acquire Message lock + | | b. Transfer message + | | + +-------------------------------> c. Release Message lock + | return success/error | d. Return success/error + | | + | | + | return success/error | 3. Release bus lock + +-------------------------------> 4. Return success/error + | | + + + diff --git a/Documentation/driver-api/soundwire/stream.rst b/Documentation/driver-api/soundwire/stream.rst new file mode 100644 index 000000000000..29121aa55fb9 --- /dev/null +++ b/Documentation/driver-api/soundwire/stream.rst @@ -0,0 +1,372 @@ +========================= +Audio Stream in SoundWire +========================= + +An audio stream is a logical or virtual connection created between + + (1) System memory buffer(s) and Codec(s) + + (2) DSP memory buffer(s) and Codec(s) + + (3) FIFO(s) and Codec(s) + + (4) Codec(s) and Codec(s) + +which is typically driven by a DMA(s) channel through the data link. An +audio stream contains one or more channels of data. All channels within +stream must have same sample rate and same sample size. + +Assume a stream with two channels (Left & Right) is opened using SoundWire +interface. Below are some ways a stream can be represented in SoundWire. + +Stream Sample in memory (System memory, DSP memory or FIFOs) :: + + ------------------------- + | L | R | L | R | L | R | + ------------------------- + +Example 1: Stereo Stream with L and R channels is rendered from Master to +Slave. Both Master and Slave is using single port. :: + + +---------------+ Clock Signal +---------------+ + | Master +----------------------------------+ Slave | + | Interface | | Interface | + | | | 1 | + | | Data Signal | | + | L + R +----------------------------------+ L + R | + | (Data) | Data Direction | (Data) | + +---------------+ +-----------------------> +---------------+ + + +Example 2: Stereo Stream with L and R channels is captured from Slave to +Master. Both Master and Slave is using single port. :: + + + +---------------+ Clock Signal +---------------+ + | Master +----------------------------------+ Slave | + | Interface | | Interface | + | | | 1 | + | | Data Signal | | + | L + R +----------------------------------+ L + R | + | (Data) | Data Direction | (Data) | + +---------------+ <-----------------------+ +---------------+ + + +Example 3: Stereo Stream with L and R channels is rendered by Master. Each +of the L and R channel is received by two different Slaves. Master and both +Slaves are using single port. :: + + +---------------+ Clock Signal +---------------+ + | Master +---------+------------------------+ Slave | + | Interface | | | Interface | + | | | | 1 | + | | | Data Signal | | + | L + R +---+------------------------------+ L | + | (Data) | | | Data Direction | (Data) | + +---------------+ | | +-------------> +---------------+ + | | + | | + | | +---------------+ + | +----------------------> | Slave | + | | Interface | + | | 2 | + | | | + +----------------------------> | R | + | (Data) | + +---------------+ + + +Example 4: Stereo Stream with L and R channel is rendered by two different +Ports of the Master and is received by only single Port of the Slave +interface. :: + + +--------------------+ + | | + | +--------------+ +----------------+ + | | || | | + | | Data Port || L Channel | | + | | 1 |------------+ | | + | | L Channel || | +-----+----+ | + | | (Data) || | L + R Channel || Data | | + | Master +----------+ | +---+---------> || Port | | + | Interface | | || 1 | | + | +--------------+ | || | | + | | || | +----------+ | + | | Data Port |------------+ | | + | | 2 || R Channel | Slave | + | | R Channel || | Interface | + | | (Data) || | 1 | + | +--------------+ Clock Signal | L + R | + | +---------------------------> | (Data) | + +--------------------+ | | + +----------------+ + +SoundWire Stream Management flow +================================ + +Stream definitions +------------------ + + (1) Current stream: This is classified as the stream on which operation has + to be performed like prepare, enable, disable, de-prepare etc. + + (2) Active stream: This is classified as the stream which is already active + on Bus other than current stream. There can be multiple active streams + on the Bus. + +SoundWire Bus manages stream operations for each stream getting +rendered/captured on the SoundWire Bus. This section explains Bus operations +done for each of the stream allocated/released on Bus. Following are the +stream states maintained by the Bus for each of the audio stream. + + +SoundWire stream states +----------------------- + +Below shows the SoundWire stream states and state transition diagram. :: + + +-----------+ +------------+ +----------+ +----------+ + | ALLOCATED +---->| CONFIGURED +---->| PREPARED +---->| ENABLED | + | STATE | | STATE | | STATE | | STATE | + +-----------+ +------------+ +----------+ +----+-----+ + ^ + | + | + v + +----------+ +------------+ +----+-----+ + | RELEASED |<----------+ DEPREPARED |<-------+ DISABLED | + | STATE | | STATE | | STATE | + +----------+ +------------+ +----------+ + +NOTE: State transition between prepare and deprepare is supported in Spec +but not in the software (subsystem) + +NOTE2: Stream state transition checks need to be handled by caller +framework, for example ALSA/ASoC. No checks for stream transition exist in +SoundWire subsystem. + +Stream State Operations +----------------------- + +Below section explains the operations done by the Bus on Master(s) and +Slave(s) as part of stream state transitions. + +SDW_STREAM_ALLOCATED +~~~~~~~~~~~~~~~~~~~~ + +Allocation state for stream. This is the entry state +of the stream. Operations performed before entering in this state: + + (1) A stream runtime is allocated for the stream. This stream + runtime is used as a reference for all the operations performed + on the stream. + + (2) The resources required for holding stream runtime information are + allocated and initialized. This holds all stream related information + such as stream type (PCM/PDM) and parameters, Master and Slave + interface associated with the stream, stream state etc. + +After all above operations are successful, stream state is set to +``SDW_STREAM_ALLOCATED``. + +Bus implements below API for allocate a stream which needs to be called once +per stream. From ASoC DPCM framework, this stream state maybe linked to +.startup() operation. + + .. code-block:: c + int sdw_alloc_stream(char * stream_name); + + +SDW_STREAM_CONFIGURED +~~~~~~~~~~~~~~~~~~~~~ + +Configuration state of stream. Operations performed before entering in +this state: + + (1) The resources allocated for stream information in SDW_STREAM_ALLOCATED + state are updated here. This includes stream parameters, Master(s) + and Slave(s) runtime information associated with current stream. + + (2) All the Master(s) and Slave(s) associated with current stream provide + the port information to Bus which includes port numbers allocated by + Master(s) and Slave(s) for current stream and their channel mask. + +After all above operations are successful, stream state is set to +``SDW_STREAM_CONFIGURED``. + +Bus implements below APIs for CONFIG state which needs to be called by +the respective Master(s) and Slave(s) associated with stream. These APIs can +only be invoked once by respective Master(s) and Slave(s). From ASoC DPCM +framework, this stream state is linked to .hw_params() operation. + + .. code-block:: c + int sdw_stream_add_master(struct sdw_bus * bus, + struct sdw_stream_config * stream_config, + struct sdw_ports_config * ports_config, + struct sdw_stream_runtime * stream); + + int sdw_stream_add_slave(struct sdw_slave * slave, + struct sdw_stream_config * stream_config, + struct sdw_ports_config * ports_config, + struct sdw_stream_runtime * stream); + + +SDW_STREAM_PREPARED +~~~~~~~~~~~~~~~~~~~ + +Prepare state of stream. Operations performed before entering in this state: + + (1) Bus parameters such as bandwidth, frame shape, clock frequency, + are computed based on current stream as well as already active + stream(s) on Bus. Re-computation is required to accommodate current + stream on the Bus. + + (2) Transport and port parameters of all Master(s) and Slave(s) port(s) are + computed for the current as well as already active stream based on frame + shape and clock frequency computed in step 1. + + (3) Computed Bus and transport parameters are programmed in Master(s) and + Slave(s) registers. The banked registers programming is done on the + alternate bank (bank currently unused). Port(s) are enabled for the + already active stream(s) on the alternate bank (bank currently unused). + This is done in order to not disrupt already active stream(s). + + (4) Once all the values are programmed, Bus initiates switch to alternate + bank where all new values programmed gets into effect. + + (5) Ports of Master(s) and Slave(s) for current stream are prepared by + programming PrepareCtrl register. + +After all above operations are successful, stream state is set to +``SDW_STREAM_PREPARED``. + +Bus implements below API for PREPARE state which needs to be called once per +stream. From ASoC DPCM framework, this stream state is linked to +.prepare() operation. + + .. code-block:: c + int sdw_prepare_stream(struct sdw_stream_runtime * stream); + + +SDW_STREAM_ENABLED +~~~~~~~~~~~~~~~~~~ + +Enable state of stream. The data port(s) are enabled upon entering this state. +Operations performed before entering in this state: + + (1) All the values computed in SDW_STREAM_PREPARED state are programmed + in alternate bank (bank currently unused). It includes programming of + already active stream(s) as well. + + (2) All the Master(s) and Slave(s) port(s) for the current stream are + enabled on alternate bank (bank currently unused) by programming + ChannelEn register. + + (3) Once all the values are programmed, Bus initiates switch to alternate + bank where all new values programmed gets into effect and port(s) + associated with current stream are enabled. + +After all above operations are successful, stream state is set to +``SDW_STREAM_ENABLED``. + +Bus implements below API for ENABLE state which needs to be called once per +stream. From ASoC DPCM framework, this stream state is linked to +.trigger() start operation. + + .. code-block:: c + int sdw_enable_stream(struct sdw_stream_runtime * stream); + +SDW_STREAM_DISABLED +~~~~~~~~~~~~~~~~~~~ + +Disable state of stream. The data port(s) are disabled upon exiting this state. +Operations performed before entering in this state: + + (1) All the Master(s) and Slave(s) port(s) for the current stream are + disabled on alternate bank (bank currently unused) by programming + ChannelEn register. + + (2) All the current configuration of Bus and active stream(s) are programmed + into alternate bank (bank currently unused). + + (3) Once all the values are programmed, Bus initiates switch to alternate + bank where all new values programmed gets into effect and port(s) associated + with current stream are disabled. + +After all above operations are successful, stream state is set to +``SDW_STREAM_DISABLED``. + +Bus implements below API for DISABLED state which needs to be called once +per stream. From ASoC DPCM framework, this stream state is linked to +.trigger() stop operation. + + .. code-block:: c + int sdw_disable_stream(struct sdw_stream_runtime * stream); + + +SDW_STREAM_DEPREPARED +~~~~~~~~~~~~~~~~~~~~~ + +De-prepare state of stream. Operations performed before entering in this +state: + + (1) All the port(s) of Master(s) and Slave(s) for current stream are + de-prepared by programming PrepareCtrl register. + + (2) The payload bandwidth of current stream is reduced from the total + bandwidth requirement of bus and new parameters calculated and + applied by performing bank switch etc. + +After all above operations are successful, stream state is set to +``SDW_STREAM_DEPREPARED``. + +Bus implements below API for DEPREPARED state which needs to be called once +per stream. From ASoC DPCM framework, this stream state is linked to +.trigger() stop operation. + + .. code-block:: c + int sdw_deprepare_stream(struct sdw_stream_runtime * stream); + + +SDW_STREAM_RELEASED +~~~~~~~~~~~~~~~~~~~ + +Release state of stream. Operations performed before entering in this state: + + (1) Release port resources for all Master(s) and Slave(s) port(s) + associated with current stream. + + (2) Release Master(s) and Slave(s) runtime resources associated with + current stream. + + (3) Release stream runtime resources associated with current stream. + +After all above operations are successful, stream state is set to +``SDW_STREAM_RELEASED``. + +Bus implements below APIs for RELEASE state which needs to be called by +all the Master(s) and Slave(s) associated with stream. From ASoC DPCM +framework, this stream state is linked to .hw_free() operation. + + .. code-block:: c + int sdw_stream_remove_master(struct sdw_bus * bus, + struct sdw_stream_runtime * stream); + int sdw_stream_remove_slave(struct sdw_slave * slave, + struct sdw_stream_runtime * stream); + + +The .shutdown() ASoC DPCM operation calls below Bus API to release +stream assigned as part of ALLOCATED state. + +In .shutdown() the data structure maintaining stream state are freed up. + + .. code-block:: c + void sdw_release_stream(struct sdw_stream_runtime * stream); + +Not Supported +============= + +1. A single port with multiple channels supported cannot be used between two +streams or across stream. For example a port with 4 channels cannot be used +to handle 2 independent stereo streams even though it's possible in theory +in SoundWire. diff --git a/Documentation/driver-api/target.rst b/Documentation/driver-api/target.rst new file mode 100644 index 000000000000..4363611dd86d --- /dev/null +++ b/Documentation/driver-api/target.rst @@ -0,0 +1,64 @@ +================================= +target and iSCSI Interfaces Guide +================================= + +Introduction and Overview +========================= + +TBD + +Target core device interfaces +============================= + +.. kernel-doc:: drivers/target/target_core_device.c + :export: + +Target core transport interfaces +================================ + +.. kernel-doc:: drivers/target/target_core_transport.c + :export: + +Target-supported userspace I/O +============================== + +.. kernel-doc:: drivers/target/target_core_user.c + :doc: Userspace I/O + +.. kernel-doc:: include/uapi/linux/target_core_user.h + :doc: Ring Design + +iSCSI helper functions +====================== + +.. kernel-doc:: drivers/scsi/libiscsi.c + :export: + + +iSCSI boot information +====================== + +.. kernel-doc:: drivers/scsi/iscsi_boot_sysfs.c + :export: + + +iSCSI transport class +===================== + +The file drivers/scsi/scsi_transport_iscsi.c defines transport +attributes for the iSCSI class, which sends SCSI packets over TCP/IP +connections. + +.. kernel-doc:: drivers/scsi/scsi_transport_iscsi.c + :export: + + +iSCSI TCP interfaces +==================== + +.. kernel-doc:: drivers/scsi/iscsi_tcp.c + :internal: + +.. kernel-doc:: drivers/scsi/libiscsi_tcp.c + :export: + diff --git a/Documentation/driver-api/uio-howto.rst b/Documentation/driver-api/uio-howto.rst index 92056c20e070..fb2eb73be4a3 100644 --- a/Documentation/driver-api/uio-howto.rst +++ b/Documentation/driver-api/uio-howto.rst @@ -711,7 +711,8 @@ The vmbus device regions are mapped into uio device resources: If a subchannel is created by a request to host, then the uio_hv_generic device driver will create a sysfs binary file for the per-channel ring buffer. -For example: +For example:: + /sys/bus/vmbus/devices/3811fe4d-0fa0-4b62-981a-74fc1084c757/channels/21/ring Further information diff --git a/Documentation/driver-api/usb/dwc3.rst b/Documentation/driver-api/usb/dwc3.rst index c3dc84a50ce5..8b36ff11cef9 100644 --- a/Documentation/driver-api/usb/dwc3.rst +++ b/Documentation/driver-api/usb/dwc3.rst @@ -674,9 +674,8 @@ operations, both of which can be traced. Format is:: __entry->flags & DWC3_EP_ENABLED ? 'E' : 'e', __entry->flags & DWC3_EP_STALL ? 'S' : 's', __entry->flags & DWC3_EP_WEDGE ? 'W' : 'w', - __entry->flags & DWC3_EP_BUSY ? 'B' : 'b', + __entry->flags & DWC3_EP_TRANSFER_STARTED ? 'B' : 'b', __entry->flags & DWC3_EP_PENDING_REQUEST ? 'P' : 'p', - __entry->flags & DWC3_EP_MISSED_ISOC ? 'M' : 'm', __entry->flags & DWC3_EP_END_TRANSFER_PENDING ? 'E' : 'e', __entry->direction ? '<' : '>' ) diff --git a/Documentation/driver-api/usb/typec.rst b/Documentation/driver-api/usb/typec.rst index feb31946490b..48ff58095f11 100644 --- a/Documentation/driver-api/usb/typec.rst +++ b/Documentation/driver-api/usb/typec.rst @@ -210,7 +210,7 @@ If the connector is dual-role capable, there may also be a switch for the data role. USB Type-C Connector Class does not supply separate API for them. The port drivers can use USB Role Class API with those. -Illustration of the muxes behind a connector that supports an alternate mode: +Illustration of the muxes behind a connector that supports an alternate mode:: ------------------------ | Connector | diff --git a/Documentation/features/lib/strncasecmp/arch-support.txt b/Documentation/features/core/cBPF-JIT/arch-support.txt index 4f3a6a0e4e68..90459cdde314 100644 --- a/Documentation/features/lib/strncasecmp/arch-support.txt +++ b/Documentation/features/core/cBPF-JIT/arch-support.txt @@ -1,7 +1,7 @@ # -# Feature name: strncasecmp -# Kconfig: __HAVE_ARCH_STRNCASECMP -# description: arch provides an optimized strncasecmp() function +# Feature name: cBPF-JIT +# Kconfig: HAVE_CBPF_JIT +# description: arch supports cBPF JIT optimizations # ----------------------- | arch |status| @@ -16,14 +16,16 @@ | ia64: | TODO | | m68k: | TODO | | microblaze: | TODO | - | mips: | TODO | + | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | - | powerpc: | TODO | + | powerpc: | ok | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | - | sparc: | TODO | + | sparc: | ok | | um: | TODO | | unicore32: | TODO | | x86: | TODO | diff --git a/Documentation/features/core/BPF-JIT/arch-support.txt b/Documentation/features/core/eBPF-JIT/arch-support.txt index 0b96b4e1e7d4..c90a0382fe66 100644 --- a/Documentation/features/core/BPF-JIT/arch-support.txt +++ b/Documentation/features/core/eBPF-JIT/arch-support.txt @@ -1,7 +1,7 @@ # -# Feature name: BPF-JIT -# Kconfig: HAVE_BPF_JIT -# description: arch supports BPF JIT optimizations +# Feature name: eBPF-JIT +# Kconfig: HAVE_EBPF_JIT +# description: arch supports eBPF JIT optimizations # ----------------------- | arch |status| @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | ok | diff --git a/Documentation/features/core/generic-idle-thread/arch-support.txt b/Documentation/features/core/generic-idle-thread/arch-support.txt index 372a2b18a617..0ef6acdb991c 100644 --- a/Documentation/features/core/generic-idle-thread/arch-support.txt +++ b/Documentation/features/core/generic-idle-thread/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | - | openrisc: | TODO | + | openrisc: | ok | | parisc: | ok | | powerpc: | ok | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/core/jump-labels/arch-support.txt b/Documentation/features/core/jump-labels/arch-support.txt index ad97217b003b..27cbd63abfd2 100644 --- a/Documentation/features/core/jump-labels/arch-support.txt +++ b/Documentation/features/core/jump-labels/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | ok | diff --git a/Documentation/features/core/tracehook/arch-support.txt b/Documentation/features/core/tracehook/arch-support.txt index 36ee7bef5d18..f44c274e40ed 100644 --- a/Documentation/features/core/tracehook/arch-support.txt +++ b/Documentation/features/core/tracehook/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | ok | | nios2: | ok | | openrisc: | ok | | parisc: | ok | | powerpc: | ok | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/debug/KASAN/arch-support.txt b/Documentation/features/debug/KASAN/arch-support.txt index f5c99fa576d3..282ecc8ea1da 100644 --- a/Documentation/features/debug/KASAN/arch-support.txt +++ b/Documentation/features/debug/KASAN/arch-support.txt @@ -17,15 +17,17 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | | um: | TODO | | unicore32: | TODO | - | x86: | ok | 64-bit only + | x86: | ok | | xtensa: | ok | ----------------------- diff --git a/Documentation/features/debug/gcov-profile-all/arch-support.txt b/Documentation/features/debug/gcov-profile-all/arch-support.txt index 5170a9934843..01b2b3004e0a 100644 --- a/Documentation/features/debug/gcov-profile-all/arch-support.txt +++ b/Documentation/features/debug/gcov-profile-all/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | ok | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | ok | | sparc: | TODO | diff --git a/Documentation/features/debug/kgdb/arch-support.txt b/Documentation/features/debug/kgdb/arch-support.txt index 13b6e994ae1f..3b4dff22329f 100644 --- a/Documentation/features/debug/kgdb/arch-support.txt +++ b/Documentation/features/debug/kgdb/arch-support.txt @@ -11,16 +11,18 @@ | arm: | ok | | arm64: | ok | | c6x: | TODO | - | h8300: | TODO | + | h8300: | ok | | hexagon: | ok | | ia64: | TODO | | m68k: | TODO | | microblaze: | ok | | mips: | ok | + | nds32: | TODO | | nios2: | ok | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | TODO | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt b/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt index 419bb38820e7..7e963d0ae646 100644 --- a/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt +++ b/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/debug/kprobes/arch-support.txt b/Documentation/features/debug/kprobes/arch-support.txt index 52b3ace0a030..4ada027faf16 100644 --- a/Documentation/features/debug/kprobes/arch-support.txt +++ b/Documentation/features/debug/kprobes/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | ok | | arm: | ok | - | arm64: | TODO | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | TODO | @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/debug/kretprobes/arch-support.txt b/Documentation/features/debug/kretprobes/arch-support.txt index 180d24419518..044e13fcca5d 100644 --- a/Documentation/features/debug/kretprobes/arch-support.txt +++ b/Documentation/features/debug/kretprobes/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | ok | | arm: | ok | - | arm64: | TODO | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | TODO | @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/debug/optprobes/arch-support.txt b/Documentation/features/debug/optprobes/arch-support.txt index 0a1241f45e41..dce7669c918f 100644 --- a/Documentation/features/debug/optprobes/arch-support.txt +++ b/Documentation/features/debug/optprobes/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | - | powerpc: | TODO | + | powerpc: | ok | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/debug/stackprotector/arch-support.txt b/Documentation/features/debug/stackprotector/arch-support.txt index 570019572383..74b89a9c8b3a 100644 --- a/Documentation/features/debug/stackprotector/arch-support.txt +++ b/Documentation/features/debug/stackprotector/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | ok | | sparc: | TODO | diff --git a/Documentation/features/debug/uprobes/arch-support.txt b/Documentation/features/debug/uprobes/arch-support.txt index 0b8d922eb799..1a3f9d3229bf 100644 --- a/Documentation/features/debug/uprobes/arch-support.txt +++ b/Documentation/features/debug/uprobes/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | TODO | | arm: | ok | - | arm64: | TODO | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | TODO | @@ -17,13 +17,15 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | - | sparc: | TODO | + | sparc: | ok | | um: | TODO | | unicore32: | TODO | | x86: | ok | diff --git a/Documentation/features/debug/user-ret-profiler/arch-support.txt b/Documentation/features/debug/user-ret-profiler/arch-support.txt index 13852ae62e9e..1d78d1069a5f 100644 --- a/Documentation/features/debug/user-ret-profiler/arch-support.txt +++ b/Documentation/features/debug/user-ret-profiler/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/io/dma-api-debug/arch-support.txt b/Documentation/features/io/dma-api-debug/arch-support.txt deleted file mode 100644 index e438ed675623..000000000000 --- a/Documentation/features/io/dma-api-debug/arch-support.txt +++ /dev/null @@ -1,31 +0,0 @@ -# -# Feature name: dma-api-debug -# Kconfig: HAVE_DMA_API_DEBUG -# description: arch supports DMA debug facilities -# - ----------------------- - | arch |status| - ----------------------- - | alpha: | TODO | - | arc: | TODO | - | arm: | ok | - | arm64: | ok | - | c6x: | ok | - | h8300: | TODO | - | hexagon: | TODO | - | ia64: | ok | - | m68k: | TODO | - | microblaze: | ok | - | mips: | ok | - | nios2: | TODO | - | openrisc: | TODO | - | parisc: | TODO | - | powerpc: | ok | - | s390: | ok | - | sh: | ok | - | sparc: | ok | - | um: | TODO | - | unicore32: | TODO | - | x86: | ok | - | xtensa: | ok | - ----------------------- diff --git a/Documentation/features/io/dma-contiguous/arch-support.txt b/Documentation/features/io/dma-contiguous/arch-support.txt index 47f64a433df0..30c072d2b67c 100644 --- a/Documentation/features/io/dma-contiguous/arch-support.txt +++ b/Documentation/features/io/dma-contiguous/arch-support.txt @@ -17,11 +17,13 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | - | s390: | TODO | + | riscv: | ok | + | s390: | ok | | sh: | TODO | | sparc: | TODO | | um: | TODO | diff --git a/Documentation/features/io/sg-chain/arch-support.txt b/Documentation/features/io/sg-chain/arch-support.txt index 07f357fadbff..6554f0372c3f 100644 --- a/Documentation/features/io/sg-chain/arch-support.txt +++ b/Documentation/features/io/sg-chain/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | ok | diff --git a/Documentation/features/locking/cmpxchg-local/arch-support.txt b/Documentation/features/locking/cmpxchg-local/arch-support.txt index 482a0b09d1f8..51704a2dc8d1 100644 --- a/Documentation/features/locking/cmpxchg-local/arch-support.txt +++ b/Documentation/features/locking/cmpxchg-local/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | TODO | | arm: | TODO | - | arm64: | TODO | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | TODO | @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/locking/lockdep/arch-support.txt b/Documentation/features/locking/lockdep/arch-support.txt index bb35c5ba6286..bd39c5edd460 100644 --- a/Documentation/features/locking/lockdep/arch-support.txt +++ b/Documentation/features/locking/lockdep/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | ok | | mips: | ok | + | nds32: | ok | | nios2: | TODO | - | openrisc: | TODO | + | openrisc: | ok | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/locking/queued-rwlocks/arch-support.txt b/Documentation/features/locking/queued-rwlocks/arch-support.txt index 627e9a6b2db9..da7aff3bee0b 100644 --- a/Documentation/features/locking/queued-rwlocks/arch-support.txt +++ b/Documentation/features/locking/queued-rwlocks/arch-support.txt @@ -9,21 +9,23 @@ | alpha: | TODO | | arc: | TODO | | arm: | TODO | - | arm64: | TODO | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | TODO | | ia64: | TODO | | m68k: | TODO | | microblaze: | TODO | - | mips: | TODO | + | mips: | ok | + | nds32: | TODO | | nios2: | TODO | - | openrisc: | TODO | + | openrisc: | ok | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | - | sparc: | TODO | + | sparc: | ok | | um: | TODO | | unicore32: | TODO | | x86: | ok | diff --git a/Documentation/features/locking/queued-spinlocks/arch-support.txt b/Documentation/features/locking/queued-spinlocks/arch-support.txt index 9edda216cdfb..478e9101322c 100644 --- a/Documentation/features/locking/queued-spinlocks/arch-support.txt +++ b/Documentation/features/locking/queued-spinlocks/arch-support.txt @@ -16,14 +16,16 @@ | ia64: | TODO | | m68k: | TODO | | microblaze: | TODO | - | mips: | TODO | + | mips: | ok | + | nds32: | TODO | | nios2: | TODO | - | openrisc: | TODO | + | openrisc: | ok | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | - | sparc: | TODO | + | sparc: | ok | | um: | TODO | | unicore32: | TODO | | x86: | ok | diff --git a/Documentation/features/locking/rwsem-optimized/arch-support.txt b/Documentation/features/locking/rwsem-optimized/arch-support.txt index 8d9afb10b16e..e54b1f1a8091 100644 --- a/Documentation/features/locking/rwsem-optimized/arch-support.txt +++ b/Documentation/features/locking/rwsem-optimized/arch-support.txt @@ -1,6 +1,6 @@ # # Feature name: rwsem-optimized -# Kconfig: Optimized asm/rwsem.h +# Kconfig: !RWSEM_GENERIC_SPINLOCK # description: arch provides optimized rwsem APIs # ----------------------- @@ -8,8 +8,8 @@ ----------------------- | alpha: | ok | | arc: | TODO | - | arm: | TODO | - | arm64: | TODO | + | arm: | ok | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | TODO | @@ -17,14 +17,16 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | ok | | sh: | ok | | sparc: | ok | - | um: | TODO | + | um: | ok | | unicore32: | TODO | | x86: | ok | | xtensa: | ok | diff --git a/Documentation/features/perf/kprobes-event/arch-support.txt b/Documentation/features/perf/kprobes-event/arch-support.txt index d01239ee34b3..7331402d1887 100644 --- a/Documentation/features/perf/kprobes-event/arch-support.txt +++ b/Documentation/features/perf/kprobes-event/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | TODO | | arm: | ok | - | arm64: | TODO | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | ok | @@ -17,13 +17,15 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | ok | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | ok | - | sparc: | TODO | + | sparc: | ok | | um: | TODO | | unicore32: | TODO | | x86: | ok | diff --git a/Documentation/features/perf/perf-regs/arch-support.txt b/Documentation/features/perf/perf-regs/arch-support.txt index 458faba5311a..53feeee6cdad 100644 --- a/Documentation/features/perf/perf-regs/arch-support.txt +++ b/Documentation/features/perf/perf-regs/arch-support.txt @@ -17,11 +17,13 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | - | s390: | TODO | + | riscv: | TODO | + | s390: | ok | | sh: | TODO | | sparc: | TODO | | um: | TODO | diff --git a/Documentation/features/perf/perf-stackdump/arch-support.txt b/Documentation/features/perf/perf-stackdump/arch-support.txt index 545d01c69c88..16164348e0ea 100644 --- a/Documentation/features/perf/perf-stackdump/arch-support.txt +++ b/Documentation/features/perf/perf-stackdump/arch-support.txt @@ -17,11 +17,13 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | - | s390: | TODO | + | riscv: | TODO | + | s390: | ok | | sh: | TODO | | sparc: | TODO | | um: | TODO | diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt index 85a6c9d4571c..dbdf62907703 100644 --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt @@ -40,10 +40,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/sched/numa-balancing/arch-support.txt b/Documentation/features/sched/numa-balancing/arch-support.txt index 347508863872..c68bb2c2cb62 100644 --- a/Documentation/features/sched/numa-balancing/arch-support.txt +++ b/Documentation/features/sched/numa-balancing/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | .. | | arm: | .. | - | arm64: | .. | + | arm64: | ok | | c6x: | .. | | h8300: | .. | | hexagon: | .. | @@ -17,11 +17,13 @@ | m68k: | .. | | microblaze: | .. | | mips: | TODO | + | nds32: | TODO | | nios2: | .. | | openrisc: | .. | | parisc: | .. | | powerpc: | ok | - | s390: | .. | + | riscv: | TODO | + | s390: | ok | | sh: | .. | | sparc: | TODO | | um: | .. | diff --git a/Documentation/features/scripts/features-refresh.sh b/Documentation/features/scripts/features-refresh.sh new file mode 100755 index 000000000000..9e72d38a0720 --- /dev/null +++ b/Documentation/features/scripts/features-refresh.sh @@ -0,0 +1,98 @@ +# +# Small script that refreshes the kernel feature support status in place. +# + +for F_FILE in Documentation/features/*/*/arch-support.txt; do + F=$(grep "^# Kconfig:" "$F_FILE" | cut -c26-) + + # + # Each feature F is identified by a pair (O, K), where 'O' can + # be either the empty string (for 'nop') or "not" (the logical + # negation operator '!'); other operators are not supported. + # + O="" + K=$F + if [[ "$F" == !* ]]; then + O="not" + K=$(echo $F | sed -e 's/^!//g') + fi + + # + # F := (O, K) is 'valid' iff there is a Kconfig file (for some + # arch) which contains K. + # + # Notice that this definition entails an 'asymmetry' between + # the case 'O = ""' and the case 'O = "not"'. E.g., F may be + # _invalid_ if: + # + # [case 'O = ""'] + # 1) no arch provides support for F, + # 2) K does not exist (e.g., it was renamed/mis-typed); + # + # [case 'O = "not"'] + # 3) all archs provide support for F, + # 4) as in (2). + # + # The rationale for adopting this definition (and, thus, for + # keeping the asymmetry) is: + # + # We want to be able to 'detect' (2) (or (4)). + # + # (1) and (3) may further warn the developers about the fact + # that K can be removed. + # + F_VALID="false" + for ARCH_DIR in arch/*/; do + K_FILES=$(find $ARCH_DIR -name "Kconfig*") + K_GREP=$(grep "$K" $K_FILES) + if [ ! -z "$K_GREP" ]; then + F_VALID="true" + break + fi + done + if [ "$F_VALID" = "false" ]; then + printf "WARNING: '%s' is not a valid Kconfig\n" "$F" + fi + + T_FILE="$F_FILE.tmp" + grep "^#" $F_FILE > $T_FILE + echo " -----------------------" >> $T_FILE + echo " | arch |status|" >> $T_FILE + echo " -----------------------" >> $T_FILE + for ARCH_DIR in arch/*/; do + ARCH=$(echo $ARCH_DIR | sed -e 's/arch//g' | sed -e 's/\///g') + K_FILES=$(find $ARCH_DIR -name "Kconfig*") + K_GREP=$(grep "$K" $K_FILES) + # + # Arch support status values for (O, K) are updated according + # to the following rules. + # + # - ("", K) is 'supported by a given arch', if there is a + # Kconfig file for that arch which contains K; + # + # - ("not", K) is 'supported by a given arch', if there is + # no Kconfig file for that arch which contains K; + # + # - otherwise: preserve the previous status value (if any), + # default to 'not yet supported'. + # + # Notice that, according these rules, invalid features may be + # updated/modified. + # + if [ "$O" = "" ] && [ ! -z "$K_GREP" ]; then + printf " |%12s: | ok |\n" "$ARCH" >> $T_FILE + elif [ "$O" = "not" ] && [ -z "$K_GREP" ]; then + printf " |%12s: | ok |\n" "$ARCH" >> $T_FILE + else + S=$(grep -v "^#" "$F_FILE" | grep " $ARCH:") + if [ ! -z "$S" ]; then + echo "$S" >> $T_FILE + else + printf " |%12s: | TODO |\n" "$ARCH" \ + >> $T_FILE + fi + fi + done + echo " -----------------------" >> $T_FILE + mv $T_FILE $F_FILE +done diff --git a/Documentation/features/seccomp/seccomp-filter/arch-support.txt b/Documentation/features/seccomp/seccomp-filter/arch-support.txt index e4fad58a05e5..d4271b493b41 100644 --- a/Documentation/features/seccomp/seccomp-filter/arch-support.txt +++ b/Documentation/features/seccomp/seccomp-filter/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | - | parisc: | TODO | - | powerpc: | TODO | + | parisc: | ok | + | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/time/arch-tick-broadcast/arch-support.txt b/Documentation/features/time/arch-tick-broadcast/arch-support.txt index 8052904b25fc..83d9e68462bb 100644 --- a/Documentation/features/time/arch-tick-broadcast/arch-support.txt +++ b/Documentation/features/time/arch-tick-broadcast/arch-support.txt @@ -17,12 +17,14 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | TODO | - | sh: | TODO | + | sh: | ok | | sparc: | TODO | | um: | TODO | | unicore32: | TODO | diff --git a/Documentation/features/time/clockevents/arch-support.txt b/Documentation/features/time/clockevents/arch-support.txt index 7c76b946297e..3d4908fce6da 100644 --- a/Documentation/features/time/clockevents/arch-support.txt +++ b/Documentation/features/time/clockevents/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | ok | | microblaze: | ok | | mips: | ok | + | nds32: | ok | | nios2: | ok | | openrisc: | ok | - | parisc: | TODO | + | parisc: | ok | | powerpc: | ok | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/time/context-tracking/arch-support.txt b/Documentation/features/time/context-tracking/arch-support.txt index 9433b3e523b3..c29974afffaa 100644 --- a/Documentation/features/time/context-tracking/arch-support.txt +++ b/Documentation/features/time/context-tracking/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | ok | diff --git a/Documentation/features/time/irq-time-acct/arch-support.txt b/Documentation/features/time/irq-time-acct/arch-support.txt index 212dde0b578c..8d73c463ec27 100644 --- a/Documentation/features/time/irq-time-acct/arch-support.txt +++ b/Documentation/features/time/irq-time-acct/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | .. | - | powerpc: | .. | + | powerpc: | ok | + | riscv: | TODO | | s390: | .. | | sh: | TODO | | sparc: | .. | diff --git a/Documentation/features/time/modern-timekeeping/arch-support.txt b/Documentation/features/time/modern-timekeeping/arch-support.txt index 4074028f72f7..e7c6ea6b8fb3 100644 --- a/Documentation/features/time/modern-timekeeping/arch-support.txt +++ b/Documentation/features/time/modern-timekeeping/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | ok | | mips: | ok | + | nds32: | ok | | nios2: | ok | | openrisc: | ok | | parisc: | ok | | powerpc: | ok | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/time/virt-cpuacct/arch-support.txt b/Documentation/features/time/virt-cpuacct/arch-support.txt index a394d8820517..4646457461cf 100644 --- a/Documentation/features/time/virt-cpuacct/arch-support.txt +++ b/Documentation/features/time/virt-cpuacct/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | ok | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | ok | diff --git a/Documentation/features/vm/ELF-ASLR/arch-support.txt b/Documentation/features/vm/ELF-ASLR/arch-support.txt index 082f93d5b40e..1f71d090ff2c 100644 --- a/Documentation/features/vm/ELF-ASLR/arch-support.txt +++ b/Documentation/features/vm/ELF-ASLR/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | - | parisc: | TODO | + | parisc: | ok | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/vm/PG_uncached/arch-support.txt b/Documentation/features/vm/PG_uncached/arch-support.txt index 605e0abb756d..fbd5aa463b0a 100644 --- a/Documentation/features/vm/PG_uncached/arch-support.txt +++ b/Documentation/features/vm/PG_uncached/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/vm/THP/arch-support.txt b/Documentation/features/vm/THP/arch-support.txt index 7a8eb0bd5ca8..5d7ecc378f29 100644 --- a/Documentation/features/vm/THP/arch-support.txt +++ b/Documentation/features/vm/THP/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | .. | | microblaze: | .. | | mips: | ok | + | nds32: | TODO | | nios2: | .. | | openrisc: | .. | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | .. | | sparc: | ok | diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt index 35fb99b2b3ea..f7af9678eb66 100644 --- a/Documentation/features/vm/TLB/arch-support.txt +++ b/Documentation/features/vm/TLB/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | .. | | microblaze: | .. | | mips: | TODO | + | nds32: | TODO | | nios2: | .. | | openrisc: | .. | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/vm/huge-vmap/arch-support.txt b/Documentation/features/vm/huge-vmap/arch-support.txt index ed8b943ad8fc..d0713ccc7117 100644 --- a/Documentation/features/vm/huge-vmap/arch-support.txt +++ b/Documentation/features/vm/huge-vmap/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/vm/ioremap_prot/arch-support.txt b/Documentation/features/vm/ioremap_prot/arch-support.txt index 589947bdf0a8..8527601a3739 100644 --- a/Documentation/features/vm/ioremap_prot/arch-support.txt +++ b/Documentation/features/vm/ioremap_prot/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | TODO | | sh: | ok | | sparc: | TODO | diff --git a/Documentation/features/vm/numa-memblock/arch-support.txt b/Documentation/features/vm/numa-memblock/arch-support.txt index 8b8bea0318a0..1a988052cd24 100644 --- a/Documentation/features/vm/numa-memblock/arch-support.txt +++ b/Documentation/features/vm/numa-memblock/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | .. | | arm: | .. | - | arm64: | .. | + | arm64: | ok | | c6x: | .. | | h8300: | .. | | hexagon: | .. | @@ -17,10 +17,12 @@ | m68k: | .. | | microblaze: | ok | | mips: | ok | + | nds32: | TODO | | nios2: | .. | | openrisc: | .. | | parisc: | .. | | powerpc: | ok | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/vm/pte_special/arch-support.txt b/Documentation/features/vm/pte_special/arch-support.txt index 055004f467d2..a8378424bc98 100644 --- a/Documentation/features/vm/pte_special/arch-support.txt +++ b/Documentation/features/vm/pte_special/arch-support.txt @@ -1,6 +1,6 @@ # # Feature name: pte_special -# Kconfig: __HAVE_ARCH_PTE_SPECIAL +# Kconfig: ARCH_HAS_PTE_SPECIAL # description: arch supports the pte_special()/pte_mkspecial() VM APIs # ----------------------- @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index b7bd6c9009cc..0937bade1099 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -10,8 +10,8 @@ afs.txt - info and examples for the distributed AFS (Andrew File System) fs. affs.txt - info and mount options for the Amiga Fast File System. -autofs4-mount-control.txt - - info on device control operations for autofs4 module. +autofs-mount-control.txt + - info on device control operations for autofs module. automount-support.txt - information about filesystem automount support. befs.txt @@ -89,8 +89,6 @@ locks.txt - info on file locking implementations, flock() vs. fcntl(), etc. mandatory-locking.txt - info on the Linux implementation of Sys V mandatory file locking. -ncpfs.txt - - info on Novell Netware(tm) filesystem using NCP protocol. nfs/ - nfs-related documentation. nilfs2.txt diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 75d2d57e2c44..2c391338c675 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -69,31 +69,31 @@ prototypes: locking rules: all may block - i_mutex(inode) -lookup: yes -create: yes -link: yes (both) -mknod: yes -symlink: yes -mkdir: yes -unlink: yes (both) -rmdir: yes (both) (see below) -rename: yes (all) (see below) + i_rwsem(inode) +lookup: shared +create: exclusive +link: exclusive (both) +mknod: exclusive +symlink: exclusive +mkdir: exclusive +unlink: exclusive (both) +rmdir: exclusive (both)(see below) +rename: exclusive (all) (see below) readlink: no get_link: no -setattr: yes +setattr: exclusive permission: no (may not block if called in rcu-walk mode) get_acl: no getattr: no listxattr: no fiemap: no update_time: no -atomic_open: yes +atomic_open: exclusive tmpfile: no - Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on -victim. + Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem + exclusive on victim. cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. See Documentation/filesystems/directory-locking for more detailed discussion @@ -111,10 +111,10 @@ prototypes: locking rules: all may block - i_mutex(inode) + i_rwsem(inode) list: no get: no -set: yes +set: exclusive --------------------------- super_operations --------------------------- prototypes: @@ -217,14 +217,14 @@ prototypes: locking rules: All except set_page_dirty and freepage may block - PageLocked(page) i_mutex + PageLocked(page) i_rwsem writepage: yes, unlocks (see below) readpage: yes, unlocks writepages: set_page_dirty no readpages: -write_begin: locks the page yes -write_end: yes, unlocks yes +write_begin: locks the page exclusive +write_end: yes, unlocks exclusive bmap: invalidatepage: yes releasepage: yes @@ -439,7 +439,10 @@ prototypes: ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); int (*iterate) (struct file *, struct dir_context *); - unsigned int (*poll) (struct file *, struct poll_table_struct *); + int (*iterate_shared) (struct file *, struct dir_context *); + __poll_t (*poll) (struct file *, struct poll_table_struct *); + struct wait_queue_head * (*get_poll_head)(struct file *, __poll_t); + __poll_t (*poll_mask) (struct file *, __poll_t); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); @@ -470,7 +473,7 @@ prototypes: }; locking rules: - All may block. + All except for ->poll_mask may block. ->llseek() locking has moved from llseek to the individual llseek implementations. If your fs is not using generic_file_llseek, you @@ -480,6 +483,10 @@ mutex or just to use i_size_read() instead. Note: this does not protect the file->f_pos against concurrent modifications since this is something the userspace has to take care about. +->iterate() is called with i_rwsem exclusive. + +->iterate_shared() is called with i_rwsem at least shared. + ->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags. Most instances call fasync_helper(), which does that maintenance, so it's not normally something one needs to worry about. Return values > 0 will be @@ -498,6 +505,9 @@ in sys_read() and friends. the lease within the individual filesystem to record the result of the operation +->poll_mask can be called with or without the waitqueue lock for the waitqueue +returned from ->get_poll_head. + --------------------------- dquot_operations ------------------------------- prototypes: int (*write_dquot) (struct dquot *); diff --git a/Documentation/filesystems/autofs4-mount-control.txt b/Documentation/filesystems/autofs-mount-control.txt index e5177cb31a04..45edad6933cc 100644 --- a/Documentation/filesystems/autofs4-mount-control.txt +++ b/Documentation/filesystems/autofs-mount-control.txt @@ -1,5 +1,5 @@ -Miscellaneous Device control operations for the autofs4 kernel module +Miscellaneous Device control operations for the autofs kernel module ==================================================================== The problem @@ -164,7 +164,7 @@ possibility for future development due to the requirements of the message bus architecture. -autofs4 Miscellaneous Device mount control interface +autofs Miscellaneous Device mount control interface ==================================================== The control interface is opening a device node, typically /dev/autofs. @@ -244,7 +244,7 @@ The device node ioctl operations implemented by this interface are: AUTOFS_DEV_IOCTL_VERSION ------------------------ -Get the major and minor version of the autofs4 device ioctl kernel module +Get the major and minor version of the autofs device ioctl kernel module implementation. It requires an initialized struct autofs_dev_ioctl as an input parameter and sets the version information in the passed in structure. It returns 0 on success or the error -EINVAL if a version mismatch is @@ -254,7 +254,7 @@ detected. AUTOFS_DEV_IOCTL_PROTOVER_CMD and AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD ------------------------------------------------------------------ -Get the major and minor version of the autofs4 protocol version understood +Get the major and minor version of the autofs protocol version understood by loaded module. This call requires an initialized struct autofs_dev_ioctl with the ioctlfd field set to a valid autofs mount point descriptor and sets the requested version number in version field of struct args_protover @@ -404,4 +404,3 @@ type is also given we are looking for a particular autofs mount and if a match isn't found a fail is returned. If the the located path is the root of a mount 1 is returned along with the super magic of the mount or 0 otherwise. - diff --git a/Documentation/filesystems/autofs4.txt b/Documentation/filesystems/autofs.txt index f10dd590f69f..373ad25852d3 100644 --- a/Documentation/filesystems/autofs4.txt +++ b/Documentation/filesystems/autofs.txt @@ -30,15 +30,15 @@ key advantages: Context ------- -The "autofs4" filesystem module is only one part of an autofs system. +The "autofs" filesystem module is only one part of an autofs system. There also needs to be a user-space program which looks up names and mounts filesystems. This will often be the "automount" program, -though other tools including "systemd" can make use of "autofs4". +though other tools including "systemd" can make use of "autofs". This document describes only the kernel module and the interactions required with any user-space program. Subsequent text refers to this as the "automount daemon" or simply "the daemon". -"autofs4" is a Linux kernel module with provides the "autofs" +"autofs" is a Linux kernel module with provides the "autofs" filesystem type. Several "autofs" filesystems can be mounted and they can each be managed separately, or all managed by the same daemon. @@ -215,7 +215,7 @@ of expiry. The VFS also supports "expiry" of mounts using the MNT_EXPIRE flag to the `umount` system call. Unmounting with MNT_EXPIRE will fail unless a previous attempt had been made, and the filesystem has been inactive -and untouched since that previous attempt. autofs4 does not depend on +and untouched since that previous attempt. autofs does not depend on this but has its own internal tracking of whether filesystems were recently used. This allows individual names in the autofs directory to expire separately. @@ -415,7 +415,7 @@ which can be used to communicate directly with the autofs filesystem. It requires CAP_SYS_ADMIN for access. The `ioctl`s that can be used on this device are described in a separate -document `autofs4-mount-control.txt`, and are summarized briefly here. +document `autofs-mount-control.txt`, and are summarized briefly here. Each ioctl is passed a pointer to an `autofs_dev_ioctl` structure: struct autofs_dev_ioctl { diff --git a/Documentation/filesystems/automount-support.txt b/Documentation/filesystems/automount-support.txt index 7eb762eb3136..b0afd3d55eaf 100644 --- a/Documentation/filesystems/automount-support.txt +++ b/Documentation/filesystems/automount-support.txt @@ -9,7 +9,7 @@ also be requested by userspace. IN-KERNEL AUTOMOUNTING ====================== -See section "Mount Traps" of Documentation/filesystems/autofs4.txt +See section "Mount Traps" of Documentation/filesystems/autofs.txt Then from userspace, you can just do something like: diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst index cfbc18f0d9c9..48b424de85bb 100644 --- a/Documentation/filesystems/fscrypt.rst +++ b/Documentation/filesystems/fscrypt.rst @@ -191,11 +191,21 @@ Currently, the following pairs of encryption modes are supported: - AES-256-XTS for contents and AES-256-CTS-CBC for filenames - AES-128-CBC for contents and AES-128-CTS-CBC for filenames +- Speck128/256-XTS for contents and Speck128/256-CTS-CBC for filenames It is strongly recommended to use AES-256-XTS for contents encryption. AES-128-CBC was added only for low-powered embedded devices with crypto accelerators such as CAAM or CESA that do not support XTS. +Similarly, Speck128/256 support was only added for older or low-end +CPUs which cannot do AES fast enough -- especially ARM CPUs which have +NEON instructions but not the Cryptography Extensions -- and for which +it would not otherwise be feasible to use encryption at all. It is +not recommended to use Speck on CPUs that have AES instructions. +Speck support is only available if it has been enabled in the crypto +API via CONFIG_CRYPTO_SPECK. Also, on ARM platforms, to get +acceptable performance CONFIG_CRYPTO_SPECK_NEON must be enabled. + New encryption modes can be added relatively easily, without changes to individual filesystems. However, authenticated encryption (AE) modes are not currently supported because of the difficulty of dealing diff --git a/Documentation/filesystems/fuse-io.txt b/Documentation/filesystems/fuse-io.txt new file mode 100644 index 000000000000..07b8f73f100f --- /dev/null +++ b/Documentation/filesystems/fuse-io.txt @@ -0,0 +1,38 @@ +Fuse supports the following I/O modes: + +- direct-io +- cached + + write-through + + writeback-cache + +The direct-io mode can be selected with the FOPEN_DIRECT_IO flag in the +FUSE_OPEN reply. + +In direct-io mode the page cache is completely bypassed for reads and writes. +No read-ahead takes place. Shared mmap is disabled. + +In cached mode reads may be satisfied from the page cache, and data may be +read-ahead by the kernel to fill the cache. The cache is always kept consistent +after any writes to the file. All mmap modes are supported. + +The cached mode has two sub modes controlling how writes are handled. The +write-through mode is the default and is supported on all kernels. The +writeback-cache mode may be selected by the FUSE_WRITEBACK_CACHE flag in the +FUSE_INIT reply. + +In write-through mode each write is immediately sent to userspace as one or more +WRITE requests, as well as updating any cached pages (and caching previously +uncached, but fully written pages). No READ requests are ever sent for writes, +so when an uncached page is partially written, the page is discarded. + +In writeback-cache mode (enabled by the FUSE_WRITEBACK_CACHE flag) writes go to +the cache only, which means that the write(2) syscall can often complete very +fast. Dirty pages are written back implicitly (background writeback or page +reclaim on memory pressure) or explicitly (invoked by close(2), fsync(2) and +when the last ref to the file is being released on munmap(2)). This mode +assumes that all changes to the filesystem go through the FUSE kernel module +(size and atime/ctime/mtime attributes are kept up-to-date by the kernel), so +it's generally not suitable for network filesystems. If a partial page is +written, then the page needs to be first read from userspace. This means, that +even for files opened for O_WRONLY it is possible that READ requests will be +generated by the kernel. diff --git a/Documentation/filesystems/ncpfs.txt b/Documentation/filesystems/ncpfs.txt deleted file mode 100644 index 5af164f4b37b..000000000000 --- a/Documentation/filesystems/ncpfs.txt +++ /dev/null @@ -1,12 +0,0 @@ -The ncpfs filesystem understands the NCP protocol, designed by the -Novell Corporation for their NetWare(tm) product. NCP is functionally -similar to the NFS used in the TCP/IP community. -To mount a NetWare filesystem, you need a special mount program, which -can be found in the ncpfs package. The home site for ncpfs is -ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors -will have it as well. - -Related products are linware and mars_nwe, which will give Linux partial -NetWare server functionality. - -mars_nwe can be found on ftp.gwdg.de/pub/linux/misc/ncpfs. diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt index 5efae00f6c7f..d2963123eb1c 100644 --- a/Documentation/filesystems/nfs/nfsroot.txt +++ b/Documentation/filesystems/nfs/nfsroot.txt @@ -5,6 +5,7 @@ Written 1996 by Gero Kuhlmann <gero@gkminix.han.de> Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz> Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org> Updated 2006 by Horms <horms@verge.net.au> +Updated 2018 by Chris Novakovic <chris@chrisn.me.uk> @@ -79,7 +80,7 @@ nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>] ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>: - <dns0-ip>:<dns1-ip> + <dns0-ip>:<dns1-ip>:<ntp0-ip> This parameter tells the kernel how to configure IP addresses of devices and also how to set up the IP routing table. It was originally called @@ -110,6 +111,9 @@ ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>: will not be triggered if it is missing and NFS root is not in operation. + Value is exported to /proc/net/pnp with the prefix "bootserver " + (see below). + Default: Determined using autoconfiguration. The address of the autoconfiguration server is used. @@ -123,10 +127,13 @@ ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>: Default: Determined using autoconfiguration. - <hostname> Name of the client. May be supplied by autoconfiguration, - but its absence will not trigger autoconfiguration. - If specified and DHCP is used, the user provided hostname will - be carried in the DHCP request to hopefully update DNS record. + <hostname> Name of the client. If a '.' character is present, anything + before the first '.' is used as the client's hostname, and anything + after it is used as its NIS domain name. May be supplied by + autoconfiguration, but its absence will not trigger autoconfiguration. + If specified and DHCP is used, the user-provided hostname (and NIS + domain name, if present) will be carried in the DHCP request; this + may cause a DNS record to be created or updated for the client. Default: Client IP address is used in ASCII notation. @@ -162,12 +169,55 @@ ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>: Default: any - <dns0-ip> IP address of first nameserver. - Value gets exported by /proc/net/pnp which is often linked - on embedded systems by /etc/resolv.conf. + <dns0-ip> IP address of primary nameserver. + Value is exported to /proc/net/pnp with the prefix "nameserver " + (see below). + + Default: None if not using autoconfiguration; determined + automatically if using autoconfiguration. + + <dns1-ip> IP address of secondary nameserver. + See <dns0-ip>. + + <ntp0-ip> IP address of a Network Time Protocol (NTP) server. + Value is exported to /proc/net/ipconfig/ntp_servers, but is + otherwise unused (see below). + + Default: None if not using autoconfiguration; determined + automatically if using autoconfiguration. + + After configuration (whether manual or automatic) is complete, two files + are created in the following format; lines are omitted if their respective + value is empty following configuration: + + - /proc/net/pnp: + + #PROTO: <DHCP|BOOTP|RARP|MANUAL> (depending on configuration method) + domain <dns-domain> (if autoconfigured, the DNS domain) + nameserver <dns0-ip> (primary name server IP) + nameserver <dns1-ip> (secondary name server IP) + nameserver <dns2-ip> (tertiary name server IP) + bootserver <server-ip> (NFS server IP) + + - /proc/net/ipconfig/ntp_servers: + + <ntp0-ip> (NTP server IP) + <ntp1-ip> (NTP server IP) + <ntp2-ip> (NTP server IP) + + <dns-domain> and <dns2-ip> (in /proc/net/pnp) and <ntp1-ip> and <ntp2-ip> + (in /proc/net/ipconfig/ntp_servers) are requested during autoconfiguration; + they cannot be specified as part of the "ip=" kernel command line parameter. + + Because the "domain" and "nameserver" options are recognised by DNS + resolvers, /etc/resolv.conf is often linked to /proc/net/pnp on systems + that use an NFS root filesystem. - <dns1-ip> IP address of second nameserver. - Same as above. + Note that the kernel will not synchronise the system time with any NTP + servers it discovers; this is the responsibility of a user space process + (e.g. an initrd/initramfs script that passes the IP addresses listed in + /proc/net/ipconfig/ntp_servers to an NTP client before mounting the real + root filesystem if it is on NFS). nfsrootdebug diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt index 961b287ef323..72615a2c0752 100644 --- a/Documentation/filesystems/overlayfs.txt +++ b/Documentation/filesystems/overlayfs.txt @@ -429,11 +429,12 @@ This verification may cause significant overhead in some cases. Testsuite --------- -There's testsuite developed by David Howells at: +There's a testsuite originally developed by David Howells and currently +maintained by Amir Goldstein at: - git://git.infradead.org/users/dhowells/unionmount-testsuite.git + https://github.com/amir73il/unionmount-testsuite.git Run as root: # cd unionmount-testsuite - # ./run --ov + # ./run --ov --verify diff --git a/Documentation/filesystems/path-lookup.md b/Documentation/filesystems/path-lookup.md index 1933ef734e63..e2edd45c4bc0 100644 --- a/Documentation/filesystems/path-lookup.md +++ b/Documentation/filesystems/path-lookup.md @@ -460,7 +460,7 @@ this retry process in the next article. Automount points are locations in the filesystem where an attempt to lookup a name can trigger changes to how that lookup should be handled, in particular by mounting a filesystem there. These are -covered in greater detail in autofs4.txt in the Linux documentation +covered in greater detail in autofs.txt in the Linux documentation tree, but a few notes specifically related to path lookup are in order here. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 2a84bb334894..520f6a84cf50 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -515,7 +515,8 @@ guarantees: The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG bits on both physical and virtual pages associated with a process, and the -soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details). +soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst +for details). To clear the bits for all the pages associated with the process > echo 1 > /proc/PID/clear_refs @@ -536,7 +537,8 @@ Any other value written to /proc/PID/clear_refs will have no effect. The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags using /proc/kpageflags and number of times a page is mapped using -/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt. +/proc/kpagecount. For detailed explanation, see +Documentation/admin-guide/mm/pagemap.rst. The /proc/pid/numa_maps is an extension based on maps, showing the memory locality and binding policy, as well as the memory usage (in pages) of @@ -564,7 +566,7 @@ address policy mapping details Where: "address" is the starting address for the mapping; -"policy" reports the NUMA memory policy set for the mapping (see vm/numa_memory_policy.txt); +"policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst); "mapping details" summarizes mapping data such as mapping type, page usage counters, node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page size, in KB, that is backing the mapping up. diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index a85355cf85f4..d06e9a59a9f4 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -105,8 +105,9 @@ policy for the file will revert to "default" policy. NUMA memory allocation policies have optional flags that can be used in conjunction with their modes. These optional flags can be specified when tmpfs is mounted by appending them to the mode before the NodeList. -See Documentation/vm/numa_memory_policy.txt for a list of all available -memory allocation policy mode flags and their effect on memory policy. +See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of +all available memory allocation policy mode flags and their effect on +memory policy. =static is equivalent to MPOL_F_STATIC_NODES =relative is equivalent to MPOL_F_RELATIVE_NODES diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 5fd325df59e2..829a7b7857a4 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -856,7 +856,9 @@ struct file_operations { ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); int (*iterate) (struct file *, struct dir_context *); - unsigned int (*poll) (struct file *, struct poll_table_struct *); + __poll_t (*poll) (struct file *, struct poll_table_struct *); + struct wait_queue_head * (*get_poll_head)(struct file *, __poll_t); + __poll_t (*poll_mask) (struct file *, __poll_t); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); @@ -901,6 +903,17 @@ otherwise noted. activity on this file and (optionally) go to sleep until there is activity. Called by the select(2) and poll(2) system calls + get_poll_head: Returns the struct wait_queue_head that callers can + wait on. Callers need to check the returned events using ->poll_mask + once woken. Can return NULL to indicate polling is not supported, + or any error code using the ERR_PTR convention to indicate that a + grave error occured and ->poll_mask shall not be called. + + poll_mask: return the mask of EPOLL* values describing the file descriptor + state. Called either before going to sleep on the waitqueue returned by + get_poll_head, or after it has been woken. If ->get_poll_head and + ->poll_mask are implemented ->poll does not need to be implement. + unlocked_ioctl: called by the ioctl(2) system call. compat_ioctl: called by the ioctl(2) system call when 32 bit system calls diff --git a/Documentation/fpga/fpga-mgr.txt b/Documentation/fpga/fpga-mgr.txt deleted file mode 100644 index cc6413ed6fc9..000000000000 --- a/Documentation/fpga/fpga-mgr.txt +++ /dev/null @@ -1,199 +0,0 @@ -FPGA Manager Core - -Alan Tull 2015 - -Overview -======== - -The FPGA manager core exports a set of functions for programming an FPGA with -an image. The API is manufacturer agnostic. All manufacturer specifics are -hidden away in a low level driver which registers a set of ops with the core. -The FPGA image data itself is very manufacturer specific, but for our purposes -it's just binary data. The FPGA manager core won't parse it. - -The FPGA image to be programmed can be in a scatter gather list, a single -contiguous buffer, or a firmware file. Because allocating contiguous kernel -memory for the buffer should be avoided, users are encouraged to use a scatter -gather list instead if possible. - -The particulars for programming the image are presented in a structure (struct -fpga_image_info). This struct contains parameters such as pointers to the -FPGA image as well as image-specific particulars such as whether the image was -built for full or partial reconfiguration. - -API Functions: -============== - -To program the FPGA: --------------------- - - int fpga_mgr_load(struct fpga_manager *mgr, - struct fpga_image_info *info); - -Load the FPGA from an image which is indicated in the info. If successful, -the FPGA ends up in operating mode. Return 0 on success or a negative error -code. - -To allocate or free a struct fpga_image_info: ---------------------------------------------- - - struct fpga_image_info *fpga_image_info_alloc(struct device *dev); - - void fpga_image_info_free(struct fpga_image_info *info); - -To get/put a reference to a FPGA manager: ------------------------------------------ - - struct fpga_manager *of_fpga_mgr_get(struct device_node *node); - struct fpga_manager *fpga_mgr_get(struct device *dev); - void fpga_mgr_put(struct fpga_manager *mgr); - -Given a DT node or device, get a reference to a FPGA manager. This pointer -can be saved until you are ready to program the FPGA. fpga_mgr_put releases -the reference. - - -To get exclusive control of a FPGA manager: -------------------------------------------- - - int fpga_mgr_lock(struct fpga_manager *mgr); - void fpga_mgr_unlock(struct fpga_manager *mgr); - -The user should call fpga_mgr_lock and verify that it returns 0 before -attempting to program the FPGA. Likewise, the user should call -fpga_mgr_unlock when done programming the FPGA. - - -To register or unregister the low level FPGA-specific driver: -------------------------------------------------------------- - - int fpga_mgr_register(struct device *dev, const char *name, - const struct fpga_manager_ops *mops, - void *priv); - - void fpga_mgr_unregister(struct device *dev); - -Use of these two functions is described below in "How To Support a new FPGA -device." - - -How to write an image buffer to a supported FPGA -================================================ -#include <linux/fpga/fpga-mgr.h> - -struct fpga_manager *mgr; -struct fpga_image_info *info; -int ret; - -/* - * Get a reference to FPGA manager. The manager is not locked, so you can - * hold onto this reference without it preventing programming. - * - * This example uses the device node of the manager. Alternatively, use - * fpga_mgr_get(dev) instead if you have the device. - */ -mgr = of_fpga_mgr_get(mgr_node); - -/* struct with information about the FPGA image to program. */ -info = fpga_image_info_alloc(dev); - -/* flags indicates whether to do full or partial reconfiguration */ -info->flags = FPGA_MGR_PARTIAL_RECONFIG; - -/* - * At this point, indicate where the image is. This is pseudo-code; you're - * going to use one of these three. - */ -if (image is in a scatter gather table) { - - info->sgt = [your scatter gather table] - -} else if (image is in a buffer) { - - info->buf = [your image buffer] - info->count = [image buffer size] - -} else if (image is in a firmware file) { - - info->firmware_name = devm_kstrdup(dev, firmware_name, GFP_KERNEL); - -} - -/* Get exclusive control of FPGA manager */ -ret = fpga_mgr_lock(mgr); - -/* Load the buffer to the FPGA */ -ret = fpga_mgr_buf_load(mgr, &info, buf, count); - -/* Release the FPGA manager */ -fpga_mgr_unlock(mgr); -fpga_mgr_put(mgr); - -/* Deallocate the image info if you're done with it */ -fpga_image_info_free(info); - -How to support a new FPGA device -================================ -To add another FPGA manager, write a driver that implements a set of ops. The -probe function calls fpga_mgr_register(), such as: - -static const struct fpga_manager_ops socfpga_fpga_ops = { - .write_init = socfpga_fpga_ops_configure_init, - .write = socfpga_fpga_ops_configure_write, - .write_complete = socfpga_fpga_ops_configure_complete, - .state = socfpga_fpga_ops_state, -}; - -static int socfpga_fpga_probe(struct platform_device *pdev) -{ - struct device *dev = &pdev->dev; - struct socfpga_fpga_priv *priv; - int ret; - - priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL); - if (!priv) - return -ENOMEM; - - /* ... do ioremaps, get interrupts, etc. and save - them in priv... */ - - return fpga_mgr_register(dev, "Altera SOCFPGA FPGA Manager", - &socfpga_fpga_ops, priv); -} - -static int socfpga_fpga_remove(struct platform_device *pdev) -{ - fpga_mgr_unregister(&pdev->dev); - - return 0; -} - - -The ops will implement whatever device specific register writes are needed to -do the programming sequence for this particular FPGA. These ops return 0 for -success or negative error codes otherwise. - -The programming sequence is: - 1. .write_init - 2. .write or .write_sg (may be called once or multiple times) - 3. .write_complete - -The .write_init function will prepare the FPGA to receive the image data. The -buffer passed into .write_init will be atmost .initial_header_size bytes long, -if the whole bitstream is not immediately available then the core code will -buffer up at least this much before starting. - -The .write function writes a buffer to the FPGA. The buffer may be contain the -whole FPGA image or may be a smaller chunk of an FPGA image. In the latter -case, this function is called multiple times for successive chunks. This interface -is suitable for drivers which use PIO. - -The .write_sg version behaves the same as .write except the input is a sg_table -scatter list. This interface is suitable for drivers which use DMA. - -The .write_complete function is called after all the image has been written -to put the FPGA into operating mode. - -The ops include a .state function which will read the hardware FPGA manager and -return a code of type enum fpga_mgr_states. It doesn't result in a change in -hardware state. diff --git a/Documentation/fpga/fpga-region.txt b/Documentation/fpga/fpga-region.txt deleted file mode 100644 index 139a02ba1ff6..000000000000 --- a/Documentation/fpga/fpga-region.txt +++ /dev/null @@ -1,95 +0,0 @@ -FPGA Regions - -Alan Tull 2017 - -CONTENTS - - Introduction - - The FPGA region API - - Usage example - -Introduction -============ - -This document is meant to be an brief overview of the FPGA region API usage. A -more conceptual look at regions can be found in [1]. - -For the purposes of this API document, let's just say that a region associates -an FPGA Manager and a bridge (or bridges) with a reprogrammable region of an -FPGA or the whole FPGA. The API provides a way to register a region and to -program a region. - -Currently the only layer above fpga-region.c in the kernel is the Device Tree -support (of-fpga-region.c) described in [1]. The DT support layer uses regions -to program the FPGA and then DT to handle enumeration. The common region code -is intended to be used by other schemes that have other ways of accomplishing -enumeration after programming. - -An fpga-region can be set up to know the following things: -* which FPGA manager to use to do the programming -* which bridges to disable before programming and enable afterwards. - -Additional info needed to program the FPGA image is passed in the struct -fpga_image_info [2] including: -* pointers to the image as either a scatter-gather buffer, a contiguous - buffer, or the name of firmware file -* flags indicating specifics such as whether the image if for partial - reconfiguration. - -=================== -The FPGA region API -=================== - -To register or unregister a region: ------------------------------------ - - int fpga_region_register(struct device *dev, - struct fpga_region *region); - int fpga_region_unregister(struct fpga_region *region); - -An example of usage can be seen in the probe function of [3] - -To program an FPGA: -------------------- - int fpga_region_program_fpga(struct fpga_region *region); - -This function operates on info passed in the fpga_image_info -(region->info). - -This function will attempt to: - * lock the region's mutex - * lock the region's FPGA manager - * build a list of FPGA bridges if a method has been specified to do so - * disable the bridges - * program the FPGA - * re-enable the bridges - * release the locks - -============= -Usage example -============= - -First, allocate the info struct: - - info = fpga_image_info_alloc(dev); - if (!info) - return -ENOMEM; - -Set flags as needed, i.e. - - info->flags |= FPGA_MGR_PARTIAL_RECONFIG; - -Point to your FPGA image, such as: - - info->sgt = &sgt; - -Add info to region and do the programming: - - region->info = info; - ret = fpga_region_program_fpga(region); - -Then enumerate whatever hardware has appeared in the FPGA. - --- -[1] ../devicetree/bindings/fpga/fpga-region.txt -[2] ./fpga-mgr.txt -[3] ../../drivers/fpga/of-fpga-region.c diff --git a/Documentation/fpga/overview.txt b/Documentation/fpga/overview.txt deleted file mode 100644 index 0f1236e7e675..000000000000 --- a/Documentation/fpga/overview.txt +++ /dev/null @@ -1,23 +0,0 @@ -Linux kernel FPGA support - -Alan Tull 2017 - -The main point of this project has been to separate the out the upper layers -that know when to reprogram a FPGA from the lower layers that know how to -reprogram a specific FPGA device. The intention is to make this manufacturer -agnostic, understanding that of course the FPGA images are very device specific -themselves. - -The framework in the kernel includes: -* low level FPGA manager drivers that know how to program a specific device -* the fpga-mgr framework they are registered with -* low level FPGA bridge drivers for hard/soft bridges which are intended to - be disable during FPGA programming -* the fpga-bridge framework they are registered with -* the fpga-region framework which associates and controls managers and bridges - as reconfigurable regions -* the of-fpga-region support for reprogramming FPGAs when device tree overlays - are applied. - -I would encourage you the user to add code that creates FPGA regions rather -that trying to control managers and bridges separately. diff --git a/Documentation/gpu/drivers.rst b/Documentation/gpu/drivers.rst index e8c84419a2a1..f982558fc25d 100644 --- a/Documentation/gpu/drivers.rst +++ b/Documentation/gpu/drivers.rst @@ -10,8 +10,10 @@ GPU Driver Documentation tegra tinydrm tve200 + v3d vc4 bridge/dw-hdmi + xen-front .. only:: subproject and html diff --git a/Documentation/gpu/i915.rst b/Documentation/gpu/i915.rst index 41dc881b00dc..055df45596c1 100644 --- a/Documentation/gpu/i915.rst +++ b/Documentation/gpu/i915.rst @@ -58,6 +58,12 @@ Intel GVT-g Host Support(vGPU device model) .. kernel-doc:: drivers/gpu/drm/i915/intel_gvt.c :internal: +Workarounds +----------- + +.. kernel-doc:: drivers/gpu/drm/i915/intel_workarounds.c + :doc: Hardware workarounds + Display Hardware Handling ========================= @@ -249,6 +255,103 @@ Memory Management and Command Submission This sections covers all things related to the GEM implementation in the i915 driver. +Intel GPU Basics +---------------- + +An Intel GPU has multiple engines. There are several engine types. + +- RCS engine is for rendering 3D and performing compute, this is named + `I915_EXEC_RENDER` in user space. +- BCS is a blitting (copy) engine, this is named `I915_EXEC_BLT` in user + space. +- VCS is a video encode and decode engine, this is named `I915_EXEC_BSD` + in user space +- VECS is video enhancement engine, this is named `I915_EXEC_VEBOX` in user + space. +- The enumeration `I915_EXEC_DEFAULT` does not refer to specific engine; + instead it is to be used by user space to specify a default rendering + engine (for 3D) that may or may not be the same as RCS. + +The Intel GPU family is a family of integrated GPU's using Unified +Memory Access. For having the GPU "do work", user space will feed the +GPU batch buffers via one of the ioctls `DRM_IOCTL_I915_GEM_EXECBUFFER2` +or `DRM_IOCTL_I915_GEM_EXECBUFFER2_WR`. Most such batchbuffers will +instruct the GPU to perform work (for example rendering) and that work +needs memory from which to read and memory to which to write. All memory +is encapsulated within GEM buffer objects (usually created with the ioctl +`DRM_IOCTL_I915_GEM_CREATE`). An ioctl providing a batchbuffer for the GPU +to create will also list all GEM buffer objects that the batchbuffer reads +and/or writes. For implementation details of memory management see +`GEM BO Management Implementation Details`_. + +The i915 driver allows user space to create a context via the ioctl +`DRM_IOCTL_I915_GEM_CONTEXT_CREATE` which is identified by a 32-bit +integer. Such a context should be viewed by user-space as -loosely- +analogous to the idea of a CPU process of an operating system. The i915 +driver guarantees that commands issued to a fixed context are to be +executed so that writes of a previously issued command are seen by +reads of following commands. Actions issued between different contexts +(even if from the same file descriptor) are NOT given that guarantee +and the only way to synchronize across contexts (even from the same +file descriptor) is through the use of fences. At least as far back as +Gen4, also have that a context carries with it a GPU HW context; +the HW context is essentially (most of atleast) the state of a GPU. +In addition to the ordering guarantees, the kernel will restore GPU +state via HW context when commands are issued to a context, this saves +user space the need to restore (most of atleast) the GPU state at the +start of each batchbuffer. The non-deprecated ioctls to submit batchbuffer +work can pass that ID (in the lower bits of drm_i915_gem_execbuffer2::rsvd1) +to identify what context to use with the command. + +The GPU has its own memory management and address space. The kernel +driver maintains the memory translation table for the GPU. For older +GPUs (i.e. those before Gen8), there is a single global such translation +table, a global Graphics Translation Table (GTT). For newer generation +GPUs each context has its own translation table, called Per-Process +Graphics Translation Table (PPGTT). Of important note, is that although +PPGTT is named per-process it is actually per context. When user space +submits a batchbuffer, the kernel walks the list of GEM buffer objects +used by the batchbuffer and guarantees that not only is the memory of +each such GEM buffer object resident but it is also present in the +(PP)GTT. If the GEM buffer object is not yet placed in the (PP)GTT, +then it is given an address. Two consequences of this are: the kernel +needs to edit the batchbuffer submitted to write the correct value of +the GPU address when a GEM BO is assigned a GPU address and the kernel +might evict a different GEM BO from the (PP)GTT to make address room +for another GEM BO. Consequently, the ioctls submitting a batchbuffer +for execution also include a list of all locations within buffers that +refer to GPU-addresses so that the kernel can edit the buffer correctly. +This process is dubbed relocation. + +GEM BO Management Implementation Details +---------------------------------------- + +.. kernel-doc:: drivers/gpu/drm/i915/i915_vma.h + :doc: Virtual Memory Address + +Buffer Object Eviction +---------------------- + +This section documents the interface functions for evicting buffer +objects to make space available in the virtual gpu address spaces. Note +that this is mostly orthogonal to shrinking buffer objects caches, which +has the goal to make main memory (shared with the gpu through the +unified memory architecture) available. + +.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_evict.c + :internal: + +Buffer Object Memory Shrinking +------------------------------ + +This section documents the interface function for shrinking memory usage +of buffer object caches. Shrinking is used to make main memory +available. Note that this is mostly orthogonal to evicting buffer +objects, which has the goal to make space in gpu virtual address spaces. + +.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_shrinker.c + :internal: + Batchbuffer Parsing ------------------- @@ -267,6 +370,12 @@ Batchbuffer Pools .. kernel-doc:: drivers/gpu/drm/i915/i915_gem_batch_pool.c :internal: +User Batchbuffer Execution +-------------------------- + +.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_execbuffer.c + :doc: User command execution + Logical Rings, Logical Ring Contexts and Execlists -------------------------------------------------- @@ -312,28 +421,14 @@ Object Tiling IOCTLs .. kernel-doc:: drivers/gpu/drm/i915/i915_gem_tiling.c :doc: buffer object tiling -Buffer Object Eviction ----------------------- - -This section documents the interface functions for evicting buffer -objects to make space available in the virtual gpu address spaces. Note -that this is mostly orthogonal to shrinking buffer objects caches, which -has the goal to make main memory (shared with the gpu through the -unified memory architecture) available. - -.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_evict.c - :internal: - -Buffer Object Memory Shrinking ------------------------------- +WOPCM +===== -This section documents the interface function for shrinking memory usage -of buffer object caches. Shrinking is used to make main memory -available. Note that this is mostly orthogonal to evicting buffer -objects, which has the goal to make space in gpu virtual address spaces. +WOPCM Layout +------------ -.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_shrinker.c - :internal: +.. kernel-doc:: drivers/gpu/drm/i915/intel_wopcm.c + :doc: WOPCM Layout GuC === @@ -359,6 +454,12 @@ GuC Firmware Layout .. kernel-doc:: drivers/gpu/drm/i915/intel_guc_fwif.h :doc: GuC Firmware Layout +GuC Address Space +----------------- + +.. kernel-doc:: drivers/gpu/drm/i915/intel_guc.c + :doc: GuC Address Space + Tracing ======= diff --git a/Documentation/gpu/kms-properties.csv b/Documentation/gpu/kms-properties.csv index 6b28b014cb7d..07ed22ea3bd6 100644 --- a/Documentation/gpu/kms-properties.csv +++ b/Documentation/gpu/kms-properties.csv @@ -98,5 +98,4 @@ radeon,DVI-I,“coherent”,RANGE,"Min=0, Max=1",Connector,TBD ,,"""underscan vborder""",RANGE,"Min=0, Max=128",Connector,TBD ,Audio,“audio”,ENUM,"{ ""off"", ""on"", ""auto"" }",Connector,TBD ,FMT Dithering,“dither”,ENUM,"{ ""off"", ""on"" }",Connector,TBD -rcar-du,Generic,"""alpha""",RANGE,"Min=0, Max=255",Plane,TBD ,,"""colorkey""",RANGE,"Min=0, Max=0x01ffffff",Plane,TBD diff --git a/Documentation/gpu/todo.rst b/Documentation/gpu/todo.rst index f4d0b3476d9c..a7c150d6b63f 100644 --- a/Documentation/gpu/todo.rst +++ b/Documentation/gpu/todo.rst @@ -212,6 +212,24 @@ probably use drm_fb_helper_fbdev_teardown(). Contact: Maintainer of the driver you plan to convert +Clean up mmap forwarding +------------------------ + +A lot of drivers forward gem mmap calls to dma-buf mmap for imported buffers. +And also a lot of them forward dma-buf mmap to the gem mmap implementations. +Would be great to refactor this all into a set of small common helpers. + +Contact: Daniel Vetter + +Put a reservation_object into drm_gem_object +-------------------------------------------- + +This would remove the need for the ->gem_prime_res_obj callback. It would also +allow us to implement generic helpers for waiting for a bo, allowing for quite a +bit of refactoring in the various wait ioctl implementations. + +Contact: Daniel Vetter + idr_init_base() --------------- diff --git a/Documentation/gpu/xen-front.rst b/Documentation/gpu/xen-front.rst new file mode 100644 index 000000000000..d988da7d1983 --- /dev/null +++ b/Documentation/gpu/xen-front.rst @@ -0,0 +1,31 @@ +==================================================== + drm/xen-front Xen para-virtualized frontend driver +==================================================== + +This frontend driver implements Xen para-virtualized display +according to the display protocol described at +include/xen/interface/io/displif.h + +Driver modes of operation in terms of display buffers used +========================================================== + +.. kernel-doc:: drivers/gpu/drm/xen/xen_drm_front.h + :doc: Driver modes of operation in terms of display buffers used + +Buffers allocated by the frontend driver +---------------------------------------- + +.. kernel-doc:: drivers/gpu/drm/xen/xen_drm_front.h + :doc: Buffers allocated by the frontend driver + +Buffers allocated by the backend +-------------------------------- + +.. kernel-doc:: drivers/gpu/drm/xen/xen_drm_front.h + :doc: Buffers allocated by the backend + +Driver limitations +================== + +.. kernel-doc:: drivers/gpu/drm/xen/xen_drm_front.h + :doc: Driver limitations diff --git a/Documentation/hwmon/hwmon-kernel-api.txt b/Documentation/hwmon/hwmon-kernel-api.txt index 53a806696c64..eb7a78aebb38 100644 --- a/Documentation/hwmon/hwmon-kernel-api.txt +++ b/Documentation/hwmon/hwmon-kernel-api.txt @@ -71,7 +71,8 @@ hwmon_device_register_with_info is the most comprehensive and preferred means to register a hardware monitoring device. It creates the standard sysfs attributes in the hardware monitoring core, letting the driver focus on reading from and writing to the chip instead of having to bother with sysfs attributes. -Its parameters are described in more detail below. +The parent device parameter cannot be NULL with non-NULL chip info. Its +parameters are described in more detail below. devm_hwmon_device_register_with_info is similar to hwmon_device_register_with_info. However, it is device managed, meaning the diff --git a/Documentation/hwmon/ltc2990 b/Documentation/hwmon/ltc2990 index c25211e90bdc..3ed68f676c0f 100644 --- a/Documentation/hwmon/ltc2990 +++ b/Documentation/hwmon/ltc2990 @@ -8,6 +8,7 @@ Supported chips: Datasheet: http://www.linear.com/product/ltc2990 Author: Mike Looijmans <mike.looijmans@topic.nl> + Tom Levens <tom.levens@cern.ch> Description @@ -16,10 +17,8 @@ Description LTC2990 is a Quad I2C Voltage, Current and Temperature Monitor. The chip's inputs can measure 4 voltages, or two inputs together (1+2 and 3+4) can be combined to measure a differential voltage, which is typically used to -measure current through a series resistor, or a temperature. - -This driver currently uses the 2x differential mode only. In order to support -other modes, the driver will need to be expanded. +measure current through a series resistor, or a temperature with an external +diode. Usage Notes @@ -32,12 +31,19 @@ devices explicitly. Sysfs attributes ---------------- +in0_input Voltage at Vcc pin in millivolt (range 2.5V to 5V) +temp1_input Internal chip temperature in millidegrees Celcius + +A subset of the following attributes are visible, depending on the measurement +mode of the chip. + +in[1-4]_input Voltage at V[1-4] pin in millivolt +temp2_input External temperature sensor TR1 in millidegrees Celcius +temp3_input External temperature sensor TR2 in millidegrees Celcius +curr1_input Current in mA across V1-V2 assuming a 1mOhm sense resistor +curr2_input Current in mA across V3-V4 assuming a 1mOhm sense resistor + The "curr*_input" measurements actually report the voltage drop across the input pins in microvolts. This is equivalent to the current through a 1mOhm sense resistor. Divide the reported value by the actual sense resistor value in mOhm to get the actual value. - -in0_input Voltage at Vcc pin in millivolt (range 2.5V to 5V) -temp1_input Internal chip temperature in millidegrees Celcius -curr1_input Current in mA across v1-v2 assuming a 1mOhm sense resistor. -curr2_input Current in mA across v3-v4 assuming a 1mOhm sense resistor. diff --git a/Documentation/i2c/busses/i2c-ocores b/Documentation/i2c/busses/i2c-ocores index c269aaa2f26a..9e1dfe7553ad 100644 --- a/Documentation/i2c/busses/i2c-ocores +++ b/Documentation/i2c/busses/i2c-ocores @@ -2,7 +2,7 @@ Kernel driver i2c-ocores Supported adapters: * OpenCores.org I2C controller by Richard Herveille (see datasheet link) - Datasheet: http://www.opencores.org/projects.cgi/web/i2c/overview + https://opencores.org/project/i2c/overview Author: Peter Korsgaard <jacmet@sunsite.dk> diff --git a/Documentation/i2c/dev-interface b/Documentation/i2c/dev-interface index d04e6e4964ee..fbed645ccd75 100644 --- a/Documentation/i2c/dev-interface +++ b/Documentation/i2c/dev-interface @@ -9,8 +9,8 @@ i2c adapters present on your system at a given time. i2cdetect is part of the i2c-tools package. I2C device files are character device files with major device number 89 -and a minor device number corresponding to the number assigned as -explained above. They should be called "i2c-%d" (i2c-0, i2c-1, ..., +and a minor device number corresponding to the number assigned as +explained above. They should be called "i2c-%d" (i2c-0, i2c-1, ..., i2c-10, ...). All 256 minor device numbers are reserved for i2c. @@ -23,11 +23,6 @@ First, you need to include these two headers: #include <linux/i2c-dev.h> #include <i2c/smbus.h> -(Please note that there are two files named "i2c-dev.h" out there. One is -distributed with the Linux kernel and the other one is included in the -source tree of i2c-tools. They used to be different in content but since 2012 -they're identical. You should use "linux/i2c-dev.h"). - Now, you have to decide which adapter you want to access. You should inspect /sys/class/i2c-dev/ or run "i2cdetect -l" to decide this. Adapter numbers are assigned somewhat dynamically, so you can not @@ -38,7 +33,7 @@ Next thing, open the device file, as follows: int file; int adapter_nr = 2; /* probably dynamically determined */ char filename[20]; - + snprintf(filename, 19, "/dev/i2c-%d", adapter_nr); file = open(filename, O_RDWR); if (file < 0) { @@ -72,8 +67,10 @@ the device supports them. Both are illustrated below. /* res contains the read word */ } - /* Using I2C Write, equivalent of - i2c_smbus_write_word_data(file, reg, 0x6543) */ + /* + * Using I2C Write, equivalent of + * i2c_smbus_write_word_data(file, reg, 0x6543) + */ buf[0] = reg; buf[1] = 0x43; buf[2] = 0x65; @@ -140,14 +137,14 @@ ioctl(file, I2C_RDWR, struct i2c_rdwr_ioctl_data *msgset) set in each message, overriding the values set with the above ioctl's. ioctl(file, I2C_SMBUS, struct i2c_smbus_ioctl_data *args) - Not meant to be called directly; instead, use the access functions - below. + If possible, use the provided i2c_smbus_* methods described below instead + of issuing direct ioctls. You can do plain i2c transactions by using read(2) and write(2) calls. You do not need to pass the address byte; instead, set it through ioctl I2C_SLAVE before you try to access the device. -You can do SMBus level transactions (see documentation file smbus-protocol +You can do SMBus level transactions (see documentation file smbus-protocol for details) through the following functions: __s32 i2c_smbus_write_quick(int file, __u8 value); __s32 i2c_smbus_read_byte(int file); @@ -158,7 +155,7 @@ for details) through the following functions: __s32 i2c_smbus_write_word_data(int file, __u8 command, __u16 value); __s32 i2c_smbus_process_call(int file, __u8 command, __u16 value); __s32 i2c_smbus_read_block_data(int file, __u8 command, __u8 *values); - __s32 i2c_smbus_write_block_data(int file, __u8 command, __u8 length, + __s32 i2c_smbus_write_block_data(int file, __u8 command, __u8 length, __u8 *values); All these transactions return -1 on failure; you can read errno to see what happened. The 'write' transactions return 0 on success; the @@ -166,10 +163,9 @@ what happened. The 'write' transactions return 0 on success; the returns the number of values read. The block buffers need not be longer than 32 bytes. -The above functions are all inline functions, that resolve to calls to -the i2c_smbus_access function, that on its turn calls a specific ioctl -with the data in a specific format. Read the source code if you -want to know what happens behind the screens. +The above functions are made available by linking against the libi2c library, +which is provided by the i2c-tools project. See: +https://git.kernel.org/pub/scm/utils/i2c-tools/i2c-tools.git/. Implementation details diff --git a/Documentation/index.rst b/Documentation/index.rst index 3b99ab931d41..fdc585703498 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -45,7 +45,7 @@ the kernel interface as seen by application developers. .. toctree:: :maxdepth: 2 - userspace-api/index + userspace-api/index Introduction to kernel development @@ -89,6 +89,7 @@ needed). sound/index crypto/index filesystems/index + vm/index Architecture-specific documentation ----------------------------------- diff --git a/Documentation/ioctl/botching-up-ioctls.txt b/Documentation/ioctl/botching-up-ioctls.txt index d02cfb48901c..883fb034bd04 100644 --- a/Documentation/ioctl/botching-up-ioctls.txt +++ b/Documentation/ioctl/botching-up-ioctls.txt @@ -73,7 +73,9 @@ will have a second iteration or at least an extension for any given interface. future extensions is going right down the gutters since someone will submit an ioctl struct with random stack garbage in the yet unused parts. Which then bakes in the ABI that those fields can never be used for anything else - but garbage. + but garbage. This is also the reason why you must explicitly pad all + structures, even if you never use them in an array - the padding the compiler + might insert could contain garbage. * Have simple testcases for all of the above. diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 84bb74dcae12..480c8609dc58 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -151,7 +151,7 @@ Code Seq#(hex) Include File Comments 'J' 00-1F drivers/scsi/gdth_ioctl.h 'K' all linux/kd.h 'L' 00-1F linux/loop.h conflict! -'L' 10-1F drivers/scsi/mpt2sas/mpt2sas_ctl.h conflict! +'L' 10-1F drivers/scsi/mpt3sas/mpt3sas_ctl.h conflict! 'L' 20-2F linux/lightnvm.h 'L' E0-FF linux/ppdd.h encrypted disk device driver <http://linux01.gwdg.de/~alatham/ppdd.html> @@ -217,7 +217,6 @@ Code Seq#(hex) Include File Comments 'd' 02-40 pcmcia/ds.h conflict! 'd' F0-FF linux/digi1.h 'e' all linux/digi1.h conflict! -'e' 00-1F drivers/net/irda/irtty-sir.h conflict! 'f' 00-1F linux/ext2_fs.h conflict! 'f' 00-1F linux/ext3_fs.h conflict! 'f' 00-0F fs/jfs/jfs_dinode.h conflict! @@ -247,7 +246,6 @@ Code Seq#(hex) Include File Comments 'm' all linux/synclink.h conflict! 'm' 00-19 drivers/message/fusion/mptctl.h conflict! 'm' 00 drivers/scsi/megaraid/megaraid_ioctl.h conflict! -'m' 00-1F net/irda/irmod.h conflict! 'n' 00-7F linux/ncp_fs.h and fs/ncpfs/ioctl.c 'n' 80-8F uapi/linux/nilfs2_api.h NILFS2 'n' E0-FF linux/matroxfb.h matroxfb @@ -298,7 +296,8 @@ Code Seq#(hex) Include File Comments 0x90 00 drivers/cdrom/sbpcd.h 0x92 00-0F drivers/usb/mon/mon_bin.c 0x93 60-7F linux/auto_fs.h -0x94 all fs/btrfs/ioctl.h +0x94 all fs/btrfs/ioctl.h Btrfs filesystem + and linux/fs.h some lifted to vfs/generic 0x97 00-7F fs/ceph/ioctl.h Ceph file system 0x99 00-0F 537-Addinboard driver <mailto:buk@buks.ipn.de> @@ -329,6 +328,7 @@ Code Seq#(hex) Include File Comments 0xCA 80-BF uapi/scsi/cxlflash_ioctl.h 0xCB 00-1F CBM serial IEC bus in development: <mailto:michael.klein@puffin.lb.shuttle.de> +0xCC 00-0F drivers/misc/ibmvmc.h pseries VMC driver 0xCD 01 linux/reiserfs_fs.h 0xCF 02 fs/cifs/ioctl.c 0xDB 00-0F drivers/char/mwave/mwavepub.h diff --git a/Documentation/kbuild/kconfig-language.txt b/Documentation/kbuild/kconfig-language.txt index f5b9493f04ad..0e966e8f9ec7 100644 --- a/Documentation/kbuild/kconfig-language.txt +++ b/Documentation/kbuild/kconfig-language.txt @@ -198,14 +198,6 @@ applicable everywhere (see syntax). enables the third modular state for all config symbols. At most one symbol may have the "modules" option set. - - "env"=<value> - This imports the environment variable into Kconfig. It behaves like - a default, except that the value comes from the environment, this - also means that the behaviour when mixing it with normal defaults is - undefined at this point. The symbol is currently not exported back - to the build environment (if this is desired, it can be done via - another symbol). - - "allnoconfig_y" This declares the symbol as one that should have the value y when using "allnoconfig". Used for symbols that hide other symbols. diff --git a/Documentation/kbuild/kconfig-macro-language.txt b/Documentation/kbuild/kconfig-macro-language.txt new file mode 100644 index 000000000000..07da2ea68dce --- /dev/null +++ b/Documentation/kbuild/kconfig-macro-language.txt @@ -0,0 +1,242 @@ +Concept +------- + +The basic idea was inspired by Make. When we look at Make, we notice sort of +two languages in one. One language describes dependency graphs consisting of +targets and prerequisites. The other is a macro language for performing textual +substitution. + +There is clear distinction between the two language stages. For example, you +can write a makefile like follows: + + APP := foo + SRC := foo.c + CC := gcc + + $(APP): $(SRC) + $(CC) -o $(APP) $(SRC) + +The macro language replaces the variable references with their expanded form, +and handles as if the source file were input like follows: + + foo: foo.c + gcc -o foo foo.c + +Then, Make analyzes the dependency graph and determines the targets to be +updated. + +The idea is quite similar in Kconfig - it is possible to describe a Kconfig +file like this: + + CC := gcc + + config CC_HAS_FOO + def_bool $(shell, $(srctree)/scripts/gcc-check-foo.sh $(CC)) + +The macro language in Kconfig processes the source file into the following +intermediate: + + config CC_HAS_FOO + def_bool y + +Then, Kconfig moves onto the evaluation stage to resolve inter-symbol +dependency as explained in kconfig-language.txt. + + +Variables +--------- + +Like in Make, a variable in Kconfig works as a macro variable. A macro +variable is expanded "in place" to yield a text string that may then be +expanded further. To get the value of a variable, enclose the variable name in +$( ). The parentheses are required even for single-letter variable names; $X is +a syntax error. The curly brace form as in ${CC} is not supported either. + +There are two types of variables: simply expanded variables and recursively +expanded variables. + +A simply expanded variable is defined using the := assignment operator. Its +righthand side is expanded immediately upon reading the line from the Kconfig +file. + +A recursively expanded variable is defined using the = assignment operator. +Its righthand side is simply stored as the value of the variable without +expanding it in any way. Instead, the expansion is performed when the variable +is used. + +There is another type of assignment operator; += is used to append text to a +variable. The righthand side of += is expanded immediately if the lefthand +side was originally defined as a simple variable. Otherwise, its evaluation is +deferred. + +The variable reference can take parameters, in the following form: + + $(name,arg1,arg2,arg3) + +You can consider the parameterized reference as a function. (more precisely, +"user-defined function" in contrast to "built-in function" listed below). + +Useful functions must be expanded when they are used since the same function is +expanded differently if different parameters are passed. Hence, a user-defined +function is defined using the = assignment operator. The parameters are +referenced within the body definition with $(1), $(2), etc. + +In fact, recursively expanded variables and user-defined functions are the same +internally. (In other words, "variable" is "function with zero argument".) +When we say "variable" in a broad sense, it includes "user-defined function". + + +Built-in functions +------------------ + +Like Make, Kconfig provides several built-in functions. Every function takes a +particular number of arguments. + +In Make, every built-in function takes at least one argument. Kconfig allows +zero argument for built-in functions, such as $(fileno), $(lineno). You could +consider those as "built-in variable", but it is just a matter of how we call +it after all. Let's say "built-in function" here to refer to natively supported +functionality. + +Kconfig currently supports the following built-in functions. + + - $(shell,command) + + The "shell" function accepts a single argument that is expanded and passed + to a subshell for execution. The standard output of the command is then read + and returned as the value of the function. Every newline in the output is + replaced with a space. Any trailing newlines are deleted. The standard error + is not returned, nor is any program exit status. + + - $(info,text) + + The "info" function takes a single argument and prints it to stdout. + It evaluates to an empty string. + + - $(warning-if,condition,text) + + The "warning-if" function takes two arguments. If the condition part is "y", + the text part is sent to stderr. The text is prefixed with the name of the + current Kconfig file and the current line number. + + - $(error-if,condition,text) + + The "error-if" function is similar to "warning-if", but it terminates the + parsing immediately if the condition part is "y". + + - $(filename) + + The 'filename' takes no argument, and $(filename) is expanded to the file + name being parsed. + + - $(lineno) + + The 'lineno' takes no argument, and $(lineno) is expanded to the line number + being parsed. + + +Make vs Kconfig +--------------- + +Kconfig adopts Make-like macro language, but the function call syntax is +slightly different. + +A function call in Make looks like this: + + $(func-name arg1,arg2,arg3) + +The function name and the first argument are separated by at least one +whitespace. Then, leading whitespaces are trimmed from the first argument, +while whitespaces in the other arguments are kept. You need to use a kind of +trick to start the first parameter with spaces. For example, if you want +to make "info" function print " hello", you can write like follows: + + empty := + space := $(empty) $(empty) + $(info $(space)$(space)hello) + +Kconfig uses only commas for delimiters, and keeps all whitespaces in the +function call. Some people prefer putting a space after each comma delimiter: + + $(func-name, arg1, arg2, arg3) + +In this case, "func-name" will receive " arg1", " arg2", " arg3". The presence +of leading spaces may matter depending on the function. The same applies to +Make - for example, $(subst .c, .o, $(sources)) is a typical mistake; it +replaces ".c" with " .o". + +In Make, a user-defined function is referenced by using a built-in function, +'call', like this: + + $(call my-func,arg1,arg2,arg3) + +Kconfig invokes user-defined functions and built-in functions in the same way. +The omission of 'call' makes the syntax shorter. + +In Make, some functions treat commas verbatim instead of argument separators. +For example, $(shell echo hello, world) runs the command "echo hello, world". +Likewise, $(info hello, world) prints "hello, world" to stdout. You could say +this is _useful_ inconsistency. + +In Kconfig, for simpler implementation and grammatical consistency, commas that +appear in the $( ) context are always delimiters. It means + + $(shell, echo hello, world) + +is an error because it is passing two parameters where the 'shell' function +accepts only one. To pass commas in arguments, you can use the following trick: + + comma := , + $(shell, echo hello$(comma) world) + + +Caveats +------- + +A variable (or function) cannot be expanded across tokens. So, you cannot use +a variable as a shorthand for an expression that consists of multiple tokens. +The following works: + + RANGE_MIN := 1 + RANGE_MAX := 3 + + config FOO + int "foo" + range $(RANGE_MIN) $(RANGE_MAX) + +But, the following does not work: + + RANGES := 1 3 + + config FOO + int "foo" + range $(RANGES) + +A variable cannot be expanded to any keyword in Kconfig. The following does +not work: + + MY_TYPE := tristate + + config FOO + $(MY_TYPE) "foo" + default y + +Obviously from the design, $(shell command) is expanded in the textual +substitution phase. You cannot pass symbols to the 'shell' function. +The following does not work as expected. + + config ENDIAN_FLAG + string + default "-mbig-endian" if CPU_BIG_ENDIAN + default "-mlittle-endian" if CPU_LITTLE_ENDIAN + + config CC_HAS_ENDIAN_FLAG + def_bool $(shell $(srctree)/scripts/gcc-check-flag ENDIAN_FLAG) + +Instead, you can do like follows so that any function call is statically +expanded. + + config CC_HAS_ENDIAN_FLAG + bool + default $(shell $(srctree)/scripts/gcc-check-flag -mbig-endian) if CPU_BIG_ENDIAN + default $(shell $(srctree)/scripts/gcc-check-flag -mlittle-endian) if CPU_LITTLE_ENDIAN diff --git a/Documentation/livepatch/livepatch.txt b/Documentation/livepatch/livepatch.txt index 1ae2de758c08..2d7ed09dbd59 100644 --- a/Documentation/livepatch/livepatch.txt +++ b/Documentation/livepatch/livepatch.txt @@ -429,30 +429,6 @@ See Documentation/ABI/testing/sysfs-kernel-livepatch for more details. The current Livepatch implementation has several limitations: - - + The patch must not change the semantic of the patched functions. - - The current implementation guarantees only that either the old - or the new function is called. The functions are patched one - by one. It means that the patch must _not_ change the semantic - of the function. - - - + Data structures can not be patched. - - There is no support to version data structures or anyhow migrate - one structure into another. Also the simple consistency model does - not allow to switch more functions atomically. - - Once there is more complex consistency mode, it will be possible to - use some workarounds. For example, it will be possible to use a hole - for a new member because the data structure is aligned. Or it will - be possible to use an existing member for something else. - - There are no plans to add more generic support for modified structures - at the moment. - - + Only functions that can be traced could be patched. Livepatch is based on the dynamic ftrace. In particular, functions diff --git a/Documentation/livepatch/shadow-vars.txt b/Documentation/livepatch/shadow-vars.txt index 89c66634d600..ecc09a7be5dd 100644 --- a/Documentation/livepatch/shadow-vars.txt +++ b/Documentation/livepatch/shadow-vars.txt @@ -34,9 +34,13 @@ meta-data and shadow-data: - data[] - storage for shadow data It is important to note that the klp_shadow_alloc() and -klp_shadow_get_or_alloc() calls, described below, store a *copy* of the -data that the functions are provided. Callers should provide whatever -mutual exclusion is required of the shadow data. +klp_shadow_get_or_alloc() are zeroing the variable by default. +They also allow to call a custom constructor function when a non-zero +value is needed. Callers should provide whatever mutual exclusion +is required. + +Note that the constructor is called under klp_shadow_lock spinlock. It allows +to do actions that can be done only once when a new variable is allocated. * klp_shadow_get() - retrieve a shadow variable data pointer - search hashtable for <obj, id> pair @@ -47,7 +51,7 @@ mutual exclusion is required of the shadow data. - WARN and return NULL - if <obj, id> doesn't already exist - allocate a new shadow variable - - copy data into the new shadow variable + - initialize the variable using a custom constructor and data when provided - add <obj, id> to the global hashtable * klp_shadow_get_or_alloc() - get existing or alloc a new shadow variable @@ -56,16 +60,20 @@ mutual exclusion is required of the shadow data. - return existing shadow variable - if <obj, id> doesn't already exist - allocate a new shadow variable - - copy data into the new shadow variable + - initialize the variable using a custom constructor and data when provided - add <obj, id> pair to the global hashtable * klp_shadow_free() - detach and free a <obj, id> shadow variable - find and remove a <obj, id> reference from global hashtable - - if found, free shadow variable + - if found + - call destructor function if defined + - free shadow variable * klp_shadow_free_all() - detach and free all <*, id> shadow variables - find and remove any <*, id> references from global hashtable - - if found, free shadow variable + - if found + - call destructor function if defined + - free shadow variable 2. Use cases @@ -107,7 +115,8 @@ struct sta_info *sta_info_alloc(struct ieee80211_sub_if_data *sdata, sta = kzalloc(sizeof(*sta) + hw->sta_data_size, gfp); /* Attach a corresponding shadow variable, then initialize it */ - ps_lock = klp_shadow_alloc(sta, PS_LOCK, NULL, sizeof(*ps_lock), gfp); + ps_lock = klp_shadow_alloc(sta, PS_LOCK, sizeof(*ps_lock), gfp, + NULL, NULL); if (!ps_lock) goto shadow_fail; spin_lock_init(ps_lock); @@ -131,7 +140,7 @@ variable: void sta_info_free(struct ieee80211_local *local, struct sta_info *sta) { - klp_shadow_free(sta, PS_LOCK); + klp_shadow_free(sta, PS_LOCK, NULL); kfree(sta); ... @@ -148,16 +157,24 @@ shadow variables to parents already in-flight. For commit 1d147bfa6429, a good spot to allocate a shadow spinlock is inside ieee80211_sta_ps_deliver_wakeup(): +int ps_lock_shadow_ctor(void *obj, void *shadow_data, void *ctor_data) +{ + spinlock_t *lock = shadow_data; + + spin_lock_init(lock); + return 0; +} + #define PS_LOCK 1 void ieee80211_sta_ps_deliver_wakeup(struct sta_info *sta) { - DEFINE_SPINLOCK(ps_lock_fallback); spinlock_t *ps_lock; /* sync with ieee80211_tx_h_unicast_ps_buf */ ps_lock = klp_shadow_get_or_alloc(sta, PS_LOCK, - &ps_lock_fallback, sizeof(ps_lock_fallback), - GFP_ATOMIC); + sizeof(*ps_lock), GFP_ATOMIC, + ps_lock_shadow_ctor, NULL); + if (ps_lock) spin_lock(ps_lock); ... diff --git a/Documentation/media/kapi/cec-core.rst b/Documentation/media/kapi/cec-core.rst index a9f53f069a2d..1d989c544370 100644 --- a/Documentation/media/kapi/cec-core.rst +++ b/Documentation/media/kapi/cec-core.rst @@ -246,7 +246,10 @@ CEC_TX_STATUS_LOW_DRIVE: CEC_TX_STATUS_ERROR: some unspecified error occurred: this can be one of ARB_LOST or LOW_DRIVE if the hardware cannot differentiate or something - else entirely. + else entirely. Some hardware only supports OK and FAIL as the + result of a transmit, i.e. there is no way to differentiate + between the different possible errors. In that case map FAIL + to CEC_TX_STATUS_NACK and not to CEC_TX_STATUS_ERROR. CEC_TX_STATUS_MAX_RETRIES: could not transmit the message after trying multiple times. diff --git a/Documentation/media/uapi/cec/cec-ioc-receive.rst b/Documentation/media/uapi/cec/cec-ioc-receive.rst index bdad4b197bcd..e964074cd15b 100644 --- a/Documentation/media/uapi/cec/cec-ioc-receive.rst +++ b/Documentation/media/uapi/cec/cec-ioc-receive.rst @@ -231,26 +231,32 @@ View On' messages from initiator 0xf ('Unregistered') to destination 0 ('TV'). - ``CEC_TX_STATUS_OK`` - 0x01 - The message was transmitted successfully. This is mutually - exclusive with :ref:`CEC_TX_STATUS_MAX_RETRIES <CEC-TX-STATUS-MAX-RETRIES>`. Other bits can still - be set if earlier attempts met with failure before the transmit - was eventually successful. + exclusive with :ref:`CEC_TX_STATUS_MAX_RETRIES <CEC-TX-STATUS-MAX-RETRIES>`. + Other bits can still be set if earlier attempts met with failure before + the transmit was eventually successful. * .. _`CEC-TX-STATUS-ARB-LOST`: - ``CEC_TX_STATUS_ARB_LOST`` - 0x02 - - CEC line arbitration was lost. + - CEC line arbitration was lost, i.e. another transmit started at the + same time with a higher priority. Optional status, not all hardware + can detect this error condition. * .. _`CEC-TX-STATUS-NACK`: - ``CEC_TX_STATUS_NACK`` - 0x04 - - Message was not acknowledged. + - Message was not acknowledged. Note that some hardware cannot tell apart + a 'Not Acknowledged' status from other error conditions, i.e. the result + of a transmit is just OK or FAIL. In that case this status will be + returned when the transmit failed. * .. _`CEC-TX-STATUS-LOW-DRIVE`: - ``CEC_TX_STATUS_LOW_DRIVE`` - 0x08 - Low drive was detected on the CEC bus. This indicates that a follower detected an error on the bus and requests a - retransmission. + retransmission. Optional status, not all hardware can detect this + error condition. * .. _`CEC-TX-STATUS-ERROR`: - ``CEC_TX_STATUS_ERROR`` @@ -258,14 +264,14 @@ View On' messages from initiator 0xf ('Unregistered') to destination 0 ('TV'). - Some error occurred. This is used for any errors that do not fit ``CEC_TX_STATUS_ARB_LOST`` or ``CEC_TX_STATUS_LOW_DRIVE``, either because the hardware could not tell which error occurred, or because the hardware - tested for other conditions besides those two. + tested for other conditions besides those two. Optional status. * .. _`CEC-TX-STATUS-MAX-RETRIES`: - ``CEC_TX_STATUS_MAX_RETRIES`` - 0x20 - The transmit failed after one or more retries. This status bit is - mutually exclusive with :ref:`CEC_TX_STATUS_OK <CEC-TX-STATUS-OK>`. Other bits can still - be set to explain which failures were seen. + mutually exclusive with :ref:`CEC_TX_STATUS_OK <CEC-TX-STATUS-OK>`. + Other bits can still be set to explain which failures were seen. .. tabularcolumns:: |p{5.6cm}|p{0.9cm}|p{11.0cm}| diff --git a/Documentation/media/uapi/dvb/dvbapi.rst b/Documentation/media/uapi/dvb/dvbapi.rst index 18c86b3a3af1..89ddca38626f 100644 --- a/Documentation/media/uapi/dvb/dvbapi.rst +++ b/Documentation/media/uapi/dvb/dvbapi.rst @@ -62,7 +62,7 @@ Authors: - Original author of the Digital TV API documentation. -- Carvalho Chehab, Mauro <m.chehab@kernel.org> +- Carvalho Chehab, Mauro <mchehab+samsung@kernel.org> - Ported document to Docbook XML, addition of DVBv5 API, documentation gaps fix. diff --git a/Documentation/media/uapi/rc/keytable.c.rst b/Documentation/media/uapi/rc/keytable.c.rst index e6ce1e3f5a78..217237f93b37 100644 --- a/Documentation/media/uapi/rc/keytable.c.rst +++ b/Documentation/media/uapi/rc/keytable.c.rst @@ -7,7 +7,7 @@ file: uapi/v4l/keytable.c /* keytable.c - This program allows checking/replacing keys at IR - Copyright (C) 2006-2009 Mauro Carvalho Chehab <mchehab@infradead.org> + Copyright (C) 2006-2009 Mauro Carvalho Chehab <mchehab@kernel.org> This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by diff --git a/Documentation/media/uapi/rc/lirc-dev-intro.rst b/Documentation/media/uapi/rc/lirc-dev-intro.rst index 698e4f80270e..11516c8bff62 100644 --- a/Documentation/media/uapi/rc/lirc-dev-intro.rst +++ b/Documentation/media/uapi/rc/lirc-dev-intro.rst @@ -18,7 +18,7 @@ Example dmesg output upon a driver registering w/LIRC: .. code-block:: none $ dmesg |grep lirc_dev - rc rc0: lirc_dev: driver mceusb registered at minor = 0 + rc rc0: lirc_dev: driver mceusb registered at minor = 0, raw IR receiver, raw IR transmitter What you should see for a chardev: diff --git a/Documentation/media/uapi/rc/lirc-set-rec-timeout.rst b/Documentation/media/uapi/rc/lirc-set-rec-timeout.rst index b3e16bbdbc90..a833a6a4c25a 100644 --- a/Documentation/media/uapi/rc/lirc-set-rec-timeout.rst +++ b/Documentation/media/uapi/rc/lirc-set-rec-timeout.rst @@ -1,19 +1,23 @@ .. -*- coding: utf-8; mode: rst -*- .. _lirc_set_rec_timeout: +.. _lirc_get_rec_timeout: -************************** -ioctl LIRC_SET_REC_TIMEOUT -************************** +*************************************************** +ioctl LIRC_GET_REC_TIMEOUT and LIRC_SET_REC_TIMEOUT +*************************************************** Name ==== -LIRC_SET_REC_TIMEOUT - sets the integer value for IR inactivity timeout. +LIRC_GET_REC_TIMEOUT/LIRC_SET_REC_TIMEOUT - Get/set the integer value for IR inactivity timeout. Synopsis ======== +.. c:function:: int ioctl( int fd, LIRC_GET_REC_TIMEOUT, __u32 *timeout ) + :name: LIRC_GET_REC_TIMEOUT + .. c:function:: int ioctl( int fd, LIRC_SET_REC_TIMEOUT, __u32 *timeout ) :name: LIRC_SET_REC_TIMEOUT @@ -30,7 +34,7 @@ Arguments Description =========== -Sets the integer value for IR inactivity timeout. +Get and set the integer value for IR inactivity timeout. If supported by the hardware, setting it to 0 disables all hardware timeouts and data should be reported as soon as possible. If the exact value diff --git a/Documentation/media/uapi/v4l/common.rst b/Documentation/media/uapi/v4l/common.rst index 13f2ed3fc5a6..5f93e71122ef 100644 --- a/Documentation/media/uapi/v4l/common.rst +++ b/Documentation/media/uapi/v4l/common.rst @@ -41,6 +41,6 @@ applicable to all devices. extended-controls format planar-apis - crop selection-api + crop streaming-par diff --git a/Documentation/media/uapi/v4l/crop.rst b/Documentation/media/uapi/v4l/crop.rst index 182565b9ace4..45e8a895a320 100644 --- a/Documentation/media/uapi/v4l/crop.rst +++ b/Documentation/media/uapi/v4l/crop.rst @@ -2,9 +2,18 @@ .. _crop: -************************************* -Image Cropping, Insertion and Scaling -************************************* +***************************************************** +Image Cropping, Insertion and Scaling -- the CROP API +***************************************************** + +.. note:: + + The CROP API is mostly superseded by the newer :ref:`SELECTION API + <selection-api>`. The new API should be preferred in most cases, + with the exception of pixel aspect ratio detection, which is + implemented by :ref:`VIDIOC_CROPCAP <VIDIOC_CROPCAP>` and has no + equivalent in the SELECTION API. See :ref:`selection-vs-crop` for a + comparison of the two APIs. Some video capture devices can sample a subsection of the picture and shrink or enlarge it to an image of arbitrary size. We call these @@ -42,10 +51,9 @@ where applicable) will be fixed in this case. .. note:: - All capture and output devices must support the - :ref:`VIDIOC_CROPCAP <VIDIOC_CROPCAP>` ioctl such that applications - can determine if scaling takes place. - + All capture and output devices that support the CROP or SELECTION + API will also support the :ref:`VIDIOC_CROPCAP <VIDIOC_CROPCAP>` + ioctl. Cropping Structures =================== diff --git a/Documentation/media/uapi/v4l/selection-api-005.rst b/Documentation/media/uapi/v4l/selection-api-005.rst deleted file mode 100644 index 5b47a28ac6d7..000000000000 --- a/Documentation/media/uapi/v4l/selection-api-005.rst +++ /dev/null @@ -1,34 +0,0 @@ -.. -*- coding: utf-8; mode: rst -*- - -******************************** -Comparison with old cropping API -******************************** - -The selection API was introduced to cope with deficiencies of previous -:ref:`API <crop>`, that was designed to control simple capture -devices. Later the cropping API was adopted by video output drivers. The -ioctls are used to select a part of the display were the video signal is -inserted. It should be considered as an API abuse because the described -operation is actually the composing. The selection API makes a clear -distinction between composing and cropping operations by setting the -appropriate targets. The V4L2 API lacks any support for composing to and -cropping from an image inside a memory buffer. The application could -configure a capture device to fill only a part of an image by abusing -V4L2 API. Cropping a smaller image from a larger one is achieved by -setting the field ``bytesperline`` at struct -:c:type:`v4l2_pix_format`. -Introducing an image offsets could be done by modifying field ``m_userptr`` -at struct -:c:type:`v4l2_buffer` before calling -:ref:`VIDIOC_QBUF`. Those operations should be avoided because they are not -portable (endianness), and do not work for macroblock and Bayer formats -and mmap buffers. The selection API deals with configuration of buffer -cropping/composing in a clear, intuitive and portable way. Next, with -the selection API the concepts of the padded target and constraints -flags are introduced. Finally, struct :c:type:`v4l2_crop` -and struct :c:type:`v4l2_cropcap` have no reserved -fields. Therefore there is no way to extend their functionality. The new -struct :c:type:`v4l2_selection` provides a lot of place -for future extensions. Driver developers are encouraged to implement -only selection API. The former cropping API would be simulated using the -new one. diff --git a/Documentation/media/uapi/v4l/selection-api-004.rst b/Documentation/media/uapi/v4l/selection-api-configuration.rst index d782cd5b2117..0a4ddc2d71db 100644 --- a/Documentation/media/uapi/v4l/selection-api-004.rst +++ b/Documentation/media/uapi/v4l/selection-api-configuration.rst @@ -41,7 +41,7 @@ The driver may further adjust the requested size and/or position according to hardware limitations. Each capture device has a default source rectangle, given by the -``V4L2_SEL_TGT_CROP_DEFAULT`` target. This rectangle shall over what the +``V4L2_SEL_TGT_CROP_DEFAULT`` target. This rectangle shall cover what the driver writer considers the complete picture. Drivers shall set the active crop rectangle to the default when the driver is first loaded, but not later. diff --git a/Documentation/media/uapi/v4l/selection-api-006.rst b/Documentation/media/uapi/v4l/selection-api-examples.rst index 67e0e9aed9e8..67e0e9aed9e8 100644 --- a/Documentation/media/uapi/v4l/selection-api-006.rst +++ b/Documentation/media/uapi/v4l/selection-api-examples.rst diff --git a/Documentation/media/uapi/v4l/selection-api-002.rst b/Documentation/media/uapi/v4l/selection-api-intro.rst index 09ca93f91bf7..09ca93f91bf7 100644 --- a/Documentation/media/uapi/v4l/selection-api-002.rst +++ b/Documentation/media/uapi/v4l/selection-api-intro.rst diff --git a/Documentation/media/uapi/v4l/selection-api-003.rst b/Documentation/media/uapi/v4l/selection-api-targets.rst index bf7e76dfbdf9..bf7e76dfbdf9 100644 --- a/Documentation/media/uapi/v4l/selection-api-003.rst +++ b/Documentation/media/uapi/v4l/selection-api-targets.rst diff --git a/Documentation/media/uapi/v4l/selection-api-vs-crop-api.rst b/Documentation/media/uapi/v4l/selection-api-vs-crop-api.rst new file mode 100644 index 000000000000..e7455fb1e572 --- /dev/null +++ b/Documentation/media/uapi/v4l/selection-api-vs-crop-api.rst @@ -0,0 +1,39 @@ +.. -*- coding: utf-8; mode: rst -*- + +.. _selection-vs-crop: + +******************************** +Comparison with old cropping API +******************************** + +The selection API was introduced to cope with deficiencies of the +older :ref:`CROP API <crop>`, that was designed to control simple +capture devices. Later the cropping API was adopted by video output +drivers. The ioctls are used to select a part of the display were the +video signal is inserted. It should be considered as an API abuse +because the described operation is actually the composing. The +selection API makes a clear distinction between composing and cropping +operations by setting the appropriate targets. + +The CROP API lacks any support for composing to and cropping from an +image inside a memory buffer. The application could configure a +capture device to fill only a part of an image by abusing V4L2 +API. Cropping a smaller image from a larger one is achieved by setting +the field ``bytesperline`` at struct :c:type:`v4l2_pix_format`. +Introducing an image offsets could be done by modifying field +``m_userptr`` at struct :c:type:`v4l2_buffer` before calling +:ref:`VIDIOC_QBUF <VIDIOC_QBUF>`. Those operations should be avoided +because they are not portable (endianness), and do not work for +macroblock and Bayer formats and mmap buffers. + +The selection API deals with configuration of buffer +cropping/composing in a clear, intuitive and portable way. Next, with +the selection API the concepts of the padded target and constraints +flags are introduced. Finally, struct :c:type:`v4l2_crop` and struct +:c:type:`v4l2_cropcap` have no reserved fields. Therefore there is no +way to extend their functionality. The new struct +:c:type:`v4l2_selection` provides a lot of place for future +extensions. + +Driver developers are encouraged to implement only selection API. The +former cropping API would be simulated using the new one. diff --git a/Documentation/media/uapi/v4l/selection-api.rst b/Documentation/media/uapi/v4l/selection-api.rst index 81ea52d785b9..390233f704a3 100644 --- a/Documentation/media/uapi/v4l/selection-api.rst +++ b/Documentation/media/uapi/v4l/selection-api.rst @@ -2,15 +2,15 @@ .. _selection-api: -API for cropping, composing and scaling -======================================= +Cropping, composing and scaling -- the SELECTION API +==================================================== .. toctree:: :maxdepth: 1 - selection-api-002 - selection-api-003 - selection-api-004 - selection-api-005 - selection-api-006 + selection-api-intro.rst + selection-api-targets.rst + selection-api-configuration.rst + selection-api-vs-crop-api.rst + selection-api-examples.rst diff --git a/Documentation/media/uapi/v4l/selection.svg b/Documentation/media/uapi/v4l/selection.svg index a93e3b59786d..911062bd2844 100644 --- a/Documentation/media/uapi/v4l/selection.svg +++ b/Documentation/media/uapi/v4l/selection.svg @@ -1128,11 +1128,11 @@ </text> </g> <text transform="matrix(.96106 0 0 1.0405 48.571 195.53)" x="2438.062" y="1368.429" enable-background="new" font-size="50" style="line-height:125%"> - <tspan x="2438.062" y="1368.429">COMPOSE_BONDS</tspan> + <tspan x="2438.062" y="1368.429">COMPOSE_BOUNDS</tspan> </text> <g font-size="40"> <text transform="translate(48.571 195.53)" x="8.082" y="1438.896" enable-background="new" style="line-height:125%"> - <tspan x="8.082" y="1438.896" font-size="50">CROP_BONDS</tspan> + <tspan x="8.082" y="1438.896" font-size="50">CROP_BOUNDS</tspan> </text> <text transform="translate(48.571 195.53)" x="1455.443" y="-26.808" enable-background="new" style="line-height:125%"> <tspan x="1455.443" y="-26.808" font-size="50">overscan area</tspan> diff --git a/Documentation/media/uapi/v4l/v4l2.rst b/Documentation/media/uapi/v4l/v4l2.rst index 2128717299b3..b89e5621ae69 100644 --- a/Documentation/media/uapi/v4l/v4l2.rst +++ b/Documentation/media/uapi/v4l/v4l2.rst @@ -45,7 +45,7 @@ Authors, in alphabetical order: - Subdev selections API. -- Carvalho Chehab, Mauro <m.chehab@kernel.org> +- Carvalho Chehab, Mauro <mchehab+samsung@kernel.org> - Documented libv4l, designed and added v4l2grab example, Remote Controller chapter. diff --git a/Documentation/media/uapi/v4l/v4l2grab.c.rst b/Documentation/media/uapi/v4l/v4l2grab.c.rst index 5aabd0b7b089..f0d0ab6abd41 100644 --- a/Documentation/media/uapi/v4l/v4l2grab.c.rst +++ b/Documentation/media/uapi/v4l/v4l2grab.c.rst @@ -6,7 +6,7 @@ file: media/v4l/v4l2grab.c .. code-block:: c /* V4L2 video picture grabber - Copyright (C) 2009 Mauro Carvalho Chehab <mchehab@infradead.org> + Copyright (C) 2009 Mauro Carvalho Chehab <mchehab@kernel.org> This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by diff --git a/Documentation/media/v4l-drivers/cx23885-cardlist.rst b/Documentation/media/v4l-drivers/cx23885-cardlist.rst index 3129ef04ddd3..8c24df8e0423 100644 --- a/Documentation/media/v4l-drivers/cx23885-cardlist.rst +++ b/Documentation/media/v4l-drivers/cx23885-cardlist.rst @@ -186,7 +186,7 @@ cx23885 cards list * - 43 - Hauppauge ImpactVCB-e - - 0070:7133 + - 0070:7133, 0070:7137 * - 44 - DViCO FusionHDTV DVB-T Dual Express2 @@ -243,3 +243,19 @@ cx23885 cards list * - 57 - Hauppauge WinTV-QuadHD-ATSC - 0070:6a18, 0070:6b18 + + * - 58 + - Hauppauge WinTV-HVR-1265(161111) + - 0070:2a18 + + * - 59 + - Hauppauge WinTV-Starburst2 + - 0070:f02a + + * - 60 + - Hauppauge WinTV-QuadHD-DVB(885) + - + + * - 61 + - Hauppauge WinTV-QuadHD-ATSC(885) + - diff --git a/Documentation/media/v4l-drivers/em28xx-cardlist.rst b/Documentation/media/v4l-drivers/em28xx-cardlist.rst index ec938c08f43d..dfe882ca945f 100644 --- a/Documentation/media/v4l-drivers/em28xx-cardlist.rst +++ b/Documentation/media/v4l-drivers/em28xx-cardlist.rst @@ -391,7 +391,7 @@ EM28xx cards list * - 94 - PCTV tripleStick (292e) - em28178 - - 2013:025f, 2040:0264 + - 2013:025f, 2013:0264, 2040:0264, 2040:8264, 2040:8268, 2040:8268 * - 95 - Leadtek VC100 - em2861 @@ -411,12 +411,16 @@ EM28xx cards list * - 99 - Hauppauge WinTV-dualHD DVB - em28174 - - 2040:0265 + - 2040:0265, 2040:8265 * - 100 - Hauppauge WinTV-dualHD 01595 ATSC/QAM - em28174 - - 2040:026d + - 2040:026d, 2040:826d * - 101 - Terratec Cinergy H6 rev. 2 - em2884 - 0ccd:10b2 + * - 102 + - :ZOLID HYBRID TV STICK + - em2882 + - diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 6dafc8085acc..a02d6bbfc9d0 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1920,9 +1920,6 @@ There are some more advanced barrier functions: /* assign ownership */ desc->status = DEVICE_OWN; - /* force memory to sync before notifying device via MMIO */ - wmb(); - /* notify device of new descriptors */ writel(DESC_NOTIFY, doorbell); } @@ -1930,11 +1927,15 @@ There are some more advanced barrier functions: The dma_rmb() allows us guarantee the device has released ownership before we read the data from the descriptor, and the dma_wmb() allows us to guarantee the data is written to the descriptor before the device - can see it now has ownership. The wmb() is needed to guarantee that the - cache coherent memory writes have completed before attempting a write to - the cache incoherent MMIO region. + can see it now has ownership. Note that, when using writel(), a prior + wmb() is not needed to guarantee that the cache coherent memory writes + have completed before writing to the MMIO region. The cheaper + writel_relaxed() does not provide this guarantee and must not be used + here. - See Documentation/DMA-API.txt for more information on consistent memory. + See the subsection "Kernel I/O barrier effects" for more information on + relaxed I/O accessors and the Documentation/DMA-API.txt file for more + information on consistent memory. MMIO WRITE BARRIER @@ -2903,7 +2904,7 @@ is discarded from the CPU's cache and reloaded. To deal with this, the appropriate part of the kernel must invalidate the overlapping bits of the cache on each CPU. -See Documentation/cachetlb.txt for more information on cache management. +See Documentation/core-api/cachetlb.rst for more information on cache management. CACHE COHERENCY VS MMIO @@ -3083,7 +3084,7 @@ CIRCULAR BUFFERS Memory barriers can be used to implement circular buffering without the need of a lock to serialise the producer with the consumer. See: - Documentation/circular-buffers.txt + Documentation/core-api/circular-buffers.rst for details. diff --git a/Documentation/misc-devices/ibmvmc.rst b/Documentation/misc-devices/ibmvmc.rst new file mode 100644 index 000000000000..46ded79554d4 --- /dev/null +++ b/Documentation/misc-devices/ibmvmc.rst @@ -0,0 +1,226 @@ +.. SPDX-License-Identifier: GPL-2.0+ +====================================================== +IBM Virtual Management Channel Kernel Driver (IBMVMC) +====================================================== + +:Authors: + Dave Engebretsen <engebret@us.ibm.com>, + Adam Reznechek <adreznec@linux.vnet.ibm.com>, + Steven Royer <seroyer@linux.vnet.ibm.com>, + Bryant G. Ly <bryantly@linux.vnet.ibm.com>, + +Introduction +============ + +Note: Knowledge of virtualization technology is required to understand +this document. + +A good reference document would be: + +https://openpowerfoundation.org/wp-content/uploads/2016/05/LoPAPR_DRAFT_v11_24March2016_cmt1.pdf + +The Virtual Management Channel (VMC) is a logical device which provides an +interface between the hypervisor and a management partition. This interface +is like a message passing interface. This management partition is intended +to provide an alternative to systems that use a Hardware Management +Console (HMC) - based system management. + +The primary hardware management solution that is developed by IBM relies +on an appliance server named the Hardware Management Console (HMC), +packaged as an external tower or rack-mounted personal computer. In a +Power Systems environment, a single HMC can manage multiple POWER +processor-based systems. + +Management Application +---------------------- + +In the management partition, a management application exists which enables +a system administrator to configure the system’s partitioning +characteristics via a command line interface (CLI) or Representational +State Transfer Application (REST API's). + +The management application runs on a Linux logical partition on a +POWER8 or newer processor-based server that is virtualized by PowerVM. +System configuration, maintenance, and control functions which +traditionally require an HMC can be implemented in the management +application using a combination of HMC to hypervisor interfaces and +existing operating system methods. This tool provides a subset of the +functions implemented by the HMC and enables basic partition configuration. +The set of HMC to hypervisor messages supported by the management +application component are passed to the hypervisor over a VMC interface, +which is defined below. + +The VMC enables the management partition to provide basic partitioning +functions: + +- Logical Partitioning Configuration +- Start, and stop actions for individual partitions +- Display of partition status +- Management of virtual Ethernet +- Management of virtual Storage +- Basic system management + +Virtual Management Channel (VMC) +-------------------------------- + +A logical device, called the Virtual Management Channel (VMC), is defined +for communicating between the management application and the hypervisor. It +basically creates the pipes that enable virtualization management +software. This device is presented to a designated management partition as +a virtual device. + +This communication device uses Command/Response Queue (CRQ) and the +Remote Direct Memory Access (RDMA) interfaces. A three-way handshake is +defined that must take place to establish that both the hypervisor and +management partition sides of the channel are running prior to +sending/receiving any of the protocol messages. + +This driver also utilizes Transport Event CRQs. CRQ messages are sent +when the hypervisor detects one of the peer partitions has abnormally +terminated, or one side has called H_FREE_CRQ to close their CRQ. +Two new classes of CRQ messages are introduced for the VMC device. VMC +Administrative messages are used for each partition using the VMC to +communicate capabilities to their partner. HMC Interface messages are used +for the actual flow of HMC messages between the management partition and +the hypervisor. As most HMC messages far exceed the size of a CRQ buffer, +a virtual DMA (RMDA) of the HMC message data is done prior to each HMC +Interface CRQ message. Only the management partition drives RDMA +operations; hypervisors never directly cause the movement of message data. + + +Terminology +----------- +RDMA + Remote Direct Memory Access is DMA transfer from the server to its + client or from the server to its partner partition. DMA refers + to both physical I/O to and from memory operations and to memory + to memory move operations. +CRQ + Command/Response Queue a facility which is used to communicate + between partner partitions. Transport events which are signaled + from the hypervisor to partition are also reported in this queue. + +Example Management Partition VMC Driver Interface +================================================= + +This section provides an example for the management application +implementation where a device driver is used to interface to the VMC +device. This driver consists of a new device, for example /dev/ibmvmc, +which provides interfaces to open, close, read, write, and perform +ioctl’s against the VMC device. + +VMC Interface Initialization +---------------------------- + +The device driver is responsible for initializing the VMC when the driver +is loaded. It first creates and initializes the CRQ. Next, an exchange of +VMC capabilities is performed to indicate the code version and number of +resources available in both the management partition and the hypervisor. +Finally, the hypervisor requests that the management partition create an +initial pool of VMC buffers, one buffer for each possible HMC connection, +which will be used for management application session initialization. +Prior to completion of this initialization sequence, the device returns +EBUSY to open() calls. EIO is returned for all open() failures. + +:: + + Management Partition Hypervisor + CRQ INIT + ----------------------------------------> + CRQ INIT COMPLETE + <---------------------------------------- + CAPABILITIES + ----------------------------------------> + CAPABILITIES RESPONSE + <---------------------------------------- + ADD BUFFER (HMC IDX=0,1,..) _ + <---------------------------------------- | + ADD BUFFER RESPONSE | - Perform # HMCs Iterations + ----------------------------------------> - + +VMC Interface Open +------------------ + +After the basic VMC channel has been initialized, an HMC session level +connection can be established. The application layer performs an open() to +the VMC device and executes an ioctl() against it, indicating the HMC ID +(32 bytes of data) for this session. If the VMC device is in an invalid +state, EIO will be returned for the ioctl(). The device driver creates a +new HMC session value (ranging from 1 to 255) and HMC index value (starting +at index 0 and ranging to 254) for this HMC ID. The driver then does an +RDMA of the HMC ID to the hypervisor, and then sends an Interface Open +message to the hypervisor to establish the session over the VMC. After the +hypervisor receives this information, it sends Add Buffer messages to the +management partition to seed an initial pool of buffers for the new HMC +connection. Finally, the hypervisor sends an Interface Open Response +message, to indicate that it is ready for normal runtime messaging. The +following illustrates this VMC flow: + +:: + + Management Partition Hypervisor + RDMA HMC ID + ----------------------------------------> + Interface Open + ----------------------------------------> + Add Buffer _ + <---------------------------------------- | + Add Buffer Response | - Perform N Iterations + ----------------------------------------> - + Interface Open Response + <---------------------------------------- + +VMC Interface Runtime +--------------------- + +During normal runtime, the management application and the hypervisor +exchange HMC messages via the Signal VMC message and RDMA operations. When +sending data to the hypervisor, the management application performs a +write() to the VMC device, and the driver RDMA’s the data to the hypervisor +and then sends a Signal Message. If a write() is attempted before VMC +device buffers have been made available by the hypervisor, or no buffers +are currently available, EBUSY is returned in response to the write(). A +write() will return EIO for all other errors, such as an invalid device +state. When the hypervisor sends a message to the management, the data is +put into a VMC buffer and an Signal Message is sent to the VMC driver in +the management partition. The driver RDMA’s the buffer into the partition +and passes the data up to the appropriate management application via a +read() to the VMC device. The read() request blocks if there is no buffer +available to read. The management application may use select() to wait for +the VMC device to become ready with data to read. + +:: + + Management Partition Hypervisor + MSG RDMA + ----------------------------------------> + SIGNAL MSG + ----------------------------------------> + SIGNAL MSG + <---------------------------------------- + MSG RDMA + <---------------------------------------- + +VMC Interface Close +------------------- + +HMC session level connections are closed by the management partition when +the application layer performs a close() against the device. This action +results in an Interface Close message flowing to the hypervisor, which +causes the session to be terminated. The device driver must free any +storage allocated for buffers for this HMC connection. + +:: + + Management Partition Hypervisor + INTERFACE CLOSE + ----------------------------------------> + INTERFACE CLOSE RESPONSE + <---------------------------------------- + +Additional Information +====================== + +For more information on the documentation for CRQ Messages, VMC Messages, +HMC interface Buffers, and signal messages please refer to the Linux on +Power Architecture Platform Reference. Section F. diff --git a/Documentation/networking/6lowpan.txt b/Documentation/networking/6lowpan.txt index a7dc7e939c7a..2e5a939d7e6f 100644 --- a/Documentation/networking/6lowpan.txt +++ b/Documentation/networking/6lowpan.txt @@ -24,10 +24,10 @@ enum lowpan_lltypes. Example to evaluate the private usually you can do: -static inline sturct lowpan_priv_foobar * +static inline struct lowpan_priv_foobar * lowpan_foobar_priv(struct net_device *dev) { - return (sturct lowpan_priv_foobar *)lowpan_priv(dev)->priv; + return (struct lowpan_priv_foobar *)lowpan_priv(dev)->priv; } switch (dev->type) { diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst new file mode 100644 index 000000000000..ff929cfab4f4 --- /dev/null +++ b/Documentation/networking/af_xdp.rst @@ -0,0 +1,312 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====== +AF_XDP +====== + +Overview +======== + +AF_XDP is an address family that is optimized for high performance +packet processing. + +This document assumes that the reader is familiar with BPF and XDP. If +not, the Cilium project has an excellent reference guide at +http://cilium.readthedocs.io/en/latest/bpf/. + +Using the XDP_REDIRECT action from an XDP program, the program can +redirect ingress frames to other XDP enabled netdevs, using the +bpf_redirect_map() function. AF_XDP sockets enable the possibility for +XDP programs to redirect frames to a memory buffer in a user-space +application. + +An AF_XDP socket (XSK) is created with the normal socket() +syscall. Associated with each XSK are two rings: the RX ring and the +TX ring. A socket can receive packets on the RX ring and it can send +packets on the TX ring. These rings are registered and sized with the +setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory +to have at least one of these rings for each socket. An RX or TX +descriptor ring points to a data buffer in a memory area called a +UMEM. RX and TX can share the same UMEM so that a packet does not have +to be copied between RX and TX. Moreover, if a packet needs to be kept +for a while due to a possible retransmit, the descriptor that points +to that packet can be changed to point to another and reused right +away. This again avoids copying data. + +The UMEM consists of a number of equally sized chunks. A descriptor in +one of the rings references a frame by referencing its addr. The addr +is simply an offset within the entire UMEM region. The user space +allocates memory for this UMEM using whatever means it feels is most +appropriate (malloc, mmap, huge pages, etc). This memory area is then +registered with the kernel using the new setsockopt XDP_UMEM_REG. The +UMEM also has two rings: the FILL ring and the COMPLETION ring. The +fill ring is used by the application to send down addr for the kernel +to fill in with RX packet data. References to these frames will then +appear in the RX ring once each packet has been received. The +completion ring, on the other hand, contains frame addr that the +kernel has transmitted completely and can now be used again by user +space, for either TX or RX. Thus, the frame addrs appearing in the +completion ring are addrs that were previously transmitted using the +TX ring. In summary, the RX and FILL rings are used for the RX path +and the TX and COMPLETION rings are used for the TX path. + +The socket is then finally bound with a bind() call to a device and a +specific queue id on that device, and it is not until bind is +completed that traffic starts to flow. + +The UMEM can be shared between processes, if desired. If a process +wants to do this, it simply skips the registration of the UMEM and its +corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind +call and submits the XSK of the process it would like to share UMEM +with as well as its own newly created XSK socket. The new process will +then receive frame addr references in its own RX ring that point to +this shared UMEM. Note that since the ring structures are +single-consumer / single-producer (for performance reasons), the new +process has to create its own socket with associated RX and TX rings, +since it cannot share this with the other process. This is also the +reason that there is only one set of FILL and COMPLETION rings per +UMEM. It is the responsibility of a single process to handle the UMEM. + +How is then packets distributed from an XDP program to the XSKs? There +is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The +user-space application can place an XSK at an arbitrary place in this +map. The XDP program can then redirect a packet to a specific index in +this map and at this point XDP validates that the XSK in that map was +indeed bound to that device and ring number. If not, the packet is +dropped. If the map is empty at that index, the packet is also +dropped. This also means that it is currently mandatory to have an XDP +program loaded (and one XSK in the XSKMAP) to be able to get any +traffic to user space through the XSK. + +AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the +driver does not have support for XDP, or XDP_SKB is explicitly chosen +when loading the XDP program, XDP_SKB mode is employed that uses SKBs +together with the generic XDP support and copies out the data to user +space. A fallback mode that works for any network device. On the other +hand, if the driver has support for XDP, it will be used by the AF_XDP +code to provide better performance, but there is still a copy of the +data into user space. + +Concepts +======== + +In order to use an AF_XDP socket, a number of associated objects need +to be setup. + +Jonathan Corbet has also written an excellent article on LWN, +"Accelerating networking with AF_XDP". It can be found at +https://lwn.net/Articles/750845/. + +UMEM +---- + +UMEM is a region of virtual contiguous memory, divided into +equal-sized frames. An UMEM is associated to a netdev and a specific +queue id of that netdev. It is created and configured (chunk size, +headroom, start address and size) by using the XDP_UMEM_REG setsockopt +system call. A UMEM is bound to a netdev and queue id, via the bind() +system call. + +An AF_XDP is socket linked to a single UMEM, but one UMEM can have +multiple AF_XDP sockets. To share an UMEM created via one socket A, +the next socket B can do this by setting the XDP_SHARED_UMEM flag in +struct sockaddr_xdp member sxdp_flags, and passing the file descriptor +of A to struct sockaddr_xdp member sxdp_shared_umem_fd. + +The UMEM has two single-producer/single-consumer rings, that are used +to transfer ownership of UMEM frames between the kernel and the +user-space application. + +Rings +----- + +There are a four different kind of rings: Fill, Completion, RX and +TX. All rings are single-producer/single-consumer, so the user-space +application need explicit synchronization of multiple +processes/threads are reading/writing to them. + +The UMEM uses two rings: Fill and Completion. Each socket associated +with the UMEM must have an RX queue, TX queue or both. Say, that there +is a setup with four sockets (all doing TX and RX). Then there will be +one Fill ring, one Completion ring, four TX rings and four RX rings. + +The rings are head(producer)/tail(consumer) based rings. A producer +writes the data ring at the index pointed out by struct xdp_ring +producer member, and increasing the producer index. A consumer reads +the data ring at the index pointed out by struct xdp_ring consumer +member, and increasing the consumer index. + +The rings are configured and created via the _RING setsockopt system +calls and mmapped to user-space using the appropriate offset to mmap() +(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and +XDP_UMEM_PGOFF_COMPLETION_RING). + +The size of the rings need to be of size power of two. + +UMEM Fill Ring +~~~~~~~~~~~~~~ + +The Fill ring is used to transfer ownership of UMEM frames from +user-space to kernel-space. The UMEM addrs are passed in the ring. As +an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has +16 chunks and can pass addrs between 0 and 64k. + +Frames passed to the kernel are used for the ingress path (RX rings). + +The user application produces UMEM addrs to this ring. Note that the +kernel will mask the incoming addr. E.g. for a chunk size of 2k, the +log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050 +and 3000 refers to the same chunk. + + +UMEM Completetion Ring +~~~~~~~~~~~~~~~~~~~~~~ + +The Completion Ring is used transfer ownership of UMEM frames from +kernel-space to user-space. Just like the Fill ring, UMEM indicies are +used. + +Frames passed from the kernel to user-space are frames that has been +sent (TX ring) and can be used by user-space again. + +The user application consumes UMEM addrs from this ring. + + +RX Ring +~~~~~~~ + +The RX ring is the receiving side of a socket. Each entry in the ring +is a struct xdp_desc descriptor. The descriptor contains UMEM offset +(addr) and the length of the data (len). + +If no frames have been passed to kernel via the Fill ring, no +descriptors will (or can) appear on the RX ring. + +The user application consumes struct xdp_desc descriptors from this +ring. + +TX Ring +~~~~~~~ + +The TX ring is used to send frames. The struct xdp_desc descriptor is +filled (index, length and offset) and passed into the ring. + +To start the transfer a sendmsg() system call is required. This might +be relaxed in the future. + +The user application produces struct xdp_desc descriptors to this +ring. + +XSKMAP / BPF_MAP_TYPE_XSKMAP +---------------------------- + +On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that +is used in conjunction with bpf_redirect_map() to pass the ingress +frame to a socket. + +The user application inserts the socket into the map, via the bpf() +system call. + +Note that if an XDP program tries to redirect to a socket that does +not match the queue configuration and netdev, the frame will be +dropped. E.g. an AF_XDP socket is bound to netdev eth0 and +queue 17. Only the XDP program executing for eth0 and queue 17 will +successfully pass data to the socket. Please refer to the sample +application (samples/bpf/) in for an example. + +Usage +===== + +In order to use AF_XDP sockets there are two parts needed. The +user-space application and the XDP program. For a complete setup and +usage example, please refer to the sample application. The user-space +side is xdpsock_user.c and the XDP side xdpsock_kern.c. + +Naive ring dequeue and enqueue could look like this:: + + // struct xdp_rxtx_ring { + // __u32 *producer; + // __u32 *consumer; + // struct xdp_desc *desc; + // }; + + // struct xdp_umem_ring { + // __u32 *producer; + // __u32 *consumer; + // __u64 *desc; + // }; + + // typedef struct xdp_rxtx_ring RING; + // typedef struct xdp_umem_ring RING; + + // typedef struct xdp_desc RING_TYPE; + // typedef __u64 RING_TYPE; + + int dequeue_one(RING *ring, RING_TYPE *item) + { + __u32 entries = *ring->producer - *ring->consumer; + + if (entries == 0) + return -1; + + // read-barrier! + + *item = ring->desc[*ring->consumer & (RING_SIZE - 1)]; + (*ring->consumer)++; + return 0; + } + + int enqueue_one(RING *ring, const RING_TYPE *item) + { + u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer); + + if (free_entries == 0) + return -1; + + ring->desc[*ring->producer & (RING_SIZE - 1)] = *item; + + // write-barrier! + + (*ring->producer)++; + return 0; + } + + +For a more optimized version, please refer to the sample application. + +Sample application +================== + +There is a xdpsock benchmarking/test application included that +demonstrates how to use AF_XDP sockets with both private and shared +UMEMs. Say that you would like your UDP traffic from port 4242 to end +up in queue 16, that we will enable AF_XDP on. Here, we use ethtool +for this:: + + ethtool -N p3p2 rx-flow-hash udp4 fn + ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ + action 16 + +Running the rxdrop benchmark in XDP_DRV mode can then be done +using:: + + samples/bpf/xdpsock -i p3p2 -q 16 -r -N + +For XDP_SKB mode, use the switch "-S" instead of "-N" and all options +can be displayed with "-h", as usual. + +Credits +======= + +- Björn Töpel (AF_XDP core) +- Magnus Karlsson (AF_XDP core) +- Alexander Duyck +- Alexei Starovoitov +- Daniel Borkmann +- Jesper Dangaard Brouer +- John Fastabend +- Jonathan Corbet (LWN coverage) +- Michael S. Tsirkin +- Qi Z Zhang +- Willem de Bruijn + diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 9ba04c0bab8d..c13214d073a4 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -140,7 +140,7 @@ bonding module at load time, or are specified via sysfs. Module options may be given as command line arguments to the insmod or modprobe command, but are usually specified in either the -/etc/modrobe.d/*.conf configuration files, or in a distro-specific +/etc/modprobe.d/*.conf configuration files, or in a distro-specific configuration file (some of which are detailed in the next section). Details on bonding support for sysfs is provided in the diff --git a/Documentation/networking/e100.txt b/Documentation/networking/e100.rst index 54810b82c01a..d4d837027925 100644 --- a/Documentation/networking/e100.txt +++ b/Documentation/networking/e100.rst @@ -1,7 +1,7 @@ Linux* Base Driver for the Intel(R) PRO/100 Family of Adapters ============================================================== -March 15, 2011 +June 1, 2018 Contents ======== @@ -36,16 +36,9 @@ Channel Bonding documentation can be found in the Linux kernel source: Identifying Your Adapter ======================== -For more information on how to identify your adapter, go to the Adapter & -Driver ID Guide at: - - http://support.intel.com/support/network/adapter/pro100/21397.htm - -For the latest Intel network drivers for Linux, refer to the following -website. In the search field, enter your adapter name or type, or use the -networking link on the left to search for your adapter: - - http://downloadfinder.intel.com/scripts-df/support_intel.asp +For information on how to identify your adapter, and for the latest Intel +network drivers, refer to the Intel Support website: +http://www.intel.com/support Driver Configuration Parameters =============================== @@ -57,22 +50,26 @@ Rx Descriptors: Number of receive descriptors. A receive descriptor is a data structure that describes a receive buffer and its attributes to the network controller. The data in the descriptor is used by the controller to write data from the controller to host memory. In the 3.x.x driver the valid range - for this parameter is 64-256. The default value is 64. This parameter can be - changed using the command: + for this parameter is 64-256. The default value is 256. This parameter can be + changed using the command:: - ethtool -G eth? rx n, where n is the number of desired rx descriptors. + ethtool -G eth? rx n + + Where n is the number of desired Rx descriptors. Tx Descriptors: Number of transmit descriptors. A transmit descriptor is a data structure that describes a transmit buffer and its attributes to the network controller. The data in the descriptor is used by the controller to read data from the host memory to the controller. In the 3.x.x driver the valid - range for this parameter is 64-256. The default value is 64. This parameter - can be changed using the command: + range for this parameter is 64-256. The default value is 128. This parameter + can be changed using the command:: + + ethtool -G eth? tx n - ethtool -G eth? tx n, where n is the number of desired tx descriptors. + Where n is the number of desired Tx descriptors. Speed/Duplex: The driver auto-negotiates the link speed and duplex settings by - default. The ethtool utility can be used as follows to force speed/duplex. + default. The ethtool utility can be used as follows to force speed/duplex.:: ethtool -s eth? autoneg off speed {10|100} duplex {full|half} @@ -81,7 +78,7 @@ Speed/Duplex: The driver auto-negotiates the link speed and duplex settings by Event Log Message Level: The driver uses the message level flag to log events to syslog. The message level can be set at driver load time. It can also be - set using the command: + set using the command:: ethtool -s eth? msglvl n @@ -112,9 +109,9 @@ Additional Configurations --------------------- In order to see link messages and other Intel driver information on your console, you must set the dmesg level up to six. This can be done by - entering the following on the command line before loading the e100 driver: + entering the following on the command line before loading the e100 driver:: - dmesg -n 8 + dmesg -n 6 If you wish to see all messages issued by the driver, including debug messages, set the dmesg level to eight. @@ -146,7 +143,8 @@ Additional Configurations NAPI (Rx polling mode) is supported in the e100 driver. - See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI. + See https://wiki.linuxfoundation.org/networking/napi for more information + on NAPI. Multiple Interfaces on Same Ethernet Broadcast Network ------------------------------------------------------ @@ -160,7 +158,7 @@ Additional Configurations If you have multiple interfaces in a server, either turn on ARP filtering by - (1) entering: echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter + (1) entering:: echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter (this only works if your kernel's version is higher than 2.4.5), or (2) installing the interfaces in separate broadcast domains (either @@ -169,15 +167,11 @@ Additional Configurations Support ======= - For general information, go to the Intel support website at: +http://www.intel.com/support/ - http://support.intel.com - - or the Intel Wired Networking project hosted by Sourceforge at: - - http://sourceforge.net/projects/e1000 - -If an issue is identified with the released source code on the supported -kernel with a supported adapter, email the specific information related to the -issue to e1000-devel@lists.sourceforge.net. +or the Intel Wired Networking project hosted by Sourceforge at: +http://sourceforge.net/projects/e1000 +If an issue is identified with the released source code on a supported kernel +with a supported adapter, email the specific information related to the issue +to e1000-devel@lists.sf.net. diff --git a/Documentation/networking/e1000.txt b/Documentation/networking/e1000.rst index 1f6ed848363d..616848940e63 100644 --- a/Documentation/networking/e1000.txt +++ b/Documentation/networking/e1000.rst @@ -154,7 +154,7 @@ NOTE: When e1000 is loaded with default settings and multiple adapters are in use simultaneously, the CPU utilization may increase non- linearly. In order to limit the CPU utilization without impacting the overall throughput, we recommend that you load the driver as - follows: + follows:: modprobe e1000 InterruptThrottleRate=3000,3000,3000 @@ -167,8 +167,8 @@ NOTE: When e1000 is loaded with default settings and multiple adapters RxDescriptors ------------- -Valid Range: 80-256 for 82542 and 82543-based adapters - 80-4096 for all other supported adapters +Valid Range: 48-256 for 82542 and 82543-based adapters + 48-4096 for all other supported adapters Default Value: 256 This value specifies the number of receive buffer descriptors allocated @@ -230,8 +230,8 @@ speed. Duplex should also be set when Speed is set to either 10 or 100. TxDescriptors ------------- -Valid Range: 80-256 for 82542 and 82543-based adapters - 80-4096 for all other supported adapters +Valid Range: 48-256 for 82542 and 82543-based adapters + 48-4096 for all other supported adapters Default Value: 256 This value is the number of transmit descriptors allocated by the driver. @@ -242,41 +242,10 @@ NOTE: Depending on the available system resources, the request for a higher number of transmit descriptors may be denied. In this case, use a lower number. -TxDescriptorStep ----------------- -Valid Range: 1 (use every Tx Descriptor) - 4 (use every 4th Tx Descriptor) - -Default Value: 1 (use every Tx Descriptor) - -On certain non-Intel architectures, it has been observed that intense TX -traffic bursts of short packets may result in an improper descriptor -writeback. If this occurs, the driver will report a "TX Timeout" and reset -the adapter, after which the transmit flow will restart, though data may -have stalled for as much as 10 seconds before it resumes. - -The improper writeback does not occur on the first descriptor in a system -memory cache-line, which is typically 32 bytes, or 4 descriptors long. - -Setting TxDescriptorStep to a value of 4 will ensure that all TX descriptors -are aligned to the start of a system memory cache line, and so this problem -will not occur. - -NOTES: Setting TxDescriptorStep to 4 effectively reduces the number of - TxDescriptors available for transmits to 1/4 of the normal allocation. - This has a possible negative performance impact, which may be - compensated for by allocating more descriptors using the TxDescriptors - module parameter. - - There are other conditions which may result in "TX Timeout", which will - not be resolved by the use of the TxDescriptorStep parameter. As the - issue addressed by this parameter has never been observed on Intel - Architecture platforms, it should not be used on Intel platforms. - TxIntDelay ---------- Valid Range: 0-65535 (0=off) -Default Value: 64 +Default Value: 8 This value delays the generation of transmit interrupts in units of 1.024 microseconds. Transmit interrupt reduction can improve CPU @@ -288,7 +257,7 @@ TxAbsIntDelay ------------- (This parameter is supported only on 82540, 82545 and later adapters.) Valid Range: 0-65535 (0=off) -Default Value: 64 +Default Value: 32 This value, in units of 1.024 microseconds, limits the delay in which a transmit interrupt is generated. Useful only if TxIntDelay is non-zero, @@ -310,7 +279,7 @@ Copybreak --------- Valid Range: 0-xxxxxxx (0=off) Default Value: 256 -Usage: insmod e1000.ko copybreak=128 +Usage: modprobe e1000.ko copybreak=128 Driver copies all packets below or equaling this size to a fresh RX buffer before handing it up the stack. @@ -328,14 +297,6 @@ Default Value: 0 (disabled) Allows PHY to turn off in lower power states. The user can turn off this parameter in supported chipsets. -KumeranLockLoss ---------------- -Valid Range: 0-1 -Default Value: 1 (enabled) - -This workaround skips resetting the PHY at shutdown for the initial -silicon releases of ICH8 systems. - Speed and Duplex Configuration ============================== @@ -397,12 +358,12 @@ Additional Configurations ------------ Jumbo Frames support is enabled by changing the MTU to a value larger than the default of 1500. Use the ifconfig command to increase the MTU size. - For example: + For example:: ifconfig eth<x> mtu 9000 up This setting is not saved across reboots. It can be made permanent if - you add: + you add:: MTU=9000 diff --git a/Documentation/networking/failover.rst b/Documentation/networking/failover.rst new file mode 100644 index 000000000000..f0c8483cdbf5 --- /dev/null +++ b/Documentation/networking/failover.rst @@ -0,0 +1,18 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======== +FAILOVER +======== + +Overview +======== + +The failover module provides a generic interface for paravirtual drivers +to register a netdev and a set of ops with a failover instance. The ops +are used as event handlers that get called to handle netdev register/ +unregister/link change/name change events on slave pci ethernet devices +with the same mac address as the failover netdev. + +This enables paravirtual drivers to use a VF as an accelerated low latency +datapath. It also allows live migration of VMs with direct attached VFs by +failing over to the paravirtual datapath when the VF is unplugged. diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index a4508ec1816b..e6b4ebb2b243 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -169,7 +169,7 @@ access to BPF code as well. BPF engine and instruction set ------------------------------ -Under tools/net/ there's a small helper tool called bpf_asm which can +Under tools/bpf/ there's a small helper tool called bpf_asm which can be used to write low-level filters for example scenarios mentioned in the previous section. Asm-like syntax mentioned here has been implemented in bpf_asm and will be used for further explanations (instead of dealing with @@ -359,7 +359,7 @@ $ ./bpf_asm -c foo In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF filters that might not be obvious at first, it's good to test filters before attaching to a live system. For that purpose, there's a small tool called -bpf_dbg under tools/net/ in the kernel source directory. This debugger allows +bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows for testing BPF filters against given pcap files, single stepping through the BPF code on the pcap's packets and to do BPF machine register dumps. @@ -483,7 +483,13 @@ Example output from dmesg: [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 -In the kernel source tree under tools/net/, there's bpf_jit_disasm for +When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and +setting any other value than that will return in failure. This is even the case for +setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log +is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the +generally recommended approach instead. + +In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for generating disassembly out of the kernel log's hexdump: # ./bpf_jit_disasm @@ -1136,6 +1142,7 @@ into a register from memory, the register's top 56 bits are known zero, while the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; 0x1ff), because of potential carries. + Besides arithmetic, the register state can also be updated by conditional branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' @@ -1144,14 +1151,16 @@ BPF_JSGE) would instead update the signed minimum/maximum values. Information from the signed and unsigned bounds can be combined; for instance if a value is first tested < 8 and then tested s> 4, the verifier will conclude that the value is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. + PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all pointers sharing that same variable offset. This is important for packet range -checks: after adding some variable to a packet pointer, if you then copy it to -another register and (say) add a constant 4, both registers will share the same -'id' but one will have a fixed offset of +4. Then if it is bounds-checked and -found to be less than a PTR_TO_PACKET_END, the other register is now known to -have a safe range of at least 4 bytes. See 'Direct packet access', below, for -more on PTR_TO_PACKET ranges. +checks: after adding a variable to a packet pointer register A, if you then copy +it to another register B and then add a constant 4 to A, both registers will +share the same 'id' but the A will have a fixed offset of +4. Then if A is +bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is +now known to have a safe range of at least 4 bytes. See 'Direct packet access', +below, for more on PTR_TO_PACKET ranges. + The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of the pointer returned from a map lookup. This means that when one copy is checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. diff --git a/Documentation/networking/gtp.txt b/Documentation/networking/gtp.txt index 0d9c18f05ec6..6966bbec1ecb 100644 --- a/Documentation/networking/gtp.txt +++ b/Documentation/networking/gtp.txt @@ -67,7 +67,7 @@ Don't be confused by terminology: The GTP User Plane goes through kernel accelerated path, while the GTP Control Plane goes to Userspace :) -The official homepge of the module is at +The official homepage of the module is at https://osmocom.org/projects/linux-kernel-gtp-u/wiki == Userspace Programs with Linux Kernel GTP-U support == @@ -120,7 +120,7 @@ If yo have questions regarding how to use the Kernel GTP module from your own software, or want to contribute to the code, please use the osmocom-net-grps mailing list for related discussion. The list can be reached at osmocom-net-gprs@lists.osmocom.org and the mailman -interface for managign your subscription is at +interface for managing your subscription is at https://lists.osmocom.org/mailman/listinfo/osmocom-net-gprs == Issue Tracker == diff --git a/Documentation/networking/ila.txt b/Documentation/networking/ila.txt index 78df879abd26..a17dac9dc915 100644 --- a/Documentation/networking/ila.txt +++ b/Documentation/networking/ila.txt @@ -121,7 +121,7 @@ three options to deal with this: - checksum neutral mapping When an address is translated the difference can be offset - elsewhere in a part of the packet that is covered by the + elsewhere in a part of the packet that is covered by the checksum. The low order sixteen bits of the identifier are used. This method is preferred since it doesn't require parsing a packet beyond the IP header and in most cases the diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index f204eaff657d..fec8588a588e 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -6,9 +6,12 @@ Contents: .. toctree:: :maxdepth: 2 + af_xdp batman-adv can dpaa2/index + e100 + e1000 kapi z8530book msg_zerocopy diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 5dc1a040a2f1..ce8fbf5aa63c 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -26,7 +26,7 @@ ip_no_pmtu_disc - INTEGER discarded. Outgoing frames are handled the same as in mode 1, implicitly setting IP_PMTUDISC_DONT on every created socket. - Mode 3 is a hardend pmtu discover mode. The kernel will only + Mode 3 is a hardened pmtu discover mode. The kernel will only accept fragmentation-needed errors if the underlying protocol can verify them besides a plain socket lookup. Current protocols for which pmtu events will be honored are TCP, SCTP @@ -449,8 +449,10 @@ tcp_recovery - INTEGER features. RACK: 0x1 enables the RACK loss detection for fast detection of lost - retransmissions and tail drops. + retransmissions and tail drops. It also subsumes and disables + RFC6675 recovery for SACK connections. RACK: 0x2 makes RACK's reordering window static (min_rtt/4). + RACK: 0x4 disables RACK's DUPACK threshold heuristic Default: 0x1 @@ -523,6 +525,19 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max tcp_sack - BOOLEAN Enable select acknowledgments (SACKS). +tcp_comp_sack_delay_ns - LONG INTEGER + TCP tries to reduce number of SACK sent, using a timer + based on 5% of SRTT, capped by this sysctl, in nano seconds. + The default is 1ms, based on TSO autosizing period. + + Default : 1,000,000 ns (1 ms) + +tcp_comp_sack_nr - INTEGER + Max numer of SACK that can be compressed. + Using 0 disables SACK compression. + + Detault : 44 + tcp_slow_start_after_idle - BOOLEAN If set, provide RFC2861 behavior and time out the congestion window after an idle period. An idle period is defined at @@ -652,11 +667,15 @@ tcp_tso_win_divisor - INTEGER building larger TSO frames. Default: 3 -tcp_tw_reuse - BOOLEAN - Allow to reuse TIME-WAIT sockets for new connections when it is - safe from protocol viewpoint. Default value is 0. +tcp_tw_reuse - INTEGER + Enable reuse of TIME-WAIT sockets for new connections when it is + safe from protocol viewpoint. + 0 - disable + 1 - global enable + 2 - enable for loopback traffic only It should not be changed without advice/request of technical experts. + Default: 2 tcp_window_scaling - BOOLEAN Enable window scaling as defined in RFC1323. @@ -1390,26 +1409,26 @@ mld_qrv - INTEGER Default: 2 (as specified by RFC3810 9.1) Minimum: 1 (as specified by RFC6636 4.5) -max_dst_opts_cnt - INTEGER +max_dst_opts_number - INTEGER Maximum number of non-padding TLVs allowed in a Destination options extension header. If this value is less than zero then unknown options are disallowed and the number of known TLVs allowed is the absolute value of this number. Default: 8 -max_hbh_opts_cnt - INTEGER +max_hbh_opts_number - INTEGER Maximum number of non-padding TLVs allowed in a Hop-by-Hop options extension header. If this value is less than zero then unknown options are disallowed and the number of known TLVs allowed is the absolute value of this number. Default: 8 -max dst_opts_len - INTEGER +max_dst_opts_length - INTEGER Maximum length allowed for a Destination options extension header. Default: INT_MAX (unlimited) -max hbh_opts_len - INTEGER +max_hbh_length - INTEGER Maximum length allowed for a Hop-by-Hop options extension header. Default: INT_MAX (unlimited) @@ -1428,6 +1447,19 @@ ip6frag_low_thresh - INTEGER ip6frag_time - INTEGER Time in seconds to keep an IPv6 fragment in memory. +IPv6 Segment Routing: + +seg6_flowlabel - INTEGER + Controls the behaviour of computing the flowlabel of outer + IPv6 header in case of SR T.encaps + + -1 set flowlabel to zero. + 0 copy flowlabel from Inner packet in case of Inner IPv6 + (Set flowlabel to 0 in case IPv4/L2) + 1 Compute the flowlabel using seg6_make_flowlabel() + + Default is 0. + conf/default/*: Change the interface-specific default settings. @@ -2126,18 +2158,3 @@ max_dgram_qlen - INTEGER Default: 10 - -UNDOCUMENTED: - -/proc/sys/net/irda/* - fast_poll_increase FIXME - warn_noreply_time FIXME - discovery_slots FIXME - slot_timeout FIXME - max_baud_rate FIXME - discovery_timeout FIXME - lap_keepalive_time FIXME - max_noreply_time FIXME - max_tx_data_size FIXME - max_tx_window FIXME - min_tx_turn_time FIXME diff --git a/Documentation/networking/ipsec.txt b/Documentation/networking/ipsec.txt index 8dbc08b7e431..ba794b7e51be 100644 --- a/Documentation/networking/ipsec.txt +++ b/Documentation/networking/ipsec.txt @@ -25,8 +25,8 @@ Quote from RFC3173: is implementation dependent. Current IPComp implementation is indeed by the book, while as in practice -when sending non-compressed packet to the peer(whether or not packet len -is smaller than the threshold or the compressed len is large than original +when sending non-compressed packet to the peer (whether or not packet len +is smaller than the threshold or the compressed len is larger than original packet len), the packet is dropped when checking the policy as this packet matches the selector but not coming from any XFRM layer, i.e., with no security path. Such naked packet will not eventually make it to upper layer. diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt index 812ef003e0a8..27a38e50c287 100644 --- a/Documentation/networking/ipvlan.txt +++ b/Documentation/networking/ipvlan.txt @@ -73,11 +73,11 @@ mode to make conn-tracking work. This is the default option. To configure the IPvlan port in this mode, user can choose to either add this option on the command-line or don't specify anything. This is the traditional mode where slaves can cross-talk among -themseleves apart from talking through the master device. +themselves apart from talking through the master device. 5.2 private: If this option is added to the command-line, the port is set in private -mode. i.e. port wont allow cross communication between slaves. +mode. i.e. port won't allow cross communication between slaves. 5.3 vepa: If this is added to the command-line, the port is set in VEPA mode. diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.txt index 9a513295b07c..b773a5278ac4 100644 --- a/Documentation/networking/kcm.txt +++ b/Documentation/networking/kcm.txt @@ -1,4 +1,4 @@ -Kernel Connection Mulitplexor +Kernel Connection Multiplexor ----------------------------- Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based @@ -31,7 +31,7 @@ KCM implements an NxM multiplexor in the kernel as diagrammed below: KCM sockets ----------- -The KCM sockets provide the user interface to the muliplexor. All the KCM sockets +The KCM sockets provide the user interface to the multiplexor. All the KCM sockets bound to a multiplexor are considered to have equivalent function, and I/O operations in different sockets may be done in parallel without the need for synchronization between threads in userspace. @@ -199,7 +199,7 @@ while. Example use: BFP programs for message delineation ------------------------------------ -BPF programs can be compiled using the BPF LLVM backend. For exmple, +BPF programs can be compiled using the BPF LLVM backend. For example, the BPF program for parsing Thrift is: #include "bpf.h" /* for __sk_buff */ @@ -222,7 +222,7 @@ messages. The kernel provides necessary assurances that messages are sent and received atomically. This relieves much of the burden applications have in mapping a message based protocol onto the TCP stream. KCM also make application layer messages a unit of work in the kernel for the purposes of -steerng and scheduling, which in turn allows a simpler networking model in +steering and scheduling, which in turn allows a simpler networking model in multithreaded applications. Configurations @@ -272,7 +272,7 @@ on the socket thus waking up the application thread. When the application sees the error (which may just be a disconnect) it should unattach the socket from KCM and then close it. It is assumed that once an error is posted on the TCP socket the data stream is unrecoverable (i.e. an error -may have occurred in the middle of receiving a messssge). +may have occurred in the middle of receiving a message). TCP connection monitoring ------------------------- diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst new file mode 100644 index 000000000000..70ca2f5800c4 --- /dev/null +++ b/Documentation/networking/net_failover.rst @@ -0,0 +1,116 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +NET_FAILOVER +============ + +Overview +======== + +The net_failover driver provides an automated failover mechanism via APIs +to create and destroy a failover master netdev and mananges a primary and +standby slave netdevs that get registered via the generic failover +infrastructrure. + +The failover netdev acts a master device and controls 2 slave devices. The +original paravirtual interface is registered as 'standby' slave netdev and +a passthru/vf device with the same MAC gets registered as 'primary' slave +netdev. Both 'standby' and 'failover' netdevs are associated with the same +'pci' device. The user accesses the network interface via 'failover' netdev. +The 'failover' netdev chooses 'primary' netdev as default for transmits when +it is available with link up and running. + +This can be used by paravirtual drivers to enable an alternate low latency +datapath. It also enables hypervisor controlled live migration of a VM with +direct attached VF by failing over to the paravirtual datapath when the VF +is unplugged. + +virtio-net accelerated datapath: STANDBY mode +============================================= + +net_failover enables hypervisor controlled accelerated datapath to virtio-net +enabled VMs in a transparent manner with no/minimal guest userspace chanages. + +To support this, the hypervisor needs to enable VIRTIO_NET_F_STANDBY +feature on the virtio-net interface and assign the same MAC address to both +virtio-net and VF interfaces. + +Here is an example XML snippet that shows such configuration. + + <interface type='network'> + <mac address='52:54:00:00:12:53'/> + <source network='enp66s0f0_br'/> + <target dev='tap01'/> + <model type='virtio'/> + <driver name='vhost' queues='4'/> + <link state='down'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/> + </interface> + <interface type='hostdev' managed='yes'> + <mac address='52:54:00:00:12:53'/> + <source> + <address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/> + </interface> + +Booting a VM with the above configuration will result in the following 3 +netdevs created in the VM. + +4: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 + link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff + inet 192.168.12.53/24 brd 192.168.12.255 scope global dynamic ens10 + valid_lft 42482sec preferred_lft 42482sec + inet6 fe80::97d8:db2:8c10:b6d6/64 scope link + valid_lft forever preferred_lft forever +5: ens10nsby: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ens10 state UP group default qlen 1000 + link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff +7: ens11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ens10 state UP group default qlen 1000 + link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff + +ens10 is the 'failover' master netdev, ens10nsby and ens11 are the slave +'standby' and 'primary' netdevs respectively. + +Live Migration of a VM with SR-IOV VF & virtio-net in STANDBY mode +================================================================== + +net_failover also enables hypervisor controlled live migration to be supported +with VMs that have direct attached SR-IOV VF devices by automatic failover to +the paravirtual datapath when the VF is unplugged. + +Here is a sample script that shows the steps to initiate live migration on +the source hypervisor. + +# cat vf_xml +<interface type='hostdev' managed='yes'> + <mac address='52:54:00:00:12:53'/> + <source> + <address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/> +</interface> + +# Source Hypervisor +#!/bin/bash + +DOMAIN=fedora27-tap01 +PF=enp66s0f0 +VF_NUM=5 +TAP_IF=tap01 +VF_XML= + +MAC=52:54:00:00:12:53 +ZERO_MAC=00:00:00:00:00:00 + +virsh domif-setlink $DOMAIN $TAP_IF up +bridge fdb del $MAC dev $PF master +virsh detach-device $DOMAIN $VF_XML +ip link set $PF vf $VF_NUM mac $ZERO_MAC + +virsh migrate --live $DOMAIN qemu+ssh://$REMOTE_HOST/system + +# Destination Hypervisor +#!/bin/bash + +virsh attach-device $DOMAIN $VF_XML +virsh domif-setlink $DOMAIN $TAP_IF down diff --git a/Documentation/networking/netdev-FAQ.txt b/Documentation/networking/netdev-FAQ.txt index 2a3278d5cf35..fa951b820b25 100644 --- a/Documentation/networking/netdev-FAQ.txt +++ b/Documentation/networking/netdev-FAQ.txt @@ -179,6 +179,15 @@ A: No. See above answer. In short, if you think it really belongs in dash marker line as described in Documentation/process/submitting-patches.rst to temporarily embed that information into the patch that you send. +Q: Are all networking bug fixes backported to all stable releases? + +A: Due to capacity, Dave could only take care of the backports for the last + 2 stable releases. For earlier stable releases, each stable branch maintainer + is supposed to take care of them. If you find any patch is missing from an + earlier stable branch, please notify stable@vger.kernel.org with either a + commit ID or a formal patch backported, and CC Dave and other relevant + networking developers. + Q: Someone said that the comment style and coding convention is different for the networking content. Is this true? diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.txt index c77f9d57eb91..c4a54c162547 100644 --- a/Documentation/networking/netdev-features.txt +++ b/Documentation/networking/netdev-features.txt @@ -113,6 +113,13 @@ whatever headers there might be. NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6). + * Transmit UDP segmentation offload + +NETIF_F_GSO_UDP_GSO_L4 accepts a single UDP header with a payload that exceeds +gso_size. On segmentation, it segments the payload on gso_size boundaries and +replicates the network and UDP headers (fixing up the last one if less than +gso_size). + * Transmit DMA from high memory On platforms where this is relevant, NETIF_F_HIGHDMA signals that diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt index 433b6724797a..1669dc2419fd 100644 --- a/Documentation/networking/nf_conntrack-sysctl.txt +++ b/Documentation/networking/nf_conntrack-sysctl.txt @@ -156,7 +156,7 @@ nf_conntrack_timestamp - BOOLEAN nf_conntrack_udp_timeout - INTEGER (seconds) default 30 -nf_conntrack_udp_timeout_stream2 - INTEGER (seconds) +nf_conntrack_udp_timeout_stream - INTEGER (seconds) default 180 This extended timeout will be used in case there is an UDP stream diff --git a/Documentation/networking/ppp_generic.txt b/Documentation/networking/ppp_generic.txt index 091d20273dcb..61daf4b39600 100644 --- a/Documentation/networking/ppp_generic.txt +++ b/Documentation/networking/ppp_generic.txt @@ -300,12 +300,6 @@ unattached instance are: The ioctl calls available on an instance of /dev/ppp attached to a channel are: -* PPPIOCDETACH detaches the instance from the channel. This ioctl is - deprecated since the same effect can be achieved by closing the - instance. In order to prevent possible races this ioctl will fail - with an EINVAL error if more than one file descriptor refers to this - instance (i.e. as a result of dup(), dup2() or fork()). - * PPPIOCCONNECT connects this channel to a PPP interface. The argument should point to an int containing the interface unit number. It will return an EINVAL error if the channel is already diff --git a/Documentation/power/suspend-and-cpuhotplug.txt b/Documentation/power/suspend-and-cpuhotplug.txt index 31abd04b9572..6f55eb960a6d 100644 --- a/Documentation/power/suspend-and-cpuhotplug.txt +++ b/Documentation/power/suspend-and-cpuhotplug.txt @@ -168,7 +168,7 @@ update on the CPUs, as discussed below: [Please bear in mind that the kernel requests the microcode images from userspace, using the request_firmware() function defined in -drivers/base/firmware_class.c] +drivers/base/firmware_loader/main.c] a. When all the CPUs are identical: diff --git a/Documentation/process/2.Process.rst b/Documentation/process/2.Process.rst index ce5561bb3f8e..a9c46dd0706b 100644 --- a/Documentation/process/2.Process.rst +++ b/Documentation/process/2.Process.rst @@ -18,17 +18,17 @@ major kernel release happening every two or three months. The recent release history looks like this: ====== ================= - 2.6.38 March 14, 2011 - 2.6.37 January 4, 2011 - 2.6.36 October 20, 2010 - 2.6.35 August 1, 2010 - 2.6.34 May 15, 2010 - 2.6.33 February 24, 2010 + 4.11 April 30, 2017 + 4.12 July 2, 2017 + 4.13 September 3, 2017 + 4.14 November 12, 2017 + 4.15 January 28, 2018 + 4.16 April 1, 2018 ====== ================= -Every 2.6.x release is a major kernel release with new features, internal -API changes, and more. A typical 2.6 release can contain nearly 10,000 -changesets with changes to several hundred thousand lines of code. 2.6 is +Every 4.x release is a major kernel release with new features, internal +API changes, and more. A typical 4.x release contain about 13,000 +changesets with changes to several hundred thousand lines of code. 4.x is thus the leading edge of Linux kernel development; the kernel uses a rolling development model which is continually integrating major changes. @@ -70,20 +70,19 @@ will get up to somewhere between -rc6 and -rc9 before the kernel is considered to be sufficiently stable and the final 2.6.x release is made. At that point the whole process starts over again. -As an example, here is how the 2.6.38 development cycle went (all dates in -2011): +As an example, here is how the 4.16 development cycle went (all dates in +2018): ============== =============================== - January 4 2.6.37 stable release - January 18 2.6.38-rc1, merge window closes - January 21 2.6.38-rc2 - February 1 2.6.38-rc3 - February 7 2.6.38-rc4 - February 15 2.6.38-rc5 - February 21 2.6.38-rc6 - March 1 2.6.38-rc7 - March 7 2.6.38-rc8 - March 14 2.6.38 stable release + January 28 4.15 stable release + February 11 4.16-rc1, merge window closes + February 18 4.16-rc2 + February 25 4.16-rc3 + March 4 4.16-rc4 + March 11 4.16-rc5 + March 18 4.16-rc6 + March 25 4.16-rc7 + April 1 4.17 stable release ============== =============================== How do the developers decide when to close the development cycle and create @@ -99,37 +98,42 @@ release is made. In the real world, this kind of perfection is hard to achieve; there are just too many variables in a project of this size. There comes a point where delaying the final release just makes the problem worse; the pile of changes waiting for the next merge window will grow -larger, creating even more regressions the next time around. So most 2.6.x +larger, creating even more regressions the next time around. So most 4.x kernels go out with a handful of known regressions though, hopefully, none of them are serious. Once a stable release is made, its ongoing maintenance is passed off to the "stable team," currently consisting of Greg Kroah-Hartman. The stable team -will release occasional updates to the stable release using the 2.6.x.y +will release occasional updates to the stable release using the 4.x.y numbering scheme. To be considered for an update release, a patch must (1) fix a significant bug, and (2) already be merged into the mainline for the next development kernel. Kernels will typically receive stable updates for a little more than one development cycle past their initial release. So, -for example, the 2.6.36 kernel's history looked like: +for example, the 4.13 kernel's history looked like: ============== =============================== - October 10 2.6.36 stable release - November 22 2.6.36.1 - December 9 2.6.36.2 - January 7 2.6.36.3 - February 17 2.6.36.4 + September 3 4.13 stable release + September 13 4.13.1 + September 20 4.13.2 + September 27 4.13.3 + October 5 4.13.4 + October 12 4.13.5 + ... ... + November 24 4.13.16 ============== =============================== -2.6.36.4 was the final stable update for the 2.6.36 release. +4.13.16 was the final stable update of the 4.13 release. Some kernels are designated "long term" kernels; they will receive support for a longer period. As of this writing, the current long term kernels and their maintainers are: - ====== ====================== =========================== - 2.6.27 Willy Tarreau (Deep-frozen stable kernel) - 2.6.32 Greg Kroah-Hartman - 2.6.35 Andi Kleen (Embedded flag kernel) + ====== ====================== ============================== + 3.16 Ben Hutchings (very long-term stable kernel) + 4.1 Sasha Levin + 4.4 Greg Kroah-Hartman (very long-term stable kernel) + 4.9 Greg Kroah-Hartman + 4.14 Greg Kroah-Hartman ====== ====================== =========================== The selection of a kernel for long-term support is purely a matter of a diff --git a/Documentation/process/5.Posting.rst b/Documentation/process/5.Posting.rst index c209d70da66f..c418c5d6cae4 100644 --- a/Documentation/process/5.Posting.rst +++ b/Documentation/process/5.Posting.rst @@ -10,8 +10,8 @@ of conventions and procedures which are used in the posting of patches; following them will make life much easier for everybody involved. This document will attempt to cover these expectations in reasonable detail; more information can also be found in the files process/submitting-patches.rst, -process/submitting-drivers.rst, and process/submit-checklist.rst in the kernel documentation -directory. +process/submitting-drivers.rst, and process/submit-checklist.rst in the kernel +documentation directory. When to post @@ -198,8 +198,8 @@ pass it to diff with the "-X" option. The tags mentioned above are used to describe how various developers have been associated with the development of this patch. They are described in -detail in the process/submitting-patches.rst document; what follows here is a brief -summary. Each of these lines has the format: +detail in the process/submitting-patches.rst document; what follows here is a +brief summary. Each of these lines has the format: :: @@ -210,8 +210,8 @@ The tags in common use are: - Signed-off-by: this is a developer's certification that he or she has the right to submit the patch for inclusion into the kernel. It is an agreement to the Developer's Certificate of Origin, the full text of - which can be found in Documentation/process/submitting-patches.rst. Code without a - proper signoff cannot be merged into the mainline. + which can be found in Documentation/process/submitting-patches.rst. Code + without a proper signoff cannot be merged into the mainline. - Co-developed-by: states that the patch was also created by another developer along with the original author. This is useful at times when multiple @@ -226,8 +226,8 @@ The tags in common use are: it to work. - Reviewed-by: the named developer has reviewed the patch for correctness; - see the reviewer's statement in Documentation/process/submitting-patches.rst for more - detail. + see the reviewer's statement in Documentation/process/submitting-patches.rst + for more detail. - Reported-by: names a user who reported a problem which is fixed by this patch; this tag is used to give credit to the (often underappreciated) diff --git a/Documentation/process/index.rst b/Documentation/process/index.rst index 1c9fe657ed01..37bd0628b6ee 100644 --- a/Documentation/process/index.rst +++ b/Documentation/process/index.rst @@ -52,6 +52,7 @@ lack of a better place. adding-syscalls magic-number volatile-considered-harmful + clang-format .. only:: subproject and html diff --git a/Documentation/process/magic-number.rst b/Documentation/process/magic-number.rst index 00cecf1fcba9..633be1043690 100644 --- a/Documentation/process/magic-number.rst +++ b/Documentation/process/magic-number.rst @@ -157,8 +157,5 @@ memory management. See ``include/sound/sndmagic.h`` for complete list of them. M OSS sound drivers have their magic numbers constructed from the soundcard PCI ID - these are not listed here as well. -IrDA subsystem also uses large number of own magic numbers, see -``include/net/irda/irda.h`` for a complete list of them. - HFS is another larger user of magic numbers - you can find them in ``fs/hfs/hfs.h``. diff --git a/Documentation/process/maintainer-pgp-guide.rst b/Documentation/process/maintainer-pgp-guide.rst index b453561a7148..aff9b1a4d77b 100644 --- a/Documentation/process/maintainer-pgp-guide.rst +++ b/Documentation/process/maintainer-pgp-guide.rst @@ -219,7 +219,7 @@ Our goal is to protect your master key by moving it to offline media, so if you only have a combined **[SC]** key, then you should create a separate signing subkey:: - $ gpg --quick-add-key [fpr] ed25519 sign + $ gpg --quick-addkey [fpr] ed25519 sign Remember to tell the keyservers about this change, so others can pull down your new subkey:: @@ -450,11 +450,18 @@ functionality. There are several options available: others. If you want to use ECC keys, your best bet among commercially available devices is the Nitrokey Start. +.. note:: + + If you are listed in MAINTAINERS or have an account at kernel.org, + you `qualify for a free Nitrokey Start`_ courtesy of The Linux + Foundation. + .. _`Nitrokey Start`: https://shop.nitrokey.com/shop/product/nitrokey-start-6 .. _`Nitrokey Pro`: https://shop.nitrokey.com/shop/product/nitrokey-pro-3 .. _`Yubikey 4`: https://www.yubico.com/product/yubikey-4-series/ .. _Gnuk: http://www.fsij.org/doc-gnuk/ .. _`LWN has a good review`: https://lwn.net/Articles/736231/ +.. _`qualify for a free Nitrokey Start`: https://www.kernel.org/nitrokey-digital-tokens-for-kernel-developers.html Configure your smartcard device ------------------------------- @@ -482,7 +489,7 @@ there are no convenient command-line switches:: You should set the user PIN (1), Admin PIN (3), and the Reset Code (4). Please make sure to record and store these in a safe place -- especially the Admin PIN and the Reset Code (which allows you to completely wipe -the smartcard). You so rarely need to use the Admin PIN, that you will +the smartcard). You so rarely need to use the Admin PIN, that you will inevitably forget what it is if you do not record it. Getting back to the main card menu, you can also set other values (such @@ -494,6 +501,12 @@ additionally leak information about your smartcard should you lose it. Despite having the name "PIN", neither the user PIN nor the admin PIN on the card need to be numbers. +.. warning:: + + Some devices may require that you move the subkeys onto the device + before you can change the passphrase. Please check the documentation + provided by the device manufacturer. + Move the subkeys to your smartcard ---------------------------------- @@ -655,6 +668,20 @@ want to import these changes back into your regular working directory:: $ gpg --export | gpg --homedir ~/.gnupg --import $ unset GNUPGHOME +Using gpg-agent over ssh +~~~~~~~~~~~~~~~~~~~~~~~~ + +You can forward your gpg-agent over ssh if you need to sign tags or +commits on a remote system. Please refer to the instructions provided +on the GnuPG wiki: + +- `Agent Forwarding over SSH`_ + +It works more smoothly if you can modify the sshd server settings on the +remote end. + +.. _`Agent Forwarding over SSH`: https://wiki.gnupg.org/AgentForwarding + Using PGP with Git ================== @@ -692,6 +719,7 @@ should be used (``[fpr]`` is the fingerprint of your key):: tell git to always use it instead of the legacy ``gpg`` from version 1:: $ git config --global gpg.program gpg2 + $ git config --global gpgv.program gpgv2 How to work with signed tags ---------------------------- @@ -731,6 +759,13 @@ If you are verifying someone else's git tag, then you will need to import their PGP key. Please refer to the ":ref:`verify_identities`" section below. +.. note:: + + If you get "``gpg: Can't check signature: unknown pubkey + algorithm``" error, you need to tell git to use gpgv2 for + verification, so it properly processes signatures made by ECC keys. + See instructions at the start of this section. + Configure git to always sign annotated tags ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst index f7152ed565e5..908bb55be407 100644 --- a/Documentation/process/submitting-patches.rst +++ b/Documentation/process/submitting-patches.rst @@ -761,7 +761,7 @@ requests, especially from new, unknown developers. If in doubt you can use the pull request as the cover letter for a normal posting of the patch series, giving the maintainer the option of using either. -A pull request should have [GIT] or [PULL] in the subject line. The +A pull request should have [GIT PULL] in the subject line. The request itself should include the repository name and the branch of interest on a single line; it should look something like:: diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt index 8ce78f82ae23..b14e03ff3528 100644 --- a/Documentation/scheduler/sched-deadline.txt +++ b/Documentation/scheduler/sched-deadline.txt @@ -49,7 +49,7 @@ CONTENTS 2.1 Main algorithm ------------------ - SCHED_DEADLINE uses three parameters, named "runtime", "period", and + SCHED_DEADLINE [18] uses three parameters, named "runtime", "period", and "deadline", to schedule tasks. A SCHED_DEADLINE task should receive "runtime" microseconds of execution time every "period" microseconds, and these "runtime" microseconds are available within "deadline" microseconds @@ -117,6 +117,10 @@ CONTENTS scheduling deadline = scheduling deadline + period remaining runtime = remaining runtime + runtime + The SCHED_FLAG_DL_OVERRUN flag in sched_attr's sched_flags field allows a task + to get informed about runtime overruns through the delivery of SIGXCPU + signals. + 2.2 Bandwidth reclaiming ------------------------ @@ -279,6 +283,19 @@ CONTENTS running_bw is incremented. +2.3 Energy-aware scheduling +------------------------ + + When cpufreq's schedutil governor is selected, SCHED_DEADLINE implements the + GRUB-PA [19] algorithm, reducing the CPU operating frequency to the minimum + value that still allows to meet the deadlines. This behavior is currently + implemented only for ARM architectures. + + A particular care must be taken in case the time needed for changing frequency + is of the same order of magnitude of the reservation period. In such cases, + setting a fixed CPU frequency results in a lower amount of deadline misses. + + 3. Scheduling Real-Time Tasks ============================= @@ -505,6 +522,12 @@ CONTENTS 17 - L. Abeni, G. Lipari, A. Parri, Y. Sun, Multicore CPU reclaiming: parallel or sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied Computing, 2016. + 18 - J. Lelli, C. Scordino, L. Abeni, D. Faggioli, Deadline scheduling in the + Linux kernel, Software: Practice and Experience, 46(6): 821-839, June + 2016. + 19 - C. Scordino, L. Abeni, J. Lelli, Energy-Aware Real-Time Scheduling in + the Linux Kernel, 33rd ACM/SIGAPP Symposium On Applied Computing (SAC + 2018), Pau, France, April 2018. 4. Bandwidth management diff --git a/Documentation/scsi/scsi_eh.txt b/Documentation/scsi/scsi_eh.txt index 11e447bdb3a5..1b7436932a2b 100644 --- a/Documentation/scsi/scsi_eh.txt +++ b/Documentation/scsi/scsi_eh.txt @@ -82,24 +82,13 @@ function 1. invokes optional hostt->eh_timed_out() callback. Return value can be one of - - BLK_EH_HANDLED - This indicates that eh_timed_out() dealt with the timeout. - The command is passed back to the block layer and completed - via __blk_complete_requests(). - - *NOTE* After returning BLK_EH_HANDLED the SCSI layer is - assumed to be finished with the command, and no other - functions from the SCSI layer will be called. So this - should typically only be returned if the eh_timed_out() - handler raced with normal completion. - - BLK_EH_RESET_TIMER This indicates that more time is required to finish the command. Timer is restarted. This action is counted as a retry and only allowed scmd->allowed + 1(!) times. Once the - limit is reached, action for BLK_EH_NOT_HANDLED is taken instead. + limit is reached, action for BLK_EH_DONE is taken instead. - - BLK_EH_NOT_HANDLED + - BLK_EH_DONE eh_timed_out() callback did not handle the command. Step #2 is taken. diff --git a/Documentation/security/index.rst b/Documentation/security/index.rst index 298a94a33f05..85492bfca530 100644 --- a/Documentation/security/index.rst +++ b/Documentation/security/index.rst @@ -9,5 +9,7 @@ Security Documentation IMA-templates keys/index LSM + LSM-sctp + SELinux-sctp self-protection tpm/index diff --git a/Documentation/sound/alsa-configuration.rst b/Documentation/sound/alsa-configuration.rst index aed6b4fb8e46..4d83c1c0ca04 100644 --- a/Documentation/sound/alsa-configuration.rst +++ b/Documentation/sound/alsa-configuration.rst @@ -1062,7 +1062,7 @@ output (with ``--no-upload`` option) to kernel bugzilla or alsa-devel ML (see the section `Links and Addresses`_). ``power_save`` and ``power_save_controller`` options are for power-saving -mode. See powersave.txt for details. +mode. See powersave.rst for details. Note 2: If you get click noises on output, try the module option ``position_fix=1`` or ``2``. ``position_fix=1`` will use the SD_LPIB @@ -1133,7 +1133,7 @@ line_outs_monitor enable_monitor Enable Analog Out on Channel 63/64 by default. -See hdspm.txt for details. +See hdspm.rst for details. Module snd-ice1712 ------------------ @@ -2224,6 +2224,13 @@ quirk_alias Quirk alias list, pass strings like ``0123abcd:5678beef``, which applies the existing quirk for the device 5678:beef to a new device 0123:abcd. +use_vmalloc + Use vmalloc() for allocations of the PCM buffers (default: yes). + For architectures with non-coherent memory like ARM or MIPS, the + mmap access may give inconsistent results with vmalloc'ed + buffers. If mmap is used on such architectures, turn off this + option, so that the DMA-coherent buffers are allocated and used + instead. This module supports multiple devices, autoprobe and hotplugging. diff --git a/Documentation/sound/hd-audio/models.rst b/Documentation/sound/hd-audio/models.rst index 1fee5a4f6660..7c2d37571af0 100644 --- a/Documentation/sound/hd-audio/models.rst +++ b/Documentation/sound/hd-audio/models.rst @@ -263,6 +263,8 @@ hp-dock HP dock support mute-led-gpio Mute LED control via GPIO +hp-mic-fix + Fix for headset mic pin on HP boxes STAC9200 ======== diff --git a/Documentation/sound/soc/codec.rst b/Documentation/sound/soc/codec.rst index f87612b94812..8a9737eb7597 100644 --- a/Documentation/sound/soc/codec.rst +++ b/Documentation/sound/soc/codec.rst @@ -139,7 +139,7 @@ DAPM description ---------------- The Dynamic Audio Power Management description describes the codec power components and their relationships and registers to the ASoC core. -Please read dapm.txt for details of building the description. +Please read dapm.rst for details of building the description. Please also see the examples in other codec drivers. @@ -179,12 +179,12 @@ i.e. static int wm8974_mute(struct snd_soc_dai *dai, int mute) { - struct snd_soc_codec *codec = dai->codec; - u16 mute_reg = snd_soc_read(codec, WM8974_DAC) & 0xffbf; + struct snd_soc_component *component = dai->component; + u16 mute_reg = snd_soc_component_read32(component, WM8974_DAC) & 0xffbf; if (mute) - snd_soc_write(codec, WM8974_DAC, mute_reg | 0x40); + snd_soc_component_write(component, WM8974_DAC, mute_reg | 0x40); else - snd_soc_write(codec, WM8974_DAC, mute_reg); + snd_soc_component_write(component, WM8974_DAC, mute_reg); return 0; } diff --git a/Documentation/sound/soc/platform.rst b/Documentation/sound/soc/platform.rst index d5574904d981..c1badea53d3d 100644 --- a/Documentation/sound/soc/platform.rst +++ b/Documentation/sound/soc/platform.rst @@ -23,30 +23,26 @@ The platform DMA driver optionally supports the following ALSA operations:- }; The platform driver exports its DMA functionality via struct -snd_soc_platform_driver:- +snd_soc_component_driver:- :: - struct snd_soc_platform_driver { - char *name; + struct snd_soc_component_driver { + const char *name; - int (*probe)(struct platform_device *pdev); - int (*remove)(struct platform_device *pdev); - int (*suspend)(struct platform_device *pdev, struct snd_soc_cpu_dai *cpu_dai); - int (*resume)(struct platform_device *pdev, struct snd_soc_cpu_dai *cpu_dai); + ... + int (*probe)(struct snd_soc_component *); + void (*remove)(struct snd_soc_component *); + int (*suspend)(struct snd_soc_component *); + int (*resume)(struct snd_soc_component *); /* pcm creation and destruction */ - int (*pcm_new)(struct snd_card *, struct snd_soc_codec_dai *, struct snd_pcm *); + int (*pcm_new)(struct snd_soc_pcm_runtime *); void (*pcm_free)(struct snd_pcm *); - /* - * For platform caused delay reporting. - * Optional. - */ - snd_pcm_sframes_t (*delay)(struct snd_pcm_substream *, - struct snd_soc_dai *); - - /* platform stream ops */ - struct snd_pcm_ops *pcm_ops; + ... + const struct snd_pcm_ops *ops; + const struct snd_compr_ops *compr_ops; + ... }; Please refer to the ALSA driver documentation for details of audio DMA. @@ -66,7 +62,7 @@ Each SoC DAI driver must provide the following features:- 4. SYSCLK configuration 5. Suspend and resume (optional) -Please see codec.txt for a description of items 1 - 4. +Please see codec.rst for a description of items 1 - 4. SoC DSP Drivers diff --git a/Documentation/sphinx/parse-headers.pl b/Documentation/sphinx/parse-headers.pl index a958d8b5e99d..d410f47567e9 100755 --- a/Documentation/sphinx/parse-headers.pl +++ b/Documentation/sphinx/parse-headers.pl @@ -387,11 +387,11 @@ tree for more details. =head1 BUGS -Report bugs to Mauro Carvalho Chehab <mchehab@s-opensource.com> +Report bugs to Mauro Carvalho Chehab <mchehab@kernel.org> =head1 COPYRIGHT -Copyright (c) 2016 by Mauro Carvalho Chehab <mchehab@s-opensource.com>. +Copyright (c) 2016 by Mauro Carvalho Chehab <mchehab+samsung@kernel.org>. License GPLv2: GNU GPL version 2 <http://gnu.org/licenses/gpl.html>. diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt index 5992602469d8..9ecde517728c 100644 --- a/Documentation/sysctl/net.txt +++ b/Documentation/sysctl/net.txt @@ -45,6 +45,7 @@ through bpf(2) and passing a verifier in the kernel, a JIT will then translate these BPF proglets into native CPU instructions. There are two flavors of JITs, the newer eBPF JIT currently supported on: - x86_64 + - x86_32 - arm64 - arm32 - ppc64 diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 17256f2ad919..697ef8c225df 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -515,7 +515,7 @@ nr_hugepages Change the minimum size of the hugepage pool. -See Documentation/vm/hugetlbpage.txt +See Documentation/admin-guide/mm/hugetlbpage.rst ============================================================== @@ -524,7 +524,7 @@ nr_overcommit_hugepages Change the maximum size of the hugepage pool. The maximum is nr_hugepages + nr_overcommit_hugepages. -See Documentation/vm/hugetlbpage.txt +See Documentation/admin-guide/mm/hugetlbpage.rst ============================================================== @@ -667,7 +667,7 @@ and don't use much of it. The default value is 0. -See Documentation/vm/overcommit-accounting and +See Documentation/vm/overcommit-accounting.rst and mm/mmap.c::__vm_enough_memory() for more information. ============================================================== diff --git a/Documentation/trace/coresight-cpu-debug.txt b/Documentation/trace/coresight-cpu-debug.txt index 2b9b51cd501e..89ab09e78e8d 100644 --- a/Documentation/trace/coresight-cpu-debug.txt +++ b/Documentation/trace/coresight-cpu-debug.txt @@ -177,11 +177,11 @@ Here is an example of the debugging output format: ARM external debug module: coresight-cpu-debug 850000.debug: CPU[0]: coresight-cpu-debug 850000.debug: EDPRSR: 00000001 (Power:On DLK:Unlock) -coresight-cpu-debug 850000.debug: EDPCSR: [<ffff00000808e9bc>] handle_IPI+0x174/0x1d8 +coresight-cpu-debug 850000.debug: EDPCSR: handle_IPI+0x174/0x1d8 coresight-cpu-debug 850000.debug: EDCIDSR: 00000000 coresight-cpu-debug 850000.debug: EDVIDSR: 90000000 (State:Non-secure Mode:EL1/0 Width:64bits VMID:0) coresight-cpu-debug 852000.debug: CPU[1]: coresight-cpu-debug 852000.debug: EDPRSR: 00000001 (Power:On DLK:Unlock) -coresight-cpu-debug 852000.debug: EDPCSR: [<ffff0000087fab34>] debug_notifier_call+0x23c/0x358 +coresight-cpu-debug 852000.debug: EDPCSR: debug_notifier_call+0x23c/0x358 coresight-cpu-debug 852000.debug: EDCIDSR: 00000000 coresight-cpu-debug 852000.debug: EDVIDSR: 90000000 (State:Non-secure Mode:EL1/0 Width:64bits VMID:0) diff --git a/Documentation/trace/coresight.txt b/Documentation/trace/coresight.txt index 6f0120c3a4f1..1d74ad0202b6 100644 --- a/Documentation/trace/coresight.txt +++ b/Documentation/trace/coresight.txt @@ -187,13 +187,19 @@ that can be performed on them (see "struct coresight_ops"). The specific to that component only. "Implementation defined" customisations are expected to be accessed and controlled using those entries. -Last but not least, "struct module *owner" is expected to be set to reflect -the information carried in "THIS_MODULE". How to use the tracer modules ----------------------------- -Before trace collection can start, a coresight sink needs to be identify. +There are two ways to use the Coresight framework: 1) using the perf cmd line +tools and 2) interacting directly with the Coresight devices using the sysFS +interface. Preference is given to the former as using the sysFS interface +requires a deep understanding of the Coresight HW. The following sections +provide details on using both methods. + +1) Using the sysFS interface: + +Before trace collection can start, a coresight sink needs to be identified. There is no limit on the amount of sinks (nor sources) that can be enabled at any given moment. As a generic operation, all device pertaining to the sink class will have an "active" entry in sysfs: @@ -298,42 +304,48 @@ Instruction 13570831 0x8026B584 E28DD00C false ADD Instruction 0 0x8026B588 E8BD8000 true LDM sp!,{pc} Timestamp Timestamp: 17107041535 -How to use the STM module -------------------------- +2) Using perf framework: -Using the System Trace Macrocell module is the same as the tracers - the only -difference is that clients are driving the trace capture rather -than the program flow through the code. +Coresight tracers are represented using the Perf framework's Performance +Monitoring Unit (PMU) abstraction. As such the perf framework takes charge of +controlling when tracing gets enabled based on when the process of interest is +scheduled. When configured in a system, Coresight PMUs will be listed when +queried by the perf command line tool: -As with any other CoreSight component, specifics about the STM tracer can be -found in sysfs with more information on each entry being found in [1]: + linaro@linaro-nano:~$ ./perf list pmu -root@genericarmv8:~# ls /sys/bus/coresight/devices/20100000.stm -enable_source hwevent_select port_enable subsystem uevent -hwevent_enable mgmt port_select traceid -root@genericarmv8:~# + List of pre-defined events (to be used in -e): -Like any other source a sink needs to be identified and the STM enabled before -being used: + cs_etm// [Kernel PMU event] -root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20010000.etf/enable_sink -root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20100000.stm/enable_source + linaro@linaro-nano:~$ -From there user space applications can request and use channels using the devfs -interface provided for that purpose by the generic STM API: +Regardless of the number of tracers available in a system (usually equal to the +amount of processor cores), the "cs_etm" PMU will be listed only once. -root@genericarmv8:~# ls -l /dev/20100000.stm -crw------- 1 root root 10, 61 Jan 3 18:11 /dev/20100000.stm -root@genericarmv8:~# +A Coresight PMU works the same way as any other PMU, i.e the name of the PMU is +listed along with configuration options within forward slashes '/'. Since a +Coresight system will typically have more than one sink, the name of the sink to +work with needs to be specified as an event option. Names for sink to choose +from are listed in sysFS under ($SYSFS)/bus/coresight/devices: -Details on how to use the generic STM API can be found here [2]. + root@linaro-nano:~# ls /sys/bus/coresight/devices/ + 20010000.etf 20040000.funnel 20100000.stm 22040000.etm + 22140000.etm 230c0000.funnel 23240000.etm 20030000.tpiu + 20070000.etr 20120000.replicator 220c0000.funnel + 23040000.etm 23140000.etm 23340000.etm -[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm -[2]. Documentation/trace/stm.txt + root@linaro-nano:~# perf record -e cs_etm/@20070000.etr/u --per-thread program +The syntax within the forward slashes '/' is important. The '@' character +tells the parser that a sink is about to be specified and that this is the sink +to use for the trace session. -Using perf tools ----------------- +More information on the above and other example on how to use Coresight with +the perf tools can be found in the "HOWTO.md" file of the openCSD gitHub +repository [3]. + +2.1) AutoFDO analysis using the perf tools: perf can be used to record and analyze trace of programs. @@ -381,3 +393,38 @@ sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tuto $ taskset -c 2 ./sort_autofdo Bubble sorting array of 30000 elements 5806 ms + + +How to use the STM module +------------------------- + +Using the System Trace Macrocell module is the same as the tracers - the only +difference is that clients are driving the trace capture rather +than the program flow through the code. + +As with any other CoreSight component, specifics about the STM tracer can be +found in sysfs with more information on each entry being found in [1]: + +root@genericarmv8:~# ls /sys/bus/coresight/devices/20100000.stm +enable_source hwevent_select port_enable subsystem uevent +hwevent_enable mgmt port_select traceid +root@genericarmv8:~# + +Like any other source a sink needs to be identified and the STM enabled before +being used: + +root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20010000.etf/enable_sink +root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20100000.stm/enable_source + +From there user space applications can request and use channels using the devfs +interface provided for that purpose by the generic STM API: + +root@genericarmv8:~# ls -l /dev/20100000.stm +crw------- 1 root root 10, 61 Jan 3 18:11 /dev/20100000.stm +root@genericarmv8:~# + +Details on how to use the generic STM API can be found here [2]. + +[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm +[2]. Documentation/trace/stm.txt +[3]. https://github.com/Linaro/perf-opencsd diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst index a5ea2cb0082b..1afae55dc55c 100644 --- a/Documentation/trace/events.rst +++ b/Documentation/trace/events.rst @@ -338,10 +338,14 @@ used for conditionally invoking triggers. The syntax for event triggers is roughly based on the syntax for set_ftrace_filter 'ftrace filter commands' (see the 'Filter commands' -section of Documentation/trace/ftrace.txt), but there are major +section of Documentation/trace/ftrace.rst), but there are major differences and the implementation isn't currently tied to it in any way, so beware about making generalizations between the two. +Note: Writing into trace_marker (See Documentation/trace/ftrace.rst) + can also enable triggers that are written into + /sys/kernel/tracing/events/ftrace/print/trigger + 6.1 Expression syntax --------------------- diff --git a/Documentation/trace/ftrace-uses.rst b/Documentation/trace/ftrace-uses.rst index 998a60a93015..00283b6dd101 100644 --- a/Documentation/trace/ftrace-uses.rst +++ b/Documentation/trace/ftrace-uses.rst @@ -12,7 +12,7 @@ Written for: 4.14 Introduction ============ -The ftrace infrastructure was originially created to attach callbacks to the +The ftrace infrastructure was originally created to attach callbacks to the beginning of functions in order to record and trace the flow of the kernel. But callbacks to the start of a function can have other use cases. Either for live kernel patching, or for security monitoring. This document describes @@ -30,7 +30,7 @@ The ftrace context This requires extra care to what can be done inside a callback. A callback can be called outside the protective scope of RCU. -The ftrace infrastructure has some protections agains recursions and RCU +The ftrace infrastructure has some protections against recursions and RCU but one must still be very careful how they use the callbacks. diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst index e45f0786f3f9..a20d34955333 100644 --- a/Documentation/trace/ftrace.rst +++ b/Documentation/trace/ftrace.rst @@ -224,6 +224,8 @@ of ftrace. Here is a list of some of the key files: has a side effect of enabling or disabling specific functions to be traced. Echoing names of functions into this file will limit the trace to only those functions. + This influences the tracers "function" and "function_graph" + and thus also function profiling (see "function_profile_enabled"). The functions listed in "available_filter_functions" are what can be written into this file. @@ -265,6 +267,8 @@ of ftrace. Here is a list of some of the key files: Functions listed in this file will cause the function graph tracer to only trace these functions and the functions that they call. (See the section "dynamic ftrace" for more details). + Note, set_ftrace_filter and set_ftrace_notrace still affects + what functions are being traced. set_graph_notrace: @@ -277,7 +281,8 @@ of ftrace. Here is a list of some of the key files: This lists the functions that ftrace has processed and can trace. These are the function names that you can pass to - "set_ftrace_filter" or "set_ftrace_notrace". + "set_ftrace_filter", "set_ftrace_notrace", + "set_graph_function", or "set_graph_notrace". (See the section "dynamic ftrace" below for more details.) dyn_ftrace_total_info: @@ -461,9 +466,17 @@ of ftrace. Here is a list of some of the key files: and ticks at the same rate as the hardware clocksource. boot: - Same as mono. Used to be a separate clock which accounted - for the time spent in suspend while CLOCK_MONOTONIC did - not. + This is the boot clock (CLOCK_BOOTTIME) and is based on the + fast monotonic clock, but also accounts for time spent in + suspend. Since the clock access is designed for use in + tracing in the suspend path, some side effects are possible + if clock is accessed after the suspend time is accounted before + the fast mono clock is updated. In this case, the clock update + appears to happen slightly sooner than it normally would have. + Also on 32-bit systems, it's possible that the 64-bit boot offset + sees a partial update. These effects are rare and post + processing should be able to handle them. See comments in the + ktime_get_boot_fast_ns() function for more information. To set a clock, simply echo the clock name into this file:: @@ -499,6 +512,11 @@ of ftrace. Here is a list of some of the key files: trace_fd = open("trace_marker", WR_ONLY); + Note: Writing into the trace_marker file can also initiate triggers + that are written into /sys/kernel/tracing/events/ftrace/print/trigger + See "Event triggers" in Documentation/trace/events.rst and an + example in Documentation/trace/histogram.rst (Section 3.) + trace_marker_raw: This is similar to trace_marker above, but is meant for for binary data diff --git a/Documentation/trace/histogram.txt b/Documentation/trace/histogram.txt index 6e05510afc28..b13771cb12c1 100644 --- a/Documentation/trace/histogram.txt +++ b/Documentation/trace/histogram.txt @@ -1604,7 +1604,6 @@ Entries: 7 Dropped: 0 - 2.2 Inter-event hist triggers ----------------------------- @@ -1993,3 +1992,547 @@ hist trigger specification. Hits: 12970 Entries: 2 Dropped: 0 + +3. User space creating a trigger +-------------------------------- + +Writing into /sys/kernel/tracing/trace_marker writes into the ftrace +ring buffer. This can also act like an event, by writing into the trigger +file located in /sys/kernel/tracing/events/ftrace/print/ + +Modifying cyclictest to write into the trace_marker file before it sleeps +and after it wakes up, something like this: + +static void traceputs(char *str) +{ + /* tracemark_fd is the trace_marker file descriptor */ + if (tracemark_fd < 0) + return; + /* write the tracemark message */ + write(tracemark_fd, str, strlen(str)); +} + +And later add something like: + + traceputs("start"); + clock_nanosleep(...); + traceputs("end"); + +We can make a histogram from this: + + # cd /sys/kernel/tracing + # echo 'latency u64 lat' > synthetic_events + # echo 'hist:keys=common_pid:ts0=common_timestamp.usecs if buf == "start"' > events/ftrace/print/trigger + # echo 'hist:keys=common_pid:lat=common_timestamp.usecs-$ts0:onmatch(ftrace.print).latency($lat) if buf == "end"' >> events/ftrace/print/trigger + # echo 'hist:keys=lat,common_pid:sort=lat' > events/synthetic/latency/trigger + +The above created a synthetic event called "latency" and two histograms +against the trace_marker, one gets triggered when "start" is written into the +trace_marker file and the other when "end" is written. If the pids match, then +it will call the "latency" synthetic event with the calculated latency as its +parameter. Finally, a histogram is added to the latency synthetic event to +record the calculated latency along with the pid. + +Now running cyclictest with: + + # ./cyclictest -p80 -d0 -i250 -n -a -t --tracemark -b 1000 + + -p80 : run threads at priority 80 + -d0 : have all threads run at the same interval + -i250 : start the interval at 250 microseconds (all threads will do this) + -n : sleep with nanosleep + -a : affine all threads to a separate CPU + -t : one thread per available CPU + --tracemark : enable trace mark writing + -b 1000 : stop if any latency is greater than 1000 microseconds + +Note, the -b 1000 is used just to make --tracemark available. + +Then we can see the histogram created by this with: + + # cat events/synthetic/latency/hist +# event histogram +# +# trigger info: hist:keys=lat,common_pid:vals=hitcount:sort=lat:size=2048 [active] +# + +{ lat: 107, common_pid: 2039 } hitcount: 1 +{ lat: 122, common_pid: 2041 } hitcount: 1 +{ lat: 166, common_pid: 2039 } hitcount: 1 +{ lat: 174, common_pid: 2039 } hitcount: 1 +{ lat: 194, common_pid: 2041 } hitcount: 1 +{ lat: 196, common_pid: 2036 } hitcount: 1 +{ lat: 197, common_pid: 2038 } hitcount: 1 +{ lat: 198, common_pid: 2039 } hitcount: 1 +{ lat: 199, common_pid: 2039 } hitcount: 1 +{ lat: 200, common_pid: 2041 } hitcount: 1 +{ lat: 201, common_pid: 2039 } hitcount: 2 +{ lat: 202, common_pid: 2038 } hitcount: 1 +{ lat: 202, common_pid: 2043 } hitcount: 1 +{ lat: 203, common_pid: 2039 } hitcount: 1 +{ lat: 203, common_pid: 2036 } hitcount: 1 +{ lat: 203, common_pid: 2041 } hitcount: 1 +{ lat: 206, common_pid: 2038 } hitcount: 2 +{ lat: 207, common_pid: 2039 } hitcount: 1 +{ lat: 207, common_pid: 2036 } hitcount: 1 +{ lat: 208, common_pid: 2040 } hitcount: 1 +{ lat: 209, common_pid: 2043 } hitcount: 1 +{ lat: 210, common_pid: 2039 } hitcount: 1 +{ lat: 211, common_pid: 2039 } hitcount: 4 +{ lat: 212, common_pid: 2043 } hitcount: 1 +{ lat: 212, common_pid: 2039 } hitcount: 2 +{ lat: 213, common_pid: 2039 } hitcount: 1 +{ lat: 214, common_pid: 2038 } hitcount: 1 +{ lat: 214, common_pid: 2039 } hitcount: 2 +{ lat: 214, common_pid: 2042 } hitcount: 1 +{ lat: 215, common_pid: 2039 } hitcount: 1 +{ lat: 217, common_pid: 2036 } hitcount: 1 +{ lat: 217, common_pid: 2040 } hitcount: 1 +{ lat: 217, common_pid: 2039 } hitcount: 1 +{ lat: 218, common_pid: 2039 } hitcount: 6 +{ lat: 219, common_pid: 2039 } hitcount: 9 +{ lat: 220, common_pid: 2039 } hitcount: 11 +{ lat: 221, common_pid: 2039 } hitcount: 5 +{ lat: 221, common_pid: 2042 } hitcount: 1 +{ lat: 222, common_pid: 2039 } hitcount: 7 +{ lat: 223, common_pid: 2036 } hitcount: 1 +{ lat: 223, common_pid: 2039 } hitcount: 3 +{ lat: 224, common_pid: 2039 } hitcount: 4 +{ lat: 224, common_pid: 2037 } hitcount: 1 +{ lat: 224, common_pid: 2036 } hitcount: 2 +{ lat: 225, common_pid: 2039 } hitcount: 5 +{ lat: 225, common_pid: 2042 } hitcount: 1 +{ lat: 226, common_pid: 2039 } hitcount: 7 +{ lat: 226, common_pid: 2036 } hitcount: 4 +{ lat: 227, common_pid: 2039 } hitcount: 6 +{ lat: 227, common_pid: 2036 } hitcount: 12 +{ lat: 227, common_pid: 2043 } hitcount: 1 +{ lat: 228, common_pid: 2039 } hitcount: 7 +{ lat: 228, common_pid: 2036 } hitcount: 14 +{ lat: 229, common_pid: 2039 } hitcount: 9 +{ lat: 229, common_pid: 2036 } hitcount: 8 +{ lat: 229, common_pid: 2038 } hitcount: 1 +{ lat: 230, common_pid: 2039 } hitcount: 11 +{ lat: 230, common_pid: 2036 } hitcount: 6 +{ lat: 230, common_pid: 2043 } hitcount: 1 +{ lat: 230, common_pid: 2042 } hitcount: 2 +{ lat: 231, common_pid: 2041 } hitcount: 1 +{ lat: 231, common_pid: 2036 } hitcount: 6 +{ lat: 231, common_pid: 2043 } hitcount: 1 +{ lat: 231, common_pid: 2039 } hitcount: 8 +{ lat: 232, common_pid: 2037 } hitcount: 1 +{ lat: 232, common_pid: 2039 } hitcount: 6 +{ lat: 232, common_pid: 2040 } hitcount: 2 +{ lat: 232, common_pid: 2036 } hitcount: 5 +{ lat: 232, common_pid: 2043 } hitcount: 1 +{ lat: 233, common_pid: 2036 } hitcount: 5 +{ lat: 233, common_pid: 2039 } hitcount: 11 +{ lat: 234, common_pid: 2039 } hitcount: 4 +{ lat: 234, common_pid: 2038 } hitcount: 2 +{ lat: 234, common_pid: 2043 } hitcount: 2 +{ lat: 234, common_pid: 2036 } hitcount: 11 +{ lat: 234, common_pid: 2040 } hitcount: 1 +{ lat: 235, common_pid: 2037 } hitcount: 2 +{ lat: 235, common_pid: 2036 } hitcount: 8 +{ lat: 235, common_pid: 2043 } hitcount: 2 +{ lat: 235, common_pid: 2039 } hitcount: 5 +{ lat: 235, common_pid: 2042 } hitcount: 2 +{ lat: 235, common_pid: 2040 } hitcount: 4 +{ lat: 235, common_pid: 2041 } hitcount: 1 +{ lat: 236, common_pid: 2036 } hitcount: 7 +{ lat: 236, common_pid: 2037 } hitcount: 1 +{ lat: 236, common_pid: 2041 } hitcount: 5 +{ lat: 236, common_pid: 2039 } hitcount: 3 +{ lat: 236, common_pid: 2043 } hitcount: 9 +{ lat: 236, common_pid: 2040 } hitcount: 7 +{ lat: 237, common_pid: 2037 } hitcount: 1 +{ lat: 237, common_pid: 2040 } hitcount: 1 +{ lat: 237, common_pid: 2036 } hitcount: 9 +{ lat: 237, common_pid: 2039 } hitcount: 3 +{ lat: 237, common_pid: 2043 } hitcount: 8 +{ lat: 237, common_pid: 2042 } hitcount: 2 +{ lat: 237, common_pid: 2041 } hitcount: 2 +{ lat: 238, common_pid: 2043 } hitcount: 10 +{ lat: 238, common_pid: 2040 } hitcount: 1 +{ lat: 238, common_pid: 2037 } hitcount: 9 +{ lat: 238, common_pid: 2038 } hitcount: 1 +{ lat: 238, common_pid: 2039 } hitcount: 1 +{ lat: 238, common_pid: 2042 } hitcount: 3 +{ lat: 238, common_pid: 2036 } hitcount: 7 +{ lat: 239, common_pid: 2041 } hitcount: 1 +{ lat: 239, common_pid: 2043 } hitcount: 11 +{ lat: 239, common_pid: 2037 } hitcount: 11 +{ lat: 239, common_pid: 2038 } hitcount: 6 +{ lat: 239, common_pid: 2036 } hitcount: 7 +{ lat: 239, common_pid: 2040 } hitcount: 1 +{ lat: 239, common_pid: 2042 } hitcount: 9 +{ lat: 240, common_pid: 2037 } hitcount: 29 +{ lat: 240, common_pid: 2043 } hitcount: 15 +{ lat: 240, common_pid: 2040 } hitcount: 44 +{ lat: 240, common_pid: 2039 } hitcount: 1 +{ lat: 240, common_pid: 2041 } hitcount: 2 +{ lat: 240, common_pid: 2038 } hitcount: 1 +{ lat: 240, common_pid: 2036 } hitcount: 10 +{ lat: 240, common_pid: 2042 } hitcount: 13 +{ lat: 241, common_pid: 2036 } hitcount: 21 +{ lat: 241, common_pid: 2041 } hitcount: 36 +{ lat: 241, common_pid: 2037 } hitcount: 34 +{ lat: 241, common_pid: 2042 } hitcount: 14 +{ lat: 241, common_pid: 2040 } hitcount: 94 +{ lat: 241, common_pid: 2039 } hitcount: 12 +{ lat: 241, common_pid: 2038 } hitcount: 2 +{ lat: 241, common_pid: 2043 } hitcount: 28 +{ lat: 242, common_pid: 2040 } hitcount: 109 +{ lat: 242, common_pid: 2041 } hitcount: 506 +{ lat: 242, common_pid: 2039 } hitcount: 155 +{ lat: 242, common_pid: 2042 } hitcount: 21 +{ lat: 242, common_pid: 2037 } hitcount: 52 +{ lat: 242, common_pid: 2043 } hitcount: 21 +{ lat: 242, common_pid: 2036 } hitcount: 16 +{ lat: 242, common_pid: 2038 } hitcount: 156 +{ lat: 243, common_pid: 2037 } hitcount: 46 +{ lat: 243, common_pid: 2039 } hitcount: 40 +{ lat: 243, common_pid: 2042 } hitcount: 119 +{ lat: 243, common_pid: 2041 } hitcount: 611 +{ lat: 243, common_pid: 2036 } hitcount: 69 +{ lat: 243, common_pid: 2038 } hitcount: 784 +{ lat: 243, common_pid: 2040 } hitcount: 323 +{ lat: 243, common_pid: 2043 } hitcount: 14 +{ lat: 244, common_pid: 2043 } hitcount: 35 +{ lat: 244, common_pid: 2042 } hitcount: 305 +{ lat: 244, common_pid: 2039 } hitcount: 8 +{ lat: 244, common_pid: 2040 } hitcount: 4515 +{ lat: 244, common_pid: 2038 } hitcount: 371 +{ lat: 244, common_pid: 2037 } hitcount: 31 +{ lat: 244, common_pid: 2036 } hitcount: 114 +{ lat: 244, common_pid: 2041 } hitcount: 3396 +{ lat: 245, common_pid: 2036 } hitcount: 700 +{ lat: 245, common_pid: 2041 } hitcount: 2772 +{ lat: 245, common_pid: 2037 } hitcount: 268 +{ lat: 245, common_pid: 2039 } hitcount: 472 +{ lat: 245, common_pid: 2038 } hitcount: 2758 +{ lat: 245, common_pid: 2042 } hitcount: 3833 +{ lat: 245, common_pid: 2040 } hitcount: 3105 +{ lat: 245, common_pid: 2043 } hitcount: 645 +{ lat: 246, common_pid: 2038 } hitcount: 3451 +{ lat: 246, common_pid: 2041 } hitcount: 142 +{ lat: 246, common_pid: 2037 } hitcount: 5101 +{ lat: 246, common_pid: 2040 } hitcount: 68 +{ lat: 246, common_pid: 2043 } hitcount: 5099 +{ lat: 246, common_pid: 2039 } hitcount: 5608 +{ lat: 246, common_pid: 2042 } hitcount: 3723 +{ lat: 246, common_pid: 2036 } hitcount: 4738 +{ lat: 247, common_pid: 2042 } hitcount: 312 +{ lat: 247, common_pid: 2043 } hitcount: 2385 +{ lat: 247, common_pid: 2041 } hitcount: 452 +{ lat: 247, common_pid: 2038 } hitcount: 792 +{ lat: 247, common_pid: 2040 } hitcount: 78 +{ lat: 247, common_pid: 2036 } hitcount: 2375 +{ lat: 247, common_pid: 2039 } hitcount: 1834 +{ lat: 247, common_pid: 2037 } hitcount: 2655 +{ lat: 248, common_pid: 2037 } hitcount: 36 +{ lat: 248, common_pid: 2042 } hitcount: 11 +{ lat: 248, common_pid: 2038 } hitcount: 122 +{ lat: 248, common_pid: 2036 } hitcount: 135 +{ lat: 248, common_pid: 2039 } hitcount: 26 +{ lat: 248, common_pid: 2041 } hitcount: 503 +{ lat: 248, common_pid: 2043 } hitcount: 66 +{ lat: 248, common_pid: 2040 } hitcount: 46 +{ lat: 249, common_pid: 2037 } hitcount: 29 +{ lat: 249, common_pid: 2038 } hitcount: 1 +{ lat: 249, common_pid: 2043 } hitcount: 29 +{ lat: 249, common_pid: 2039 } hitcount: 8 +{ lat: 249, common_pid: 2042 } hitcount: 56 +{ lat: 249, common_pid: 2040 } hitcount: 27 +{ lat: 249, common_pid: 2041 } hitcount: 11 +{ lat: 249, common_pid: 2036 } hitcount: 27 +{ lat: 250, common_pid: 2038 } hitcount: 1 +{ lat: 250, common_pid: 2036 } hitcount: 30 +{ lat: 250, common_pid: 2040 } hitcount: 19 +{ lat: 250, common_pid: 2043 } hitcount: 22 +{ lat: 250, common_pid: 2042 } hitcount: 20 +{ lat: 250, common_pid: 2041 } hitcount: 1 +{ lat: 250, common_pid: 2039 } hitcount: 6 +{ lat: 250, common_pid: 2037 } hitcount: 48 +{ lat: 251, common_pid: 2037 } hitcount: 43 +{ lat: 251, common_pid: 2039 } hitcount: 1 +{ lat: 251, common_pid: 2036 } hitcount: 12 +{ lat: 251, common_pid: 2042 } hitcount: 2 +{ lat: 251, common_pid: 2041 } hitcount: 1 +{ lat: 251, common_pid: 2043 } hitcount: 15 +{ lat: 251, common_pid: 2040 } hitcount: 3 +{ lat: 252, common_pid: 2040 } hitcount: 1 +{ lat: 252, common_pid: 2036 } hitcount: 12 +{ lat: 252, common_pid: 2037 } hitcount: 21 +{ lat: 252, common_pid: 2043 } hitcount: 14 +{ lat: 253, common_pid: 2037 } hitcount: 21 +{ lat: 253, common_pid: 2039 } hitcount: 2 +{ lat: 253, common_pid: 2036 } hitcount: 9 +{ lat: 253, common_pid: 2043 } hitcount: 6 +{ lat: 253, common_pid: 2040 } hitcount: 1 +{ lat: 254, common_pid: 2036 } hitcount: 8 +{ lat: 254, common_pid: 2043 } hitcount: 3 +{ lat: 254, common_pid: 2041 } hitcount: 1 +{ lat: 254, common_pid: 2042 } hitcount: 1 +{ lat: 254, common_pid: 2039 } hitcount: 1 +{ lat: 254, common_pid: 2037 } hitcount: 12 +{ lat: 255, common_pid: 2043 } hitcount: 1 +{ lat: 255, common_pid: 2037 } hitcount: 2 +{ lat: 255, common_pid: 2036 } hitcount: 2 +{ lat: 255, common_pid: 2039 } hitcount: 8 +{ lat: 256, common_pid: 2043 } hitcount: 1 +{ lat: 256, common_pid: 2036 } hitcount: 4 +{ lat: 256, common_pid: 2039 } hitcount: 6 +{ lat: 257, common_pid: 2039 } hitcount: 5 +{ lat: 257, common_pid: 2036 } hitcount: 4 +{ lat: 258, common_pid: 2039 } hitcount: 5 +{ lat: 258, common_pid: 2036 } hitcount: 2 +{ lat: 259, common_pid: 2036 } hitcount: 7 +{ lat: 259, common_pid: 2039 } hitcount: 7 +{ lat: 260, common_pid: 2036 } hitcount: 8 +{ lat: 260, common_pid: 2039 } hitcount: 6 +{ lat: 261, common_pid: 2036 } hitcount: 5 +{ lat: 261, common_pid: 2039 } hitcount: 7 +{ lat: 262, common_pid: 2039 } hitcount: 5 +{ lat: 262, common_pid: 2036 } hitcount: 5 +{ lat: 263, common_pid: 2039 } hitcount: 7 +{ lat: 263, common_pid: 2036 } hitcount: 7 +{ lat: 264, common_pid: 2039 } hitcount: 9 +{ lat: 264, common_pid: 2036 } hitcount: 9 +{ lat: 265, common_pid: 2036 } hitcount: 5 +{ lat: 265, common_pid: 2039 } hitcount: 1 +{ lat: 266, common_pid: 2036 } hitcount: 1 +{ lat: 266, common_pid: 2039 } hitcount: 3 +{ lat: 267, common_pid: 2036 } hitcount: 1 +{ lat: 267, common_pid: 2039 } hitcount: 3 +{ lat: 268, common_pid: 2036 } hitcount: 1 +{ lat: 268, common_pid: 2039 } hitcount: 6 +{ lat: 269, common_pid: 2036 } hitcount: 1 +{ lat: 269, common_pid: 2043 } hitcount: 1 +{ lat: 269, common_pid: 2039 } hitcount: 2 +{ lat: 270, common_pid: 2040 } hitcount: 1 +{ lat: 270, common_pid: 2039 } hitcount: 6 +{ lat: 271, common_pid: 2041 } hitcount: 1 +{ lat: 271, common_pid: 2039 } hitcount: 5 +{ lat: 272, common_pid: 2039 } hitcount: 10 +{ lat: 273, common_pid: 2039 } hitcount: 8 +{ lat: 274, common_pid: 2039 } hitcount: 2 +{ lat: 275, common_pid: 2039 } hitcount: 1 +{ lat: 276, common_pid: 2039 } hitcount: 2 +{ lat: 276, common_pid: 2037 } hitcount: 1 +{ lat: 276, common_pid: 2038 } hitcount: 1 +{ lat: 277, common_pid: 2039 } hitcount: 1 +{ lat: 277, common_pid: 2042 } hitcount: 1 +{ lat: 278, common_pid: 2039 } hitcount: 1 +{ lat: 279, common_pid: 2039 } hitcount: 4 +{ lat: 279, common_pid: 2043 } hitcount: 1 +{ lat: 280, common_pid: 2039 } hitcount: 3 +{ lat: 283, common_pid: 2036 } hitcount: 2 +{ lat: 284, common_pid: 2039 } hitcount: 1 +{ lat: 284, common_pid: 2043 } hitcount: 1 +{ lat: 288, common_pid: 2039 } hitcount: 1 +{ lat: 289, common_pid: 2039 } hitcount: 1 +{ lat: 300, common_pid: 2039 } hitcount: 1 +{ lat: 384, common_pid: 2039 } hitcount: 1 + +Totals: + Hits: 67625 + Entries: 278 + Dropped: 0 + +Note, the writes are around the sleep, so ideally they will all be of 250 +microseconds. If you are wondering how there are several that are under +250 microseconds, that is because the way cyclictest works, is if one +iteration comes in late, the next one will set the timer to wake up less that +250. That is, if an iteration came in 50 microseconds late, the next wake up +will be at 200 microseconds. + +But this could easily be done in userspace. To make this even more +interesting, we can mix the histogram between events that happened in the +kernel with trace_marker. + + # cd /sys/kernel/tracing + # echo 'latency u64 lat' > synthetic_events + # echo 'hist:keys=pid:ts0=common_timestamp.usecs' > events/sched/sched_waking/trigger + # echo 'hist:keys=common_pid:lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).latency($lat) if buf == "end"' > events/ftrace/print/trigger + # echo 'hist:keys=lat,common_pid:sort=lat' > events/synthetic/latency/trigger + +The difference this time is that instead of using the trace_marker to start +the latency, the sched_waking event is used, matching the common_pid for the +trace_marker write with the pid that is being woken by sched_waking. + +After running cyclictest again with the same parameters, we now have: + + # cat events/synthetic/latency/hist +# event histogram +# +# trigger info: hist:keys=lat,common_pid:vals=hitcount:sort=lat:size=2048 [active] +# + +{ lat: 7, common_pid: 2302 } hitcount: 640 +{ lat: 7, common_pid: 2299 } hitcount: 42 +{ lat: 7, common_pid: 2303 } hitcount: 18 +{ lat: 7, common_pid: 2305 } hitcount: 166 +{ lat: 7, common_pid: 2306 } hitcount: 1 +{ lat: 7, common_pid: 2301 } hitcount: 91 +{ lat: 7, common_pid: 2300 } hitcount: 17 +{ lat: 8, common_pid: 2303 } hitcount: 8296 +{ lat: 8, common_pid: 2304 } hitcount: 6864 +{ lat: 8, common_pid: 2305 } hitcount: 9464 +{ lat: 8, common_pid: 2301 } hitcount: 9213 +{ lat: 8, common_pid: 2306 } hitcount: 6246 +{ lat: 8, common_pid: 2302 } hitcount: 8797 +{ lat: 8, common_pid: 2299 } hitcount: 8771 +{ lat: 8, common_pid: 2300 } hitcount: 8119 +{ lat: 9, common_pid: 2305 } hitcount: 1519 +{ lat: 9, common_pid: 2299 } hitcount: 2346 +{ lat: 9, common_pid: 2303 } hitcount: 2841 +{ lat: 9, common_pid: 2301 } hitcount: 1846 +{ lat: 9, common_pid: 2304 } hitcount: 3861 +{ lat: 9, common_pid: 2302 } hitcount: 1210 +{ lat: 9, common_pid: 2300 } hitcount: 2762 +{ lat: 9, common_pid: 2306 } hitcount: 4247 +{ lat: 10, common_pid: 2299 } hitcount: 16 +{ lat: 10, common_pid: 2306 } hitcount: 333 +{ lat: 10, common_pid: 2303 } hitcount: 16 +{ lat: 10, common_pid: 2304 } hitcount: 168 +{ lat: 10, common_pid: 2302 } hitcount: 240 +{ lat: 10, common_pid: 2301 } hitcount: 28 +{ lat: 10, common_pid: 2300 } hitcount: 95 +{ lat: 10, common_pid: 2305 } hitcount: 18 +{ lat: 11, common_pid: 2303 } hitcount: 5 +{ lat: 11, common_pid: 2305 } hitcount: 8 +{ lat: 11, common_pid: 2306 } hitcount: 221 +{ lat: 11, common_pid: 2302 } hitcount: 76 +{ lat: 11, common_pid: 2304 } hitcount: 26 +{ lat: 11, common_pid: 2300 } hitcount: 125 +{ lat: 11, common_pid: 2299 } hitcount: 2 +{ lat: 12, common_pid: 2305 } hitcount: 3 +{ lat: 12, common_pid: 2300 } hitcount: 6 +{ lat: 12, common_pid: 2306 } hitcount: 90 +{ lat: 12, common_pid: 2302 } hitcount: 4 +{ lat: 12, common_pid: 2303 } hitcount: 1 +{ lat: 12, common_pid: 2304 } hitcount: 122 +{ lat: 13, common_pid: 2300 } hitcount: 12 +{ lat: 13, common_pid: 2301 } hitcount: 1 +{ lat: 13, common_pid: 2306 } hitcount: 32 +{ lat: 13, common_pid: 2302 } hitcount: 5 +{ lat: 13, common_pid: 2305 } hitcount: 1 +{ lat: 13, common_pid: 2303 } hitcount: 1 +{ lat: 13, common_pid: 2304 } hitcount: 61 +{ lat: 14, common_pid: 2303 } hitcount: 4 +{ lat: 14, common_pid: 2306 } hitcount: 5 +{ lat: 14, common_pid: 2305 } hitcount: 4 +{ lat: 14, common_pid: 2304 } hitcount: 62 +{ lat: 14, common_pid: 2302 } hitcount: 19 +{ lat: 14, common_pid: 2300 } hitcount: 33 +{ lat: 14, common_pid: 2299 } hitcount: 1 +{ lat: 14, common_pid: 2301 } hitcount: 4 +{ lat: 15, common_pid: 2305 } hitcount: 1 +{ lat: 15, common_pid: 2302 } hitcount: 25 +{ lat: 15, common_pid: 2300 } hitcount: 11 +{ lat: 15, common_pid: 2299 } hitcount: 5 +{ lat: 15, common_pid: 2301 } hitcount: 1 +{ lat: 15, common_pid: 2304 } hitcount: 8 +{ lat: 15, common_pid: 2303 } hitcount: 1 +{ lat: 15, common_pid: 2306 } hitcount: 6 +{ lat: 16, common_pid: 2302 } hitcount: 31 +{ lat: 16, common_pid: 2306 } hitcount: 3 +{ lat: 16, common_pid: 2300 } hitcount: 5 +{ lat: 17, common_pid: 2302 } hitcount: 6 +{ lat: 17, common_pid: 2303 } hitcount: 1 +{ lat: 18, common_pid: 2304 } hitcount: 1 +{ lat: 18, common_pid: 2302 } hitcount: 8 +{ lat: 18, common_pid: 2299 } hitcount: 1 +{ lat: 18, common_pid: 2301 } hitcount: 1 +{ lat: 19, common_pid: 2303 } hitcount: 4 +{ lat: 19, common_pid: 2304 } hitcount: 5 +{ lat: 19, common_pid: 2302 } hitcount: 4 +{ lat: 19, common_pid: 2299 } hitcount: 3 +{ lat: 19, common_pid: 2306 } hitcount: 1 +{ lat: 19, common_pid: 2300 } hitcount: 4 +{ lat: 19, common_pid: 2305 } hitcount: 5 +{ lat: 20, common_pid: 2299 } hitcount: 2 +{ lat: 20, common_pid: 2302 } hitcount: 3 +{ lat: 20, common_pid: 2305 } hitcount: 1 +{ lat: 20, common_pid: 2300 } hitcount: 2 +{ lat: 20, common_pid: 2301 } hitcount: 2 +{ lat: 20, common_pid: 2303 } hitcount: 3 +{ lat: 21, common_pid: 2305 } hitcount: 1 +{ lat: 21, common_pid: 2299 } hitcount: 5 +{ lat: 21, common_pid: 2303 } hitcount: 4 +{ lat: 21, common_pid: 2302 } hitcount: 7 +{ lat: 21, common_pid: 2300 } hitcount: 1 +{ lat: 21, common_pid: 2301 } hitcount: 5 +{ lat: 21, common_pid: 2304 } hitcount: 2 +{ lat: 22, common_pid: 2302 } hitcount: 5 +{ lat: 22, common_pid: 2303 } hitcount: 1 +{ lat: 22, common_pid: 2306 } hitcount: 3 +{ lat: 22, common_pid: 2301 } hitcount: 2 +{ lat: 22, common_pid: 2300 } hitcount: 1 +{ lat: 22, common_pid: 2299 } hitcount: 1 +{ lat: 22, common_pid: 2305 } hitcount: 1 +{ lat: 22, common_pid: 2304 } hitcount: 1 +{ lat: 23, common_pid: 2299 } hitcount: 1 +{ lat: 23, common_pid: 2306 } hitcount: 2 +{ lat: 23, common_pid: 2302 } hitcount: 6 +{ lat: 24, common_pid: 2302 } hitcount: 3 +{ lat: 24, common_pid: 2300 } hitcount: 1 +{ lat: 24, common_pid: 2306 } hitcount: 2 +{ lat: 24, common_pid: 2305 } hitcount: 1 +{ lat: 24, common_pid: 2299 } hitcount: 1 +{ lat: 25, common_pid: 2300 } hitcount: 1 +{ lat: 25, common_pid: 2302 } hitcount: 4 +{ lat: 26, common_pid: 2302 } hitcount: 2 +{ lat: 27, common_pid: 2305 } hitcount: 1 +{ lat: 27, common_pid: 2300 } hitcount: 1 +{ lat: 27, common_pid: 2302 } hitcount: 3 +{ lat: 28, common_pid: 2306 } hitcount: 1 +{ lat: 28, common_pid: 2302 } hitcount: 4 +{ lat: 29, common_pid: 2302 } hitcount: 1 +{ lat: 29, common_pid: 2300 } hitcount: 2 +{ lat: 29, common_pid: 2306 } hitcount: 1 +{ lat: 29, common_pid: 2304 } hitcount: 1 +{ lat: 30, common_pid: 2302 } hitcount: 4 +{ lat: 31, common_pid: 2302 } hitcount: 6 +{ lat: 32, common_pid: 2302 } hitcount: 1 +{ lat: 33, common_pid: 2299 } hitcount: 1 +{ lat: 33, common_pid: 2302 } hitcount: 3 +{ lat: 34, common_pid: 2302 } hitcount: 2 +{ lat: 35, common_pid: 2302 } hitcount: 1 +{ lat: 35, common_pid: 2304 } hitcount: 1 +{ lat: 36, common_pid: 2302 } hitcount: 4 +{ lat: 37, common_pid: 2302 } hitcount: 6 +{ lat: 38, common_pid: 2302 } hitcount: 2 +{ lat: 39, common_pid: 2302 } hitcount: 2 +{ lat: 39, common_pid: 2304 } hitcount: 1 +{ lat: 40, common_pid: 2304 } hitcount: 2 +{ lat: 40, common_pid: 2302 } hitcount: 5 +{ lat: 41, common_pid: 2304 } hitcount: 1 +{ lat: 41, common_pid: 2302 } hitcount: 8 +{ lat: 42, common_pid: 2302 } hitcount: 6 +{ lat: 42, common_pid: 2304 } hitcount: 1 +{ lat: 43, common_pid: 2302 } hitcount: 3 +{ lat: 43, common_pid: 2304 } hitcount: 4 +{ lat: 44, common_pid: 2302 } hitcount: 6 +{ lat: 45, common_pid: 2302 } hitcount: 5 +{ lat: 46, common_pid: 2302 } hitcount: 5 +{ lat: 47, common_pid: 2302 } hitcount: 7 +{ lat: 48, common_pid: 2301 } hitcount: 1 +{ lat: 48, common_pid: 2302 } hitcount: 9 +{ lat: 49, common_pid: 2302 } hitcount: 3 +{ lat: 50, common_pid: 2302 } hitcount: 1 +{ lat: 50, common_pid: 2301 } hitcount: 1 +{ lat: 51, common_pid: 2302 } hitcount: 2 +{ lat: 51, common_pid: 2301 } hitcount: 1 +{ lat: 61, common_pid: 2302 } hitcount: 1 +{ lat: 110, common_pid: 2302 } hitcount: 1 + +Totals: + Hits: 89565 + Entries: 158 + Dropped: 0 + +This doesn't tell us any information about how late cyclictest may have +woken up, but it does show us a nice histogram of how long it took from +the time that cyclictest was woken to the time it made it into user space. diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt index 0a0930ab4156..921739d00f69 100644 --- a/Documentation/translations/ko_KR/memory-barriers.txt +++ b/Documentation/translations/ko_KR/memory-barriers.txt @@ -36,6 +36,9 @@ Documentation/memory-barriers.txt 부분도 있고, 의도하진 않았지만 사람에 의해 쓰였다보니 불완전한 부분도 있습니다. 이 문서는 리눅스에서 제공하는 다양한 메모리 배리어들을 사용하기 위한 안내서입니다만, 뭔가 이상하다 싶으면 (그런게 많을 겁니다) 질문을 부탁드립니다. +일부 이상한 점들은 공식적인 메모리 일관성 모델과 tools/memory-model/ 에 있는 +관련 문서를 참고해서 해결될 수 있을 겁니다. 그러나, 이 메모리 모델조차도 그 +관리자들의 의견의 집합으로 봐야지, 절대 옳은 예언자로 신봉해선 안될 겁니다. 다시 말하지만, 이 문서는 리눅스가 하드웨어에 기대하는 사항에 대한 명세서가 아닙니다. @@ -77,7 +80,7 @@ Documentation/memory-barriers.txt - 메모리 배리어의 종류. - 메모리 배리어에 대해 가정해선 안될 것. - - 데이터 의존성 배리어. + - 데이터 의존성 배리어 (역사적). - 컨트롤 의존성. - SMP 배리어 짝맞추기. - 메모리 배리어 시퀀스의 예. @@ -255,17 +258,20 @@ CPU 에게 기대할 수 있는 최소한의 보장사항 몇가지가 있습니 (*) 어떤 CPU 든, 의존성이 존재하는 메모리 액세스들은 해당 CPU 자신에게 있어서는 순서대로 메모리 시스템에 수행 요청됩니다. 즉, 다음에 대해서: - Q = READ_ONCE(P); smp_read_barrier_depends(); D = READ_ONCE(*Q); + Q = READ_ONCE(P); D = READ_ONCE(*Q); CPU 는 다음과 같은 메모리 오퍼레이션 시퀀스를 수행 요청합니다: Q = LOAD P, D = LOAD *Q - 그리고 그 시퀀스 내에서의 순서는 항상 지켜집니다. 대부분의 시스템에서 - smp_read_barrier_depends() 는 아무일도 안하지만 DEC Alpha 에서는 - 명시적으로 사용되어야 합니다. 보통의 경우에는 smp_read_barrier_depends() - 를 직접 사용하는 대신 rcu_dereference() 같은 것들을 사용해야 함을 - 알아두세요. + 그리고 그 시퀀스 내에서의 순서는 항상 지켜집니다. 하지만, DEC Alpha 에서 + READ_ONCE() 는 메모리 배리어 명령도 내게 되어 있어서, DEC Alpha CPU 는 + 다음과 같은 메모리 오퍼레이션들을 내놓게 됩니다: + + Q = LOAD P, MEMORY_BARRIER, D = LOAD *Q, MEMORY_BARRIER + + DEC Alpha 에서 수행되든 아니든, READ_ONCE() 는 컴파일러로부터의 악영향 + 또한 제거합니다. (*) 특정 CPU 내에서 겹치는 영역의 메모리에 행해지는 로드와 스토어 들은 해당 CPU 안에서는 순서가 바뀌지 않은 것으로 보여집니다. 즉, 다음에 대해서: @@ -421,8 +427,8 @@ CPU 에게 기대할 수 있는 최소한의 보장사항 몇가지가 있습니 데이터 의존성 배리어는 읽기 배리어의 보다 완화된 형태입니다. 두개의 로드 오퍼레이션이 있고 두번째 것이 첫번째 것의 결과에 의존하고 있을 때(예: 두번째 로드가 참조할 주소를 첫번째 로드가 읽는 경우), 두번째 로드가 읽어올 - 데이터는 첫번째 로드에 의해 그 주소가 얻어지기 전에 업데이트 되어 있음을 - 보장하기 위해서 데이터 의존성 배리어가 필요할 수 있습니다. + 데이터는 첫번째 로드에 의해 그 주소가 얻어진 뒤에 업데이트 됨을 보장하기 + 위해서 데이터 의존성 배리어가 필요할 수 있습니다. 데이터 의존성 배리어는 상호 의존적인 로드 오퍼레이션들 사이의 부분적 순서 세우기입니다; 스토어 오퍼레이션들이나 독립적인 로드들, 또는 중복되는 @@ -570,8 +576,14 @@ ACQUIRE 는 해당 오퍼레이션의 로드 부분에만 적용되고 RELEASE Documentation/DMA-API.txt -데이터 의존성 배리어 --------------------- +데이터 의존성 배리어 (역사적) +----------------------------- + +리눅스 커널 v4.15 기준으로, smp_read_barrier_depends() 가 READ_ONCE() 에 +추가되었는데, 이는 이 섹션에 주의를 기울여야 하는 사람들은 DEC Alpha 아키텍쳐 +전용 코드를 만드는 사람들과 READ_ONCE() 자체를 만드는 사람들 뿐임을 의미합니다. +그런 분들을 위해, 그리고 역사에 관심 있는 분들을 위해, 여기 데이터 의존성 +배리어에 대한 이야기를 적습니다. 데이터 의존성 배리어의 사용에 있어 지켜야 하는 사항들은 약간 미묘하고, 데이터 의존성 배리어가 사용되어야 하는 상황도 항상 명백하지는 않습니다. 설명을 위해 @@ -1787,7 +1799,7 @@ CPU 메모리 배리어 범용 mb() smp_mb() 쓰기 wmb() smp_wmb() 읽기 rmb() smp_rmb() - 데이터 의존성 read_barrier_depends() smp_read_barrier_depends() + 데이터 의존성 READ_ONCE() 데이터 의존성 배리어를 제외한 모든 메모리 배리어는 컴파일러 배리어를 @@ -2796,8 +2808,9 @@ CPU 2 는 C/D 를 갖습니다)가 병렬로 연결되어 있는 시스템을 여기에 개입하기 위해선, 데이터 의존성 배리어나 읽기 배리어를 로드 오퍼레이션들 -사이에 넣어야 합니다. 이렇게 함으로써 캐시가 다음 요청을 처리하기 전에 일관성 -큐를 처리하도록 강제하게 됩니다. +사이에 넣어야 합니다 (v4.15 부터는 READ_ONCE() 매크로에 의해 무조건적으로 +그렇게 됩니다). 이렇게 함으로써 캐시가 다음 요청을 처리하기 전에 일관성 큐를 +처리하도록 강제하게 됩니다. CPU 1 CPU 2 COMMENT =============== =============== ======================================= @@ -2826,7 +2839,10 @@ CPU 2 는 C/D 를 갖습니다)가 병렬로 연결되어 있는 시스템을 다른 CPU 들도 분할된 캐시를 가지고 있을 수 있지만, 그런 CPU 들은 평범한 메모리 액세스를 위해서도 이 분할된 캐시들 사이의 조정을 해야만 합니다. Alpha 는 가장 약한 메모리 순서 시맨틱 (semantic) 을 선택함으로써 메모리 배리어가 명시적으로 -사용되지 않았을 때에는 그런 조정이 필요하지 않게 했습니다. +사용되지 않았을 때에는 그런 조정이 필요하지 않게 했으며, 이는 Alpha 가 당시에 +더 높은 CPU 클락 속도를 가질 수 있게 했습니다. 하지만, (다시 말하건대, v4.15 +이후부터는) Alpha 아키텍쳐 전용 코드와 READ_ONCE() 매크로 내부에서를 제외하고는 +smp_read_barrier_depends() 가 사용되지 않아야 함을 알아두시기 바랍니다. 캐시 일관성 VS DMA @@ -2846,7 +2862,7 @@ CPU 의 캐시에서 RAM 으로 쓰여지는 더티 캐시 라인에 의해 덮 문제를 해결하기 위해선, 커널의 적절한 부분에서 각 CPU 의 캐시 안의 문제가 되는 비트들을 무효화 시켜야 합니다. -캐시 관리에 대한 더 많은 정보를 위해선 Documentation/cachetlb.txt 를 +캐시 관리에 대한 더 많은 정보를 위해선 Documentation/core-api/cachetlb.rst 를 참고하세요. @@ -2988,7 +3004,9 @@ Alpha CPU 의 일부 버전은 분할된 데이터 캐시를 가지고 있어서 메모리 일관성 시스템과 함께 두개의 캐시를 동기화 시켜서, 포인터 변경과 새로운 데이터의 발견을 올바른 순서로 일어나게 하기 때문입니다. -리눅스 커널의 메모리 배리어 모델은 Alpha 에 기초해서 정의되었습니다. +리눅스 커널의 메모리 배리어 모델은 Alpha 에 기초해서 정의되었습니다만, v4.15 +부터는 리눅스 커널이 READ_ONCE() 내에 smp_read_barrier_depends() 를 추가해서 +Alpha 의 메모리 모델로의 영향력이 크게 줄어들긴 했습니다. 위의 "캐시 일관성" 서브섹션을 참고하세요. @@ -3023,7 +3041,7 @@ smp_mb() 가 아니라 virt_mb() 를 사용해야 합니다. 동기화에 락을 사용하지 않고 구현하는데에 사용될 수 있습니다. 더 자세한 내용을 위해선 다음을 참고하세요: - Documentation/circular-buffers.txt + Documentation/core-api/circular-buffers.rst ========= diff --git a/Documentation/translations/zh_CN/video4linux/v4l2-framework.txt b/Documentation/translations/zh_CN/video4linux/v4l2-framework.txt index 698660b7f21f..c77c0f060864 100644 --- a/Documentation/translations/zh_CN/video4linux/v4l2-framework.txt +++ b/Documentation/translations/zh_CN/video4linux/v4l2-framework.txt @@ -6,7 +6,7 @@ communicating in English you can also ask the Chinese maintainer for help. Contact the Chinese maintainer if this translation is outdated or if there is a problem with the translation. -Maintainer: Mauro Carvalho Chehab <mchehab@infradead.org> +Maintainer: Mauro Carvalho Chehab <mchehab@kernel.org> Chinese maintainer: Fu Wei <tekkamanninja@gmail.com> --------------------------------------------------------------------- Documentation/video4linux/v4l2-framework.txt 的中文翻译 @@ -14,7 +14,7 @@ Documentation/video4linux/v4l2-framework.txt 的中文翻译 如果想评论或更新本文的内容,请直接联系原文档的维护者。如果你使用英文 交流有困难的话,也可以向中文版维护者求助。如果本翻译更新不及时或者翻 译存在问题,请联系中文版维护者。 -英文版维护者: Mauro Carvalho Chehab <mchehab@infradead.org> +英文版维护者: Mauro Carvalho Chehab <mchehab@kernel.org> 中文版维护者: 傅炜 Fu Wei <tekkamanninja@gmail.com> 中文版翻译者: 傅炜 Fu Wei <tekkamanninja@gmail.com> 中文版校译者: 傅炜 Fu Wei <tekkamanninja@gmail.com> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 7b2eb1b7d4ca..a3233da7fa88 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -19,6 +19,7 @@ place where this information is gathered. no_new_privs seccomp_filter unshare + spec_ctrl .. only:: subproject and html diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst index 099c412951d6..82a468bc7560 100644 --- a/Documentation/userspace-api/seccomp_filter.rst +++ b/Documentation/userspace-api/seccomp_filter.rst @@ -207,13 +207,6 @@ directory. Here's a description of each file in that directory: to the file do not need to be in ordered form but reads from the file will be ordered in the same way as the actions_avail sysctl. - It is important to note that the value of ``actions_logged`` does not - prevent certain actions from being logged when the audit subsystem is - configured to audit a task. If the action is not found in - ``actions_logged`` list, the final decision on whether to audit the - action for that task is ultimately left up to the audit subsystem to - decide for all seccomp return values other than ``SECCOMP_RET_ALLOW``. - The ``allow`` string is not accepted in the ``actions_logged`` sysctl as it is not possible to log ``SECCOMP_RET_ALLOW`` actions. Attempting to write ``allow`` to the sysctl will result in an EINVAL being diff --git a/Documentation/userspace-api/spec_ctrl.rst b/Documentation/userspace-api/spec_ctrl.rst new file mode 100644 index 000000000000..32f3d55c54b7 --- /dev/null +++ b/Documentation/userspace-api/spec_ctrl.rst @@ -0,0 +1,94 @@ +=================== +Speculation Control +=================== + +Quite some CPUs have speculation-related misfeatures which are in +fact vulnerabilities causing data leaks in various forms even across +privilege domains. + +The kernel provides mitigation for such vulnerabilities in various +forms. Some of these mitigations are compile-time configurable and some +can be supplied on the kernel command line. + +There is also a class of mitigations which are very expensive, but they can +be restricted to a certain set of processes or tasks in controlled +environments. The mechanism to control these mitigations is via +:manpage:`prctl(2)`. + +There are two prctl options which are related to this: + + * PR_GET_SPECULATION_CTRL + + * PR_SET_SPECULATION_CTRL + +PR_GET_SPECULATION_CTRL +----------------------- + +PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature +which is selected with arg2 of prctl(2). The return value uses bits 0-3 with +the following meaning: + +==== ===================== =================================================== +Bit Define Description +==== ===================== =================================================== +0 PR_SPEC_PRCTL Mitigation can be controlled per task by + PR_SET_SPECULATION_CTRL. +1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is + disabled. +2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is + enabled. +3 PR_SPEC_FORCE_DISABLE Same as PR_SPEC_DISABLE, but cannot be undone. A + subsequent prctl(..., PR_SPEC_ENABLE) will fail. +==== ===================== =================================================== + +If all bits are 0 the CPU is not affected by the speculation misfeature. + +If PR_SPEC_PRCTL is set, then the per-task control of the mitigation is +available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation +misfeature will fail. + +PR_SET_SPECULATION_CTRL +----------------------- + +PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which +is selected by arg2 of :manpage:`prctl(2)` per task. arg3 is used to hand +in the control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE or +PR_SPEC_FORCE_DISABLE. + +Common error codes +------------------ +======= ================================================================= +Value Meaning +======= ================================================================= +EINVAL The prctl is not implemented by the architecture or unused + prctl(2) arguments are not 0. + +ENODEV arg2 is selecting a not supported speculation misfeature. +======= ================================================================= + +PR_SET_SPECULATION_CTRL error codes +----------------------------------- +======= ================================================================= +Value Meaning +======= ================================================================= +0 Success + +ERANGE arg3 is incorrect, i.e. it's neither PR_SPEC_ENABLE nor + PR_SPEC_DISABLE nor PR_SPEC_FORCE_DISABLE. + +ENXIO Control of the selected speculation misfeature is not possible. + See PR_GET_SPECULATION_CTRL. + +EPERM Speculation was disabled with PR_SPEC_FORCE_DISABLE and caller + tried to enable it again. +======= ================================================================= + +Speculation misfeature controls +------------------------------- +- PR_SPEC_STORE_BYPASS: Speculative Store Bypass + + Invocations: + * prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, 0, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_ENABLE, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_DISABLE, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_FORCE_DISABLE, 0, 0); diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index ef6a5111eaa1..f1a4d3c3ba0b 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -252,15 +252,14 @@ into VFIO core. When devices are bound and unbound to the driver, the driver should call vfio_add_group_dev() and vfio_del_group_dev() respectively:: - extern int vfio_add_group_dev(struct iommu_group *iommu_group, - struct device *dev, + extern int vfio_add_group_dev(struct device *dev, const struct vfio_device_ops *ops, void *device_data); extern void *vfio_del_group_dev(struct device *dev); vfio_add_group_dev() indicates to the core to begin tracking the -specified iommu_group and register the specified dev as owned by +iommu_group of the specified dev and register the dev as owned by a VFIO bus driver. The driver provides an ops structure for callbacks similar to a file operations structure:: diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 1c7958b57fe9..758bf403a169 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1960,6 +1960,9 @@ ARM 32-bit VFP control registers have the following id bit patterns: ARM 64-bit FP registers have the following id bit patterns: 0x4030 0000 0012 0 <regno:12> +ARM firmware pseudo-registers have the following bit pattern: + 0x4030 0000 0014 <regno:16> + arm64 registers are mapped using the lower 32 bits. The upper 16 of that is the register group type, or coprocessor number: @@ -1976,6 +1979,9 @@ arm64 CCSIDR registers are demultiplexed by CSSELR value: arm64 system registers have the following id bit patterns: 0x6030 0000 0013 <op0:2> <op1:3> <crn:4> <crm:4> <op2:3> +arm64 firmware pseudo-registers have the following bit pattern: + 0x6030 0000 0014 <regno:16> + MIPS registers are mapped using the lower 32 bits. The upper 16 of that is the register group type: @@ -2510,7 +2516,8 @@ Possible features: and execute guest code when KVM_RUN is called. - KVM_ARM_VCPU_EL1_32BIT: Starts the CPU in a 32bit mode. Depends on KVM_CAP_ARM_EL1_32BIT (arm64 only). - - KVM_ARM_VCPU_PSCI_0_2: Emulate PSCI v0.2 for the CPU. + - KVM_ARM_VCPU_PSCI_0_2: Emulate PSCI v0.2 (or a future revision + backward compatible with v0.2) for the CPU. Depends on KVM_CAP_ARM_PSCI_0_2. - KVM_ARM_VCPU_PMU_V3: Emulate PMUv3 for the CPU. Depends on KVM_CAP_ARM_PMU_V3. diff --git a/Documentation/virtual/kvm/arm/psci.txt b/Documentation/virtual/kvm/arm/psci.txt new file mode 100644 index 000000000000..aafdab887b04 --- /dev/null +++ b/Documentation/virtual/kvm/arm/psci.txt @@ -0,0 +1,30 @@ +KVM implements the PSCI (Power State Coordination Interface) +specification in order to provide services such as CPU on/off, reset +and power-off to the guest. + +The PSCI specification is regularly updated to provide new features, +and KVM implements these updates if they make sense from a virtualization +point of view. + +This means that a guest booted on two different versions of KVM can +observe two different "firmware" revisions. This could cause issues if +a given guest is tied to a particular PSCI revision (unlikely), or if +a migration causes a different PSCI version to be exposed out of the +blue to an unsuspecting guest. + +In order to remedy this situation, KVM exposes a set of "firmware +pseudo-registers" that can be manipulated using the GET/SET_ONE_REG +interface. These registers can be saved/restored by userspace, and set +to a convenient value if required. + +The following register is defined: + +* KVM_REG_ARM_PSCI_VERSION: + + - Only valid if the vcpu has the KVM_ARM_VCPU_PSCI_0_2 feature set + (and thus has already been initialized) + - Returns the current PSCI version on GET_ONE_REG (defaulting to the + highest PSCI version implemented by KVM and compatible with v0.2) + - Allows any PSCI version implemented by KVM and compatible with + v0.2 to be set with SET_ONE_REG + - Affects the whole VM (even if the register view is per-vcpu) diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt index d4f33eb805dd..ab022dcd0911 100644 --- a/Documentation/virtual/kvm/cpuid.txt +++ b/Documentation/virtual/kvm/cpuid.txt @@ -72,8 +72,8 @@ KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side flag || value || meaning ================================================================================== -KVM_HINTS_DEDICATED || 0 || guest checks this feature bit to - || || determine if there is vCPU pinning - || || and there is no vCPU over-commitment, +KVM_HINTS_REALTIME || 0 || guest checks this feature bit to + || || determine that vCPUs are never + || || preempted for an unlimited time, || || allowing optimizations ---------------------------------------------------------------------------------- diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index 0278f2c85efb..f4a4f3e884cf 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX @@ -1,62 +1,50 @@ 00-INDEX - this file. -active_mm.txt +active_mm.rst - An explanation from Linus about tsk->active_mm vs tsk->mm. -balance +balance.rst - various information on memory balancing. -cleancache.txt +cleancache.rst - Intro to cleancache and page-granularity victim cache. -frontswap.txt +frontswap.rst - Outline frontswap, part of the transcendent memory frontend. -highmem.txt +highmem.rst - Outline of highmem and common issues. -hmm.txt +hmm.rst - Documentation of heterogeneous memory management -hugetlbpage.txt - - a brief summary of hugetlbpage support in the Linux kernel. -hugetlbfs_reserv.txt +hugetlbfs_reserv.rst - A brief overview of hugetlbfs reservation design/implementation. -hwpoison.txt +hwpoison.rst - explains what hwpoison is -idle_page_tracking.txt - - description of the idle page tracking feature. -ksm.txt +ksm.rst - how to use the Kernel Samepage Merging feature. -mmu_notifier.txt +mmu_notifier.rst - a note about clearing pte/pmd and mmu notifications -numa +numa.rst - information about NUMA specific code in the Linux vm. -numa_memory_policy.txt - - documentation of concepts and APIs of the 2.6 memory policy support. -overcommit-accounting +overcommit-accounting.rst - description of the Linux kernels overcommit handling modes. -page_frags +page_frags.rst - description of page fragments allocator -page_migration +page_migration.rst - description of page migration in NUMA systems. -pagemap.txt - - pagemap, from the userspace perspective -page_owner.txt +page_owner.rst - tracking about who allocated each page -remap_file_pages.txt +remap_file_pages.rst - a note about remap_file_pages() system call -slub.txt +slub.rst - a short users guide for SLUB. -soft-dirty.txt - - short explanation for soft-dirty PTEs -split_page_table_lock +split_page_table_lock.rst - Separate per-table lock to improve scalability of the old page_table_lock. -swap_numa.txt +swap_numa.rst - automatic binding of swap device to numa node -transhuge.txt +transhuge.rst - Transparent Hugepage Support, alternative way of using hugepages. -unevictable-lru.txt +unevictable-lru.rst - Unevictable LRU infrastructure -userfaultfd.txt - - description of userfaultfd system call z3fold.txt - outline of z3fold allocator for storing compressed pages -zsmalloc.txt +zsmalloc.rst - outline of zsmalloc allocator for storing compressed pages -zswap.txt +zswap.rst - Intro to compressed cache for swap pages diff --git a/Documentation/vm/active_mm.rst b/Documentation/vm/active_mm.rst new file mode 100644 index 000000000000..c84471b180f8 --- /dev/null +++ b/Documentation/vm/active_mm.rst @@ -0,0 +1,91 @@ +.. _active_mm: + +========= +Active MM +========= + +:: + + List: linux-kernel + Subject: Re: active_mm + From: Linus Torvalds <torvalds () transmeta ! com> + Date: 1999-07-30 21:36:24 + + Cc'd to linux-kernel, because I don't write explanations all that often, + and when I do I feel better about more people reading them. + + On Fri, 30 Jul 1999, David Mosberger wrote: + > + > Is there a brief description someplace on how "mm" vs. "active_mm" in + > the task_struct are supposed to be used? (My apologies if this was + > discussed on the mailing lists---I just returned from vacation and + > wasn't able to follow linux-kernel for a while). + + Basically, the new setup is: + + - we have "real address spaces" and "anonymous address spaces". The + difference is that an anonymous address space doesn't care about the + user-level page tables at all, so when we do a context switch into an + anonymous address space we just leave the previous address space + active. + + The obvious use for a "anonymous address space" is any thread that + doesn't need any user mappings - all kernel threads basically fall into + this category, but even "real" threads can temporarily say that for + some amount of time they are not going to be interested in user space, + and that the scheduler might as well try to avoid wasting time on + switching the VM state around. Currently only the old-style bdflush + sync does that. + + - "tsk->mm" points to the "real address space". For an anonymous process, + tsk->mm will be NULL, for the logical reason that an anonymous process + really doesn't _have_ a real address space at all. + + - however, we obviously need to keep track of which address space we + "stole" for such an anonymous user. For that, we have "tsk->active_mm", + which shows what the currently active address space is. + + The rule is that for a process with a real address space (ie tsk->mm is + non-NULL) the active_mm obviously always has to be the same as the real + one. + + For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the + "borrowed" mm while the anonymous process is running. When the + anonymous process gets scheduled away, the borrowed address space is + returned and cleared. + + To support all that, the "struct mm_struct" now has two counters: a + "mm_users" counter that is how many "real address space users" there are, + and a "mm_count" counter that is the number of "lazy" users (ie anonymous + users) plus one if there are any real users. + + Usually there is at least one real user, but it could be that the real + user exited on another CPU while a lazy user was still active, so you do + actually get cases where you have a address space that is _only_ used by + lazy users. That is often a short-lived state, because once that thread + gets scheduled away in favour of a real thread, the "zombie" mm gets + released because "mm_users" becomes zero. + + Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any + more. "init_mm" should be considered just a "lazy context when no other + context is available", and in fact it is mainly used just at bootup when + no real VM has yet been created. So code that used to check + + if (current->mm == &init_mm) + + should generally just do + + if (!current->mm) + + instead (which makes more sense anyway - the test is basically one of "do + we have a user context", and is generally done by the page fault handler + and things like that). + + Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago, + because it slightly changes the interfaces to accommodate the alpha (who + would have thought it, but the alpha actually ends up having one of the + ugliest context switch codes - unlike the other architectures where the MM + and register state is separate, the alpha PALcode joins the two, and you + need to switch both together). + + (From http://marc.info/?l=linux-kernel&m=93337278602211&w=2) diff --git a/Documentation/vm/active_mm.txt b/Documentation/vm/active_mm.txt deleted file mode 100644 index dbf45817405f..000000000000 --- a/Documentation/vm/active_mm.txt +++ /dev/null @@ -1,83 +0,0 @@ -List: linux-kernel -Subject: Re: active_mm -From: Linus Torvalds <torvalds () transmeta ! com> -Date: 1999-07-30 21:36:24 - -Cc'd to linux-kernel, because I don't write explanations all that often, -and when I do I feel better about more people reading them. - -On Fri, 30 Jul 1999, David Mosberger wrote: -> -> Is there a brief description someplace on how "mm" vs. "active_mm" in -> the task_struct are supposed to be used? (My apologies if this was -> discussed on the mailing lists---I just returned from vacation and -> wasn't able to follow linux-kernel for a while). - -Basically, the new setup is: - - - we have "real address spaces" and "anonymous address spaces". The - difference is that an anonymous address space doesn't care about the - user-level page tables at all, so when we do a context switch into an - anonymous address space we just leave the previous address space - active. - - The obvious use for a "anonymous address space" is any thread that - doesn't need any user mappings - all kernel threads basically fall into - this category, but even "real" threads can temporarily say that for - some amount of time they are not going to be interested in user space, - and that the scheduler might as well try to avoid wasting time on - switching the VM state around. Currently only the old-style bdflush - sync does that. - - - "tsk->mm" points to the "real address space". For an anonymous process, - tsk->mm will be NULL, for the logical reason that an anonymous process - really doesn't _have_ a real address space at all. - - - however, we obviously need to keep track of which address space we - "stole" for such an anonymous user. For that, we have "tsk->active_mm", - which shows what the currently active address space is. - - The rule is that for a process with a real address space (ie tsk->mm is - non-NULL) the active_mm obviously always has to be the same as the real - one. - - For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the - "borrowed" mm while the anonymous process is running. When the - anonymous process gets scheduled away, the borrowed address space is - returned and cleared. - -To support all that, the "struct mm_struct" now has two counters: a -"mm_users" counter that is how many "real address space users" there are, -and a "mm_count" counter that is the number of "lazy" users (ie anonymous -users) plus one if there are any real users. - -Usually there is at least one real user, but it could be that the real -user exited on another CPU while a lazy user was still active, so you do -actually get cases where you have a address space that is _only_ used by -lazy users. That is often a short-lived state, because once that thread -gets scheduled away in favour of a real thread, the "zombie" mm gets -released because "mm_users" becomes zero. - -Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any -more. "init_mm" should be considered just a "lazy context when no other -context is available", and in fact it is mainly used just at bootup when -no real VM has yet been created. So code that used to check - - if (current->mm == &init_mm) - -should generally just do - - if (!current->mm) - -instead (which makes more sense anyway - the test is basically one of "do -we have a user context", and is generally done by the page fault handler -and things like that). - -Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago, -because it slightly changes the interfaces to accommodate the alpha (who -would have thought it, but the alpha actually ends up having one of the -ugliest context switch codes - unlike the other architectures where the MM -and register state is separate, the alpha PALcode joins the two, and you -need to switch both together). - -(From http://marc.info/?l=linux-kernel&m=93337278602211&w=2) diff --git a/Documentation/vm/balance b/Documentation/vm/balance.rst index 964595481af6..6a1fadf3e173 100644 --- a/Documentation/vm/balance +++ b/Documentation/vm/balance.rst @@ -1,3 +1,9 @@ +.. _balance: + +================ +Memory Balancing +================ + Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com> Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as @@ -62,11 +68,11 @@ for non-sleepable allocations. Second, the HIGHMEM zone is also balanced, so as to give a fighting chance for replace_with_highmem() to get a HIGHMEM page, as well as to ensure that HIGHMEM allocations do not fall back into regular zone. This also makes sure that HIGHMEM pages -are not leaked (for example, in situations where a HIGHMEM page is in +are not leaked (for example, in situations where a HIGHMEM page is in the swapcache but is not being used by anyone) kswapd also needs to know about the zones it should balance. kswapd is -primarily needed in a situation where balancing can not be done, +primarily needed in a situation where balancing can not be done, probably because all allocation requests are coming from intr context and all process contexts are sleeping. For 2.3, kswapd does not really need to balance the highmem zone, since intr context does not request @@ -89,7 +95,8 @@ pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set. (Good) Ideas that I have heard: + 1. Dynamic experience should influence balancing: number of failed requests -for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net) + for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net) 2. Implement a replace_with_highmem()-like replace_with_regular() to preserve -dma pages. (lkd@tantalophile.demon.co.uk) + dma pages. (lkd@tantalophile.demon.co.uk) diff --git a/Documentation/vm/cleancache.txt b/Documentation/vm/cleancache.rst index e4b49df7a048..68cba9131c31 100644 --- a/Documentation/vm/cleancache.txt +++ b/Documentation/vm/cleancache.rst @@ -1,4 +1,11 @@ -MOTIVATION +.. _cleancache: + +========== +Cleancache +========== + +Motivation +========== Cleancache is a new optional feature provided by the VFS layer that potentially dramatically increases page cache effectiveness for @@ -21,9 +28,10 @@ Transcendent memory "drivers" for cleancache are currently implemented in Xen (using hypervisor memory) and zcache (using in-kernel compressed memory) and other implementations are in development. -FAQs are included below. +:ref:`FAQs <faq>` are included below. -IMPLEMENTATION OVERVIEW +Implementation Overview +======================= A cleancache "backend" that provides transcendent memory registers itself to the kernel's cleancache "frontend" by calling cleancache_register_ops, @@ -80,22 +88,33 @@ different Linux threads are simultaneously putting and invalidating a page with the same handle, the results are indeterminate. Callers must lock the page to ensure serial behavior. -CLEANCACHE PERFORMANCE METRICS +Cleancache Performance Metrics +============================== If properly configured, monitoring of cleancache is done via debugfs in -the /sys/kernel/debug/cleancache directory. The effectiveness of cleancache +the `/sys/kernel/debug/cleancache` directory. The effectiveness of cleancache can be measured (across all filesystems) with: -succ_gets - number of gets that were successful -failed_gets - number of gets that failed -puts - number of puts attempted (all "succeed") -invalidates - number of invalidates attempted +``succ_gets`` + number of gets that were successful + +``failed_gets`` + number of gets that failed + +``puts`` + number of puts attempted (all "succeed") + +``invalidates`` + number of invalidates attempted A backend implementation may provide additional metrics. +.. _faq: + FAQ +=== -1) Where's the value? (Andrew Morton) +* Where's the value? (Andrew Morton) Cleancache provides a significant performance benefit to many workloads in many environments with negligible overhead by improving the @@ -137,8 +156,8 @@ device that stores pages of data in a compressed state. And the proposed "RAMster" driver shares RAM across multiple physical systems. -2) Why does cleancache have its sticky fingers so deep inside the - filesystems and VFS? (Andrew Morton and Christoph Hellwig) +* Why does cleancache have its sticky fingers so deep inside the + filesystems and VFS? (Andrew Morton and Christoph Hellwig) The core hooks for cleancache in VFS are in most cases a single line and the minimum set are placed precisely where needed to maintain @@ -168,9 +187,9 @@ filesystems in the future. The total impact of the hooks to existing fs and mm files is only about 40 lines added (not counting comments and blank lines). -3) Why not make cleancache asynchronous and batched so it can - more easily interface with real devices with DMA instead - of copying each individual page? (Minchan Kim) +* Why not make cleancache asynchronous and batched so it can more + easily interface with real devices with DMA instead of copying each + individual page? (Minchan Kim) The one-page-at-a-time copy semantics simplifies the implementation on both the frontend and backend and also allows the backend to @@ -182,8 +201,8 @@ are avoided. While the interface seems odd for a "real device" or for real kernel-addressable RAM, it makes perfect sense for transcendent memory. -4) Why is non-shared cleancache "exclusive"? And where is the - page "invalidated" after a "get"? (Minchan Kim) +* Why is non-shared cleancache "exclusive"? And where is the + page "invalidated" after a "get"? (Minchan Kim) The main reason is to free up space in transcendent memory and to avoid unnecessary cleancache_invalidate calls. If you want inclusive, @@ -193,7 +212,7 @@ be easily extended to add a "get_no_invalidate" call. The invalidate is done by the cleancache backend implementation. -5) What's the performance impact? +* What's the performance impact? Performance analysis has been presented at OLS'09 and LCA'10. Briefly, performance gains can be significant on most workloads, @@ -206,7 +225,7 @@ single-core systems with slow memory-copy speeds, cleancache has little value, but in newer multicore machines, especially consolidated/virtualized machines, it has great value. -6) How do I add cleancache support for filesystem X? (Boaz Harrash) +* How do I add cleancache support for filesystem X? (Boaz Harrash) Filesystems that are well-behaved and conform to certain restrictions can utilize cleancache simply by making a call to @@ -217,26 +236,26 @@ not enable the optional cleancache. Some points for a filesystem to consider: -- The FS should be block-device-based (e.g. a ram-based FS such - as tmpfs should not enable cleancache) -- To ensure coherency/correctness, the FS must ensure that all - file removal or truncation operations either go through VFS or - add hooks to do the equivalent cleancache "invalidate" operations -- To ensure coherency/correctness, either inode numbers must - be unique across the lifetime of the on-disk file OR the - FS must provide an "encode_fh" function. -- The FS must call the VFS superblock alloc and deactivate routines - or add hooks to do the equivalent cleancache calls done there. -- To maximize performance, all pages fetched from the FS should - go through the do_mpag_readpage routine or the FS should add - hooks to do the equivalent (cf. btrfs) -- Currently, the FS blocksize must be the same as PAGESIZE. This - is not an architectural restriction, but no backends currently - support anything different. -- A clustered FS should invoke the "shared_init_fs" cleancache - hook to get best performance for some backends. - -7) Why not use the KVA of the inode as the key? (Christoph Hellwig) + - The FS should be block-device-based (e.g. a ram-based FS such + as tmpfs should not enable cleancache) + - To ensure coherency/correctness, the FS must ensure that all + file removal or truncation operations either go through VFS or + add hooks to do the equivalent cleancache "invalidate" operations + - To ensure coherency/correctness, either inode numbers must + be unique across the lifetime of the on-disk file OR the + FS must provide an "encode_fh" function. + - The FS must call the VFS superblock alloc and deactivate routines + or add hooks to do the equivalent cleancache calls done there. + - To maximize performance, all pages fetched from the FS should + go through the do_mpag_readpage routine or the FS should add + hooks to do the equivalent (cf. btrfs) + - Currently, the FS blocksize must be the same as PAGESIZE. This + is not an architectural restriction, but no backends currently + support anything different. + - A clustered FS should invoke the "shared_init_fs" cleancache + hook to get best performance for some backends. + +* Why not use the KVA of the inode as the key? (Christoph Hellwig) If cleancache would use the inode virtual address instead of inode/filehandle, the pool id could be eliminated. But, this @@ -251,7 +270,7 @@ of cleancache would be lost because the cache of pages in cleanache is potentially much larger than the kernel pagecache and is most useful if the pages survive inode cache removal. -8) Why is a global variable required? +* Why is a global variable required? The cleancache_enabled flag is checked in all of the frequently-used cleancache hooks. The alternative is a function call to check a static @@ -262,14 +281,14 @@ global variable allows cleancache to be enabled by default at compile time, but have insignificant performance impact when cleancache remains disabled at runtime. -9) Does cleanache work with KVM? +* Does cleanache work with KVM? The memory model of KVM is sufficiently different that a cleancache backend may have less value for KVM. This remains to be tested, especially in an overcommitted system. -10) Does cleancache work in userspace? It sounds useful for - memory hungry caches like web browsers. (Jamie Lokier) +* Does cleancache work in userspace? It sounds useful for + memory hungry caches like web browsers. (Jamie Lokier) No plans yet, though we agree it sounds useful, at least for apps that bypass the page cache (e.g. O_DIRECT). diff --git a/Documentation/vm/conf.py b/Documentation/vm/conf.py new file mode 100644 index 000000000000..3b0b601af558 --- /dev/null +++ b/Documentation/vm/conf.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8; mode: python -*- + +project = "Linux Memory Management Documentation" + +tags.add("subproject") + +latex_documents = [ + ('index', 'memory-management.tex', project, + 'The kernel development community', 'manual'), +] diff --git a/Documentation/vm/frontswap.txt b/Documentation/vm/frontswap.rst index c71a019be600..1979f430c1c5 100644 --- a/Documentation/vm/frontswap.txt +++ b/Documentation/vm/frontswap.rst @@ -1,13 +1,20 @@ +.. _frontswap: + +========= +Frontswap +========= + Frontswap provides a "transcendent memory" interface for swap pages. In some environments, dramatic performance savings may be obtained because swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. -(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends" +(Note, frontswap -- and :ref:`cleancache` (merged at 3.0) -- are the "frontends" and the only necessary changes to the core kernel for transcendent memory; all other supporting code -- the "backends" -- is implemented as drivers. -See the LWN.net article "Transcendent memory in a nutshell" for a detailed -overview of frontswap and related kernel parts: -https://lwn.net/Articles/454795/ ) +See the LWN.net article `Transcendent memory in a nutshell`_ +for a detailed overview of frontswap and related kernel parts) + +.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/ Frontswap is so named because it can be thought of as the opposite of a "backing" store for a swap device. The storage is assumed to be @@ -50,19 +57,27 @@ or the store fails AND the page is invalidated. This ensures stale data may never be obtained from frontswap. If properly configured, monitoring of frontswap is done via debugfs in -the /sys/kernel/debug/frontswap directory. The effectiveness of +the `/sys/kernel/debug/frontswap` directory. The effectiveness of frontswap can be measured (across all swap devices) with: -failed_stores - how many store attempts have failed -loads - how many loads were attempted (all should succeed) -succ_stores - how many store attempts have succeeded -invalidates - how many invalidates were attempted +``failed_stores`` + how many store attempts have failed + +``loads`` + how many loads were attempted (all should succeed) + +``succ_stores`` + how many store attempts have succeeded + +``invalidates`` + how many invalidates were attempted A backend implementation may provide additional metrics. FAQ +=== -1) Where's the value? +* Where's the value? When a workload starts swapping, performance falls through the floor. Frontswap significantly increases performance in many such workloads by @@ -117,8 +132,8 @@ A KVM implementation is underway and has been RFC'ed to lkml. And, using frontswap, investigation is also underway on the use of NVM as a memory extension technology. -2) Sure there may be performance advantages in some situations, but - what's the space/time overhead of frontswap? +* Sure there may be performance advantages in some situations, but + what's the space/time overhead of frontswap? If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into nothingness and the only overhead is a few extra bytes per swapon'ed @@ -148,8 +163,8 @@ pressure that can potentially outweigh the other advantages. A backend, such as zcache, must implement policies to carefully (but dynamically) manage memory limits to ensure this doesn't happen. -3) OK, how about a quick overview of what this frontswap patch does - in terms that a kernel hacker can grok? +* OK, how about a quick overview of what this frontswap patch does + in terms that a kernel hacker can grok? Let's assume that a frontswap "backend" has registered during kernel initialization; this registration indicates that this @@ -188,9 +203,9 @@ and (potentially) a swap device write are replaced by a "frontswap backend store" and (possibly) a "frontswap backend loads", which are presumably much faster. -4) Can't frontswap be configured as a "special" swap device that is - just higher priority than any real swap device (e.g. like zswap, - or maybe swap-over-nbd/NFS)? +* Can't frontswap be configured as a "special" swap device that is + just higher priority than any real swap device (e.g. like zswap, + or maybe swap-over-nbd/NFS)? No. First, the existing swap subsystem doesn't allow for any kind of swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy, @@ -240,9 +255,9 @@ installation, frontswap is useless. Swapless portable devices can still use frontswap but a backend for such devices must configure some kind of "ghost" swap device and ensure that it is never used. -5) Why this weird definition about "duplicate stores"? If a page - has been previously successfully stored, can't it always be - successfully overwritten? +* Why this weird definition about "duplicate stores"? If a page + has been previously successfully stored, can't it always be + successfully overwritten? Nearly always it can, but no, sometimes it cannot. Consider an example where data is compressed and the original 4K page has been compressed @@ -254,7 +269,7 @@ the old data and ensure that it is no longer accessible. Since the swap subsystem then writes the new data to the read swap device, this is the correct course of action to ensure coherency. -6) What is frontswap_shrink for? +* What is frontswap_shrink for? When the (non-frontswap) swap subsystem swaps out a page to a real swap device, that page is only taking up low-value pre-allocated disk @@ -267,7 +282,7 @@ to "repatriate" pages sent to a remote machine back to the local machine; this is driven using the frontswap_shrink mechanism when memory pressure subsides. -7) Why does the frontswap patch create the new include file swapfile.h? +* Why does the frontswap patch create the new include file swapfile.h? The frontswap code depends on some swap-subsystem-internal data structures that have, over the years, moved back and forth between diff --git a/Documentation/vm/highmem.txt b/Documentation/vm/highmem.rst index 4324d24ffacd..0f69a9fec34d 100644 --- a/Documentation/vm/highmem.txt +++ b/Documentation/vm/highmem.rst @@ -1,25 +1,14 @@ +.. _highmem: - ==================== - HIGH MEMORY HANDLING - ==================== +==================== +High Memory Handling +==================== By: Peter Zijlstra <a.p.zijlstra@chello.nl> -Contents: - - (*) What is high memory? - - (*) Temporary virtual mappings. - - (*) Using kmap_atomic. - - (*) Cost of temporary mappings. - - (*) i386 PAE. +.. contents:: :local: - -==================== -WHAT IS HIGH MEMORY? +What Is High Memory? ==================== High memory (highmem) is used when the size of physical memory approaches or @@ -38,7 +27,7 @@ kernel entry/exit. This means the available virtual memory space (4GiB on i386) has to be divided between user and kernel space. The traditional split for architectures using this approach is 3:1, 3GiB for -userspace and the top 1GiB for kernel space: +userspace and the top 1GiB for kernel space:: +--------+ 0xffffffff | Kernel | @@ -58,40 +47,38 @@ and user maps. Some hardware (like some ARMs), however, have limited virtual space when they use mm context tags. -========================== -TEMPORARY VIRTUAL MAPPINGS +Temporary Virtual Mappings ========================== The kernel contains several ways of creating temporary mappings: - (*) vmap(). This can be used to make a long duration mapping of multiple - physical pages into a contiguous virtual space. It needs global - synchronization to unmap. +* vmap(). This can be used to make a long duration mapping of multiple + physical pages into a contiguous virtual space. It needs global + synchronization to unmap. - (*) kmap(). This permits a short duration mapping of a single page. It needs - global synchronization, but is amortized somewhat. It is also prone to - deadlocks when using in a nested fashion, and so it is not recommended for - new code. +* kmap(). This permits a short duration mapping of a single page. It needs + global synchronization, but is amortized somewhat. It is also prone to + deadlocks when using in a nested fashion, and so it is not recommended for + new code. - (*) kmap_atomic(). This permits a very short duration mapping of a single - page. Since the mapping is restricted to the CPU that issued it, it - performs well, but the issuing task is therefore required to stay on that - CPU until it has finished, lest some other task displace its mappings. +* kmap_atomic(). This permits a very short duration mapping of a single + page. Since the mapping is restricted to the CPU that issued it, it + performs well, but the issuing task is therefore required to stay on that + CPU until it has finished, lest some other task displace its mappings. - kmap_atomic() may also be used by interrupt contexts, since it is does not - sleep and the caller may not sleep until after kunmap_atomic() is called. + kmap_atomic() may also be used by interrupt contexts, since it is does not + sleep and the caller may not sleep until after kunmap_atomic() is called. - It may be assumed that k[un]map_atomic() won't fail. + It may be assumed that k[un]map_atomic() won't fail. -================= -USING KMAP_ATOMIC +Using kmap_atomic ================= When and where to use kmap_atomic() is straightforward. It is used when code wants to access the contents of a page that might be allocated from high memory (see __GFP_HIGHMEM), for example a page in the pagecache. The API has two -functions, and they can be used in a manner similar to the following: +functions, and they can be used in a manner similar to the following:: /* Find the page of interest. */ struct page *page = find_get_page(mapping, offset); @@ -109,7 +96,7 @@ Note that the kunmap_atomic() call takes the result of the kmap_atomic() call not the argument. If you need to map two pages because you want to copy from one page to -another you need to keep the kmap_atomic calls strictly nested, like: +another you need to keep the kmap_atomic calls strictly nested, like:: vaddr1 = kmap_atomic(page1); vaddr2 = kmap_atomic(page2); @@ -120,8 +107,7 @@ another you need to keep the kmap_atomic calls strictly nested, like: kunmap_atomic(vaddr1); -========================== -COST OF TEMPORARY MAPPINGS +Cost of Temporary Mappings ========================== The cost of creating temporary mappings can be quite high. The arch has to @@ -136,25 +122,24 @@ If CONFIG_MMU is not set, then there can be no temporary mappings and no highmem. In such a case, the arithmetic approach will also be used. -======== i386 PAE ======== The i386 arch, under some circumstances, will permit you to stick up to 64GiB of RAM into your 32-bit machine. This has a number of consequences: - (*) Linux needs a page-frame structure for each page in the system and the - pageframes need to live in the permanent mapping, which means: +* Linux needs a page-frame structure for each page in the system and the + pageframes need to live in the permanent mapping, which means: - (*) you can have 896M/sizeof(struct page) page-frames at most; with struct - page being 32-bytes that would end up being something in the order of 112G - worth of pages; the kernel, however, needs to store more than just - page-frames in that memory... +* you can have 896M/sizeof(struct page) page-frames at most; with struct + page being 32-bytes that would end up being something in the order of 112G + worth of pages; the kernel, however, needs to store more than just + page-frames in that memory... - (*) PAE makes your page tables larger - which slows the system down as more - data has to be accessed to traverse in TLB fills and the like. One - advantage is that PAE has more PTE bits and can provide advanced features - like NX and PAT. +* PAE makes your page tables larger - which slows the system down as more + data has to be accessed to traverse in TLB fills and the like. One + advantage is that PAE has more PTE bits and can provide advanced features + like NX and PAT. The general recommendation is that you don't use more than 8GiB on a 32-bit machine - although more might work for you and your workload, you're pretty diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.rst index 2d1d6f69e91b..cdf3911582c8 100644 --- a/Documentation/vm/hmm.txt +++ b/Documentation/vm/hmm.rst @@ -1,4 +1,8 @@ +.. hmm: + +===================================== Heterogeneous Memory Management (HMM) +===================================== Provide infrastructure and helpers to integrate non-conventional memory (device memory like GPU on board memory) into regular kernel path, with the cornerstone @@ -6,10 +10,10 @@ of this being specialized struct page for such memory (see sections 5 to 7 of this document). HMM also provides optional helpers for SVM (Share Virtual Memory), i.e., -allowing a device to transparently access program address coherently with the -CPU meaning that any valid pointer on the CPU is also a valid pointer for the -device. This is becoming mandatory to simplify the use of advanced hetero- -geneous computing where GPU, DSP, or FPGA are used to perform various +allowing a device to transparently access program address coherently with +the CPU meaning that any valid pointer on the CPU is also a valid pointer +for the device. This is becoming mandatory to simplify the use of advanced +heterogeneous computing where GPU, DSP, or FPGA are used to perform various computations on behalf of a process. This document is divided as follows: in the first section I expose the problems @@ -21,19 +25,10 @@ fifth section deals with how device memory is represented inside the kernel. Finally, the last section presents a new migration helper that allows lever- aging the device DMA engine. +.. contents:: :local: -1) Problems of using a device specific memory allocator: -2) I/O bus, device memory characteristics -3) Shared address space and migration -4) Address space mirroring implementation and API -5) Represent and manage device memory from core kernel point of view -6) Migration to and from device memory -7) Memory cgroup (memcg) and rss accounting - - -------------------------------------------------------------------------------- - -1) Problems of using a device specific memory allocator: +Problems of using a device specific memory allocator +==================================================== Devices with a large amount of on board memory (several gigabytes) like GPUs have historically managed their memory through dedicated driver specific APIs. @@ -77,9 +72,8 @@ are only do-able with a shared address space. It is also more reasonable to use a shared address space for all other patterns. -------------------------------------------------------------------------------- - -2) I/O bus, device memory characteristics +I/O bus, device memory characteristics +====================================== I/O buses cripple shared address spaces due to a few limitations. Most I/O buses only allow basic memory access from device to main memory; even cache @@ -109,9 +103,8 @@ access any memory but we must also permit any memory to be migrated to device memory while device is using it (blocking CPU access while it happens). -------------------------------------------------------------------------------- - -3) Shared address space and migration +Shared address space and migration +================================== HMM intends to provide two main features. First one is to share the address space by duplicating the CPU page table in the device page table so the same @@ -148,23 +141,23 @@ ages device memory by migrating the part of the data set that is actively being used by the device. -------------------------------------------------------------------------------- - -4) Address space mirroring implementation and API +Address space mirroring implementation and API +============================================== Address space mirroring's main objective is to allow duplication of a range of CPU page table into a device page table; HMM helps keep both synchronized. A device driver that wants to mirror a process address space must start with the -registration of an hmm_mirror struct: +registration of an hmm_mirror struct:: int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm); int hmm_mirror_register_locked(struct hmm_mirror *mirror, struct mm_struct *mm); + The locked variant is to be used when the driver is already holding mmap_sem of the mm in write mode. The mirror struct has a set of callbacks that are used -to propagate CPU page tables: +to propagate CPU page tables:: struct hmm_mirror_ops { /* sync_cpu_device_pagetables() - synchronize page tables @@ -193,10 +186,10 @@ The device driver must perform the update action to the range (mark range read only, or fully unmap, ...). The device must be done with the update before the driver callback returns. - When the device driver wants to populate a range of virtual addresses, it can -use either: - int hmm_vma_get_pfns(struct vm_area_struct *vma, +use either:: + + int hmm_vma_get_pfns(struct vm_area_struct *vma, struct hmm_range *range, unsigned long start, unsigned long end, @@ -221,7 +214,7 @@ provides a set of flags to help the driver identify special CPU page table entries. Locking with the update() callback is the most important aspect the driver must -respect in order to keep things properly synchronized. The usage pattern is: +respect in order to keep things properly synchronized. The usage pattern is:: int driver_populate_range(...) { @@ -262,9 +255,8 @@ report commands as executed is serialized (there is no point in doing this concurrently). -------------------------------------------------------------------------------- - -5) Represent and manage device memory from core kernel point of view +Represent and manage device memory from core kernel point of view +================================================================= Several different designs were tried to support device memory. First one used a device specific data structure to keep information about migrated memory and @@ -280,14 +272,14 @@ unaware of the difference. We only need to make sure that no one ever tries to map those pages from the CPU side. HMM provides a set of helpers to register and hotplug device memory as a new -region needing a struct page. This is offered through a very simple API: +region needing a struct page. This is offered through a very simple API:: struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, struct device *device, unsigned long size); void hmm_devmem_remove(struct hmm_devmem *devmem); -The hmm_devmem_ops is where most of the important things are: +The hmm_devmem_ops is where most of the important things are:: struct hmm_devmem_ops { void (*free)(struct hmm_devmem *devmem, struct page *page); @@ -306,13 +298,12 @@ which it cannot do. This second callback must trigger a migration back to system memory. -------------------------------------------------------------------------------- - -6) Migration to and from device memory +Migration to and from device memory +=================================== Because the CPU cannot access device memory, migration must use the device DMA engine to perform copy from and to device memory. For this we need a new -migration helper: +migration helper:: int migrate_vma(const struct migrate_vma_ops *ops, struct vm_area_struct *vma, @@ -331,7 +322,7 @@ migration might be for a range of addresses the device is actively accessing. The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy()) controls destination memory allocation and copy operation. Second one is there -to allow the device driver to perform cleanup operations after migration. +to allow the device driver to perform cleanup operations after migration:: struct migrate_vma_ops { void (*alloc_and_copy)(struct vm_area_struct *vma, @@ -365,9 +356,8 @@ bandwidth but this is considered as a rare event and a price that we are willing to pay to keep all the code simpler. -------------------------------------------------------------------------------- - -7) Memory cgroup (memcg) and rss accounting +Memory cgroup (memcg) and rss accounting +======================================== For now device memory is accounted as any regular page in rss counters (either anonymous if device page is used for anonymous, file if device page is used for diff --git a/Documentation/vm/hugetlbfs_reserv.txt b/Documentation/vm/hugetlbfs_reserv.rst index 9aca09a76bed..9d200762114f 100644 --- a/Documentation/vm/hugetlbfs_reserv.txt +++ b/Documentation/vm/hugetlbfs_reserv.rst @@ -1,6 +1,13 @@ -Hugetlbfs Reservation Overview ------------------------------- -Huge pages as described at 'Documentation/vm/hugetlbpage.txt' are typically +.. _hugetlbfs_reserve: + +===================== +Hugetlbfs Reservation +===================== + +Overview +======== + +Huge pages as described at :ref:`hugetlbpage` are typically preallocated for application use. These huge pages are instantiated in a task's address space at page fault time if the VMA indicates huge pages are to be used. If no huge page exists at page fault time, the task is sent @@ -17,47 +24,55 @@ describe how huge page reserve processing is done in the v4.10 kernel. Audience --------- +======== This description is primarily targeted at kernel developers who are modifying hugetlbfs code. The Data Structures -------------------- +=================== + resv_huge_pages This is a global (per-hstate) count of reserved huge pages. Reserved huge pages are only available to the task which reserved them. Therefore, the number of huge pages generally available is computed - as (free_huge_pages - resv_huge_pages). + as (``free_huge_pages - resv_huge_pages``). Reserve Map - A reserve map is described by the structure: - struct resv_map { - struct kref refs; - spinlock_t lock; - struct list_head regions; - long adds_in_progress; - struct list_head region_cache; - long region_cache_count; - }; + A reserve map is described by the structure:: + + struct resv_map { + struct kref refs; + spinlock_t lock; + struct list_head regions; + long adds_in_progress; + struct list_head region_cache; + long region_cache_count; + }; + There is one reserve map for each huge page mapping in the system. The regions list within the resv_map describes the regions within - the mapping. A region is described as: - struct file_region { - struct list_head link; - long from; - long to; - }; + the mapping. A region is described as:: + + struct file_region { + struct list_head link; + long from; + long to; + }; + The 'from' and 'to' fields of the file region structure are huge page indices into the mapping. Depending on the type of mapping, a region in the reserv_map may indicate reservations exist for the range, or reservations do not exist. Flags for MAP_PRIVATE Reservations These are stored in the bottom bits of the reservation map pointer. - #define HPAGE_RESV_OWNER (1UL << 0) Indicates this task is the - owner of the reservations associated with the mapping. - #define HPAGE_RESV_UNMAPPED (1UL << 1) Indicates task originally - mapping this range (and creating reserves) has unmapped a - page from this task (the child) due to a failed COW. + + ``#define HPAGE_RESV_OWNER (1UL << 0)`` + Indicates this task is the owner of the reservations + associated with the mapping. + ``#define HPAGE_RESV_UNMAPPED (1UL << 1)`` + Indicates task originally mapping this range (and creating + reserves) has unmapped a page from this task (the child) + due to a failed COW. Page Flags The PagePrivate page flag is used to indicate that a huge page reservation must be restored when the huge page is freed. More @@ -65,12 +80,14 @@ Page Flags Reservation Map Location (Private or Shared) --------------------------------------------- +============================================ + A huge page mapping or segment is either private or shared. If private, it is typically only available to a single address space (task). If shared, it can be mapped into multiple address spaces (tasks). The location and semantics of the reservation map is significantly different for two types of mappings. Location differences are: + - For private mappings, the reservation map hangs off the the VMA structure. Specifically, vma->vm_private_data. This reserve map is created at the time the mapping (mmap(MAP_PRIVATE)) is created. @@ -82,15 +99,15 @@ of mappings. Location differences are: Creating Reservations ---------------------- +===================== Reservations are created when a huge page backed shared memory segment is created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB). -These operations result in a call to the routine hugetlb_reserve_pages() +These operations result in a call to the routine hugetlb_reserve_pages():: -int hugetlb_reserve_pages(struct inode *inode, - long from, long to, - struct vm_area_struct *vma, - vm_flags_t vm_flags) + int hugetlb_reserve_pages(struct inode *inode, + long from, long to, + struct vm_area_struct *vma, + vm_flags_t vm_flags) The first thing hugetlb_reserve_pages() does is check for the NORESERVE flag was specified in either the shmget() or mmap() call. If NORESERVE @@ -105,6 +122,7 @@ the 'from' and 'to' arguments have been adjusted by this offset. One of the big differences between PRIVATE and SHARED mappings is the way in which reservations are represented in the reservation map. + - For shared mappings, an entry in the reservation map indicates a reservation exists or did exist for the corresponding page. As reservations are consumed, the reservation map is not modified. @@ -121,12 +139,13 @@ to indicate this VMA owns the reservations. The reservation map is consulted to determine how many huge page reservations are needed for the current mapping/segment. For private mappings, this is always the value (to - from). However, for shared mappings it is possible that some reservations may already exist within the range (to - from). See the -section "Reservation Map Modifications" for details on how this is accomplished. +section :ref:`Reservation Map Modifications <resv_map_modifications>` +for details on how this is accomplished. The mapping may be associated with a subpool. If so, the subpool is consulted to ensure there is sufficient space for the mapping. It is possible that the subpool has set aside reservations that can be used for the mapping. See the -section "Subpool Reservations" for more details. +section :ref:`Subpool Reservations <sub_pool_resv>` for more details. After consulting the reservation map and subpool, the number of needed new reservations is known. The routine hugetlb_acct_memory() is called to check @@ -135,9 +154,11 @@ calls into routines that potentially allocate and adjust surplus page counts. However, within those routines the code is simply checking to ensure there are enough free huge pages to accommodate the reservation. If there are, the global reservation count resv_huge_pages is adjusted something like the -following. +following:: + if (resv_needed <= (resv_huge_pages - free_huge_pages)) resv_huge_pages += resv_needed; + Note that the global lock hugetlb_lock is held when checking and adjusting these counters. @@ -152,14 +173,18 @@ If hugetlb_reserve_pages() was successful, the global reservation count and reservation map associated with the mapping will be modified as required to ensure reservations exist for the range 'from' - 'to'. +.. _consume_resv: Consuming Reservations/Allocating a Huge Page ---------------------------------------------- +============================================= + Reservations are consumed when huge pages associated with the reservations are allocated and instantiated in the corresponding mapping. The allocation -is performed within the routine alloc_huge_page(). -struct page *alloc_huge_page(struct vm_area_struct *vma, - unsigned long addr, int avoid_reserve) +is performed within the routine alloc_huge_page():: + + struct page *alloc_huge_page(struct vm_area_struct *vma, + unsigned long addr, int avoid_reserve) + alloc_huge_page is passed a VMA pointer and a virtual address, so it can consult the reservation map to determine if a reservation exists. In addition, alloc_huge_page takes the argument avoid_reserve which indicates reserves @@ -170,8 +195,9 @@ page are being allocated. The helper routine vma_needs_reservation() is called to determine if a reservation exists for the address within the mapping(vma). See the section -"Reservation Map Helper Routines" for detailed information on what this -routine does. The value returned from vma_needs_reservation() is generally +:ref:`Reservation Map Helper Routines <resv_map_helpers>` for detailed +information on what this routine does. +The value returned from vma_needs_reservation() is generally 0 or 1. 0 if a reservation exists for the address, 1 if no reservation exists. If a reservation does not exist, and there is a subpool associated with the mapping the subpool is consulted to determine if it contains reservations. @@ -180,21 +206,25 @@ However, in every case the avoid_reserve argument overrides the use of a reservation for the allocation. After determining whether a reservation exists and can be used for the allocation, the routine dequeue_huge_page_vma() is called. This routine takes two arguments related to reservations: + - avoid_reserve, this is the same value/argument passed to alloc_huge_page() - chg, even though this argument is of type long only the values 0 or 1 are passed to dequeue_huge_page_vma. If the value is 0, it indicates a reservation exists (see the section "Memory Policy and Reservations" for possible issues). If the value is 1, it indicates a reservation does not exist and the page must be taken from the global free pool if possible. + The free lists associated with the memory policy of the VMA are searched for a free page. If a page is found, the value free_huge_pages is decremented when the page is removed from the free list. If there was a reservation -associated with the page, the following adjustments are made: +associated with the page, the following adjustments are made:: + SetPagePrivate(page); /* Indicates allocating this page consumed * a reservation, and if an error is * encountered such that the page must be * freed, the reservation will be restored. */ resv_huge_pages--; /* Decrement the global reservation count */ + Note, if no huge page can be found that satisfies the VMA's memory policy an attempt will be made to allocate one using the buddy allocator. This brings up the issue of surplus huge pages and overcommit which is beyond @@ -222,12 +252,14 @@ mapping. In such cases, the reservation count and subpool free page count will be off by one. This rare condition can be identified by comparing the return value from vma_needs_reservation and vma_commit_reservation. If such a race is detected, the subpool and global reserve counts are adjusted to -compensate. See the section "Reservation Map Helper Routines" for more +compensate. See the section +:ref:`Reservation Map Helper Routines <resv_map_helpers>` for more information on these routines. Instantiate Huge Pages ----------------------- +====================== + After huge page allocation, the page is typically added to the page tables of the allocating task. Before this, pages in a shared mapping are added to the page cache and pages in private mappings are added to an anonymous @@ -237,7 +269,8 @@ to the global reservation count (resv_huge_pages). Freeing Huge Pages ------------------- +================== + Huge page freeing is performed by the routine free_huge_page(). This routine is the destructor for hugetlbfs compound pages. As a result, it is only passed a pointer to the page struct. When a huge page is freed, reservation @@ -247,7 +280,8 @@ on an error path where a global reserve count must be restored. The page->private field points to any subpool associated with the page. If the PagePrivate flag is set, it indicates the global reserve count should -be adjusted (see the section "Consuming Reservations/Allocating a Huge Page" +be adjusted (see the section +:ref:`Consuming Reservations/Allocating a Huge Page <consume_resv>` for information on how these are set). The routine first calls hugepage_subpool_put_pages() for the page. If this @@ -259,9 +293,11 @@ Therefore, the global resv_huge_pages counter is incremented in this case. If the PagePrivate flag was set in the page, the global resv_huge_pages counter will always be incremented. +.. _sub_pool_resv: Subpool Reservations --------------------- +==================== + There is a struct hstate associated with each huge page size. The hstate tracks all huge pages of the specified size. A subpool represents a subset of pages within a hstate that is associated with a mounted hugetlbfs @@ -295,7 +331,8 @@ the global pools. COW and Reservations --------------------- +==================== + Since shared mappings all point to and use the same underlying pages, the biggest reservation concern for COW is private mappings. In this case, two tasks can be pointing at the same previously allocated page. One task @@ -326,30 +363,36 @@ faults on a non-present page. But, the original owner of the mapping/reservation will behave as expected. +.. _resv_map_modifications: + Reservation Map Modifications ------------------------------ +============================= + The following low level routines are used to make modifications to a reservation map. Typically, these routines are not called directly. Rather, a reservation map helper routine is called which calls one of these low level routines. These low level routines are fairly well documented in the source -code (mm/hugetlb.c). These routines are: -long region_chg(struct resv_map *resv, long f, long t); -long region_add(struct resv_map *resv, long f, long t); -void region_abort(struct resv_map *resv, long f, long t); -long region_count(struct resv_map *resv, long f, long t); +code (mm/hugetlb.c). These routines are:: + + long region_chg(struct resv_map *resv, long f, long t); + long region_add(struct resv_map *resv, long f, long t); + void region_abort(struct resv_map *resv, long f, long t); + long region_count(struct resv_map *resv, long f, long t); Operations on the reservation map typically involve two operations: + 1) region_chg() is called to examine the reserve map and determine how many pages in the specified range [f, t) are NOT currently represented. The calling code performs global checks and allocations to determine if there are enough huge pages for the operation to succeed. -2a) If the operation can succeed, region_add() is called to actually modify - the reservation map for the same range [f, t) previously passed to - region_chg(). -2b) If the operation can not succeed, region_abort is called for the same range - [f, t) to abort the operation. +2) + a) If the operation can succeed, region_add() is called to actually modify + the reservation map for the same range [f, t) previously passed to + region_chg(). + b) If the operation can not succeed, region_abort is called for the same + range [f, t) to abort the operation. Note that this is a two step process where region_add() and region_abort() are guaranteed to succeed after a prior call to region_chg() for the same @@ -371,6 +414,7 @@ and make the appropriate adjustments. The routine region_del() is called to remove regions from a reservation map. It is typically called in the following situations: + - When a file in the hugetlbfs filesystem is being removed, the inode will be released and the reservation map freed. Before freeing the reservation map, all the individual file_region structures must be freed. In this case @@ -384,6 +428,7 @@ It is typically called in the following situations: removed, region_del() is called to remove the corresponding entry from the reservation map. In this case, region_del is passed the range [page_idx, page_idx + 1). + In every case, region_del() will return the number of pages removed from the reservation map. In VERY rare cases, region_del() can fail. This can only happen in the hole punch case where it has to split an existing file_region @@ -403,9 +448,11 @@ outstanding (outstanding = (end - start) - region_count(resv, start, end)). Since the mapping is going away, the subpool and global reservation counts are decremented by the number of outstanding reservations. +.. _resv_map_helpers: Reservation Map Helper Routines -------------------------------- +=============================== + Several helper routines exist to query and modify the reservation maps. These routines are only interested with reservations for a specific huge page, so they just pass in an address instead of a range. In addition, @@ -414,32 +461,40 @@ or shared) and the location of the reservation map (inode or VMA) can be determined. These routines simply call the underlying routines described in the section "Reservation Map Modifications". However, they do take into account the 'opposite' meaning of reservation map entries for private and -shared mappings and hide this detail from the caller. +shared mappings and hide this detail from the caller:: + + long vma_needs_reservation(struct hstate *h, + struct vm_area_struct *vma, + unsigned long addr) -long vma_needs_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) This routine calls region_chg() for the specified page. If no reservation -exists, 1 is returned. If a reservation exists, 0 is returned. +exists, 1 is returned. If a reservation exists, 0 is returned:: + + long vma_commit_reservation(struct hstate *h, + struct vm_area_struct *vma, + unsigned long addr) -long vma_commit_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) This calls region_add() for the specified page. As in the case of region_chg and region_add, this routine is to be called after a previous call to vma_needs_reservation. It will add a reservation entry for the page. It returns 1 if the reservation was added and 0 if not. The return value should be compared with the return value of the previous call to vma_needs_reservation. An unexpected difference indicates the reservation -map was modified between calls. +map was modified between calls:: + + void vma_end_reservation(struct hstate *h, + struct vm_area_struct *vma, + unsigned long addr) -void vma_end_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) This calls region_abort() for the specified page. As in the case of region_chg and region_abort, this routine is to be called after a previous call to vma_needs_reservation. It will abort/end the in progress reservation add -operation. +operation:: + + long vma_add_reservation(struct hstate *h, + struct vm_area_struct *vma, + unsigned long addr) -long vma_add_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) This is a special wrapper routine to help facilitate reservation cleanup on error paths. It is only called from the routine restore_reserve_on_error(). This routine is used in conjunction with vma_needs_reservation in an attempt @@ -453,8 +508,10 @@ be done on error paths. Reservation Cleanup in Error Paths ----------------------------------- -As mentioned in the section "Reservation Map Helper Routines", reservation +================================== + +As mentioned in the section +:ref:`Reservation Map Helper Routines <resv_map_helpers>`, reservation map modifications are performed in two steps. First vma_needs_reservation is called before a page is allocated. If the allocation is successful, then vma_commit_reservation is called. If not, vma_end_reservation is called. @@ -494,13 +551,14 @@ so that a reservation will not be leaked when the huge page is freed. Reservations and Memory Policy ------------------------------- +============================== Per-node huge page lists existed in struct hstate when git was first used to manage Linux code. The concept of reservations was added some time later. When reservations were added, no attempt was made to take memory policy into account. While cpusets are not exactly the same as memory policy, this comment in hugetlb_acct_memory sums up the interaction between reservations -and cpusets/memory policy. +and cpusets/memory policy:: + /* * When cpuset is configured, it breaks the strict hugetlb page * reservation as the accounting is done on a global variable. Such @@ -525,5 +583,13 @@ of cpusets or memory policy there is no guarantee that huge pages will be available on the required nodes. This is true even if there are a sufficient number of global reservations. +Hugetlbfs regression testing +============================ + +The most complete set of hugetlb tests are in the libhugetlbfs repository. +If you modify any hugetlb related code, use the libhugetlbfs test suite +to check for regressions. In addition, if you add any new hugetlb +functionality, please add appropriate tests to libhugetlbfs. +-- Mike Kravetz, 7 April 2017 diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.rst index e912d7eee769..09bd24a92784 100644 --- a/Documentation/vm/hwpoison.txt +++ b/Documentation/vm/hwpoison.rst @@ -1,7 +1,14 @@ +.. hwpoison: + +======== +hwpoison +======== + What is hwpoison? +================= Upcoming Intel CPUs have support for recovering from some memory errors -(``MCA recovery''). This requires the OS to declare a page "poisoned", +(``MCA recovery``). This requires the OS to declare a page "poisoned", kill the processes associated with it and avoid using it in the future. This patchkit implements the necessary infrastructure in the VM. @@ -46,9 +53,10 @@ address. This in theory allows other applications to handle memory failures too. The expection is that near all applications won't do that, but some very specialized ones might. ---- +Failure recovery modes +====================== -There are two (actually three) modi memory failure recovery can be in: +There are two (actually three) modes memory failure recovery can be in: vm.memory_failure_recovery sysctl set to zero: All memory failures cause a panic. Do not attempt recovery. @@ -67,9 +75,8 @@ late kill This is best for memory error unaware applications and default Note some pages are always handled as late kill. ---- - -User control: +User control +============ vm.memory_failure_recovery See sysctl.txt @@ -79,11 +86,19 @@ vm.memory_failure_early_kill PR_MCE_KILL Set early/late kill mode/revert to system default - arg1: PR_MCE_KILL_CLEAR: Revert to system default - arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode - PR_MCE_KILL_EARLY: Early kill - PR_MCE_KILL_LATE: Late kill - PR_MCE_KILL_DEFAULT: Use system global default + + arg1: PR_MCE_KILL_CLEAR: + Revert to system default + arg1: PR_MCE_KILL_SET: + arg2 defines thread specific mode + + PR_MCE_KILL_EARLY: + Early kill + PR_MCE_KILL_LATE: + Late kill + PR_MCE_KILL_DEFAULT + Use system global default + Note that if you want to have a dedicated thread which handles the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, @@ -92,77 +107,64 @@ PR_MCE_KILL PR_MCE_KILL_GET return current mode +Testing +======= ---- - -Testing: - -madvise(MADV_HWPOISON, ....) - (as root) - Poison a page in the process for testing - +* madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the + process for testing -hwpoison-inject module through debugfs +* hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/`` -/sys/kernel/debug/hwpoison/ + corrupt-pfn + Inject hwpoison fault at PFN echoed into this file. This does + some early filtering to avoid corrupted unintended pages in test suites. -corrupt-pfn + unpoison-pfn + Software-unpoison page at PFN echoed into this file. This way + a page can be reused again. This only works for Linux + injected failures, not for real memory failures. -Inject hwpoison fault at PFN echoed into this file. This does -some early filtering to avoid corrupted unintended pages in test suites. + Note these injection interfaces are not stable and might change between + kernel versions -unpoison-pfn + corrupt-filter-dev-major, corrupt-filter-dev-minor + Only handle memory failures to pages associated with the file + system defined by block device major/minor. -1U is the + wildcard value. This should be only used for testing with + artificial injection. -Software-unpoison page at PFN echoed into this file. This -way a page can be reused again. -This only works for Linux injected failures, not for real -memory failures. + corrupt-filter-memcg + Limit injection to pages owned by memgroup. Specified by inode + number of the memcg. -Note these injection interfaces are not stable and might change between -kernel versions + Example:: -corrupt-filter-dev-major -corrupt-filter-dev-minor + mkdir /sys/fs/cgroup/mem/hwpoison -Only handle memory failures to pages associated with the file system defined -by block device major/minor. -1U is the wildcard value. -This should be only used for testing with artificial injection. + usemem -m 100 -s 1000 & + echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks -corrupt-filter-memcg + memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') + echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg -Limit injection to pages owned by memgroup. Specified by inode number -of the memcg. + page-types -p `pidof init` --hwpoison # shall do nothing + page-types -p `pidof usemem` --hwpoison # poison its pages -Example: - mkdir /sys/fs/cgroup/mem/hwpoison + corrupt-filter-flags-mask, corrupt-filter-flags-value + When specified, only poison pages if ((page_flags & mask) == + value). This allows stress testing of many kinds of + pages. The page_flags are the same as in /proc/kpageflags. The + flag bits are defined in include/linux/kernel-page-flags.h and + documented in Documentation/admin-guide/mm/pagemap.rst - usemem -m 100 -s 1000 & - echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks +* Architecture specific MCE injector - memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') - echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg + x86 has mce-inject, mce-test - page-types -p `pidof init` --hwpoison # shall do nothing - page-types -p `pidof usemem` --hwpoison # poison its pages + Some portable hwpoison test programs in mce-test, see below. -corrupt-filter-flags-mask -corrupt-filter-flags-value - -When specified, only poison pages if ((page_flags & mask) == value). -This allows stress testing of many kinds of pages. The page_flags -are the same as in /proc/kpageflags. The flag bits are defined in -include/linux/kernel-page-flags.h and documented in -Documentation/vm/pagemap.txt - -Architecture specific MCE injector - -x86 has mce-inject, mce-test - -Some portable hwpoison test programs in mce-test, see blow. - ---- - -References: +References +========== http://halobates.de/mce-lc09-2.pdf Overview presentation from LinuxCon 09 @@ -174,14 +176,11 @@ git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git x86 specific injector ---- - -Limitations: - +Limitations +=========== - Not all page types are supported and never will. Most kernel internal -objects cannot be recovered, only LRU pages for now. + objects cannot be recovered, only LRU pages for now. - Right now hugepage support is missing. --- Andi Kleen, Oct 2009 - diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst new file mode 100644 index 000000000000..c4ded22197ca --- /dev/null +++ b/Documentation/vm/index.rst @@ -0,0 +1,50 @@ +===================================== +Linux Memory Management Documentation +===================================== + +This is a collection of documents about Linux memory management (mm) subsystem. + +User guides for MM features +=========================== + +The following documents provide guides for controlling and tuning +various features of the Linux memory management + +.. toctree:: + :maxdepth: 1 + + swap_numa + zswap + +Kernel developers MM documentation +================================== + +The below documents describe MM internals with different level of +details ranging from notes and mailing list responses to elaborate +descriptions of data structures and algorithms. + +.. toctree:: + :maxdepth: 1 + + active_mm + balance + cleancache + frontswap + highmem + hmm + hwpoison + hugetlbfs_reserv + ksm + mmu_notifier + numa + overcommit-accounting + page_migration + page_frags + page_owner + remap_file_pages + slub + split_page_table_lock + transhuge + unevictable-lru + z3fold + zsmalloc diff --git a/Documentation/vm/ksm.rst b/Documentation/vm/ksm.rst new file mode 100644 index 000000000000..d32016d9be2c --- /dev/null +++ b/Documentation/vm/ksm.rst @@ -0,0 +1,87 @@ +.. _ksm: + +======================= +Kernel Samepage Merging +======================= + +KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, +added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation, +and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/ + +The userspace interface of KSM is described in :ref:`Documentation/admin-guide/mm/ksm.rst <admin_guide_ksm>` + +Design +====== + +Overview +-------- + +.. kernel-doc:: mm/ksm.c + :DOC: Overview + +Reverse mapping +--------------- +KSM maintains reverse mapping information for KSM pages in the stable +tree. + +If a KSM page is shared between less than ``max_page_sharing`` VMAs, +the node of the stable tree that represents such KSM page points to a +list of :c:type:`struct rmap_item` and the ``page->mapping`` of the +KSM page points to the stable tree node. + +When the sharing passes this threshold, KSM adds a second dimension to +the stable tree. The tree node becomes a "chain" that links one or +more "dups". Each "dup" keeps reverse mapping information for a KSM +page with ``page->mapping`` pointing to that "dup". + +Every "chain" and all "dups" linked into a "chain" enforce the +invariant that they represent the same write protected memory content, +even if each "dup" will be pointed by a different KSM page copy of +that content. + +This way the stable tree lookup computational complexity is unaffected +if compared to an unlimited list of reverse mappings. It is still +enforced that there cannot be KSM page content duplicates in the +stable tree itself. + +The deduplication limit enforced by ``max_page_sharing`` is required +to avoid the virtual memory rmap lists to grow too large. The rmap +walk has O(N) complexity where N is the number of rmap_items +(i.e. virtual mappings) that are sharing the page, which is in turn +capped by ``max_page_sharing``. So this effectively spreads the linear +O(N) computational complexity from rmap walk context over different +KSM pages. The ksmd walk over the stable_node "chains" is also O(N), +but N is the number of stable_node "dups", not the number of +rmap_items, so it has not a significant impact on ksmd performance. In +practice the best stable_node "dup" candidate will be kept and found +at the head of the "dups" list. + +High values of ``max_page_sharing`` result in faster memory merging +(because there will be fewer stable_node dups queued into the +stable_node chain->hlist to check for pruning) and higher +deduplication factor at the expense of slower worst case for rmap +walks for any KSM page which can happen during swapping, compaction, +NUMA balancing and page migration. + +The ``stable_node_dups/stable_node_chains`` ratio is also affected by the +``max_page_sharing`` tunable, and an high ratio may indicate fragmentation +in the stable_node dups, which could be solved by introducing +fragmentation algorithms in ksmd which would refile rmap_items from +one stable_node dup to another stable_node dup, in order to free up +stable_node "dups" with few rmap_items in them, but that may increase +the ksmd CPU usage and possibly slowdown the readonly computations on +the KSM pages of the applications. + +The whole list of stable_node "dups" linked in the stable_node +"chains" is scanned periodically in order to prune stale stable_nodes. +The frequency of such scans is defined by +``stable_node_chains_prune_millisecs`` sysfs tunable. + +Reference +--------- +.. kernel-doc:: mm/ksm.c + :functions: mm_slot ksm_scan stable_node rmap_item + +-- +Izik Eidus, +Hugh Dickins, 17 Nov 2009 diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt deleted file mode 100644 index 6686bd267dc9..000000000000 --- a/Documentation/vm/ksm.txt +++ /dev/null @@ -1,178 +0,0 @@ -How to use the Kernel Samepage Merging feature ----------------------------------------------- - -KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, -added to the Linux kernel in 2.6.32. See mm/ksm.c for its implementation, -and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/ - -The KSM daemon ksmd periodically scans those areas of user memory which -have been registered with it, looking for pages of identical content which -can be replaced by a single write-protected page (which is automatically -copied if a process later wants to update its content). - -KSM was originally developed for use with KVM (where it was known as -Kernel Shared Memory), to fit more virtual machines into physical memory, -by sharing the data common between them. But it can be useful to any -application which generates many instances of the same data. - -KSM only merges anonymous (private) pages, never pagecache (file) pages. -KSM's merged pages were originally locked into kernel memory, but can now -be swapped out just like other user pages (but sharing is broken when they -are swapped back in: ksmd must rediscover their identity and merge again). - -KSM only operates on those areas of address space which an application -has advised to be likely candidates for merging, by using the madvise(2) -system call: int madvise(addr, length, MADV_MERGEABLE). - -The app may call int madvise(addr, length, MADV_UNMERGEABLE) to cancel -that advice and restore unshared pages: whereupon KSM unmerges whatever -it merged in that range. Note: this unmerging call may suddenly require -more memory than is available - possibly failing with EAGAIN, but more -probably arousing the Out-Of-Memory killer. - -If KSM is not configured into the running kernel, madvise MADV_MERGEABLE -and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was -built with CONFIG_KSM=y, those calls will normally succeed: even if the -the KSM daemon is not currently running, MADV_MERGEABLE still registers -the range for whenever the KSM daemon is started; even if the range -cannot contain any pages which KSM could actually merge; even if -MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. - -If a region of memory must be split into at least one new MADV_MERGEABLE -or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process -will exceed vm.max_map_count (see Documentation/sysctl/vm.txt). - -Like other madvise calls, they are intended for use on mapped areas of -the user address space: they will report ENOMEM if the specified range -includes unmapped gaps (though working on the intervening mapped areas), -and might fail with EAGAIN if not enough memory for internal structures. - -Applications should be considerate in their use of MADV_MERGEABLE, -restricting its use to areas likely to benefit. KSM's scans may use a lot -of processing power: some installations will disable KSM for that reason. - -The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, -readable by all but writable only by root: - -pages_to_scan - how many present pages to scan before ksmd goes to sleep - e.g. "echo 100 > /sys/kernel/mm/ksm/pages_to_scan" - Default: 100 (chosen for demonstration purposes) - -sleep_millisecs - how many milliseconds ksmd should sleep before next scan - e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" - Default: 20 (chosen for demonstration purposes) - -merge_across_nodes - specifies if pages from different numa nodes can be merged. - When set to 0, ksm merges only pages which physically - reside in the memory area of same NUMA node. That brings - lower latency to access of shared pages. Systems with more - nodes, at significant NUMA distances, are likely to benefit - from the lower latency of setting 0. Smaller systems, which - need to minimize memory usage, are likely to benefit from - the greater sharing of setting 1 (default). You may wish to - compare how your system performs under each setting, before - deciding on which to use. merge_across_nodes setting can be - changed only when there are no ksm shared pages in system: - set run 2 to unmerge pages first, then to 1 after changing - merge_across_nodes, to remerge according to the new setting. - Default: 1 (merging across nodes as in earlier releases) - -run - set 0 to stop ksmd from running but keep merged pages, - set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", - set 2 to stop ksmd and unmerge all pages currently merged, - but leave mergeable areas registered for next run - Default: 0 (must be changed to 1 to activate KSM, - except if CONFIG_SYSFS is disabled) - -use_zero_pages - specifies whether empty pages (i.e. allocated pages - that only contain zeroes) should be treated specially. - When set to 1, empty pages are merged with the kernel - zero page(s) instead of with each other as it would - happen normally. This can improve the performance on - architectures with coloured zero pages, depending on - the workload. Care should be taken when enabling this - setting, as it can potentially degrade the performance - of KSM for some workloads, for example if the checksums - of pages candidate for merging match the checksum of - an empty page. This setting can be changed at any time, - it is only effective for pages merged after the change. - Default: 0 (normal KSM behaviour as in earlier releases) - -max_page_sharing - Maximum sharing allowed for each KSM page. This - enforces a deduplication limit to avoid the virtual - memory rmap lists to grow too large. The minimum - value is 2 as a newly created KSM page will have at - least two sharers. The rmap walk has O(N) - complexity where N is the number of rmap_items - (i.e. virtual mappings) that are sharing the page, - which is in turn capped by max_page_sharing. So - this effectively spread the the linear O(N) - computational complexity from rmap walk context - over different KSM pages. The ksmd walk over the - stable_node "chains" is also O(N), but N is the - number of stable_node "dups", not the number of - rmap_items, so it has not a significant impact on - ksmd performance. In practice the best stable_node - "dup" candidate will be kept and found at the head - of the "dups" list. The higher this value the - faster KSM will merge the memory (because there - will be fewer stable_node dups queued into the - stable_node chain->hlist to check for pruning) and - the higher the deduplication factor will be, but - the slowest the worst case rmap walk could be for - any given KSM page. Slowing down the rmap_walk - means there will be higher latency for certain - virtual memory operations happening during - swapping, compaction, NUMA balancing and page - migration, in turn decreasing responsiveness for - the caller of those virtual memory operations. The - scheduler latency of other tasks not involved with - the VM operations doing the rmap walk is not - affected by this parameter as the rmap walks are - always schedule friendly themselves. - -stable_node_chains_prune_millisecs - How frequently to walk the whole - list of stable_node "dups" linked in the - stable_node "chains" in order to prune stale - stable_nodes. Smaller milllisecs values will free - up the KSM metadata with lower latency, but they - will make ksmd use more CPU during the scan. This - only applies to the stable_node chains so it's a - noop if not a single KSM page hit the - max_page_sharing yet (there would be no stable_node - chains in such case). - -The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/: - -pages_shared - how many shared pages are being used -pages_sharing - how many more sites are sharing them i.e. how much saved -pages_unshared - how many pages unique but repeatedly checked for merging -pages_volatile - how many pages changing too fast to be placed in a tree -full_scans - how many times all mergeable areas have been scanned - -stable_node_chains - number of stable node chains allocated, this is - effectively the number of KSM pages that hit the - max_page_sharing limit -stable_node_dups - number of stable node dups queued into the - stable_node chains - -A high ratio of pages_sharing to pages_shared indicates good sharing, but -a high ratio of pages_unshared to pages_sharing indicates wasted effort. -pages_volatile embraces several different kinds of activity, but a high -proportion there would also indicate poor use of madvise MADV_MERGEABLE. - -The maximum possible page_sharing/page_shared ratio is limited by the -max_page_sharing tunable. To increase the ratio max_page_sharing must -be increased accordingly. - -The stable_node_dups/stable_node_chains ratio is also affected by the -max_page_sharing tunable, and an high ratio may indicate fragmentation -in the stable_node dups, which could be solved by introducing -fragmentation algorithms in ksmd which would refile rmap_items from -one stable_node dup to another stable_node dup, in order to freeup -stable_node "dups" with few rmap_items in them, but that may increase -the ksmd CPU usage and possibly slowdown the readonly computations on -the KSM pages of the applications. - -Izik Eidus, -Hugh Dickins, 17 Nov 2009 diff --git a/Documentation/vm/mmu_notifier.rst b/Documentation/vm/mmu_notifier.rst new file mode 100644 index 000000000000..47baa1cf28c5 --- /dev/null +++ b/Documentation/vm/mmu_notifier.rst @@ -0,0 +1,99 @@ +.. _mmu_notifier: + +When do you need to notify inside page table lock ? +=================================================== + +When clearing a pte/pmd we are given a choice to notify the event through +(notify version of \*_clear_flush call mmu_notifier_invalidate_range) under +the page table lock. But that notification is not necessary in all cases. + +For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use +thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a +process virtual address space). There is only 2 cases when you need to notify +those secondary TLB while holding page table lock when clearing a pte/pmd: + + A) page backing address is free before mmu_notifier_invalidate_range_end() + B) a page table entry is updated to point to a new page (COW, write fault + on zero page, __replace_page(), ...) + +Case A is obvious you do not want to take the risk for the device to write to +a page that might now be used by some completely different task. + +Case B is more subtle. For correctness it requires the following sequence to +happen: + + - take page table lock + - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify()) + - set page table entry to point to new page + +If clearing the page table entry is not followed by a notify before setting +the new pte/pmd value then you can break memory model like C11 or C++11 for +the device. + +Consider the following scenario (device use a feature similar to ATS/PASID): + +Two address addrA and addrB such that \|addrA - addrB\| >= PAGE_SIZE we assume +they are write protected for COW (other case of B apply too). + +:: + + [Time N] -------------------------------------------------------------------- + CPU-thread-0 {try to write to addrA} + CPU-thread-1 {try to write to addrB} + CPU-thread-2 {} + CPU-thread-3 {} + DEV-thread-0 {read addrA and populate device TLB} + DEV-thread-2 {read addrB and populate device TLB} + [Time N+1] ------------------------------------------------------------------ + CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} + CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} + CPU-thread-2 {} + CPU-thread-3 {} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+2] ------------------------------------------------------------------ + CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}} + CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}} + CPU-thread-2 {} + CPU-thread-3 {} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+3] ------------------------------------------------------------------ + CPU-thread-0 {preempted} + CPU-thread-1 {preempted} + CPU-thread-2 {write to addrA which is a write to new page} + CPU-thread-3 {} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+3] ------------------------------------------------------------------ + CPU-thread-0 {preempted} + CPU-thread-1 {preempted} + CPU-thread-2 {} + CPU-thread-3 {write to addrB which is a write to new page} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+4] ------------------------------------------------------------------ + CPU-thread-0 {preempted} + CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} + CPU-thread-2 {} + CPU-thread-3 {} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+5] ------------------------------------------------------------------ + CPU-thread-0 {preempted} + CPU-thread-1 {} + CPU-thread-2 {} + CPU-thread-3 {} + DEV-thread-0 {read addrA from old page} + DEV-thread-2 {read addrB from new page} + +So here because at time N+2 the clear page table entry was not pair with a +notification to invalidate the secondary TLB, the device see the new value for +addrB before seing the new value for addrA. This break total memory ordering +for the device. + +When changing a pte to write protect or to point to a new write protected page +with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range +call to mmu_notifier_invalidate_range_end() outside the page table lock. This +is true even if the thread doing the page table update is preempted right after +releasing page table lock but before call mmu_notifier_invalidate_range_end(). diff --git a/Documentation/vm/mmu_notifier.txt b/Documentation/vm/mmu_notifier.txt deleted file mode 100644 index 23b462566bb7..000000000000 --- a/Documentation/vm/mmu_notifier.txt +++ /dev/null @@ -1,93 +0,0 @@ -When do you need to notify inside page table lock ? - -When clearing a pte/pmd we are given a choice to notify the event through -(notify version of *_clear_flush call mmu_notifier_invalidate_range) under -the page table lock. But that notification is not necessary in all cases. - -For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use -thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a -process virtual address space). There is only 2 cases when you need to notify -those secondary TLB while holding page table lock when clearing a pte/pmd: - - A) page backing address is free before mmu_notifier_invalidate_range_end() - B) a page table entry is updated to point to a new page (COW, write fault - on zero page, __replace_page(), ...) - -Case A is obvious you do not want to take the risk for the device to write to -a page that might now be used by some completely different task. - -Case B is more subtle. For correctness it requires the following sequence to -happen: - - take page table lock - - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify()) - - set page table entry to point to new page - -If clearing the page table entry is not followed by a notify before setting -the new pte/pmd value then you can break memory model like C11 or C++11 for -the device. - -Consider the following scenario (device use a feature similar to ATS/PASID): - -Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume -they are write protected for COW (other case of B apply too). - -[Time N] -------------------------------------------------------------------- -CPU-thread-0 {try to write to addrA} -CPU-thread-1 {try to write to addrB} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {read addrA and populate device TLB} -DEV-thread-2 {read addrB and populate device TLB} -[Time N+1] ------------------------------------------------------------------ -CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} -CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+2] ------------------------------------------------------------------ -CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}} -CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+3] ------------------------------------------------------------------ -CPU-thread-0 {preempted} -CPU-thread-1 {preempted} -CPU-thread-2 {write to addrA which is a write to new page} -CPU-thread-3 {} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+3] ------------------------------------------------------------------ -CPU-thread-0 {preempted} -CPU-thread-1 {preempted} -CPU-thread-2 {} -CPU-thread-3 {write to addrB which is a write to new page} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+4] ------------------------------------------------------------------ -CPU-thread-0 {preempted} -CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+5] ------------------------------------------------------------------ -CPU-thread-0 {preempted} -CPU-thread-1 {} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {read addrA from old page} -DEV-thread-2 {read addrB from new page} - -So here because at time N+2 the clear page table entry was not pair with a -notification to invalidate the secondary TLB, the device see the new value for -addrB before seing the new value for addrA. This break total memory ordering -for the device. - -When changing a pte to write protect or to point to a new write protected page -with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range -call to mmu_notifier_invalidate_range_end() outside the page table lock. This -is true even if the thread doing the page table update is preempted right after -releasing page table lock but before call mmu_notifier_invalidate_range_end(). diff --git a/Documentation/vm/numa b/Documentation/vm/numa.rst index a31b85b9bb88..185d8a568168 100644 --- a/Documentation/vm/numa +++ b/Documentation/vm/numa.rst @@ -1,6 +1,10 @@ +.. _numa: + Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com> +============= What is NUMA? +============= This question can be answered from a couple of perspectives: the hardware view and the Linux software view. @@ -106,7 +110,7 @@ to improve NUMA locality using various CPU affinity command line interfaces, such as taskset(1) and numactl(1), and program interfaces such as sched_setaffinity(2). Further, one can modify the kernel's default local allocation behavior using Linux NUMA memory policy. -[see Documentation/vm/numa_memory_policy.txt.] +[see Documentation/admin-guide/mm/numa_memory_policy.rst.] System administrators can restrict the CPUs and nodes' memories that a non- privileged user can specify in the scheduling or NUMA commands and functions diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt deleted file mode 100644 index 622b927816e7..000000000000 --- a/Documentation/vm/numa_memory_policy.txt +++ /dev/null @@ -1,452 +0,0 @@ - -What is Linux Memory Policy? - -In the Linux kernel, "memory policy" determines from which node the kernel will -allocate memory in a NUMA system or in an emulated NUMA system. Linux has -supported platforms with Non-Uniform Memory Access architectures since 2.4.?. -The current memory policy support was added to Linux 2.6 around May 2004. This -document attempts to describe the concepts and APIs of the 2.6 memory policy -support. - -Memory policies should not be confused with cpusets -(Documentation/cgroup-v1/cpusets.txt) -which is an administrative mechanism for restricting the nodes from which -memory may be allocated by a set of processes. Memory policies are a -programming interface that a NUMA-aware application can take advantage of. When -both cpusets and policies are applied to a task, the restrictions of the cpuset -takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. - -MEMORY POLICY CONCEPTS - -Scope of Memory Policies - -The Linux kernel supports _scopes_ of memory policy, described here from -most general to most specific: - - System Default Policy: this policy is "hard coded" into the kernel. It - is the policy that governs all page allocations that aren't controlled - by one of the more specific policy scopes discussed below. When the - system is "up and running", the system default policy will use "local - allocation" described below. However, during boot up, the system - default policy will be set to interleave allocations across all nodes - with "sufficient" memory, so as not to overload the initial boot node - with boot-time allocations. - - Task/Process Policy: this is an optional, per-task policy. When defined - for a specific task, this policy controls all page allocations made by or - on behalf of the task that aren't controlled by a more specific scope. - If a task does not define a task policy, then all page allocations that - would have been controlled by the task policy "fall back" to the System - Default Policy. - - The task policy applies to the entire address space of a task. Thus, - it is inheritable, and indeed is inherited, across both fork() - [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task - to establish the task policy for a child task exec()'d from an - executable image that has no awareness of memory policy. See the - MEMORY POLICY APIS section, below, for an overview of the system call - that a task may use to set/change its task/process policy. - - In a multi-threaded task, task policies apply only to the thread - [Linux kernel task] that installs the policy and any threads - subsequently created by that thread. Any sibling threads existing - at the time a new task policy is installed retain their current - policy. - - A task policy applies only to pages allocated after the policy is - installed. Any pages already faulted in by the task when the task - changes its task policy remain where they were allocated based on - the policy at the time they were allocated. - - VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's - virtual address space. A task may define a specific policy for a range - of its virtual address space. See the MEMORY POLICIES APIS section, - below, for an overview of the mbind() system call used to set a VMA - policy. - - A VMA policy will govern the allocation of pages that back this region of - the address space. Any regions of the task's address space that don't - have an explicit VMA policy will fall back to the task policy, which may - itself fall back to the System Default Policy. - - VMA policies have a few complicating details: - - VMA policy applies ONLY to anonymous pages. These include pages - allocated for anonymous segments, such as the task stack and heap, and - any regions of the address space mmap()ed with the MAP_ANONYMOUS flag. - If a VMA policy is applied to a file mapping, it will be ignored if - the mapping used the MAP_SHARED flag. If the file mapping used the - MAP_PRIVATE flag, the VMA policy will only be applied when an - anonymous page is allocated on an attempt to write to the mapping-- - i.e., at Copy-On-Write. - - VMA policies are shared between all tasks that share a virtual address - space--a.k.a. threads--independent of when the policy is installed; and - they are inherited across fork(). However, because VMA policies refer - to a specific region of a task's address space, and because the address - space is discarded and recreated on exec*(), VMA policies are NOT - inheritable across exec(). Thus, only NUMA-aware applications may - use VMA policies. - - A task may install a new VMA policy on a sub-range of a previously - mmap()ed region. When this happens, Linux splits the existing virtual - memory area into 2 or 3 VMAs, each with it's own policy. - - By default, VMA policy applies only to pages allocated after the policy - is installed. Any pages already faulted into the VMA range remain - where they were allocated based on the policy at the time they were - allocated. However, since 2.6.16, Linux supports page migration via - the mbind() system call, so that page contents can be moved to match - a newly installed policy. - - Shared Policy: Conceptually, shared policies apply to "memory objects" - mapped shared into one or more tasks' distinct address spaces. An - application installs a shared policies the same way as VMA policies--using - the mbind() system call specifying a range of virtual addresses that map - the shared object. However, unlike VMA policies, which can be considered - to be an attribute of a range of a task's address space, shared policies - apply directly to the shared object. Thus, all tasks that attach to the - object share the policy, and all pages allocated for the shared object, - by any task, will obey the shared policy. - - As of 2.6.22, only shared memory segments, created by shmget() or - mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared - policy support was added to Linux, the associated data structures were - added to hugetlbfs shmem segments. At the time, hugetlbfs did not - support allocation at fault time--a.k.a lazy allocation--so hugetlbfs - shmem segments were never "hooked up" to the shared policy support. - Although hugetlbfs segments now support lazy allocation, their support - for shared policy has not been completed. - - As mentioned above [re: VMA policies], allocations of page cache - pages for regular files mmap()ed with MAP_SHARED ignore any VMA - policy installed on the virtual address range backed by the shared - file mapping. Rather, shared page cache pages, including pages backing - private mappings that have not yet been written by the task, follow - task policy, if any, else System Default Policy. - - The shared policy infrastructure supports different policies on subset - ranges of the shared object. However, Linux still splits the VMA of - the task that installs the policy for each range of distinct policy. - Thus, different tasks that attach to a shared memory segment can have - different VMA configurations mapping that one shared object. This - can be seen by examining the /proc/<pid>/numa_maps of tasks sharing - a shared memory region, when one task has installed shared policy on - one or more ranges of the region. - -Components of Memory Policies - - A Linux memory policy consists of a "mode", optional mode flags, and an - optional set of nodes. The mode determines the behavior of the policy, - the optional mode flags determine the behavior of the mode, and the - optional set of nodes can be viewed as the arguments to the policy - behavior. - - Internally, memory policies are implemented by a reference counted - structure, struct mempolicy. Details of this structure will be discussed - in context, below, as required to explain the behavior. - - Linux memory policy supports the following 4 behavioral modes: - - Default Mode--MPOL_DEFAULT: This mode is only used in the memory - policy APIs. Internally, MPOL_DEFAULT is converted to the NULL - memory policy in all policy scopes. Any existing non-default policy - will simply be removed when MPOL_DEFAULT is specified. As a result, - MPOL_DEFAULT means "fall back to the next most specific policy scope." - - For example, a NULL or default task policy will fall back to the - system default policy. A NULL or default vma policy will fall - back to the task policy. - - When specified in one of the memory policy APIs, the Default mode - does not use the optional set of nodes. - - It is an error for the set of nodes specified for this policy to - be non-empty. - - MPOL_BIND: This mode specifies that memory must come from the - set of nodes specified by the policy. Memory will be allocated from - the node in the set with sufficient free memory that is closest to - the node where the allocation takes place. - - MPOL_PREFERRED: This mode specifies that the allocation should be - attempted from the single node specified in the policy. If that - allocation fails, the kernel will search other nodes, in order of - increasing distance from the preferred node based on information - provided by the platform firmware. - - Internally, the Preferred policy uses a single node--the - preferred_node member of struct mempolicy. When the internal - mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and - the policy is interpreted as local allocation. "Local" allocation - policy can be viewed as a Preferred policy that starts at the node - containing the cpu where the allocation takes place. - - It is possible for the user to specify that local allocation is - always preferred by passing an empty nodemask with this mode. - If an empty nodemask is passed, the policy cannot use the - MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described - below. - - MPOL_INTERLEAVED: This mode specifies that page allocations be - interleaved, on a page granularity, across the nodes specified in - the policy. This mode also behaves slightly differently, based on - the context where it is used: - - For allocation of anonymous pages and shared memory pages, - Interleave mode indexes the set of nodes specified by the policy - using the page offset of the faulting address into the segment - [VMA] containing the address modulo the number of nodes specified - by the policy. It then attempts to allocate a page, starting at - the selected node, as if the node had been specified by a Preferred - policy or had been selected by a local allocation. That is, - allocation will follow the per node zonelist. - - For allocation of page cache pages, Interleave mode indexes the set - of nodes specified by the policy using a node counter maintained - per task. This counter wraps around to the lowest specified node - after it reaches the highest specified node. This will tend to - spread the pages out over the nodes specified by the policy based - on the order in which they are allocated, rather than based on any - page offset into an address range or file. During system boot up, - the temporary interleaved system default policy works in this - mode. - - Linux memory policy supports the following optional mode flags: - - MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by - the user should not be remapped if the task or VMA's set of allowed - nodes changes after the memory policy has been defined. - - Without this flag, anytime a mempolicy is rebound because of a - change in the set of allowed nodes, the node (Preferred) or - nodemask (Bind, Interleave) is remapped to the new set of - allowed nodes. This may result in nodes being used that were - previously undesired. - - With this flag, if the user-specified nodes overlap with the - nodes allowed by the task's cpuset, then the memory policy is - applied to their intersection. If the two sets of nodes do not - overlap, the Default policy is used. - - For example, consider a task that is attached to a cpuset with - mems 1-3 that sets an Interleave policy over the same set. If - the cpuset's mems change to 3-5, the Interleave will now occur - over nodes 3, 4, and 5. With this flag, however, since only node - 3 is allowed from the user's nodemask, the "interleave" only - occurs over that node. If no nodes from the user's nodemask are - now allowed, the Default behavior is used. - - MPOL_F_STATIC_NODES cannot be combined with the - MPOL_F_RELATIVE_NODES flag. It also cannot be used for - MPOL_PREFERRED policies that were created with an empty nodemask - (local allocation). - - MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed - by the user will be mapped relative to the set of the task or VMA's - set of allowed nodes. The kernel stores the user-passed nodemask, - and if the allowed nodes changes, then that original nodemask will - be remapped relative to the new set of allowed nodes. - - Without this flag (and without MPOL_F_STATIC_NODES), anytime a - mempolicy is rebound because of a change in the set of allowed - nodes, the node (Preferred) or nodemask (Bind, Interleave) is - remapped to the new set of allowed nodes. That remap may not - preserve the relative nature of the user's passed nodemask to its - set of allowed nodes upon successive rebinds: a nodemask of - 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of - allowed nodes is restored to its original state. - - With this flag, the remap is done so that the node numbers from - the user's passed nodemask are relative to the set of allowed - nodes. In other words, if nodes 0, 2, and 4 are set in the user's - nodemask, the policy will be effected over the first (and in the - Bind or Interleave case, the third and fifth) nodes in the set of - allowed nodes. The nodemask passed by the user represents nodes - relative to task or VMA's set of allowed nodes. - - If the user's nodemask includes nodes that are outside the range - of the new set of allowed nodes (for example, node 5 is set in - the user's nodemask when the set of allowed nodes is only 0-3), - then the remap wraps around to the beginning of the nodemask and, - if not already set, sets the node in the mempolicy nodemask. - - For example, consider a task that is attached to a cpuset with - mems 2-5 that sets an Interleave policy over the same set with - MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the - interleave now occurs over nodes 3,5-7. If the cpuset's mems - then change to 0,2-3,5, then the interleave occurs over nodes - 0,2-3,5. - - Thanks to the consistent remapping, applications preparing - nodemasks to specify memory policies using this flag should - disregard their current, actual cpuset imposed memory placement - and prepare the nodemask as if they were always located on - memory nodes 0 to N-1, where N is the number of memory nodes the - policy is intended to manage. Let the kernel then remap to the - set of memory nodes allowed by the task's cpuset, as that may - change over time. - - MPOL_F_RELATIVE_NODES cannot be combined with the - MPOL_F_STATIC_NODES flag. It also cannot be used for - MPOL_PREFERRED policies that were created with an empty nodemask - (local allocation). - -MEMORY POLICY REFERENCE COUNTING - -To resolve use/free races, struct mempolicy contains an atomic reference -count field. Internal interfaces, mpol_get()/mpol_put() increment and -decrement this reference count, respectively. mpol_put() will only free -the structure back to the mempolicy kmem cache when the reference count -goes to zero. - -When a new memory policy is allocated, its reference count is initialized -to '1', representing the reference held by the task that is installing the -new policy. When a pointer to a memory policy structure is stored in another -structure, another reference is added, as the task's reference will be dropped -on completion of the policy installation. - -During run-time "usage" of the policy, we attempt to minimize atomic operations -on the reference count, as this can lead to cache lines bouncing between cpus -and NUMA nodes. "Usage" here means one of the following: - -1) querying of the policy, either by the task itself [using the get_mempolicy() - API discussed below] or by another task using the /proc/<pid>/numa_maps - interface. - -2) examination of the policy to determine the policy mode and associated node - or node lists, if any, for page allocation. This is considered a "hot - path". Note that for MPOL_BIND, the "usage" extends across the entire - allocation process, which may sleep during page reclaimation, because the - BIND policy nodemask is used, by reference, to filter ineligible nodes. - -We can avoid taking an extra reference during the usages listed above as -follows: - -1) we never need to get/free the system default policy as this is never - changed nor freed, once the system is up and running. - -2) for querying the policy, we do not need to take an extra reference on the - target task's task policy nor vma policies because we always acquire the - task's mm's mmap_sem for read during the query. The set_mempolicy() and - mbind() APIs [see below] always acquire the mmap_sem for write when - installing or replacing task or vma policies. Thus, there is no possibility - of a task or thread freeing a policy while another task or thread is - querying it. - -3) Page allocation usage of task or vma policy occurs in the fault path where - we hold them mmap_sem for read. Again, because replacing the task or vma - policy requires that the mmap_sem be held for write, the policy can't be - freed out from under us while we're using it for page allocation. - -4) Shared policies require special consideration. One task can replace a - shared memory policy while another task, with a distinct mmap_sem, is - querying or allocating a page based on the policy. To resolve this - potential race, the shared policy infrastructure adds an extra reference - to the shared policy during lookup while holding a spin lock on the shared - policy management structure. This requires that we drop this extra - reference when we're finished "using" the policy. We must drop the - extra reference on shared policies in the same query/allocation paths - used for non-shared policies. For this reason, shared policies are marked - as such, and the extra reference is dropped "conditionally"--i.e., only - for shared policies. - - Because of this extra reference counting, and because we must lookup - shared policies in a tree structure under spinlock, shared policies are - more expensive to use in the page allocation path. This is especially - true for shared policies on shared memory regions shared by tasks running - on different NUMA nodes. This extra overhead can be avoided by always - falling back to task or system default policy for shared memory regions, - or by prefaulting the entire shared memory region into memory and locking - it down. However, this might not be appropriate for all applications. - -MEMORY POLICY APIs - -Linux supports 3 system calls for controlling memory policy. These APIS -always affect only the calling task, the calling task's address space, or -some shared object mapped into the calling task's address space. - - Note: the headers that define these APIs and the parameter data types - for user space applications reside in a package that is not part of - the Linux kernel. The kernel system call interfaces, with the 'sys_' - prefix, are defined in <linux/syscalls.h>; the mode and flag - definitions are defined in <linux/mempolicy.h>. - -Set [Task] Memory Policy: - - long set_mempolicy(int mode, const unsigned long *nmask, - unsigned long maxnode); - - Set's the calling task's "task/process memory policy" to mode - specified by the 'mode' argument and the set of nodes defined - by 'nmask'. 'nmask' points to a bit mask of node ids containing - at least 'maxnode' ids. Optional mode flags may be passed by - combining the 'mode' argument with the flag (for example: - MPOL_INTERLEAVE | MPOL_F_STATIC_NODES). - - See the set_mempolicy(2) man page for more details - - -Get [Task] Memory Policy or Related Information - - long get_mempolicy(int *mode, - const unsigned long *nmask, unsigned long maxnode, - void *addr, int flags); - - Queries the "task/process memory policy" of the calling task, or - the policy or location of a specified virtual address, depending - on the 'flags' argument. - - See the get_mempolicy(2) man page for more details - - -Install VMA/Shared Policy for a Range of Task's Address Space - - long mbind(void *start, unsigned long len, int mode, - const unsigned long *nmask, unsigned long maxnode, - unsigned flags); - - mbind() installs the policy specified by (mode, nmask, maxnodes) as - a VMA policy for the range of the calling task's address space - specified by the 'start' and 'len' arguments. Additional actions - may be requested via the 'flags' argument. - - See the mbind(2) man page for more details. - -MEMORY POLICY COMMAND LINE INTERFACE - -Although not strictly part of the Linux implementation of memory policy, -a command line tool, numactl(8), exists that allows one to: - -+ set the task policy for a specified program via set_mempolicy(2), fork(2) and - exec(2) - -+ set the shared policy for a shared memory segment via mbind(2) - -The numactl(8) tool is packaged with the run-time version of the library -containing the memory policy system call wrappers. Some distributions -package the headers and compile-time libraries in a separate development -package. - - -MEMORY POLICIES AND CPUSETS - -Memory policies work within cpusets as described above. For memory policies -that require a node or set of nodes, the nodes are restricted to the set of -nodes whose memories are allowed by the cpuset constraints. If the nodemask -specified for the policy contains nodes that are not allowed by the cpuset and -MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes -specified for the policy and the set of nodes with memory is used. If the -result is the empty set, the policy is considered invalid and cannot be -installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped -onto and folded into the task's set of allowed nodes as previously described. - -The interaction of memory policies and cpusets can be problematic when tasks -in two cpusets share access to a memory region, such as shared memory segments -created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and -any of the tasks install shared policy on the region, only nodes whose -memories are allowed in both cpusets may be used in the policies. Obtaining -this information requires "stepping outside" the memory policy APIs to use the -cpuset information and requires that one know in what cpusets other task might -be attaching to the shared region. Furthermore, if the cpusets' allowed -memory sets are disjoint, "local" allocation is the only valid policy. diff --git a/Documentation/vm/overcommit-accounting b/Documentation/vm/overcommit-accounting deleted file mode 100644 index cbfaaa674118..000000000000 --- a/Documentation/vm/overcommit-accounting +++ /dev/null @@ -1,80 +0,0 @@ -The Linux kernel supports the following overcommit handling modes - -0 - Heuristic overcommit handling. Obvious overcommits of - address space are refused. Used for a typical system. It - ensures a seriously wild allocation fails while allowing - overcommit to reduce swap usage. root is allowed to - allocate slightly more memory in this mode. This is the - default. - -1 - Always overcommit. Appropriate for some scientific - applications. Classic example is code using sparse arrays - and just relying on the virtual memory consisting almost - entirely of zero pages. - -2 - Don't overcommit. The total address space commit - for the system is not permitted to exceed swap + a - configurable amount (default is 50%) of physical RAM. - Depending on the amount you use, in most situations - this means a process will not be killed while accessing - pages but will receive errors on memory allocation as - appropriate. - - Useful for applications that want to guarantee their - memory allocations will be available in the future - without having to initialize every page. - -The overcommit policy is set via the sysctl `vm.overcommit_memory'. - -The overcommit amount can be set via `vm.overcommit_ratio' (percentage) -or `vm.overcommit_kbytes' (absolute value). - -The current overcommit limit and amount committed are viewable in -/proc/meminfo as CommitLimit and Committed_AS respectively. - -Gotchas -------- - -The C language stack growth does an implicit mremap. If you want absolute -guarantees and run close to the edge you MUST mmap your stack for the -largest size you think you will need. For typical stack usage this does -not matter much but it's a corner case if you really really care - -In mode 2 the MAP_NORESERVE flag is ignored. - - -How It Works ------------- - -The overcommit is based on the following rules - -For a file backed map - SHARED or READ-only - 0 cost (the file is the map not swap) - PRIVATE WRITABLE - size of mapping per instance - -For an anonymous or /dev/zero map - SHARED - size of mapping - PRIVATE READ-only - 0 cost (but of little use) - PRIVATE WRITABLE - size of mapping per instance - -Additional accounting - Pages made writable copies by mmap - shmfs memory drawn from the same pool - -Status ------- - -o We account mmap memory mappings -o We account mprotect changes in commit -o We account mremap changes in size -o We account brk -o We account munmap -o We report the commit status in /proc -o Account and check on fork -o Review stack handling/building on exec -o SHMfs accounting -o Implement actual limit enforcement - -To Do ------ -o Account ptrace pages (this is hard) diff --git a/Documentation/vm/overcommit-accounting.rst b/Documentation/vm/overcommit-accounting.rst new file mode 100644 index 000000000000..0dd54bbe4afa --- /dev/null +++ b/Documentation/vm/overcommit-accounting.rst @@ -0,0 +1,87 @@ +.. _overcommit_accounting: + +===================== +Overcommit Accounting +===================== + +The Linux kernel supports the following overcommit handling modes + +0 + Heuristic overcommit handling. Obvious overcommits of address + space are refused. Used for a typical system. It ensures a + seriously wild allocation fails while allowing overcommit to + reduce swap usage. root is allowed to allocate slightly more + memory in this mode. This is the default. + +1 + Always overcommit. Appropriate for some scientific + applications. Classic example is code using sparse arrays and + just relying on the virtual memory consisting almost entirely + of zero pages. + +2 + Don't overcommit. The total address space commit for the + system is not permitted to exceed swap + a configurable amount + (default is 50%) of physical RAM. Depending on the amount you + use, in most situations this means a process will not be + killed while accessing pages but will receive errors on memory + allocation as appropriate. + + Useful for applications that want to guarantee their memory + allocations will be available in the future without having to + initialize every page. + +The overcommit policy is set via the sysctl ``vm.overcommit_memory``. + +The overcommit amount can be set via ``vm.overcommit_ratio`` (percentage) +or ``vm.overcommit_kbytes`` (absolute value). + +The current overcommit limit and amount committed are viewable in +``/proc/meminfo`` as CommitLimit and Committed_AS respectively. + +Gotchas +======= + +The C language stack growth does an implicit mremap. If you want absolute +guarantees and run close to the edge you MUST mmap your stack for the +largest size you think you will need. For typical stack usage this does +not matter much but it's a corner case if you really really care + +In mode 2 the MAP_NORESERVE flag is ignored. + + +How It Works +============ + +The overcommit is based on the following rules + +For a file backed map + | SHARED or READ-only - 0 cost (the file is the map not swap) + | PRIVATE WRITABLE - size of mapping per instance + +For an anonymous or ``/dev/zero`` map + | SHARED - size of mapping + | PRIVATE READ-only - 0 cost (but of little use) + | PRIVATE WRITABLE - size of mapping per instance + +Additional accounting + | Pages made writable copies by mmap + | shmfs memory drawn from the same pool + +Status +====== + +* We account mmap memory mappings +* We account mprotect changes in commit +* We account mremap changes in size +* We account brk +* We account munmap +* We report the commit status in /proc +* Account and check on fork +* Review stack handling/building on exec +* SHMfs accounting +* Implement actual limit enforcement + +To Do +===== +* Account ptrace pages (this is hard) diff --git a/Documentation/vm/page_frags b/Documentation/vm/page_frags.rst index a6714565dbf9..637cc49d1b2f 100644 --- a/Documentation/vm/page_frags +++ b/Documentation/vm/page_frags.rst @@ -1,5 +1,8 @@ +.. _page_frags: + +============== Page fragments --------------- +============== A page fragment is an arbitrary-length arbitrary-offset area of memory which resides within a 0 or higher order compound page. Multiple diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration.rst index 496868072e24..f68d61335abb 100644 --- a/Documentation/vm/page_migration +++ b/Documentation/vm/page_migration.rst @@ -1,5 +1,8 @@ +.. _page_migration: + +============== Page migration --------------- +============== Page migration allows the moving of the physical location of pages between nodes in a numa system while the process is running. This means that the @@ -20,7 +23,7 @@ Page migration functions are provided by the numactl package by Andi Kleen (a version later than 0.9.3 is required. Get it from ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma which provides an interface similar to other numa functionality for page -migration. cat /proc/<pid>/numa_maps allows an easy review of where the +migration. cat ``/proc/<pid>/numa_maps`` allows an easy review of where the pages of a process are located. See also the numa_maps documentation in the proc(5) man page. @@ -56,8 +59,8 @@ description for those trying to use migrate_pages() from the kernel (for userspace usage see the Andi Kleen's numactl package mentioned above) and then a low level description of how the low level details work. -A. In kernel use of migrate_pages() ------------------------------------ +In kernel use of migrate_pages() +================================ 1. Remove pages from the LRU. @@ -78,8 +81,8 @@ A. In kernel use of migrate_pages() the new page for each page that is considered for moving. -B. How migrate_pages() works ----------------------------- +How migrate_pages() works +========================= migrate_pages() does several passes over its list of pages. A page is moved if all references to a page are removable at the time. The page has @@ -142,8 +145,8 @@ Steps: 20. The new page is moved to the LRU and can be scanned by the swapper etc again. -C. Non-LRU page migration -------------------------- +Non-LRU page migration +====================== Although original migration aimed for reducing the latency of memory access for NUMA, compaction who want to create high-order page is also main customer. @@ -164,89 +167,91 @@ migration path. If a driver want to make own pages movable, it should define three functions which are function pointers of struct address_space_operations. -1. bool (*isolate_page) (struct page *page, isolate_mode_t mode); +1. ``bool (*isolate_page) (struct page *page, isolate_mode_t mode);`` -What VM expects on isolate_page function of driver is to return *true* -if driver isolates page successfully. On returing true, VM marks the page -as PG_isolated so concurrent isolation in several CPUs skip the page -for isolation. If a driver cannot isolate the page, it should return *false*. + What VM expects on isolate_page function of driver is to return *true* + if driver isolates page successfully. On returing true, VM marks the page + as PG_isolated so concurrent isolation in several CPUs skip the page + for isolation. If a driver cannot isolate the page, it should return *false*. -Once page is successfully isolated, VM uses page.lru fields so driver -shouldn't expect to preserve values in that fields. + Once page is successfully isolated, VM uses page.lru fields so driver + shouldn't expect to preserve values in that fields. -2. int (*migratepage) (struct address_space *mapping, - struct page *newpage, struct page *oldpage, enum migrate_mode); +2. ``int (*migratepage) (struct address_space *mapping,`` +| ``struct page *newpage, struct page *oldpage, enum migrate_mode);`` -After isolation, VM calls migratepage of driver with isolated page. -The function of migratepage is to move content of the old page to new page -and set up fields of struct page newpage. Keep in mind that you should -indicate to the VM the oldpage is no longer movable via __ClearPageMovable() -under page_lock if you migrated the oldpage successfully and returns -MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver -can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time -because VM interprets -EAGAIN as "temporal migration failure". On returning -any error except -EAGAIN, VM will give up the page migration without retrying -in this time. + After isolation, VM calls migratepage of driver with isolated page. + The function of migratepage is to move content of the old page to new page + and set up fields of struct page newpage. Keep in mind that you should + indicate to the VM the oldpage is no longer movable via __ClearPageMovable() + under page_lock if you migrated the oldpage successfully and returns + MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver + can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time + because VM interprets -EAGAIN as "temporal migration failure". On returning + any error except -EAGAIN, VM will give up the page migration without retrying + in this time. -Driver shouldn't touch page.lru field VM using in the functions. + Driver shouldn't touch page.lru field VM using in the functions. -3. void (*putback_page)(struct page *); +3. ``void (*putback_page)(struct page *);`` -If migration fails on isolated page, VM should return the isolated page -to the driver so VM calls driver's putback_page with migration failed page. -In this function, driver should put the isolated page back to the own data -structure. + If migration fails on isolated page, VM should return the isolated page + to the driver so VM calls driver's putback_page with migration failed page. + In this function, driver should put the isolated page back to the own data + structure. 4. non-lru movable page flags -There are two page flags for supporting non-lru movable page. + There are two page flags for supporting non-lru movable page. -* PG_movable + * PG_movable -Driver should use the below function to make page movable under page_lock. + Driver should use the below function to make page movable under page_lock:: void __SetPageMovable(struct page *page, struct address_space *mapping) -It needs argument of address_space for registering migration family functions -which will be called by VM. Exactly speaking, PG_movable is not a real flag of -struct page. Rather than, VM reuses page->mapping's lower bits to represent it. + It needs argument of address_space for registering migration + family functions which will be called by VM. Exactly speaking, + PG_movable is not a real flag of struct page. Rather than, VM + reuses page->mapping's lower bits to represent it. +:: #define PAGE_MAPPING_MOVABLE 0x2 page->mapping = page->mapping | PAGE_MAPPING_MOVABLE; -so driver shouldn't access page->mapping directly. Instead, driver should -use page_mapping which mask off the low two bits of page->mapping under -page lock so it can get right struct address_space. - -For testing of non-lru movable page, VM supports __PageMovable function. -However, it doesn't guarantee to identify non-lru movable page because -page->mapping field is unified with other variables in struct page. -As well, if driver releases the page after isolation by VM, page->mapping -doesn't have stable value although it has PAGE_MAPPING_MOVABLE -(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether -page is LRU or non-lru movable once the page has been isolated. Because -LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also -good for just peeking to test non-lru movable pages before more expensive -checking with lock_page in pfn scanning to select victim. - -For guaranteeing non-lru movable page, VM provides PageMovable function. -Unlike __PageMovable, PageMovable functions validates page->mapping and -mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden -destroying of page->mapping. - -Driver using __SetPageMovable should clear the flag via __ClearMovablePage -under page_lock before the releasing the page. - -* PG_isolated - -To prevent concurrent isolation among several CPUs, VM marks isolated page -as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru -movable page, it can skip it. Driver doesn't need to manipulate the flag -because VM will set/clear it automatically. Keep in mind that if driver -sees PG_isolated page, it means the page have been isolated by VM so it -shouldn't touch page.lru field. -PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag -for own purpose. + so driver shouldn't access page->mapping directly. Instead, driver should + use page_mapping which mask off the low two bits of page->mapping under + page lock so it can get right struct address_space. + + For testing of non-lru movable page, VM supports __PageMovable function. + However, it doesn't guarantee to identify non-lru movable page because + page->mapping field is unified with other variables in struct page. + As well, if driver releases the page after isolation by VM, page->mapping + doesn't have stable value although it has PAGE_MAPPING_MOVABLE + (Look at __ClearPageMovable). But __PageMovable is cheap to catch whether + page is LRU or non-lru movable once the page has been isolated. Because + LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also + good for just peeking to test non-lru movable pages before more expensive + checking with lock_page in pfn scanning to select victim. + + For guaranteeing non-lru movable page, VM provides PageMovable function. + Unlike __PageMovable, PageMovable functions validates page->mapping and + mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden + destroying of page->mapping. + + Driver using __SetPageMovable should clear the flag via __ClearMovablePage + under page_lock before the releasing the page. + + * PG_isolated + + To prevent concurrent isolation among several CPUs, VM marks isolated page + as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru + movable page, it can skip it. Driver doesn't need to manipulate the flag + because VM will set/clear it automatically. Keep in mind that if driver + sees PG_isolated page, it means the page have been isolated by VM so it + shouldn't touch page.lru field. + PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag + for own purpose. Christoph Lameter, May 8, 2006. Minchan Kim, Mar 28, 2016. diff --git a/Documentation/vm/page_owner.txt b/Documentation/vm/page_owner.rst index ffff1439076a..0ed5ab8c7ab4 100644 --- a/Documentation/vm/page_owner.txt +++ b/Documentation/vm/page_owner.rst @@ -1,7 +1,11 @@ +.. _page_owner: + +================================================== page owner: Tracking about who allocated each page ------------------------------------------------------------ +================================================== -* Introduction +Introduction +============ page owner is for the tracking about who allocated each page. It can be used to debug memory leak or to find a memory hogger. @@ -34,13 +38,15 @@ not affect to allocation performance, especially if the static keys jump label patching functionality is available. Following is the kernel's code size change due to this facility. -- Without page owner +- Without page owner:: + text data bss dec hex filename - 40662 1493 644 42799 a72f mm/page_alloc.o + 40662 1493 644 42799 a72f mm/page_alloc.o + +- With page owner:: -- With page owner text data bss dec hex filename - 40892 1493 644 43029 a815 mm/page_alloc.o + 40892 1493 644 43029 a815 mm/page_alloc.o 1427 24 8 1459 5b3 mm/page_ext.o 2722 50 0 2772 ad4 mm/page_owner.o @@ -62,21 +68,23 @@ are catched and marked, although they are mostly allocated from struct page extension feature. Anyway, after that, no page is left in un-tracking state. -* Usage +Usage +===== + +1) Build user-space helper:: -1) Build user-space helper cd tools/vm make page_owner_sort -2) Enable page owner - Add "page_owner=on" to boot cmdline. +2) Enable page owner: add "page_owner=on" to boot cmdline. 3) Do the job what you want to debug -4) Analyze information from page owner +4) Analyze information from page owner:: + cat /sys/kernel/debug/page_owner > page_owner_full.txt grep -v ^PFN page_owner_full.txt > page_owner.txt ./page_owner_sort page_owner.txt sorted_page_owner.txt - See the result about who allocated each page - in the sorted_page_owner.txt. + See the result about who allocated each page + in the ``sorted_page_owner.txt``. diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.rst index f609142f406a..7bef6718e3a9 100644 --- a/Documentation/vm/remap_file_pages.txt +++ b/Documentation/vm/remap_file_pages.rst @@ -1,3 +1,9 @@ +.. _remap_file_pages: + +============================== +remap_file_pages() system call +============================== + The remap_file_pages() system call is used to create a nonlinear mapping, that is, a mapping in which the pages of the file are mapped into a nonsequential order in memory. The advantage of using remap_file_pages() diff --git a/Documentation/vm/slub.rst b/Documentation/vm/slub.rst new file mode 100644 index 000000000000..3a775fd64e2d --- /dev/null +++ b/Documentation/vm/slub.rst @@ -0,0 +1,361 @@ +.. _slub: + +========================== +Short users guide for SLUB +========================== + +The basic philosophy of SLUB is very different from SLAB. SLAB +requires rebuilding the kernel to activate debug options for all +slab caches. SLUB always includes full debugging but it is off by default. +SLUB can enable debugging only for selected slabs in order to avoid +an impact on overall system performance which may make a bug more +difficult to find. + +In order to switch debugging on one can add an option ``slub_debug`` +to the kernel command line. That will enable full debugging for +all slabs. + +Typically one would then use the ``slabinfo`` command to get statistical +data and perform operation on the slabs. By default ``slabinfo`` only lists +slabs that have data in them. See "slabinfo -h" for more options when +running the command. ``slabinfo`` can be compiled with +:: + + gcc -o slabinfo tools/vm/slabinfo.c + +Some of the modes of operation of ``slabinfo`` require that slub debugging +be enabled on the command line. F.e. no tracking information will be +available without debugging on and validation can only partially +be performed if debugging was not switched on. + +Some more sophisticated uses of slub_debug: +------------------------------------------- + +Parameters may be given to ``slub_debug``. If none is specified then full +debugging is enabled. Format: + +slub_debug=<Debug-Options> + Enable options for all slabs +slub_debug=<Debug-Options>,<slab name> + Enable options only for select slabs + + +Possible debug options are:: + + F Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS + Sorry SLAB legacy issues) + Z Red zoning + P Poisoning (object and padding) + U User tracking (free and alloc) + T Trace (please only use on single slabs) + A Toggle failslab filter mark for the cache + O Switch debugging off for caches that would have + caused higher minimum slab orders + - Switch all debugging off (useful if the kernel is + configured with CONFIG_SLUB_DEBUG_ON) + +F.e. in order to boot just with sanity checks and red zoning one would specify:: + + slub_debug=FZ + +Trying to find an issue in the dentry cache? Try:: + + slub_debug=,dentry + +to only enable debugging on the dentry cache. + +Red zoning and tracking may realign the slab. We can just apply sanity checks +to the dentry cache with:: + + slub_debug=F,dentry + +Debugging options may require the minimum possible slab order to increase as +a result of storing the metadata (for example, caches with PAGE_SIZE object +sizes). This has a higher liklihood of resulting in slab allocation errors +in low memory situations or if there's high fragmentation of memory. To +switch off debugging for such caches by default, use:: + + slub_debug=O + +In case you forgot to enable debugging on the kernel command line: It is +possible to enable debugging manually when the kernel is up. Look at the +contents of:: + + /sys/kernel/slab/<slab name>/ + +Look at the writable files. Writing 1 to them will enable the +corresponding debug option. All options can be set on a slab that does +not contain objects. If the slab already contains objects then sanity checks +and tracing may only be enabled. The other options may cause the realignment +of objects. + +Careful with tracing: It may spew out lots of information and never stop if +used on the wrong slab. + +Slab merging +============ + +If no debug options are specified then SLUB may merge similar slabs together +in order to reduce overhead and increase cache hotness of objects. +``slabinfo -a`` displays which slabs were merged together. + +Slab validation +=============== + +SLUB can validate all object if the kernel was booted with slub_debug. In +order to do so you must have the ``slabinfo`` tool. Then you can do +:: + + slabinfo -v + +which will test all objects. Output will be generated to the syslog. + +This also works in a more limited way if boot was without slab debug. +In that case ``slabinfo -v`` simply tests all reachable objects. Usually +these are in the cpu slabs and the partial slabs. Full slabs are not +tracked by SLUB in a non debug situation. + +Getting more performance +======================== + +To some degree SLUB's performance is limited by the need to take the +list_lock once in a while to deal with partial slabs. That overhead is +governed by the order of the allocation for each slab. The allocations +can be influenced by kernel parameters: + +.. slub_min_objects=x (default 4) +.. slub_min_order=x (default 0) +.. slub_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER)) + +``slub_min_objects`` + allows to specify how many objects must at least fit into one + slab in order for the allocation order to be acceptable. In + general slub will be able to perform this number of + allocations on a slab without consulting centralized resources + (list_lock) where contention may occur. + +``slub_min_order`` + specifies a minim order of slabs. A similar effect like + ``slub_min_objects``. + +``slub_max_order`` + specified the order at which ``slub_min_objects`` should no + longer be checked. This is useful to avoid SLUB trying to + generate super large order pages to fit ``slub_min_objects`` + of a slab cache with large object sizes into one high order + page. Setting command line parameter + ``debug_guardpage_minorder=N`` (N > 0), forces setting + ``slub_max_order`` to 0, what cause minimum possible order of + slabs allocation. + +SLUB Debug output +================= + +Here is a sample of slub debug output:: + + ==================================================================== + BUG kmalloc-8: Redzone overwritten + -------------------------------------------------------------------- + + INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc + INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58 + INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58 + INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554 + + Bytes b4 0xc90f6d10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ + Object 0xc90f6d20: 31 30 31 39 2e 30 30 35 1019.005 + Redzone 0xc90f6d28: 00 cc cc cc . + Padding 0xc90f6d50: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ + + [<c010523d>] dump_trace+0x63/0x1eb + [<c01053df>] show_trace_log_lvl+0x1a/0x2f + [<c010601d>] show_trace+0x12/0x14 + [<c0106035>] dump_stack+0x16/0x18 + [<c017e0fa>] object_err+0x143/0x14b + [<c017e2cc>] check_object+0x66/0x234 + [<c017eb43>] __slab_free+0x239/0x384 + [<c017f446>] kfree+0xa6/0xc6 + [<c02e2335>] get_modalias+0xb9/0xf5 + [<c02e23b7>] dmi_dev_uevent+0x27/0x3c + [<c027866a>] dev_uevent+0x1ad/0x1da + [<c0205024>] kobject_uevent_env+0x20a/0x45b + [<c020527f>] kobject_uevent+0xa/0xf + [<c02779f1>] store_uevent+0x4f/0x58 + [<c027758e>] dev_attr_store+0x29/0x2f + [<c01bec4f>] sysfs_write_file+0x16e/0x19c + [<c0183ba7>] vfs_write+0xd1/0x15a + [<c01841d7>] sys_write+0x3d/0x72 + [<c0104112>] sysenter_past_esp+0x5f/0x99 + [<b7f7b410>] 0xb7f7b410 + ======================= + + FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc + +If SLUB encounters a corrupted object (full detection requires the kernel +to be booted with slub_debug) then the following output will be dumped +into the syslog: + +1. Description of the problem encountered + + This will be a message in the system log starting with:: + + =============================================== + BUG <slab cache affected>: <What went wrong> + ----------------------------------------------- + + INFO: <corruption start>-<corruption_end> <more info> + INFO: Slab <address> <slab information> + INFO: Object <address> <object information> + INFO: Allocated in <kernel function> age=<jiffies since alloc> cpu=<allocated by + cpu> pid=<pid of the process> + INFO: Freed in <kernel function> age=<jiffies since free> cpu=<freed by cpu> + pid=<pid of the process> + + (Object allocation / free information is only available if SLAB_STORE_USER is + set for the slab. slub_debug sets that option) + +2. The object contents if an object was involved. + + Various types of lines can follow the BUG SLUB line: + + Bytes b4 <address> : <bytes> + Shows a few bytes before the object where the problem was detected. + Can be useful if the corruption does not stop with the start of the + object. + + Object <address> : <bytes> + The bytes of the object. If the object is inactive then the bytes + typically contain poison values. Any non-poison value shows a + corruption by a write after free. + + Redzone <address> : <bytes> + The Redzone following the object. The Redzone is used to detect + writes after the object. All bytes should always have the same + value. If there is any deviation then it is due to a write after + the object boundary. + + (Redzone information is only available if SLAB_RED_ZONE is set. + slub_debug sets that option) + + Padding <address> : <bytes> + Unused data to fill up the space in order to get the next object + properly aligned. In the debug case we make sure that there are + at least 4 bytes of padding. This allows the detection of writes + before the object. + +3. A stackdump + + The stackdump describes the location where the error was detected. The cause + of the corruption is may be more likely found by looking at the function that + allocated or freed the object. + +4. Report on how the problem was dealt with in order to ensure the continued + operation of the system. + + These are messages in the system log beginning with:: + + FIX <slab cache affected>: <corrective action taken> + + In the above sample SLUB found that the Redzone of an active object has + been overwritten. Here a string of 8 characters was written into a slab that + has the length of 8 characters. However, a 8 character string needs a + terminating 0. That zero has overwritten the first byte of the Redzone field. + After reporting the details of the issue encountered the FIX SLUB message + tells us that SLUB has restored the Redzone to its proper value and then + system operations continue. + +Emergency operations +==================== + +Minimal debugging (sanity checks alone) can be enabled by booting with:: + + slub_debug=F + +This will be generally be enough to enable the resiliency features of slub +which will keep the system running even if a bad kernel component will +keep corrupting objects. This may be important for production systems. +Performance will be impacted by the sanity checks and there will be a +continual stream of error messages to the syslog but no additional memory +will be used (unlike full debugging). + +No guarantees. The kernel component still needs to be fixed. Performance +may be optimized further by locating the slab that experiences corruption +and enabling debugging only for that cache + +I.e.:: + + slub_debug=F,dentry + +If the corruption occurs by writing after the end of the object then it +may be advisable to enable a Redzone to avoid corrupting the beginning +of other objects:: + + slub_debug=FZ,dentry + +Extended slabinfo mode and plotting +=================================== + +The ``slabinfo`` tool has a special 'extended' ('-X') mode that includes: + - Slabcache Totals + - Slabs sorted by size (up to -N <num> slabs, default 1) + - Slabs sorted by loss (up to -N <num> slabs, default 1) + +Additionally, in this mode ``slabinfo`` does not dynamically scale +sizes (G/M/K) and reports everything in bytes (this functionality is +also available to other slabinfo modes via '-B' option) which makes +reporting more precise and accurate. Moreover, in some sense the `-X' +mode also simplifies the analysis of slabs' behaviour, because its +output can be plotted using the ``slabinfo-gnuplot.sh`` script. So it +pushes the analysis from looking through the numbers (tons of numbers) +to something easier -- visual analysis. + +To generate plots: + +a) collect slabinfo extended records, for example:: + + while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done + +b) pass stats file(-s) to ``slabinfo-gnuplot.sh`` script:: + + slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN] + + The ``slabinfo-gnuplot.sh`` script will pre-processes the collected records + and generates 3 png files (and 3 pre-processing cache files) per STATS + file: + - Slabcache Totals: FOO_STATS-totals.png + - Slabs sorted by size: FOO_STATS-slabs-by-size.png + - Slabs sorted by loss: FOO_STATS-slabs-by-loss.png + +Another use case, when ``slabinfo-gnuplot.sh`` can be useful, is when you +need to compare slabs' behaviour "prior to" and "after" some code +modification. To help you out there, ``slabinfo-gnuplot.sh`` script +can 'merge' the `Slabcache Totals` sections from different +measurements. To visually compare N plots: + +a) Collect as many STATS1, STATS2, .. STATSN files as you need:: + + while [ 1 ]; do slabinfo -X >> STATS<X>; sleep 1; done + +b) Pre-process those STATS files:: + + slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN + +c) Execute ``slabinfo-gnuplot.sh`` in '-t' mode, passing all of the + generated pre-processed \*-totals:: + + slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals + + This will produce a single plot (png file). + + Plots, expectedly, can be large so some fluctuations or small spikes + can go unnoticed. To deal with that, ``slabinfo-gnuplot.sh`` has two + options to 'zoom-in'/'zoom-out': + + a) ``-s %d,%d`` -- overwrites the default image width and heigh + b) ``-r %d,%d`` -- specifies a range of samples to use (for example, + in ``slabinfo -X >> FOO_STATS; sleep 1;`` case, using a ``-r + 40,60`` range will plot only samples collected between 40th and + 60th seconds). + +Christoph Lameter, May 30, 2007 +Sergey Senozhatsky, October 23, 2015 diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt deleted file mode 100644 index 84652419bff2..000000000000 --- a/Documentation/vm/slub.txt +++ /dev/null @@ -1,342 +0,0 @@ -Short users guide for SLUB --------------------------- - -The basic philosophy of SLUB is very different from SLAB. SLAB -requires rebuilding the kernel to activate debug options for all -slab caches. SLUB always includes full debugging but it is off by default. -SLUB can enable debugging only for selected slabs in order to avoid -an impact on overall system performance which may make a bug more -difficult to find. - -In order to switch debugging on one can add an option "slub_debug" -to the kernel command line. That will enable full debugging for -all slabs. - -Typically one would then use the "slabinfo" command to get statistical -data and perform operation on the slabs. By default slabinfo only lists -slabs that have data in them. See "slabinfo -h" for more options when -running the command. slabinfo can be compiled with - -gcc -o slabinfo tools/vm/slabinfo.c - -Some of the modes of operation of slabinfo require that slub debugging -be enabled on the command line. F.e. no tracking information will be -available without debugging on and validation can only partially -be performed if debugging was not switched on. - -Some more sophisticated uses of slub_debug: -------------------------------------------- - -Parameters may be given to slub_debug. If none is specified then full -debugging is enabled. Format: - -slub_debug=<Debug-Options> Enable options for all slabs -slub_debug=<Debug-Options>,<slab name> - Enable options only for select slabs - -Possible debug options are - F Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS - Sorry SLAB legacy issues) - Z Red zoning - P Poisoning (object and padding) - U User tracking (free and alloc) - T Trace (please only use on single slabs) - A Toggle failslab filter mark for the cache - O Switch debugging off for caches that would have - caused higher minimum slab orders - - Switch all debugging off (useful if the kernel is - configured with CONFIG_SLUB_DEBUG_ON) - -F.e. in order to boot just with sanity checks and red zoning one would specify: - - slub_debug=FZ - -Trying to find an issue in the dentry cache? Try - - slub_debug=,dentry - -to only enable debugging on the dentry cache. - -Red zoning and tracking may realign the slab. We can just apply sanity checks -to the dentry cache with - - slub_debug=F,dentry - -Debugging options may require the minimum possible slab order to increase as -a result of storing the metadata (for example, caches with PAGE_SIZE object -sizes). This has a higher liklihood of resulting in slab allocation errors -in low memory situations or if there's high fragmentation of memory. To -switch off debugging for such caches by default, use - - slub_debug=O - -In case you forgot to enable debugging on the kernel command line: It is -possible to enable debugging manually when the kernel is up. Look at the -contents of: - -/sys/kernel/slab/<slab name>/ - -Look at the writable files. Writing 1 to them will enable the -corresponding debug option. All options can be set on a slab that does -not contain objects. If the slab already contains objects then sanity checks -and tracing may only be enabled. The other options may cause the realignment -of objects. - -Careful with tracing: It may spew out lots of information and never stop if -used on the wrong slab. - -Slab merging ------------- - -If no debug options are specified then SLUB may merge similar slabs together -in order to reduce overhead and increase cache hotness of objects. -slabinfo -a displays which slabs were merged together. - -Slab validation ---------------- - -SLUB can validate all object if the kernel was booted with slub_debug. In -order to do so you must have the slabinfo tool. Then you can do - -slabinfo -v - -which will test all objects. Output will be generated to the syslog. - -This also works in a more limited way if boot was without slab debug. -In that case slabinfo -v simply tests all reachable objects. Usually -these are in the cpu slabs and the partial slabs. Full slabs are not -tracked by SLUB in a non debug situation. - -Getting more performance ------------------------- - -To some degree SLUB's performance is limited by the need to take the -list_lock once in a while to deal with partial slabs. That overhead is -governed by the order of the allocation for each slab. The allocations -can be influenced by kernel parameters: - -slub_min_objects=x (default 4) -slub_min_order=x (default 0) -slub_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER)) - -slub_min_objects allows to specify how many objects must at least fit -into one slab in order for the allocation order to be acceptable. -In general slub will be able to perform this number of allocations -on a slab without consulting centralized resources (list_lock) where -contention may occur. - -slub_min_order specifies a minim order of slabs. A similar effect like -slub_min_objects. - -slub_max_order specified the order at which slub_min_objects should no -longer be checked. This is useful to avoid SLUB trying to generate -super large order pages to fit slub_min_objects of a slab cache with -large object sizes into one high order page. Setting command line -parameter debug_guardpage_minorder=N (N > 0), forces setting -slub_max_order to 0, what cause minimum possible order of slabs -allocation. - -SLUB Debug output ------------------ - -Here is a sample of slub debug output: - -==================================================================== -BUG kmalloc-8: Redzone overwritten --------------------------------------------------------------------- - -INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc -INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58 -INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58 -INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554 - -Bytes b4 0xc90f6d10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ - Object 0xc90f6d20: 31 30 31 39 2e 30 30 35 1019.005 - Redzone 0xc90f6d28: 00 cc cc cc . - Padding 0xc90f6d50: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ - - [<c010523d>] dump_trace+0x63/0x1eb - [<c01053df>] show_trace_log_lvl+0x1a/0x2f - [<c010601d>] show_trace+0x12/0x14 - [<c0106035>] dump_stack+0x16/0x18 - [<c017e0fa>] object_err+0x143/0x14b - [<c017e2cc>] check_object+0x66/0x234 - [<c017eb43>] __slab_free+0x239/0x384 - [<c017f446>] kfree+0xa6/0xc6 - [<c02e2335>] get_modalias+0xb9/0xf5 - [<c02e23b7>] dmi_dev_uevent+0x27/0x3c - [<c027866a>] dev_uevent+0x1ad/0x1da - [<c0205024>] kobject_uevent_env+0x20a/0x45b - [<c020527f>] kobject_uevent+0xa/0xf - [<c02779f1>] store_uevent+0x4f/0x58 - [<c027758e>] dev_attr_store+0x29/0x2f - [<c01bec4f>] sysfs_write_file+0x16e/0x19c - [<c0183ba7>] vfs_write+0xd1/0x15a - [<c01841d7>] sys_write+0x3d/0x72 - [<c0104112>] sysenter_past_esp+0x5f/0x99 - [<b7f7b410>] 0xb7f7b410 - ======================= - -FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc - -If SLUB encounters a corrupted object (full detection requires the kernel -to be booted with slub_debug) then the following output will be dumped -into the syslog: - -1. Description of the problem encountered - -This will be a message in the system log starting with - -=============================================== -BUG <slab cache affected>: <What went wrong> ------------------------------------------------ - -INFO: <corruption start>-<corruption_end> <more info> -INFO: Slab <address> <slab information> -INFO: Object <address> <object information> -INFO: Allocated in <kernel function> age=<jiffies since alloc> cpu=<allocated by - cpu> pid=<pid of the process> -INFO: Freed in <kernel function> age=<jiffies since free> cpu=<freed by cpu> - pid=<pid of the process> - -(Object allocation / free information is only available if SLAB_STORE_USER is -set for the slab. slub_debug sets that option) - -2. The object contents if an object was involved. - -Various types of lines can follow the BUG SLUB line: - -Bytes b4 <address> : <bytes> - Shows a few bytes before the object where the problem was detected. - Can be useful if the corruption does not stop with the start of the - object. - -Object <address> : <bytes> - The bytes of the object. If the object is inactive then the bytes - typically contain poison values. Any non-poison value shows a - corruption by a write after free. - -Redzone <address> : <bytes> - The Redzone following the object. The Redzone is used to detect - writes after the object. All bytes should always have the same - value. If there is any deviation then it is due to a write after - the object boundary. - - (Redzone information is only available if SLAB_RED_ZONE is set. - slub_debug sets that option) - -Padding <address> : <bytes> - Unused data to fill up the space in order to get the next object - properly aligned. In the debug case we make sure that there are - at least 4 bytes of padding. This allows the detection of writes - before the object. - -3. A stackdump - -The stackdump describes the location where the error was detected. The cause -of the corruption is may be more likely found by looking at the function that -allocated or freed the object. - -4. Report on how the problem was dealt with in order to ensure the continued -operation of the system. - -These are messages in the system log beginning with - -FIX <slab cache affected>: <corrective action taken> - -In the above sample SLUB found that the Redzone of an active object has -been overwritten. Here a string of 8 characters was written into a slab that -has the length of 8 characters. However, a 8 character string needs a -terminating 0. That zero has overwritten the first byte of the Redzone field. -After reporting the details of the issue encountered the FIX SLUB message -tells us that SLUB has restored the Redzone to its proper value and then -system operations continue. - -Emergency operations: ---------------------- - -Minimal debugging (sanity checks alone) can be enabled by booting with - - slub_debug=F - -This will be generally be enough to enable the resiliency features of slub -which will keep the system running even if a bad kernel component will -keep corrupting objects. This may be important for production systems. -Performance will be impacted by the sanity checks and there will be a -continual stream of error messages to the syslog but no additional memory -will be used (unlike full debugging). - -No guarantees. The kernel component still needs to be fixed. Performance -may be optimized further by locating the slab that experiences corruption -and enabling debugging only for that cache - -I.e. - - slub_debug=F,dentry - -If the corruption occurs by writing after the end of the object then it -may be advisable to enable a Redzone to avoid corrupting the beginning -of other objects. - - slub_debug=FZ,dentry - -Extended slabinfo mode and plotting ------------------------------------ - -The slabinfo tool has a special 'extended' ('-X') mode that includes: - - Slabcache Totals - - Slabs sorted by size (up to -N <num> slabs, default 1) - - Slabs sorted by loss (up to -N <num> slabs, default 1) - -Additionally, in this mode slabinfo does not dynamically scale sizes (G/M/K) -and reports everything in bytes (this functionality is also available to -other slabinfo modes via '-B' option) which makes reporting more precise and -accurate. Moreover, in some sense the `-X' mode also simplifies the analysis -of slabs' behaviour, because its output can be plotted using the -slabinfo-gnuplot.sh script. So it pushes the analysis from looking through -the numbers (tons of numbers) to something easier -- visual analysis. - -To generate plots: -a) collect slabinfo extended records, for example: - - while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done - -b) pass stats file(-s) to slabinfo-gnuplot.sh script: - slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN] - -The slabinfo-gnuplot.sh script will pre-processes the collected records -and generates 3 png files (and 3 pre-processing cache files) per STATS -file: - - Slabcache Totals: FOO_STATS-totals.png - - Slabs sorted by size: FOO_STATS-slabs-by-size.png - - Slabs sorted by loss: FOO_STATS-slabs-by-loss.png - -Another use case, when slabinfo-gnuplot can be useful, is when you need -to compare slabs' behaviour "prior to" and "after" some code modification. -To help you out there, slabinfo-gnuplot.sh script can 'merge' the -`Slabcache Totals` sections from different measurements. To visually -compare N plots: - -a) Collect as many STATS1, STATS2, .. STATSN files as you need - while [ 1 ]; do slabinfo -X >> STATS<X>; sleep 1; done - -b) Pre-process those STATS files - slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN - -c) Execute slabinfo-gnuplot.sh in '-t' mode, passing all of the -generated pre-processed *-totals - slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals - -This will produce a single plot (png file). - -Plots, expectedly, can be large so some fluctuations or small spikes -can go unnoticed. To deal with that, `slabinfo-gnuplot.sh' has two -options to 'zoom-in'/'zoom-out': - a) -s %d,%d overwrites the default image width and heigh - b) -r %d,%d specifies a range of samples to use (for example, - in `slabinfo -X >> FOO_STATS; sleep 1;' case, using - a "-r 40,60" range will plot only samples collected - between 40th and 60th seconds). - -Christoph Lameter, May 30, 2007 -Sergey Senozhatsky, October 23, 2015 diff --git a/Documentation/vm/split_page_table_lock b/Documentation/vm/split_page_table_lock.rst index 62842a857dab..889b00be469f 100644 --- a/Documentation/vm/split_page_table_lock +++ b/Documentation/vm/split_page_table_lock.rst @@ -1,3 +1,6 @@ +.. _split_page_table_lock: + +===================== Split page table lock ===================== @@ -11,6 +14,7 @@ access to the table. At the moment we use split lock for PTE and PMD tables. Access to higher level tables protected by mm->page_table_lock. There are helpers to lock/unlock a table and other accessor functions: + - pte_offset_map_lock() maps pte and takes PTE table lock, returns pointer to the taken lock; @@ -34,12 +38,13 @@ Split page table lock for PMD tables is enabled, if it's enabled for PTE tables and the architecture supports it (see below). Hugetlb and split page table lock ---------------------------------- +================================= Hugetlb can support several page sizes. We use split lock only for PMD level, but not for PUD. Hugetlb-specific helpers: + - huge_pte_lock() takes pmd split lock for PMD_SIZE page, mm->page_table_lock otherwise; @@ -47,7 +52,7 @@ Hugetlb-specific helpers: returns pointer to table lock; Support of split page table lock by an architecture ---------------------------------------------------- +=================================================== There's no need in special enabling of PTE split page table lock: everything required is done by pgtable_page_ctor() and pgtable_page_dtor(), @@ -73,7 +78,7 @@ NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must be handled properly. page->ptl ---------- +========= page->ptl is used to access split page table lock, where 'page' is struct page of page containing the table. It shares storage with page->private @@ -81,6 +86,7 @@ page of page containing the table. It shares storage with page->private To avoid increasing size of struct page and have best performance, we use a trick: + - if spinlock_t fits into long, we use page->ptr as spinlock, so we can avoid indirect access and save a cache line. - if size of spinlock_t is bigger then size of long, we use page->ptl as diff --git a/Documentation/vm/swap_numa.txt b/Documentation/vm/swap_numa.rst index d5960c9124f5..e0466f2db8fa 100644 --- a/Documentation/vm/swap_numa.txt +++ b/Documentation/vm/swap_numa.rst @@ -1,5 +1,8 @@ +.. _swap_numa: + +=========================================== Automatically bind swap device to numa node -------------------------------------------- +=========================================== If the system has more than one swap device and swap device has the node information, we can make use of this information to decide which swap @@ -7,15 +10,16 @@ device to use in get_swap_pages() to get better performance. How to use this feature ------------------------ +======================= Swap device has priority and that decides the order of it to be used. To make use of automatically binding, there is no need to manipulate priority settings for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and swapB, with swapA attached to node 0 and swapB attached to node 1, are going -to be swapped on. Simply swapping them on by doing: -# swapon /dev/swapA -# swapon /dev/swapB +to be swapped on. Simply swapping them on by doing:: + + # swapon /dev/swapA + # swapon /dev/swapB Then node 0 will use the two swap devices in the order of swapA then swapB and node 1 will use the two swap devices in the order of swapB then swapA. Note @@ -24,32 +28,39 @@ that the order of them being swapped on doesn't matter. A more complex example on a 4 node machine. Assume 6 swap devices are going to be swapped on: swapA and swapB are attached to node 0, swapC is attached to node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. -The way to swap them on is the same as above: -# swapon /dev/swapA -# swapon /dev/swapB -# swapon /dev/swapC -# swapon /dev/swapD -# swapon /dev/swapE -# swapon /dev/swapF - -Then node 0 will use them in the order of: -swapA/swapB -> swapC -> swapD -> swapE -> swapF +The way to swap them on is the same as above:: + + # swapon /dev/swapA + # swapon /dev/swapB + # swapon /dev/swapC + # swapon /dev/swapD + # swapon /dev/swapE + # swapon /dev/swapF + +Then node 0 will use them in the order of:: + + swapA/swapB -> swapC -> swapD -> swapE -> swapF + swapA and swapB will be used in a round robin mode before any other swap device. -node 1 will use them in the order of: -swapC -> swapA -> swapB -> swapD -> swapE -> swapF +node 1 will use them in the order of:: + + swapC -> swapA -> swapB -> swapD -> swapE -> swapF + +node 2 will use them in the order of:: + + swapD/swapE -> swapA -> swapB -> swapC -> swapF -node 2 will use them in the order of: -swapD/swapE -> swapA -> swapB -> swapC -> swapF Similaly, swapD and swapE will be used in a round robin mode before any other swap devices. -node 3 will use them in the order of: -swapF -> swapA -> swapB -> swapC -> swapD -> swapE +node 3 will use them in the order of:: + + swapF -> swapA -> swapB -> swapC -> swapD -> swapE Implementation details ----------------------- +====================== The current code uses a priority based list, swap_avail_list, to decide which swap device to use and if multiple swap devices share the same diff --git a/Documentation/vm/transhuge.rst b/Documentation/vm/transhuge.rst new file mode 100644 index 000000000000..a8cf6809e36e --- /dev/null +++ b/Documentation/vm/transhuge.rst @@ -0,0 +1,197 @@ +.. _transhuge: + +============================ +Transparent Hugepage Support +============================ + +This document describes design principles Transparent Hugepage (THP) +Support and its interaction with other parts of the memory management. + +Design principles +================= + +- "graceful fallback": mm components which don't have transparent hugepage + knowledge fall back to breaking huge pmd mapping into table of ptes and, + if necessary, split a transparent hugepage. Therefore these components + can continue working on the regular pages or regular pte mappings. + +- if a hugepage allocation fails because of memory fragmentation, + regular pages should be gracefully allocated instead and mixed in + the same vma without any failure or significant delay and without + userland noticing + +- if some task quits and more hugepages become available (either + immediately in the buddy or through the VM), guest physical memory + backed by regular pages should be relocated on hugepages + automatically (with khugepaged) + +- it doesn't require memory reservation and in turn it uses hugepages + whenever possible (the only possible reservation here is kernelcore= + to avoid unmovable pages to fragment all the memory but such a tweak + is not specific to transparent hugepage support and it's a generic + feature that applies to all dynamic high order allocations in the + kernel) + +get_user_pages and follow_page +============================== + +get_user_pages and follow_page if run on a hugepage, will return the +head or tail pages as usual (exactly as they would do on +hugetlbfs). Most gup users will only care about the actual physical +address of the page and its temporary pinning to release after the I/O +is complete, so they won't ever notice the fact the page is huge. But +if any driver is going to mangle over the page structure of the tail +page (like for checking page->mapping or other bits that are relevant +for the head page and not the tail page), it should be updated to jump +to check head page instead. Taking reference on any head/tail page would +prevent page from being split by anyone. + +.. note:: + these aren't new constraints to the GUP API, and they match the + same constrains that applies to hugetlbfs too, so any driver capable + of handling GUP on hugetlbfs will also work fine on transparent + hugepage backed mappings. + +In case you can't handle compound pages if they're returned by +follow_page, the FOLL_SPLIT bit can be specified as parameter to +follow_page, so that it will split the hugepages before returning +them. Migration for example passes FOLL_SPLIT as parameter to +follow_page because it's not hugepage aware and in fact it can't work +at all on hugetlbfs (but it instead works fine on transparent +hugepages thanks to FOLL_SPLIT). migration simply can't deal with +hugepages being returned (as it's not only checking the pfn of the +page and pinning it during the copy but it pretends to migrate the +memory in regular page sizes and with regular pte/pmd mappings). + +Graceful fallback +================= + +Code walking pagetables but unaware about huge pmds can simply call +split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by +pmd_offset. It's trivial to make the code transparent hugepage aware +by just grepping for "pmd_offset" and adding split_huge_pmd where +missing after pmd_offset returns the pmd. Thanks to the graceful +fallback design, with a one liner change, you can avoid to write +hundred if not thousand of lines of complex code to make your code +hugepage aware. + +If you're not walking pagetables but you run into a physical hugepage +but you can't handle it natively in your code, you can split it by +calling split_huge_page(page). This is what the Linux VM does before +it tries to swapout the hugepage for example. split_huge_page() can fail +if the page is pinned and you must handle this correctly. + +Example to make mremap.c transparent hugepage aware with a one liner +change:: + + diff --git a/mm/mremap.c b/mm/mremap.c + --- a/mm/mremap.c + +++ b/mm/mremap.c + @@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru + return NULL; + + pmd = pmd_offset(pud, addr); + + split_huge_pmd(vma, pmd, addr); + if (pmd_none_or_clear_bad(pmd)) + return NULL; + +Locking in hugepage aware code +============================== + +We want as much code as possible hugepage aware, as calling +split_huge_page() or split_huge_pmd() has a cost. + +To make pagetable walks huge pmd aware, all you need to do is to call +pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the +mmap_sem in read (or write) mode to be sure an huge pmd cannot be +created from under you by khugepaged (khugepaged collapse_huge_page +takes the mmap_sem in write mode in addition to the anon_vma lock). If +pmd_trans_huge returns false, you just fallback in the old code +paths. If instead pmd_trans_huge returns true, you have to take the +page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the +page table lock will prevent the huge pmd to be converted into a +regular pmd from under you (split_huge_pmd can run in parallel to the +pagetable walk). If the second pmd_trans_huge returns false, you +should just drop the page table lock and fallback to the old code as +before. Otherwise you can proceed to process the huge pmd and the +hugepage natively. Once finished you can drop the page table lock. + +Refcounts and transparent huge pages +==================================== + +Refcounting on THP is mostly consistent with refcounting on other compound +pages: + + - get_page()/put_page() and GUP operate in head page's ->_refcount. + + - ->_refcount in tail pages is always zero: get_page_unless_zero() never + succeed on tail pages. + + - map/unmap of the pages with PTE entry increment/decrement ->_mapcount + on relevant sub-page of the compound page. + + - map/unmap of the whole compound page accounted in compound_mapcount + (stored in first tail page). For file huge pages, we also increment + ->_mapcount of all sub-pages in order to have race-free detection of + last unmap of subpages. + +PageDoubleMap() indicates that the page is *possibly* mapped with PTEs. + +For anonymous pages PageDoubleMap() also indicates ->_mapcount in all +subpages is offset up by one. This additional reference is required to +get race-free detection of unmap of subpages when we have them mapped with +both PMDs and PTEs. + +This is optimization required to lower overhead of per-subpage mapcount +tracking. The alternative is alter ->_mapcount in all subpages on each +map/unmap of the whole compound page. + +For anonymous pages, we set PG_double_map when a PMD of the page got split +for the first time, but still have PMD mapping. The additional references +go away with last compound_mapcount. + +File pages get PG_double_map set on first map of the page with PTE and +goes away when the page gets evicted from page cache. + +split_huge_page internally has to distribute the refcounts in the head +page to the tail pages before clearing all PG_head/tail bits from the page +structures. It can be done easily for refcounts taken by page table +entries. But we don't have enough information on how to distribute any +additional pins (i.e. from get_user_pages). split_huge_page() fails any +requests to split pinned huge page: it expects page count to be equal to +sum of mapcount of all sub-pages plus one (split_huge_page caller must +have reference for head page). + +split_huge_page uses migration entries to stabilize page->_refcount and +page->_mapcount of anonymous pages. File pages just got unmapped. + +We safe against physical memory scanners too: the only legitimate way +scanner can get reference to a page is get_page_unless_zero(). + +All tail pages have zero ->_refcount until atomic_add(). This prevents the +scanner from getting a reference to the tail page up to that point. After the +atomic_add() we don't care about the ->_refcount value. We already known how +many references should be uncharged from the head page. + +For head page get_page_unless_zero() will succeed and we don't mind. It's +clear where reference should go after split: it will stay on head page. + +Note that split_huge_pmd() doesn't have any limitation on refcounting: +pmd can be split at any point and never fails. + +Partial unmap and deferred_split_huge_page() +============================================ + +Unmapping part of THP (with munmap() or other way) is not going to free +memory immediately. Instead, we detect that a subpage of THP is not in use +in page_remove_rmap() and queue the THP for splitting if memory pressure +comes. Splitting will free up unused subpages. + +Splitting the page right away is not an option due to locking context in +the place where we can detect partial unmap. It's also might be +counterproductive since in many cases partial unmap happens during exit(2) if +a THP crosses a VMA boundary. + +Function deferred_split_huge_page() is used to queue page for splitting. +The splitting itself will happen when we get memory pressure via shrinker +interface. diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt deleted file mode 100644 index 4dde03b44ad1..000000000000 --- a/Documentation/vm/transhuge.txt +++ /dev/null @@ -1,527 +0,0 @@ -= Transparent Hugepage Support = - -== Objective == - -Performance critical computing applications dealing with large memory -working sets are already running on top of libhugetlbfs and in turn -hugetlbfs. Transparent Hugepage Support is an alternative means of -using huge pages for the backing of virtual memory with huge pages -that supports the automatic promotion and demotion of page sizes and -without the shortcomings of hugetlbfs. - -Currently it only works for anonymous memory mappings and tmpfs/shmem. -But in the future it can expand to other filesystems. - -The reason applications are running faster is because of two -factors. The first factor is almost completely irrelevant and it's not -of significant interest because it'll also have the downside of -requiring larger clear-page copy-page in page faults which is a -potentially negative effect. The first factor consists in taking a -single page fault for each 2M virtual region touched by userland (so -reducing the enter/exit kernel frequency by a 512 times factor). This -only matters the first time the memory is accessed for the lifetime of -a memory mapping. The second long lasting and much more important -factor will affect all subsequent accesses to the memory for the whole -runtime of the application. The second factor consist of two -components: 1) the TLB miss will run faster (especially with -virtualization using nested pagetables but almost always also on bare -metal without virtualization) and 2) a single TLB entry will be -mapping a much larger amount of virtual memory in turn reducing the -number of TLB misses. With virtualization and nested pagetables the -TLB can be mapped of larger size only if both KVM and the Linux guest -are using hugepages but a significant speedup already happens if only -one of the two is using hugepages just because of the fact the TLB -miss is going to run faster. - -== Design == - -- "graceful fallback": mm components which don't have transparent hugepage - knowledge fall back to breaking huge pmd mapping into table of ptes and, - if necessary, split a transparent hugepage. Therefore these components - can continue working on the regular pages or regular pte mappings. - -- if a hugepage allocation fails because of memory fragmentation, - regular pages should be gracefully allocated instead and mixed in - the same vma without any failure or significant delay and without - userland noticing - -- if some task quits and more hugepages become available (either - immediately in the buddy or through the VM), guest physical memory - backed by regular pages should be relocated on hugepages - automatically (with khugepaged) - -- it doesn't require memory reservation and in turn it uses hugepages - whenever possible (the only possible reservation here is kernelcore= - to avoid unmovable pages to fragment all the memory but such a tweak - is not specific to transparent hugepage support and it's a generic - feature that applies to all dynamic high order allocations in the - kernel) - -Transparent Hugepage Support maximizes the usefulness of free memory -if compared to the reservation approach of hugetlbfs by allowing all -unused memory to be used as cache or other movable (or even unmovable -entities). It doesn't require reservation to prevent hugepage -allocation failures to be noticeable from userland. It allows paging -and all other advanced VM features to be available on the -hugepages. It requires no modifications for applications to take -advantage of it. - -Applications however can be further optimized to take advantage of -this feature, like for example they've been optimized before to avoid -a flood of mmap system calls for every malloc(4k). Optimizing userland -is by far not mandatory and khugepaged already can take care of long -lived page allocations even for hugepage unaware applications that -deals with large amounts of memory. - -In certain cases when hugepages are enabled system wide, application -may end up allocating more memory resources. An application may mmap a -large region but only touch 1 byte of it, in that case a 2M page might -be allocated instead of a 4k page for no good. This is why it's -possible to disable hugepages system-wide and to only have them inside -MADV_HUGEPAGE madvise regions. - -Embedded systems should enable hugepages only inside madvise regions -to eliminate any risk of wasting any precious byte of memory and to -only run faster. - -Applications that gets a lot of benefit from hugepages and that don't -risk to lose memory by using hugepages, should use -madvise(MADV_HUGEPAGE) on their critical mmapped regions. - -== sysfs == - -Transparent Hugepage Support for anonymous memory can be entirely disabled -(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE -regions (to avoid the risk of consuming more memory resources) or enabled -system wide. This can be achieved with one of: - -echo always >/sys/kernel/mm/transparent_hugepage/enabled -echo madvise >/sys/kernel/mm/transparent_hugepage/enabled -echo never >/sys/kernel/mm/transparent_hugepage/enabled - -It's also possible to limit defrag efforts in the VM to generate -anonymous hugepages in case they're not immediately free to madvise -regions or to never try to defrag memory and simply fallback to regular -pages unless hugepages are immediately available. Clearly if we spend CPU -time to defrag memory, we would expect to gain even more by the fact we -use hugepages later instead of regular pages. This isn't always -guaranteed, but it may be more likely in case the allocation is for a -MADV_HUGEPAGE region. - -echo always >/sys/kernel/mm/transparent_hugepage/defrag -echo defer >/sys/kernel/mm/transparent_hugepage/defrag -echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag -echo madvise >/sys/kernel/mm/transparent_hugepage/defrag -echo never >/sys/kernel/mm/transparent_hugepage/defrag - -"always" means that an application requesting THP will stall on allocation -failure and directly reclaim pages and compact memory in an effort to -allocate a THP immediately. This may be desirable for virtual machines -that benefit heavily from THP use and are willing to delay the VM start -to utilise them. - -"defer" means that an application will wake kswapd in the background -to reclaim pages and wake kcompactd to compact memory so that THP is -available in the near future. It's the responsibility of khugepaged -to then install the THP pages later. - -"defer+madvise" will enter direct reclaim and compaction like "always", but -only for regions that have used madvise(MADV_HUGEPAGE); all other regions -will wake kswapd in the background to reclaim pages and wake kcompactd to -compact memory so that THP is available in the near future. - -"madvise" will enter direct reclaim like "always" but only for regions -that are have used madvise(MADV_HUGEPAGE). This is the default behaviour. - -"never" should be self-explanatory. - -By default kernel tries to use huge zero page on read page fault to -anonymous mapping. It's possible to disable huge zero page by writing 0 -or enable it back by writing 1: - -echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page -echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page - -Some userspace (such as a test program, or an optimized memory allocation -library) may want to know the size (in bytes) of a transparent hugepage: - -cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size - -khugepaged will be automatically started when -transparent_hugepage/enabled is set to "always" or "madvise, and it'll -be automatically shutdown if it's set to "never". - -khugepaged runs usually at low frequency so while one may not want to -invoke defrag algorithms synchronously during the page faults, it -should be worth invoking defrag at least in khugepaged. However it's -also possible to disable defrag in khugepaged by writing 0 or enable -defrag in khugepaged by writing 1: - -echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag -echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag - -You can also control how many pages khugepaged should scan at each -pass: - -/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan - -and how many milliseconds to wait in khugepaged between each pass (you -can set this to 0 to run khugepaged at 100% utilization of one core): - -/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs - -and how many milliseconds to wait in khugepaged if there's an hugepage -allocation failure to throttle the next allocation attempt. - -/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs - -The khugepaged progress can be seen in the number of pages collapsed: - -/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed - -for each pass: - -/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans - -max_ptes_none specifies how many extra small pages (that are -not already mapped) can be allocated when collapsing a group -of small pages into one large page. - -/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none - -A higher value leads to use additional memory for programs. -A lower value leads to gain less thp performance. Value of -max_ptes_none can waste cpu time very little, you can -ignore it. - -max_ptes_swap specifies how many pages can be brought in from -swap when collapsing a group of pages into a transparent huge page. - -/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap - -A higher value can cause excessive swap IO and waste -memory. A lower value can prevent THPs from being -collapsed, resulting fewer pages being collapsed into -THPs, and lower memory access performance. - -== Boot parameter == - -You can change the sysfs boot time defaults of Transparent Hugepage -Support by passing the parameter "transparent_hugepage=always" or -"transparent_hugepage=madvise" or "transparent_hugepage=never" -(without "") to the kernel command line. - -== Hugepages in tmpfs/shmem == - -You can control hugepage allocation policy in tmpfs with mount option -"huge=". It can have following values: - - - "always": - Attempt to allocate huge pages every time we need a new page; - - - "never": - Do not allocate huge pages; - - - "within_size": - Only allocate huge page if it will be fully within i_size. - Also respect fadvise()/madvise() hints; - - - "advise: - Only allocate huge pages if requested with fadvise()/madvise(); - -The default policy is "never". - -"mount -o remount,huge= /mountpoint" works fine after mount: remounting -huge=never will not attempt to break up huge pages at all, just stop more -from being allocated. - -There's also sysfs knob to control hugepage allocation policy for internal -shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount -is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or -MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. - -In addition to policies listed above, shmem_enabled allows two further -values: - - - "deny": - For use in emergencies, to force the huge option off from - all mounts; - - "force": - Force the huge option on for all - very useful for testing; - -== Need of application restart == - -The transparent_hugepage/enabled values and tmpfs mount option only affect -future behavior. So to make them effective you need to restart any -application that could have been using hugepages. This also applies to the -regions registered in khugepaged. - -== Monitoring usage == - -The number of anonymous transparent huge pages currently used by the -system is available by reading the AnonHugePages field in /proc/meminfo. -To identify what applications are using anonymous transparent huge pages, -it is necessary to read /proc/PID/smaps and count the AnonHugePages fields -for each mapping. - -The number of file transparent huge pages mapped to userspace is available -by reading ShmemPmdMapped and ShmemHugePages fields in /proc/meminfo. -To identify what applications are mapping file transparent huge pages, it -is necessary to read /proc/PID/smaps and count the FileHugeMapped fields -for each mapping. - -Note that reading the smaps file is expensive and reading it -frequently will incur overhead. - -There are a number of counters in /proc/vmstat that may be used to -monitor how successfully the system is providing huge pages for use. - -thp_fault_alloc is incremented every time a huge page is successfully - allocated to handle a page fault. This applies to both the - first time a page is faulted and for COW faults. - -thp_collapse_alloc is incremented by khugepaged when it has found - a range of pages to collapse into one huge page and has - successfully allocated a new huge page to store the data. - -thp_fault_fallback is incremented if a page fault fails to allocate - a huge page and instead falls back to using small pages. - -thp_collapse_alloc_failed is incremented if khugepaged found a range - of pages that should be collapsed into one huge page but failed - the allocation. - -thp_file_alloc is incremented every time a file huge page is successfully - allocated. - -thp_file_mapped is incremented every time a file huge page is mapped into - user address space. - -thp_split_page is incremented every time a huge page is split into base - pages. This can happen for a variety of reasons but a common - reason is that a huge page is old and is being reclaimed. - This action implies splitting all PMD the page mapped with. - -thp_split_page_failed is incremented if kernel fails to split huge - page. This can happen if the page was pinned by somebody. - -thp_deferred_split_page is incremented when a huge page is put onto split - queue. This happens when a huge page is partially unmapped and - splitting it would free up some memory. Pages on split queue are - going to be split under memory pressure. - -thp_split_pmd is incremented every time a PMD split into table of PTEs. - This can happen, for instance, when application calls mprotect() or - munmap() on part of huge page. It doesn't split huge page, only - page table entry. - -thp_zero_page_alloc is incremented every time a huge zero page is - successfully allocated. It includes allocations which where - dropped due race with other allocation. Note, it doesn't count - every map of the huge zero page, only its allocation. - -thp_zero_page_alloc_failed is incremented if kernel fails to allocate - huge zero page and falls back to using small pages. - -As the system ages, allocating huge pages may be expensive as the -system uses memory compaction to copy data around memory to free a -huge page for use. There are some counters in /proc/vmstat to help -monitor this overhead. - -compact_stall is incremented every time a process stalls to run - memory compaction so that a huge page is free for use. - -compact_success is incremented if the system compacted memory and - freed a huge page for use. - -compact_fail is incremented if the system tries to compact memory - but failed. - -compact_pages_moved is incremented each time a page is moved. If - this value is increasing rapidly, it implies that the system - is copying a lot of data to satisfy the huge page allocation. - It is possible that the cost of copying exceeds any savings - from reduced TLB misses. - -compact_pagemigrate_failed is incremented when the underlying mechanism - for moving a page failed. - -compact_blocks_moved is incremented each time memory compaction examines - a huge page aligned range of pages. - -It is possible to establish how long the stalls were using the function -tracer to record how long was spent in __alloc_pages_nodemask and -using the mm_page_alloc tracepoint to identify which allocations were -for huge pages. - -== get_user_pages and follow_page == - -get_user_pages and follow_page if run on a hugepage, will return the -head or tail pages as usual (exactly as they would do on -hugetlbfs). Most gup users will only care about the actual physical -address of the page and its temporary pinning to release after the I/O -is complete, so they won't ever notice the fact the page is huge. But -if any driver is going to mangle over the page structure of the tail -page (like for checking page->mapping or other bits that are relevant -for the head page and not the tail page), it should be updated to jump -to check head page instead. Taking reference on any head/tail page would -prevent page from being split by anyone. - -NOTE: these aren't new constraints to the GUP API, and they match the -same constrains that applies to hugetlbfs too, so any driver capable -of handling GUP on hugetlbfs will also work fine on transparent -hugepage backed mappings. - -In case you can't handle compound pages if they're returned by -follow_page, the FOLL_SPLIT bit can be specified as parameter to -follow_page, so that it will split the hugepages before returning -them. Migration for example passes FOLL_SPLIT as parameter to -follow_page because it's not hugepage aware and in fact it can't work -at all on hugetlbfs (but it instead works fine on transparent -hugepages thanks to FOLL_SPLIT). migration simply can't deal with -hugepages being returned (as it's not only checking the pfn of the -page and pinning it during the copy but it pretends to migrate the -memory in regular page sizes and with regular pte/pmd mappings). - -== Optimizing the applications == - -To be guaranteed that the kernel will map a 2M page immediately in any -memory region, the mmap region has to be hugepage naturally -aligned. posix_memalign() can provide that guarantee. - -== Hugetlbfs == - -You can use hugetlbfs on a kernel that has transparent hugepage -support enabled just fine as always. No difference can be noted in -hugetlbfs other than there will be less overall fragmentation. All -usual features belonging to hugetlbfs are preserved and -unaffected. libhugetlbfs will also work fine as usual. - -== Graceful fallback == - -Code walking pagetables but unaware about huge pmds can simply call -split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by -pmd_offset. It's trivial to make the code transparent hugepage aware -by just grepping for "pmd_offset" and adding split_huge_pmd where -missing after pmd_offset returns the pmd. Thanks to the graceful -fallback design, with a one liner change, you can avoid to write -hundred if not thousand of lines of complex code to make your code -hugepage aware. - -If you're not walking pagetables but you run into a physical hugepage -but you can't handle it natively in your code, you can split it by -calling split_huge_page(page). This is what the Linux VM does before -it tries to swapout the hugepage for example. split_huge_page() can fail -if the page is pinned and you must handle this correctly. - -Example to make mremap.c transparent hugepage aware with a one liner -change: - -diff --git a/mm/mremap.c b/mm/mremap.c ---- a/mm/mremap.c -+++ b/mm/mremap.c -@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru - return NULL; - - pmd = pmd_offset(pud, addr); -+ split_huge_pmd(vma, pmd, addr); - if (pmd_none_or_clear_bad(pmd)) - return NULL; - -== Locking in hugepage aware code == - -We want as much code as possible hugepage aware, as calling -split_huge_page() or split_huge_pmd() has a cost. - -To make pagetable walks huge pmd aware, all you need to do is to call -pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the -mmap_sem in read (or write) mode to be sure an huge pmd cannot be -created from under you by khugepaged (khugepaged collapse_huge_page -takes the mmap_sem in write mode in addition to the anon_vma lock). If -pmd_trans_huge returns false, you just fallback in the old code -paths. If instead pmd_trans_huge returns true, you have to take the -page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the -page table lock will prevent the huge pmd to be converted into a -regular pmd from under you (split_huge_pmd can run in parallel to the -pagetable walk). If the second pmd_trans_huge returns false, you -should just drop the page table lock and fallback to the old code as -before. Otherwise you can proceed to process the huge pmd and the -hugepage natively. Once finished you can drop the page table lock. - -== Refcounts and transparent huge pages == - -Refcounting on THP is mostly consistent with refcounting on other compound -pages: - - - get_page()/put_page() and GUP operate in head page's ->_refcount. - - - ->_refcount in tail pages is always zero: get_page_unless_zero() never - succeed on tail pages. - - - map/unmap of the pages with PTE entry increment/decrement ->_mapcount - on relevant sub-page of the compound page. - - - map/unmap of the whole compound page accounted in compound_mapcount - (stored in first tail page). For file huge pages, we also increment - ->_mapcount of all sub-pages in order to have race-free detection of - last unmap of subpages. - -PageDoubleMap() indicates that the page is *possibly* mapped with PTEs. - -For anonymous pages PageDoubleMap() also indicates ->_mapcount in all -subpages is offset up by one. This additional reference is required to -get race-free detection of unmap of subpages when we have them mapped with -both PMDs and PTEs. - -This is optimization required to lower overhead of per-subpage mapcount -tracking. The alternative is alter ->_mapcount in all subpages on each -map/unmap of the whole compound page. - -For anonymous pages, we set PG_double_map when a PMD of the page got split -for the first time, but still have PMD mapping. The additional references -go away with last compound_mapcount. - -File pages get PG_double_map set on first map of the page with PTE and -goes away when the page gets evicted from page cache. - -split_huge_page internally has to distribute the refcounts in the head -page to the tail pages before clearing all PG_head/tail bits from the page -structures. It can be done easily for refcounts taken by page table -entries. But we don't have enough information on how to distribute any -additional pins (i.e. from get_user_pages). split_huge_page() fails any -requests to split pinned huge page: it expects page count to be equal to -sum of mapcount of all sub-pages plus one (split_huge_page caller must -have reference for head page). - -split_huge_page uses migration entries to stabilize page->_refcount and -page->_mapcount of anonymous pages. File pages just got unmapped. - -We safe against physical memory scanners too: the only legitimate way -scanner can get reference to a page is get_page_unless_zero(). - -All tail pages have zero ->_refcount until atomic_add(). This prevents the -scanner from getting a reference to the tail page up to that point. After the -atomic_add() we don't care about the ->_refcount value. We already known how -many references should be uncharged from the head page. - -For head page get_page_unless_zero() will succeed and we don't mind. It's -clear where reference should go after split: it will stay on head page. - -Note that split_huge_pmd() doesn't have any limitation on refcounting: -pmd can be split at any point and never fails. - -== Partial unmap and deferred_split_huge_page() == - -Unmapping part of THP (with munmap() or other way) is not going to free -memory immediately. Instead, we detect that a subpage of THP is not in use -in page_remove_rmap() and queue the THP for splitting if memory pressure -comes. Splitting will free up unused subpages. - -Splitting the page right away is not an option due to locking context in -the place where we can detect partial unmap. It's also might be -counterproductive since in many cases partial unmap happens during exit(2) if -a THP crosses a VMA boundary. - -Function deferred_split_huge_page() is used to queue page for splitting. -The splitting itself will happen when we get memory pressure via shrinker -interface. diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.rst index e14718572476..fdd84cb8d511 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.rst @@ -1,37 +1,13 @@ - ============================== - UNEVICTABLE LRU INFRASTRUCTURE - ============================== - -======== -CONTENTS -======== - - (*) The Unevictable LRU - - - The unevictable page list. - - Memory control group interaction. - - Marking address spaces unevictable. - - Detecting Unevictable Pages. - - vmscan's handling of unevictable pages. - - (*) mlock()'d pages. - - - History. - - Basic management. - - mlock()/mlockall() system call handling. - - Filtering special vmas. - - munlock()/munlockall() system call handling. - - Migrating mlocked pages. - - Compacting mlocked pages. - - mmap(MAP_LOCKED) system call handling. - - munmap()/exit()/exec() system call handling. - - try_to_unmap(). - - try_to_munlock() reverse map scan. - - Page reclaim in shrink_*_list(). +.. _unevictable_lru: +============================== +Unevictable LRU Infrastructure +============================== -============ -INTRODUCTION +.. contents:: :local: + + +Introduction ============ This document describes the Linux memory manager's "Unevictable LRU" @@ -46,8 +22,8 @@ details - the "what does it do?" - by reading the code. One hopes that the descriptions below add value by provide the answer to "why does it do that?". -=================== -THE UNEVICTABLE LRU + +The Unevictable LRU =================== The Unevictable LRU facility adds an additional LRU list to track unevictable @@ -66,17 +42,17 @@ completely unresponsive. The unevictable list addresses the following classes of unevictable pages: - (*) Those owned by ramfs. + * Those owned by ramfs. - (*) Those mapped into SHM_LOCK'd shared memory regions. + * Those mapped into SHM_LOCK'd shared memory regions. - (*) Those mapped into VM_LOCKED [mlock()ed] VMAs. + * Those mapped into VM_LOCKED [mlock()ed] VMAs. The infrastructure may also be able to handle other conditions that make pages unevictable, either by definition or by circumstance, in the future. -THE UNEVICTABLE PAGE LIST +The Unevictable Page List ------------------------- The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list @@ -118,7 +94,7 @@ the unevictable list when one task has the page isolated from the LRU and other tasks are changing the "evictability" state of the page. -MEMORY CONTROL GROUP INTERACTION +Memory Control Group Interaction -------------------------------- The unevictable LRU facility interacts with the memory control group [aka @@ -144,7 +120,9 @@ effects: the control group to thrash or to OOM-kill tasks. -MARKING ADDRESS SPACES UNEVICTABLE +.. _mark_addr_space_unevict: + +Marking Address Spaces Unevictable ---------------------------------- For facilities such as ramfs none of the pages attached to the address space @@ -152,15 +130,15 @@ may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE address space flag is provided, and this can be manipulated by a filesystem using a number of wrapper functions: - (*) void mapping_set_unevictable(struct address_space *mapping); + * ``void mapping_set_unevictable(struct address_space *mapping);`` Mark the address space as being completely unevictable. - (*) void mapping_clear_unevictable(struct address_space *mapping); + * ``void mapping_clear_unevictable(struct address_space *mapping);`` Mark the address space as being evictable. - (*) int mapping_unevictable(struct address_space *mapping); + * ``int mapping_unevictable(struct address_space *mapping);`` Query the address space, and return true if it is completely unevictable. @@ -177,12 +155,13 @@ These are currently used in two places in the kernel: ensure they're in memory. -DETECTING UNEVICTABLE PAGES +Detecting Unevictable Pages --------------------------- The function page_evictable() in vmscan.c determines whether a page is -evictable or not using the query function outlined above [see section "Marking -address spaces unevictable"] to check the AS_UNEVICTABLE flag. +evictable or not using the query function outlined above [see section +:ref:`Marking address spaces unevictable <mark_addr_space_unevict>`] +to check the AS_UNEVICTABLE flag. For address spaces that are so marked after being populated (as SHM regions might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate @@ -202,7 +181,7 @@ flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED. -VMSCAN'S HANDLING OF UNEVICTABLE PAGES +Vmscan's Handling of Unevictable Pages -------------------------------------- If unevictable pages are culled in the fault path, or moved to the unevictable @@ -233,8 +212,7 @@ extra evictabilty checks should not occur in the majority of calls to putback_lru_page(). -============= -MLOCKED PAGES +MLOCKED Pages ============= The unevictable page list is also useful for mlock(), in addition to ramfs and @@ -242,7 +220,7 @@ SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in NOMMU situations, all mappings are effectively mlocked. -HISTORY +History ------- The "Unevictable mlocked Pages" infrastructure is based on work originally @@ -263,7 +241,7 @@ replaced by walking the reverse map to determine whether any VM_LOCKED VMAs mapped the page. More on this below. -BASIC MANAGEMENT +Basic Management ---------------- mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable @@ -304,10 +282,10 @@ mlocked pages become unlocked and rescued from the unevictable list when: (4) before a page is COW'd in a VM_LOCKED VMA. -mlock()/mlockall() SYSTEM CALL HANDLING +mlock()/mlockall() System Call Handling --------------------------------------- -Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() +Both [do\_]mlock() and [do\_]mlockall() system call handlers call mlock_fixup() for each VMA in the range specified by the call. In the case of mlockall(), this is the entire active address space of the task. Note that mlock_fixup() is used for both mlocking and munlocking a range of memory. A call to mlock() @@ -351,7 +329,7 @@ mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle it later if and when it attempts to reclaim the page. -FILTERING SPECIAL VMAS +Filtering Special VMAs ---------------------- mlock_fixup() filters several classes of "special" VMAs: @@ -379,8 +357,9 @@ VM_LOCKED flag. Therefore, we won't have to deal with them later during munlock(), munmap() or task exit. Neither does mlock_fixup() account these VMAs against the task's "locked_vm". +.. _munlock_munlockall_handling: -munlock()/munlockall() SYSTEM CALL HANDLING +munlock()/munlockall() System Call Handling ------------------------------------------- The munlock() and munlockall() system calls are handled by the same functions - @@ -426,7 +405,7 @@ This is fine, because we'll catch it later if and if vmscan tries to reclaim the page. This should be relatively rare. -MIGRATING MLOCKED PAGES +Migrating MLOCKED Pages ----------------------- A page that is being migrated has been isolated from the LRU lists and is held @@ -451,7 +430,7 @@ list because of a race between munlock and migration, page migration uses the putback_lru_page() function to add migrated pages back to the LRU. -COMPACTING MLOCKED PAGES +Compacting MLOCKED Pages ------------------------ The unevictable LRU can be scanned for compactable regions and the default @@ -461,7 +440,7 @@ unevictable LRU is enabled, the work of compaction is mostly handled by the page migration code and the same work flow as described in MIGRATING MLOCKED PAGES will apply. -MLOCKING TRANSPARENT HUGE PAGES +MLOCKING Transparent Huge Pages ------------------------------- A transparent huge page is represented by a single entry on an LRU list. @@ -483,7 +462,7 @@ to unevictable LRU and the rest can be reclaimed. See also comment in follow_trans_huge_pmd(). -mmap(MAP_LOCKED) SYSTEM CALL HANDLING +mmap(MAP_LOCKED) System Call Handling ------------------------------------- In addition the mlock()/mlockall() system calls, an application can request @@ -514,7 +493,7 @@ memory range accounted as locked_vm, as the protections could be changed later and pages allocated into that region. -munmap()/exit()/exec() SYSTEM CALL HANDLING +munmap()/exit()/exec() System Call Handling ------------------------------------------- When unmapping an mlocked region of memory, whether by an explicit call to @@ -568,16 +547,18 @@ munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim, holepunching, and truncation of file pages and their anonymous COWed pages. -try_to_munlock() REVERSE MAP SCAN +try_to_munlock() Reverse Map Scan --------------------------------- - [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the - page_referenced() reverse map walker. +.. warning:: + [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the + page_referenced() reverse map walker. -When munlock_vma_page() [see section "munlock()/munlockall() System Call -Handling" above] tries to munlock a page, it needs to determine whether or not -the page is mapped by any VM_LOCKED VMA without actually attempting to unmap -all PTEs from the page. For this purpose, the unevictable/mlock infrastructure +When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call +Handling <munlock_munlockall_handling>` above] tries to munlock a +page, it needs to determine whether or not the page is mapped by any +VM_LOCKED VMA without actually attempting to unmap all PTEs from the +page. For this purpose, the unevictable/mlock infrastructure introduced a variant of try_to_unmap() called try_to_munlock(). try_to_munlock() calls the same functions as try_to_unmap() for anonymous and @@ -595,7 +576,7 @@ large region or tearing down a large address space that has been mlocked via mlockall(), overall this is a fairly rare event. -PAGE RECLAIM IN shrink_*_list() +Page Reclaim in shrink_*_list() ------------------------------- shrink_active_list() culls any obviously unevictable pages - i.e. diff --git a/Documentation/vm/z3fold.txt b/Documentation/vm/z3fold.rst index 38e4dac810b6..224e3c61d686 100644 --- a/Documentation/vm/z3fold.txt +++ b/Documentation/vm/z3fold.rst @@ -1,5 +1,8 @@ +.. _z3fold: + +====== z3fold ------- +====== z3fold is a special purpose allocator for storing compressed pages. It is designed to store up to three compressed pages per physical page. @@ -7,6 +10,7 @@ It is a zbud derivative which allows for higher compression ratio keeping the simplicity and determinism of its predecessor. The main differences between z3fold and zbud are: + * unlike zbud, z3fold allows for up to PAGE_SIZE allocations * z3fold can hold up to 3 compressed pages in its page * z3fold doesn't export any API itself and is thus intended to be used diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.rst index 64ed63c4f69d..6e79893d6132 100644 --- a/Documentation/vm/zsmalloc.txt +++ b/Documentation/vm/zsmalloc.rst @@ -1,5 +1,8 @@ +.. _zsmalloc: + +======== zsmalloc --------- +======== This allocator is designed for use with zram. Thus, the allocator is supposed to work well under low memory conditions. In particular, it @@ -31,40 +34,49 @@ be mapped using zs_map_object() to get a usable pointer and subsequently unmapped using zs_unmap_object(). stat ----- +==== With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via -/sys/kernel/debug/zsmalloc/<user name>. Here is a sample of stat output: +``/sys/kernel/debug/zsmalloc/<user name>``. Here is a sample of stat output:: -# cat /sys/kernel/debug/zsmalloc/zram0/classes + # cat /sys/kernel/debug/zsmalloc/zram0/classes class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage - .. - .. + ... + ... 9 176 0 1 186 129 8 4 10 192 1 0 2880 2872 135 3 11 208 0 1 819 795 42 2 12 224 0 1 219 159 12 4 - .. - .. + ... + ... + +class + index +size + object size zspage stores +almost_empty + the number of ZS_ALMOST_EMPTY zspages(see below) +almost_full + the number of ZS_ALMOST_FULL zspages(see below) +obj_allocated + the number of objects allocated +obj_used + the number of objects allocated to the user +pages_used + the number of pages allocated for the class +pages_per_zspage + the number of 0-order pages to make a zspage -class: index -size: object size zspage stores -almost_empty: the number of ZS_ALMOST_EMPTY zspages(see below) -almost_full: the number of ZS_ALMOST_FULL zspages(see below) -obj_allocated: the number of objects allocated -obj_used: the number of objects allocated to the user -pages_used: the number of pages allocated for the class -pages_per_zspage: the number of 0-order pages to make a zspage +We assign a zspage to ZS_ALMOST_EMPTY fullness group when n <= N / f, where -We assign a zspage to ZS_ALMOST_EMPTY fullness group when: - n <= N / f, where -n = number of allocated objects -N = total number of objects zspage can store -f = fullness_threshold_frac(ie, 4 at the moment) +* n = number of allocated objects +* N = total number of objects zspage can store +* f = fullness_threshold_frac(ie, 4 at the moment) Similarly, we assign zspage to: - ZS_ALMOST_FULL when n > N / f - ZS_EMPTY when n == 0 - ZS_FULL when n == N + +* ZS_ALMOST_FULL when n > N / f +* ZS_EMPTY when n == 0 +* ZS_FULL when n == N diff --git a/Documentation/vm/zswap.txt b/Documentation/vm/zswap.rst index 0b3a1148f9f0..1444ecd40911 100644 --- a/Documentation/vm/zswap.txt +++ b/Documentation/vm/zswap.rst @@ -1,4 +1,11 @@ -Overview: +.. _zswap: + +===== +zswap +===== + +Overview +======== Zswap is a lightweight compressed cache for swap pages. It takes pages that are in the process of being swapped out and attempts to compress them into a @@ -7,32 +14,34 @@ for potentially reduced swap I/O. This trade-off can also result in a significant performance improvement if reads from the compressed cache are faster than reads from a swap device. -NOTE: Zswap is a new feature as of v3.11 and interacts heavily with memory -reclaim. This interaction has not been fully explored on the large set of -potential configurations and workloads that exist. For this reason, zswap -is a work in progress and should be considered experimental. +.. note:: + Zswap is a new feature as of v3.11 and interacts heavily with memory + reclaim. This interaction has not been fully explored on the large set of + potential configurations and workloads that exist. For this reason, zswap + is a work in progress and should be considered experimental. + + Some potential benefits: -Some potential benefits: * Desktop/laptop users with limited RAM capacities can mitigate the - performance impact of swapping. + performance impact of swapping. * Overcommitted guests that share a common I/O resource can - dramatically reduce their swap I/O pressure, avoiding heavy handed I/O - throttling by the hypervisor. This allows more work to get done with less - impact to the guest workload and guests sharing the I/O subsystem + dramatically reduce their swap I/O pressure, avoiding heavy handed I/O + throttling by the hypervisor. This allows more work to get done with less + impact to the guest workload and guests sharing the I/O subsystem * Users with SSDs as swap devices can extend the life of the device by - drastically reducing life-shortening writes. + drastically reducing life-shortening writes. Zswap evicts pages from compressed cache on an LRU basis to the backing swap device when the compressed pool reaches its size limit. This requirement had been identified in prior community discussions. Zswap is disabled by default but can be enabled at boot time by setting -the "enabled" attribute to 1 at boot time. ie: zswap.enabled=1. Zswap +the ``enabled`` attribute to 1 at boot time. ie: ``zswap.enabled=1``. Zswap can also be enabled and disabled at runtime using the sysfs interface. An example command to enable zswap at runtime, assuming sysfs is mounted -at /sys, is: +at ``/sys``, is:: -echo 1 > /sys/module/zswap/parameters/enabled + echo 1 > /sys/module/zswap/parameters/enabled When zswap is disabled at runtime it will stop storing pages that are being swapped out. However, it will _not_ immediately write out or fault @@ -43,7 +52,8 @@ pages out of the compressed pool, a swapoff on the swap device(s) will fault back into memory all swapped out pages, including those in the compressed pool. -Design: +Design +====== Zswap receives pages for compression through the Frontswap API and is able to evict pages from its own compressed pool on an LRU basis and write them back to @@ -53,12 +63,12 @@ Zswap makes use of zpool for the managing the compressed memory pool. Each allocation in zpool is not directly accessible by address. Rather, a handle is returned by the allocation routine and that handle must be mapped before being accessed. The compressed memory pool grows on demand and shrinks as compressed -pages are freed. The pool is not preallocated. By default, a zpool of type -zbud is created, but it can be selected at boot time by setting the "zpool" -attribute, e.g. zswap.zpool=zbud. It can also be changed at runtime using the -sysfs "zpool" attribute, e.g. +pages are freed. The pool is not preallocated. By default, a zpool +of type zbud is created, but it can be selected at boot time by +setting the ``zpool`` attribute, e.g. ``zswap.zpool=zbud``. It can +also be changed at runtime using the sysfs ``zpool`` attribute, e.g.:: -echo zbud > /sys/module/zswap/parameters/zpool + echo zbud > /sys/module/zswap/parameters/zpool The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which means the compression ratio will always be 2:1 or worse (because of half-full @@ -83,14 +93,16 @@ via frontswap, to free the compressed entry. Zswap seeks to be simple in its policies. Sysfs attributes allow for one user controlled policy: + * max_pool_percent - The maximum percentage of memory that the compressed - pool can occupy. + pool can occupy. -The default compressor is lzo, but it can be selected at boot time by setting -the “compressor” attribute, e.g. zswap.compressor=lzo. It can also be changed -at runtime using the sysfs "compressor" attribute, e.g. +The default compressor is lzo, but it can be selected at boot time by +setting the ``compressor`` attribute, e.g. ``zswap.compressor=lzo``. +It can also be changed at runtime using the sysfs "compressor" +attribute, e.g.:: -echo lzo > /sys/module/zswap/parameters/compressor + echo lzo > /sys/module/zswap/parameters/compressor When the zpool and/or compressor parameter is changed at runtime, any existing compressed pages are not modified; they are left in their own zpool. When a @@ -106,11 +118,12 @@ compressed length of the page is set to zero and the pattern or same-filled value is stored. Same-value filled pages identification feature is enabled by default and can be -disabled at boot time by setting the "same_filled_pages_enabled" attribute to 0, -e.g. zswap.same_filled_pages_enabled=0. It can also be enabled and disabled at -runtime using the sysfs "same_filled_pages_enabled" attribute, e.g. +disabled at boot time by setting the ``same_filled_pages_enabled`` attribute +to 0, e.g. ``zswap.same_filled_pages_enabled=0``. It can also be enabled and +disabled at runtime using the sysfs ``same_filled_pages_enabled`` +attribute, e.g.:: -echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled + echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled When zswap same-filled page identification is disabled at runtime, it will stop checking for the same-value filled pages during store operation. However, the diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt index 71c30984e94d..a16aa2113840 100644 --- a/Documentation/x86/intel_rdt_ui.txt +++ b/Documentation/x86/intel_rdt_ui.txt @@ -17,12 +17,14 @@ MBA (Memory Bandwidth Allocation) - "mba" To use the feature mount the file system: - # mount -t resctrl resctrl [-o cdp[,cdpl2]] /sys/fs/resctrl + # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl mount options are: "cdp": Enable code/data prioritization in L3 cache allocations. "cdpl2": Enable code/data prioritization in L2 cache allocations. +"mba_MBps": Enable the MBA Software Controller(mba_sc) to specify MBA + bandwidth in MBps L2 and L3 CDP are controlled seperately. @@ -270,10 +272,11 @@ and 0xA are not. On a system with a 20-bit mask each bit represents 5% of the capacity of the cache. You could partition the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. -Memory bandwidth(b/w) percentage --------------------------------- -For Memory b/w resource, user controls the resource by indicating the -percentage of total memory b/w. +Memory bandwidth Allocation and monitoring +------------------------------------------ + +For Memory bandwidth resource, by default the user controls the resource +by indicating the percentage of total memory bandwidth. The minimum bandwidth percentage value for each cpu model is predefined and can be looked up through "info/MB/min_bandwidth". The bandwidth @@ -285,7 +288,47 @@ to the next control step available on the hardware. The bandwidth throttling is a core specific mechanism on some of Intel SKUs. Using a high bandwidth and a low bandwidth setting on two threads sharing a core will result in both threads being throttled to use the -low bandwidth. +low bandwidth. The fact that Memory bandwidth allocation(MBA) is a core +specific mechanism where as memory bandwidth monitoring(MBM) is done at +the package level may lead to confusion when users try to apply control +via the MBA and then monitor the bandwidth to see if the controls are +effective. Below are such scenarios: + +1. User may *not* see increase in actual bandwidth when percentage + values are increased: + +This can occur when aggregate L2 external bandwidth is more than L3 +external bandwidth. Consider an SKL SKU with 24 cores on a package and +where L2 external is 10GBps (hence aggregate L2 external bandwidth is +240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20 +threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3 +bandwidth of 100GBps although the percentage value specified is only 50% +<< 100%. Hence increasing the bandwidth percentage will not yeild any +more bandwidth. This is because although the L2 external bandwidth still +has capacity, the L3 external bandwidth is fully used. Also note that +this would be dependent on number of cores the benchmark is run on. + +2. Same bandwidth percentage may mean different actual bandwidth + depending on # of threads: + +For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 +thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although +they have same percentage bandwidth of 10%. This is simply because as +threads start using more cores in an rdtgroup, the actual bandwidth may +increase or vary although user specified bandwidth percentage is same. + +In order to mitigate this and make the interface more user friendly, +resctrl added support for specifying the bandwidth in MBps as well. The +kernel underneath would use a software feedback mechanism or a "Software +Controller(mba_sc)" which reads the actual bandwidth using MBM counters +and adjust the memowy bandwidth percentages to ensure + + "actual bandwidth < user specified bandwidth". + +By default, the schemata would take the bandwidth percentage values +where as user can switch to the "MBA software controller" mode using +a mount option 'mba_MBps'. The schemata format is specified in the below +sections. L3 schemata file details (code and data prioritization disabled) ---------------------------------------------------------------- @@ -308,13 +351,20 @@ schemata format is always: L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... -Memory b/w Allocation details ------------------------------ +Memory bandwidth Allocation (default mode) +------------------------------------------ Memory b/w domain is L3 cache. MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... +Memory bandwidth Allocation specified in MBps +--------------------------------------------- + +Memory bandwidth domain is L3 cache. + + MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;... + Reading/writing the schemata file --------------------------------- Reading the schemata file will show the state of all resources @@ -358,6 +408,15 @@ allocations can overlap or not. The allocations specifies the maximum b/w that the group may be able to use and the system admin can configure the b/w accordingly. +If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB +rather than the percentage values. + +# echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata +# echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata + +In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w +of 1024MB where as on socket 1 they would use 500MB. + Example 2 --------- Again two sockets, but this time with a more realistic 20-bit mask. diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt index b297c48389b9..8d109ef67ab6 100644 --- a/Documentation/x86/x86_64/boot-options.txt +++ b/Documentation/x86/x86_64/boot-options.txt @@ -187,9 +187,9 @@ PCI IOMMU (input/output memory management unit) - Currently four x86-64 PCI-DMA mapping implementations exist: + Multiple x86-64 PCI-DMA mapping implementations exist, for example: - 1. <arch/x86_64/kernel/pci-nommu.c>: use no hardware/software IOMMU at all + 1. <lib/dma-direct.c>: use no hardware/software IOMMU at all (e.g. because you have < 3 GB memory). Kernel boot message: "PCI-DMA: Disabling IOMMU" @@ -208,7 +208,7 @@ IOMMU (input/output memory management unit) Kernel boot message: "PCI-DMA: Using Calgary IOMMU" iommu=[<size>][,noagp][,off][,force][,noforce][,leak[=<nr_of_leak_pages>] - [,memaper[=<order>]][,merge][,forcesac][,fullflush][,nomerge] + [,memaper[=<order>]][,merge][,fullflush][,nomerge] [,noaperture][,calgary] General iommu options: @@ -235,14 +235,7 @@ IOMMU (input/output memory management unit) (experimental). nomerge Don't do scatter-gather (SG) merging. noaperture Ask the IOMMU not to touch the aperture for AGP. - forcesac Force single-address cycle (SAC) mode for masks <40bits - (experimental). noagp Don't initialize the AGP driver and use full aperture. - allowdac Allow double-address cycle (DAC) mode, i.e. DMA >4GB. - DAC is used with 32-bit PCI to push a 64-bit address in - two cycles. When off all DMA over >4GB is forced through - an IOMMU or software bounce buffering. - nodac Forbid DAC mode, i.e. DMA >4GB. panic Always panic when IOMMU overflows. calgary Use the Calgary IOMMU if it is available |