summaryrefslogtreecommitdiffstats
path: root/Documentation/x86/x86_64
diff options
context:
space:
mode:
authorJonathan Corbet <corbet@lwn.net>2023-03-15 00:06:44 +0100
committerJonathan Corbet <corbet@lwn.net>2023-03-30 20:58:51 +0200
commitff61f0791ce969d2db6c9f3b71d74ceec0a2e958 (patch)
treefe32be44aaf65f9c436a8f37cd4a18f6ec47c3cb /Documentation/x86/x86_64
parentDocumentation: kernel-parameters: Remove meye entry (diff)
downloadlinux-ff61f0791ce969d2db6c9f3b71d74ceec0a2e958.tar.xz
linux-ff61f0791ce969d2db6c9f3b71d74ceec0a2e958.zip
docs: move x86 documentation into Documentation/arch/
Move the x86 documentation under Documentation/arch/ as a way of cleaning up the top-level directory and making the structure of our docs more closely match the structure of the source directories it describes. All in-kernel references to the old paths have been updated. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20230315211523.108836-1-corbet@lwn.net/ Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Diffstat (limited to 'Documentation/x86/x86_64')
-rw-r--r--Documentation/x86/x86_64/5level-paging.rst67
-rw-r--r--Documentation/x86/x86_64/boot-options.rst319
-rw-r--r--Documentation/x86/x86_64/cpu-hotplug-spec.rst24
-rw-r--r--Documentation/x86/x86_64/fake-numa-for-cpusets.rst78
-rw-r--r--Documentation/x86/x86_64/fsgs.rst199
-rw-r--r--Documentation/x86/x86_64/index.rst17
-rw-r--r--Documentation/x86/x86_64/machinecheck.rst33
-rw-r--r--Documentation/x86/x86_64/mm.rst157
-rw-r--r--Documentation/x86/x86_64/uefi.rst58
9 files changed, 0 insertions, 952 deletions
diff --git a/Documentation/x86/x86_64/5level-paging.rst b/Documentation/x86/x86_64/5level-paging.rst
deleted file mode 100644
index b792bbdc0b01..000000000000
--- a/Documentation/x86/x86_64/5level-paging.rst
+++ /dev/null
@@ -1,67 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==============
-5-level paging
-==============
-
-Overview
-========
-Original x86-64 was limited by 4-level paging to 256 TiB of virtual address
-space and 64 TiB of physical address space. We are already bumping into
-this limit: some vendors offer servers with 64 TiB of memory today.
-
-To overcome the limitation upcoming hardware will introduce support for
-5-level paging. It is a straight-forward extension of the current page
-table structure adding one more layer of translation.
-
-It bumps the limits to 128 PiB of virtual address space and 4 PiB of
-physical address space. This "ought to be enough for anybody" ©.
-
-QEMU 2.9 and later support 5-level paging.
-
-Virtual memory layout for 5-level paging is described in
-Documentation/x86/x86_64/mm.rst
-
-
-Enabling 5-level paging
-=======================
-CONFIG_X86_5LEVEL=y enables the feature.
-
-Kernel with CONFIG_X86_5LEVEL=y still able to boot on 4-level hardware.
-In this case additional page table level -- p4d -- will be folded at
-runtime.
-
-User-space and large virtual address space
-==========================================
-On x86, 5-level paging enables 56-bit userspace virtual address space.
-Not all user space is ready to handle wide addresses. It's known that
-at least some JIT compilers use higher bits in pointers to encode their
-information. It collides with valid pointers with 5-level paging and
-leads to crashes.
-
-To mitigate this, we are not going to allocate virtual address space
-above 47-bit by default.
-
-But userspace can ask for allocation from full address space by
-specifying hint address (with or without MAP_FIXED) above 47-bits.
-
-If hint address set above 47-bit, but MAP_FIXED is not specified, we try
-to look for unmapped area by specified address. If it's already
-occupied, we look for unmapped area in *full* address space, rather than
-from 47-bit window.
-
-A high hint address would only affect the allocation in question, but not
-any future mmap()s.
-
-Specifying high hint address on older kernel or on machine without 5-level
-paging support is safe. The hint will be ignored and kernel will fall back
-to allocation from 47-bit address space.
-
-This approach helps to easily make application's memory allocator aware
-about large address space without manually tracking allocated virtual
-address space.
-
-One important case we need to handle here is interaction with MPX.
-MPX (without MAWA extension) cannot handle addresses above 47-bit, so we
-need to make sure that MPX cannot be enabled we already have VMA above
-the boundary and forbid creating such VMAs once MPX is enabled.
diff --git a/Documentation/x86/x86_64/boot-options.rst b/Documentation/x86/x86_64/boot-options.rst
deleted file mode 100644
index cbd14124a667..000000000000
--- a/Documentation/x86/x86_64/boot-options.rst
+++ /dev/null
@@ -1,319 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===========================
-AMD64 Specific Boot Options
-===========================
-
-There are many others (usually documented in driver documentation), but
-only the AMD64 specific ones are listed here.
-
-Machine check
-=============
-Please see Documentation/x86/x86_64/machinecheck.rst for sysfs runtime tunables.
-
- mce=off
- Disable machine check
- mce=no_cmci
- Disable CMCI(Corrected Machine Check Interrupt) that
- Intel processor supports. Usually this disablement is
- not recommended, but it might be handy if your hardware
- is misbehaving.
- Note that you'll get more problems without CMCI than with
- due to the shared banks, i.e. you might get duplicated
- error logs.
- mce=dont_log_ce
- Don't make logs for corrected errors. All events reported
- as corrected are silently cleared by OS.
- This option will be useful if you have no interest in any
- of corrected errors.
- mce=ignore_ce
- Disable features for corrected errors, e.g. polling timer
- and CMCI. All events reported as corrected are not cleared
- by OS and remained in its error banks.
- Usually this disablement is not recommended, however if
- there is an agent checking/clearing corrected errors
- (e.g. BIOS or hardware monitoring applications), conflicting
- with OS's error handling, and you cannot deactivate the agent,
- then this option will be a help.
- mce=no_lmce
- Do not opt-in to Local MCE delivery. Use legacy method
- to broadcast MCEs.
- mce=bootlog
- Enable logging of machine checks left over from booting.
- Disabled by default on AMD Fam10h and older because some BIOS
- leave bogus ones.
- If your BIOS doesn't do that it's a good idea to enable though
- to make sure you log even machine check events that result
- in a reboot. On Intel systems it is enabled by default.
- mce=nobootlog
- Disable boot machine check logging.
- mce=monarchtimeout (number)
- monarchtimeout:
- Sets the time in us to wait for other CPUs on machine checks. 0
- to disable.
- mce=bios_cmci_threshold
- Don't overwrite the bios-set CMCI threshold. This boot option
- prevents Linux from overwriting the CMCI threshold set by the
- bios. Without this option, Linux always sets the CMCI
- threshold to 1. Enabling this may make memory predictive failure
- analysis less effective if the bios sets thresholds for memory
- errors since we will not see details for all errors.
- mce=recovery
- Force-enable recoverable machine check code paths
-
- nomce (for compatibility with i386)
- same as mce=off
-
- Everything else is in sysfs now.
-
-APICs
-=====
-
- apic
- Use IO-APIC. Default
-
- noapic
- Don't use the IO-APIC.
-
- disableapic
- Don't use the local APIC
-
- nolapic
- Don't use the local APIC (alias for i386 compatibility)
-
- pirq=...
- See Documentation/x86/i386/IO-APIC.rst
-
- noapictimer
- Don't set up the APIC timer
-
- no_timer_check
- Don't check the IO-APIC timer. This can work around
- problems with incorrect timer initialization on some boards.
-
- apicpmtimer
- Do APIC timer calibration using the pmtimer. Implies
- apicmaintimer. Useful when your PIT timer is totally broken.
-
-Timing
-======
-
- notsc
- Deprecated, use tsc=unstable instead.
-
- nohpet
- Don't use the HPET timer.
-
-Idle loop
-=========
-
- idle=poll
- Don't do power saving in the idle loop using HLT, but poll for rescheduling
- event. This will make the CPUs eat a lot more power, but may be useful
- to get slightly better performance in multiprocessor benchmarks. It also
- makes some profiling using performance counters more accurate.
- Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
- CPUs) this option has no performance advantage over the normal idle loop.
- It may also interact badly with hyperthreading.
-
-Rebooting
-=========
-
- reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] | p[ci] [, [w]arm | [c]old]
- bios
- Use the CPU reboot vector for warm reset
- warm
- Don't set the cold reboot flag
- cold
- Set the cold reboot flag
- triple
- Force a triple fault (init)
- kbd
- Use the keyboard controller. cold reset (default)
- acpi
- Use the ACPI RESET_REG in the FADT. If ACPI is not configured or
- the ACPI reset does not work, the reboot path attempts the reset
- using the keyboard controller.
- efi
- Use efi reset_system runtime service. If EFI is not configured or
- the EFI reset does not work, the reboot path attempts the reset using
- the keyboard controller.
- pci
- Use a write to the PCI config space register 0xcf9 to trigger reboot.
-
- Using warm reset will be much faster especially on big memory
- systems because the BIOS will not go through the memory check.
- Disadvantage is that not all hardware will be completely reinitialized
- on reboot so there may be boot problems on some systems.
-
- reboot=force
- Don't stop other CPUs on reboot. This can make reboot more reliable
- in some cases.
-
- reboot=default
- There are some built-in platform specific "quirks" - you may see:
- "reboot: <name> series board detected. Selecting <type> for reboots."
- In the case where you think the quirk is in error (e.g. you have
- newer BIOS, or newer board) using this option will ignore the built-in
- quirk table, and use the generic default reboot actions.
-
-NUMA
-====
-
- numa=off
- Only set up a single NUMA node spanning all memory.
-
- numa=noacpi
- Don't parse the SRAT table for NUMA setup
-
- numa=nohmat
- Don't parse the HMAT table for NUMA setup, or soft-reserved memory
- partitioning.
-
- numa=fake=<size>[MG]
- If given as a memory unit, fills all system RAM with nodes of
- size interleaved over physical nodes.
-
- numa=fake=<N>
- If given as an integer, fills all system RAM with N fake nodes
- interleaved over physical nodes.
-
- numa=fake=<N>U
- If given as an integer followed by 'U', it will divide each
- physical node into N emulated nodes.
-
-ACPI
-====
-
- acpi=off
- Don't enable ACPI
- acpi=ht
- Use ACPI boot table parsing, but don't enable ACPI interpreter
- acpi=force
- Force ACPI on (currently not needed)
- acpi=strict
- Disable out of spec ACPI workarounds.
- acpi_sci={edge,level,high,low}
- Set up ACPI SCI interrupt.
- acpi=noirq
- Don't route interrupts
- acpi=nocmcff
- Disable firmware first mode for corrected errors. This
- disables parsing the HEST CMC error source to check if
- firmware has set the FF flag. This may result in
- duplicate corrected error reports.
-
-PCI
-===
-
- pci=off
- Don't use PCI
- pci=conf1
- Use conf1 access.
- pci=conf2
- Use conf2 access.
- pci=rom
- Assign ROMs.
- pci=assign-busses
- Assign busses
- pci=irqmask=MASK
- Set PCI interrupt mask to MASK
- pci=lastbus=NUMBER
- Scan up to NUMBER busses, no matter what the mptable says.
- pci=noacpi
- Don't use ACPI to set up PCI interrupt routing.
-
-IOMMU (input/output memory management unit)
-===========================================
-Multiple x86-64 PCI-DMA mapping implementations exist, for example:
-
- 1. <kernel/dma/direct.c>: use no hardware/software IOMMU at all
- (e.g. because you have < 3 GB memory).
- Kernel boot message: "PCI-DMA: Disabling IOMMU"
-
- 2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU.
- Kernel boot message: "PCI-DMA: using GART IOMMU"
-
- 3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
- e.g. if there is no hardware IOMMU in the system and it is need because
- you have >3GB memory or told the kernel to us it (iommu=soft))
- Kernel boot message: "PCI-DMA: Using software bounce buffering
- for IO (SWIOTLB)"
-
-::
-
- iommu=[<size>][,noagp][,off][,force][,noforce]
- [,memaper[=<order>]][,merge][,fullflush][,nomerge]
- [,noaperture]
-
-General iommu options:
-
- off
- Don't initialize and use any kind of IOMMU.
- noforce
- Don't force hardware IOMMU usage when it is not needed. (default).
- force
- Force the use of the hardware IOMMU even when it is
- not actually needed (e.g. because < 3 GB memory).
- soft
- Use software bounce buffering (SWIOTLB) (default for
- Intel machines). This can be used to prevent the usage
- of an available hardware IOMMU.
-
-iommu options only relevant to the AMD GART hardware IOMMU:
-
- <size>
- Set the size of the remapping area in bytes.
- allowed
- Overwrite iommu off workarounds for specific chipsets.
- fullflush
- Flush IOMMU on each allocation (default).
- nofullflush
- Don't use IOMMU fullflush.
- memaper[=<order>]
- Allocate an own aperture over RAM with size 32MB<<order.
- (default: order=1, i.e. 64MB)
- merge
- Do scatter-gather (SG) merging. Implies "force" (experimental).
- nomerge
- Don't do scatter-gather (SG) merging.
- noaperture
- Ask the IOMMU not to touch the aperture for AGP.
- noagp
- Don't initialize the AGP driver and use full aperture.
- panic
- Always panic when IOMMU overflows.
-
-iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
-implementation:
-
- swiotlb=<slots>[,force,noforce]
- <slots>
- Prereserve that many 2K slots for the software IO bounce buffering.
- force
- Force all IO through the software TLB.
- noforce
- Do not initialize the software TLB.
-
-
-Miscellaneous
-=============
-
- nogbpages
- Do not use GB pages for kernel direct mappings.
- gbpages
- Use GB pages for kernel direct mappings.
-
-
-AMD SEV (Secure Encrypted Virtualization)
-=========================================
-Options relating to AMD SEV, specified via the following format:
-
-::
-
- sev=option1[,option2]
-
-The available options are:
-
- debug
- Enable debug messages.
diff --git a/Documentation/x86/x86_64/cpu-hotplug-spec.rst b/Documentation/x86/x86_64/cpu-hotplug-spec.rst
deleted file mode 100644
index 8d1c91f0c880..000000000000
--- a/Documentation/x86/x86_64/cpu-hotplug-spec.rst
+++ /dev/null
@@ -1,24 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===================================================
-Firmware support for CPU hotplug under Linux/x86-64
-===================================================
-
-Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to
-know in advance of boot time the maximum number of CPUs that could be plugged
-into the system. ACPI 3.0 currently has no official way to supply
-this information from the firmware to the operating system.
-
-In ACPI each CPU needs an LAPIC object in the MADT table (5.2.11.5 in the
-ACPI 3.0 specification). ACPI already has the concept of disabled LAPIC
-objects by setting the Enabled bit in the LAPIC object to zero.
-
-For CPU hotplug Linux/x86-64 expects now that any possible future hotpluggable
-CPU is already available in the MADT. If the CPU is not available yet
-it should have its LAPIC Enabled bit set to 0. Linux will use the number
-of disabled LAPICs to compute the maximum number of future CPUs.
-
-In the worst case the user can overwrite this choice using a command line
-option (additional_cpus=...), but it is recommended to supply the correct
-number (or a reasonable approximation of it, with erring towards more not less)
-in the MADT to avoid manual configuration.
diff --git a/Documentation/x86/x86_64/fake-numa-for-cpusets.rst b/Documentation/x86/x86_64/fake-numa-for-cpusets.rst
deleted file mode 100644
index ff9bcfd2cc14..000000000000
--- a/Documentation/x86/x86_64/fake-numa-for-cpusets.rst
+++ /dev/null
@@ -1,78 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=====================
-Fake NUMA For CPUSets
-=====================
-
-:Author: David Rientjes <rientjes@cs.washington.edu>
-
-Using numa=fake and CPUSets for Resource Management
-
-This document describes how the numa=fake x86_64 command-line option can be used
-in conjunction with cpusets for coarse memory management. Using this feature,
-you can create fake NUMA nodes that represent contiguous chunks of memory and
-assign them to cpusets and their attached tasks. This is a way of limiting the
-amount of system memory that are available to a certain class of tasks.
-
-For more information on the features of cpusets, see
-Documentation/admin-guide/cgroup-v1/cpusets.rst.
-There are a number of different configurations you can use for your needs. For
-more information on the numa=fake command line option and its various ways of
-configuring fake nodes, see Documentation/x86/x86_64/boot-options.rst.
-
-For the purposes of this introduction, we'll assume a very primitive NUMA
-emulation setup of "numa=fake=4*512,". This will split our system memory into
-four equal chunks of 512M each that we can now use to assign to cpusets. As
-you become more familiar with using this combination for resource control,
-you'll determine a better setup to minimize the number of nodes you have to deal
-with.
-
-A machine may be split as follows with "numa=fake=4*512," as reported by dmesg::
-
- Faking node 0 at 0000000000000000-0000000020000000 (512MB)
- Faking node 1 at 0000000020000000-0000000040000000 (512MB)
- Faking node 2 at 0000000040000000-0000000060000000 (512MB)
- Faking node 3 at 0000000060000000-0000000080000000 (512MB)
- ...
- On node 0 totalpages: 130975
- On node 1 totalpages: 131072
- On node 2 totalpages: 131072
- On node 3 totalpages: 131072
-
-Now following the instructions for mounting the cpusets filesystem from
-Documentation/admin-guide/cgroup-v1/cpusets.rst, you can assign fake nodes (i.e. contiguous memory
-address spaces) to individual cpusets::
-
- [root@xroads /]# mkdir exampleset
- [root@xroads /]# mount -t cpuset none exampleset
- [root@xroads /]# mkdir exampleset/ddset
- [root@xroads /]# cd exampleset/ddset
- [root@xroads /exampleset/ddset]# echo 0-1 > cpus
- [root@xroads /exampleset/ddset]# echo 0-1 > mems
-
-Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for
-memory allocations (1G).
-
-You can now assign tasks to these cpusets to limit the memory resources
-available to them according to the fake nodes assigned as mems::
-
- [root@xroads /exampleset/ddset]# echo $$ > tasks
- [root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G
- [1] 13425
-
-Notice the difference between the system memory usage as reported by
-/proc/meminfo between the restricted cpuset case above and the unrestricted
-case (i.e. running the same 'dd' command without assigning it to a fake NUMA
-cpuset):
-
- ======== ============ ==========
- Name Unrestricted Restricted
- ======== ============ ==========
- MemTotal 3091900 kB 3091900 kB
- MemFree 42113 kB 1513236 kB
- ======== ============ ==========
-
-This allows for coarse memory management for the tasks you assign to particular
-cpusets. Since cpusets can form a hierarchy, you can create some pretty
-interesting combinations of use-cases for various classes of tasks for your
-memory management needs.
diff --git a/Documentation/x86/x86_64/fsgs.rst b/Documentation/x86/x86_64/fsgs.rst
deleted file mode 100644
index 50960e09e1f6..000000000000
--- a/Documentation/x86/x86_64/fsgs.rst
+++ /dev/null
@@ -1,199 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Using FS and GS segments in user space applications
-===================================================
-
-The x86 architecture supports segmentation. Instructions which access
-memory can use segment register based addressing mode. The following
-notation is used to address a byte within a segment:
-
- Segment-register:Byte-address
-
-The segment base address is added to the Byte-address to compute the
-resulting virtual address which is accessed. This allows to access multiple
-instances of data with the identical Byte-address, i.e. the same code. The
-selection of a particular instance is purely based on the base-address in
-the segment register.
-
-In 32-bit mode the CPU provides 6 segments, which also support segment
-limits. The limits can be used to enforce address space protections.
-
-In 64-bit mode the CS/SS/DS/ES segments are ignored and the base address is
-always 0 to provide a full 64bit address space. The FS and GS segments are
-still functional in 64-bit mode.
-
-Common FS and GS usage
-------------------------------
-
-The FS segment is commonly used to address Thread Local Storage (TLS). FS
-is usually managed by runtime code or a threading library. Variables
-declared with the '__thread' storage class specifier are instantiated per
-thread and the compiler emits the FS: address prefix for accesses to these
-variables. Each thread has its own FS base address so common code can be
-used without complex address offset calculations to access the per thread
-instances. Applications should not use FS for other purposes when they use
-runtimes or threading libraries which manage the per thread FS.
-
-The GS segment has no common use and can be used freely by
-applications. GCC and Clang support GS based addressing via address space
-identifiers.
-
-Reading and writing the FS/GS base address
-------------------------------------------
-
-There exist two mechanisms to read and write the FS/GS base address:
-
- - the arch_prctl() system call
-
- - the FSGSBASE instruction family
-
-Accessing FS/GS base with arch_prctl()
---------------------------------------
-
- The arch_prctl(2) based mechanism is available on all 64-bit CPUs and all
- kernel versions.
-
- Reading the base:
-
- arch_prctl(ARCH_GET_FS, &fsbase);
- arch_prctl(ARCH_GET_GS, &gsbase);
-
- Writing the base:
-
- arch_prctl(ARCH_SET_FS, fsbase);
- arch_prctl(ARCH_SET_GS, gsbase);
-
- The ARCH_SET_GS prctl may be disabled depending on kernel configuration
- and security settings.
-
-Accessing FS/GS base with the FSGSBASE instructions
----------------------------------------------------
-
- With the Ivy Bridge CPU generation Intel introduced a new set of
- instructions to access the FS and GS base registers directly from user
- space. These instructions are also supported on AMD Family 17H CPUs. The
- following instructions are available:
-
- =============== ===========================
- RDFSBASE %reg Read the FS base register
- RDGSBASE %reg Read the GS base register
- WRFSBASE %reg Write the FS base register
- WRGSBASE %reg Write the GS base register
- =============== ===========================
-
- The instructions avoid the overhead of the arch_prctl() syscall and allow
- more flexible usage of the FS/GS addressing modes in user space
- applications. This does not prevent conflicts between threading libraries
- and runtimes which utilize FS and applications which want to use it for
- their own purpose.
-
-FSGSBASE instructions enablement
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- The instructions are enumerated in CPUID leaf 7, bit 0 of EBX. If
- available /proc/cpuinfo shows 'fsgsbase' in the flag entry of the CPUs.
-
- The availability of the instructions does not enable them
- automatically. The kernel has to enable them explicitly in CR4. The
- reason for this is that older kernels make assumptions about the values in
- the GS register and enforce them when GS base is set via
- arch_prctl(). Allowing user space to write arbitrary values to GS base
- would violate these assumptions and cause malfunction.
-
- On kernels which do not enable FSGSBASE the execution of the FSGSBASE
- instructions will fault with a #UD exception.
-
- The kernel provides reliable information about the enabled state in the
- ELF AUX vector. If the HWCAP2_FSGSBASE bit is set in the AUX vector, the
- kernel has FSGSBASE instructions enabled and applications can use them.
- The following code example shows how this detection works::
-
- #include <sys/auxv.h>
- #include <elf.h>
-
- /* Will be eventually in asm/hwcap.h */
- #ifndef HWCAP2_FSGSBASE
- #define HWCAP2_FSGSBASE (1 << 1)
- #endif
-
- ....
-
- unsigned val = getauxval(AT_HWCAP2);
-
- if (val & HWCAP2_FSGSBASE)
- printf("FSGSBASE enabled\n");
-
-FSGSBASE instructions compiler support
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-GCC version 4.6.4 and newer provide instrinsics for the FSGSBASE
-instructions. Clang 5 supports them as well.
-
- =================== ===========================
- _readfsbase_u64() Read the FS base register
- _readfsbase_u64() Read the GS base register
- _writefsbase_u64() Write the FS base register
- _writegsbase_u64() Write the GS base register
- =================== ===========================
-
-To utilize these instrinsics <immintrin.h> must be included in the source
-code and the compiler option -mfsgsbase has to be added.
-
-Compiler support for FS/GS based addressing
--------------------------------------------
-
-GCC version 6 and newer provide support for FS/GS based addressing via
-Named Address Spaces. GCC implements the following address space
-identifiers for x86:
-
- ========= ====================================
- __seg_fs Variable is addressed relative to FS
- __seg_gs Variable is addressed relative to GS
- ========= ====================================
-
-The preprocessor symbols __SEG_FS and __SEG_GS are defined when these
-address spaces are supported. Code which implements fallback modes should
-check whether these symbols are defined. Usage example::
-
- #ifdef __SEG_GS
-
- long data0 = 0;
- long data1 = 1;
-
- long __seg_gs *ptr;
-
- /* Check whether FSGSBASE is enabled by the kernel (HWCAP2_FSGSBASE) */
- ....
-
- /* Set GS base to point to data0 */
- _writegsbase_u64(&data0);
-
- /* Access offset 0 of GS */
- ptr = 0;
- printf("data0 = %ld\n", *ptr);
-
- /* Set GS base to point to data1 */
- _writegsbase_u64(&data1);
- /* ptr still addresses offset 0! */
- printf("data1 = %ld\n", *ptr);
-
-
-Clang does not provide the GCC address space identifiers, but it provides
-address spaces via an attribute based mechanism in Clang 2.6 and newer
-versions:
-
- ==================================== =====================================
- __attribute__((address_space(256)) Variable is addressed relative to GS
- __attribute__((address_space(257)) Variable is addressed relative to FS
- ==================================== =====================================
-
-FS/GS based addressing with inline assembly
--------------------------------------------
-
-In case the compiler does not support address spaces, inline assembly can
-be used for FS/GS based addressing mode::
-
- mov %fs:offset, %reg
- mov %gs:offset, %reg
-
- mov %reg, %fs:offset
- mov %reg, %gs:offset
diff --git a/Documentation/x86/x86_64/index.rst b/Documentation/x86/x86_64/index.rst
deleted file mode 100644
index a56070fc8e77..000000000000
--- a/Documentation/x86/x86_64/index.rst
+++ /dev/null
@@ -1,17 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==============
-x86_64 Support
-==============
-
-.. toctree::
- :maxdepth: 2
-
- boot-options
- uefi
- mm
- 5level-paging
- fake-numa-for-cpusets
- cpu-hotplug-spec
- machinecheck
- fsgs
diff --git a/Documentation/x86/x86_64/machinecheck.rst b/Documentation/x86/x86_64/machinecheck.rst
deleted file mode 100644
index cea12ee97200..000000000000
--- a/Documentation/x86/x86_64/machinecheck.rst
+++ /dev/null
@@ -1,33 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===============================================================
-Configurable sysfs parameters for the x86-64 machine check code
-===============================================================
-
-Machine checks report internal hardware error conditions detected
-by the CPU. Uncorrected errors typically cause a machine check
-(often with panic), corrected ones cause a machine check log entry.
-
-Machine checks are organized in banks (normally associated with
-a hardware subsystem) and subevents in a bank. The exact meaning
-of the banks and subevent is CPU specific.
-
-mcelog knows how to decode them.
-
-When you see the "Machine check errors logged" message in the system
-log then mcelog should run to collect and decode machine check entries
-from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
-
-Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
-(N = CPU number).
-
-The directory contains some configurable entries. See
-Documentation/ABI/testing/sysfs-mce for more details.
-
-TBD document entries for AMD threshold interrupt configuration
-
-For more details about the x86 machine check architecture
-see the Intel and AMD architecture manuals from their developer websites.
-
-For more details about the architecture
-see http://one.firstfloor.org/~andi/mce.pdf
diff --git a/Documentation/x86/x86_64/mm.rst b/Documentation/x86/x86_64/mm.rst
deleted file mode 100644
index 35e5e18c83d0..000000000000
--- a/Documentation/x86/x86_64/mm.rst
+++ /dev/null
@@ -1,157 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=================
-Memory Management
-=================
-
-Complete virtual memory map with 4-level page tables
-====================================================
-
-.. note::
-
- - Negative addresses such as "-23 TB" are absolute addresses in bytes, counted down
- from the top of the 64-bit address space. It's easier to understand the layout
- when seen both in absolute addresses and in distance-from-top notation.
-
- For example 0xffffe90000000000 == -23 TB, it's 23 TB lower than the top of the
- 64-bit address space (ffffffffffffffff).
-
- Note that as we get closer to the top of the address space, the notation changes
- from TB to GB and then MB/KB.
-
- - "16M TB" might look weird at first sight, but it's an easier way to visualize size
- notation than "16 EB", which few will recognize at first sight as 16 exabytes.
- It also shows it nicely how incredibly large 64-bit address space is.
-
-::
-
- ========================================================================================================================
- Start addr | Offset | End addr | Size | VM area description
- ========================================================================================================================
- | | | |
- 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm
- __________________|____________|__________________|_________|___________________________________________________________
- | | | |
- 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
- | | | | virtual memory addresses up to the -128 TB
- | | | | starting offset of kernel mappings.
- __________________|____________|__________________|_________|___________________________________________________________
- |
- | Kernel-space virtual memory, shared between all processes:
- ____________________________________________________________|___________________________________________________________
- | | | |
- ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor
- ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI
- ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base)
- ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole
- ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base)
- ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole
- ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base)
- ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole
- ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory
- __________________|____________|__________________|_________|____________________________________________________________
- |
- | Identical layout to the 56-bit one from here on:
- ____________________________________________________________|____________________________________________________________
- | | | |
- fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
- | | | | vaddr_end for KASLR
- fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
- fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
- ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
- ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
- ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
- ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
- ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
- ffffffff80000000 |-2048 MB | | |
- ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
- ffffffffff000000 | -16 MB | | |
- FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
- ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
- ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
- __________________|____________|__________________|_________|___________________________________________________________
-
-
-Complete virtual memory map with 5-level page tables
-====================================================
-
-.. note::
-
- - With 56-bit addresses, user-space memory gets expanded by a factor of 512x,
- from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PB starting
- offset and many of the regions expand to support the much larger physical
- memory supported.
-
-::
-
- ========================================================================================================================
- Start addr | Offset | End addr | Size | VM area description
- ========================================================================================================================
- | | | |
- 0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm
- __________________|____________|__________________|_________|___________________________________________________________
- | | | |
- 0100000000000000 | +64 PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
- | | | | virtual memory addresses up to the -64 PB
- | | | | starting offset of kernel mappings.
- __________________|____________|__________________|_________|___________________________________________________________
- |
- | Kernel-space virtual memory, shared between all processes:
- ____________________________________________________________|___________________________________________________________
- | | | |
- ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor
- ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
- ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base)
- ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
- ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
- ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole
- ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base)
- ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole
- ffdf000000000000 | -8.25 PB | fffffbffffffffff | ~8 PB | KASAN shadow memory
- __________________|____________|__________________|_________|____________________________________________________________
- |
- | Identical layout to the 47-bit one from here on:
- ____________________________________________________________|____________________________________________________________
- | | | |
- fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
- | | | | vaddr_end for KASLR
- fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
- fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
- ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
- ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
- ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
- ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
- ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
- ffffffff80000000 |-2048 MB | | |
- ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
- ffffffffff000000 | -16 MB | | |
- FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
- ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
- ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
- __________________|____________|__________________|_________|___________________________________________________________
-
-Architecture defines a 64-bit virtual address. Implementations can support
-less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
-through to the most-significant implemented bit are sign extended.
-This causes hole between user space and kernel addresses if you interpret them
-as unsigned.
-
-The direct mapping covers all memory in the system up to the highest
-memory address (this means in some cases it can also include PCI memory
-holes).
-
-We map EFI runtime services in the 'efi_pgd' PGD in a 64GB large virtual
-memory window (this size is arbitrary, it can be raised later if needed).
-The mappings are not part of any other kernel PGD and are only available
-during EFI runtime calls.
-
-Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all
-physical memory, vmalloc/ioremap space and virtual memory map are randomized.
-Their order is preserved but their base will be offset early at boot time.
-
-Be very careful vs. KASLR when changing anything here. The KASLR address
-range must not overlap with anything except the KASAN shadow area, which is
-correct as KASAN disables KASLR.
-
-For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB
-hole: ffffffffffff4111
diff --git a/Documentation/x86/x86_64/uefi.rst b/Documentation/x86/x86_64/uefi.rst
deleted file mode 100644
index fbc30c9a071d..000000000000
--- a/Documentation/x86/x86_64/uefi.rst
+++ /dev/null
@@ -1,58 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=====================================
-General note on [U]EFI x86_64 support
-=====================================
-
-The nomenclature EFI and UEFI are used interchangeably in this document.
-
-Although the tools below are _not_ needed for building the kernel,
-the needed bootloader support and associated tools for x86_64 platforms
-with EFI firmware and specifications are listed below.
-
-1. UEFI specification: http://www.uefi.org
-
-2. Booting Linux kernel on UEFI x86_64 platform requires bootloader
- support. Elilo with x86_64 support can be used.
-
-3. x86_64 platform with EFI/UEFI firmware.
-
-Mechanics
----------
-
-- Build the kernel with the following configuration::
-
- CONFIG_FB_EFI=y
- CONFIG_FRAMEBUFFER_CONSOLE=y
-
- If EFI runtime services are expected, the following configuration should
- be selected::
-
- CONFIG_EFI=y
- CONFIG_EFIVAR_FS=y or m # optional
-
-- Create a VFAT partition on the disk
-- Copy the following to the VFAT partition:
-
- elilo bootloader with x86_64 support, elilo configuration file,
- kernel image built in first step and corresponding
- initrd. Instructions on building elilo and its dependencies
- can be found in the elilo sourceforge project.
-
-- Boot to EFI shell and invoke elilo choosing the kernel image built
- in first step.
-- If some or all EFI runtime services don't work, you can try following
- kernel command line parameters to turn off some or all EFI runtime
- services.
-
- noefi
- turn off all EFI runtime services
- reboot_type=k
- turn off EFI reboot runtime service
-
-- If the EFI memory map has additional entries not in the E820 map,
- you can include those entries in the kernels memory map of available
- physical RAM by using the following kernel command line parameter.
-
- add_efi_memmap
- include EFI memory map of available physical RAM